Custom Knowledge Graph Extraction
This guide demonstrates how to customize the knowledge graph extraction process using your own prompts and validation schemas. This allows you to define specific extraction rules, output formats, and structural constraints for your knowledge graph.
Overview
The custom extraction process involves three main components:
Custom Prompt: Defines the instructions given to the LLM for extracting triples.
Custom Schema: Defines the JSON schema used to validate the extracted triples.
Extraction Script: Configures the
KnowledgeGraphExtractorto use your custom files.
1. Creating a Custom Prompt
Create a JSON file (e.g., custom_prompt.json) to define your extraction prompts. The file should support language keys (like "en") and contain specific instructions for the system and the extraction tasks.
You can define multiple keys (e.g., "triple_extraction", "time_extraction") to represent different stages of extraction. The system will iterate through each key and perform the corresponding extraction task.
Example custom_prompt.json:
{
"en": {
"system": "You are a helpful assistant",
"triple_extraction": "You are an expert knowledge graph constructor.\nYour task is to extract factual information from the provided text and represent it strictly as a ***JSON array*** of knowledge graph triples.\n\n### Output Format\n- The output must be a **JSON array**.\n- Each element in the array must be a **JSON object** with exactly three non-empty keys:\n - \"subject\": the main entity, concept, event, or attribute.\n - \"relation\": a concise, descriptive phrase or verb that describes the relationship.\n - \"object\": the entity, concept, value, event, or attribute that the subject has a relationship with.\n\n### Constraints\n- **Do not include any text other than the JSON output.**\n- Extract **all possible and relevant triples**.\n- If no triples can be extracted, return exactly: `[]`.",
"time_extraction": "You are an expert in temporal extraction..."
}
}
2. Creating a Custom Schema
Create a JSON schema file (e.g., custom_schema.json) to enforce the structure of the extracted data. This ensures that the LLM output conforms to your expected format.
Important: If you defined multiple extraction keys in your custom_prompt.json (e.g., "triple_extraction", "time_extraction"), you must provide a corresponding schema for each key in this file.
Example custom_schema.json:
{
"triple_extraction": {
"type": "array",
"items": {
"type": "object",
"properties": {
"subject": { "type": "string" },
"relation": { "type": "string" },
"object": { "type": "string" }
},
"required": ["subject", "relation", "object"]
}
},
"time_extraction": {
"type": "array",
"items": {
"type": "object",
"properties": {
"event": { "type": "string" },
"timestamp": { "type": "string" }
},
"required": ["event", "timestamp"]
}
}
}
3. Running Custom Extraction
Use the KnowledgeGraphExtractor class with a ProcessingConfig that points to your custom prompt and schema files.
Example Script (custom_kg_extraction.py):
from atlas_rag.kg_construction.triple_extraction import KnowledgeGraphExtractor
from atlas_rag.kg_construction.triple_config import ProcessingConfig
from atlas_rag.llm_generator import LLMGenerator
from openai import OpenAI
# Initialize LLM
client = OpenAI(base_url="http://0.0.0.0:8129/v1", api_key="EMPTY")
triple_generator = LLMGenerator(client=client, model_name="Qwen/Qwen2.5-7B-Instruct")
# Configure Extraction with Custom Files
kg_extraction_config = ProcessingConfig(
model_path="Qwen/Qwen2.5-7B-Instruct",
data_directory="benchmark_data",
filename_pattern="musique",
output_directory="example",
triple_extraction_prompt_path='example/example_scripts/custom_extraction/custom_benchmark/custom_prompt.json', # Path to your custom prompt
triple_extraction_schema_path='example/example_scripts/custom_extraction/custom_benchmark/custom_schema.json', # Path to your custom schema
)
# Run Extraction
kg_extractor = KnowledgeGraphExtractor(model=triple_generator, config=kg_extraction_config)
kg_extractor.run_extraction()
kg_extractor.convert_json_to_csv()
kg_extractor.convert_to_graphml()