Custom Knowledge Graph Extraction

This guide demonstrates how to customize the knowledge graph extraction process using your own prompts and validation schemas. This allows you to define specific extraction rules, output formats, and structural constraints for your knowledge graph.

Overview

The custom extraction process involves three main components:

Custom Prompt: Defines the instructions given to the LLM for extracting triples.
Custom Schema: Defines the JSON schema used to validate the extracted triples.
Extraction Script: Configures the KnowledgeGraphExtractor to use your custom files.

1. Creating a Custom Prompt

Create a JSON file (e.g., custom_prompt.json) to define your extraction prompts. The file should support language keys (like "en") and contain specific instructions for the system and the extraction tasks.

You can define multiple keys (e.g., "triple_extraction", "time_extraction") to represent different stages of extraction. The system will iterate through each key and perform the corresponding extraction task.

Example custom_prompt.json:

{
  "en": {
    "system": "You are a helpful assistant",
    "triple_extraction": "You are an expert knowledge graph constructor.\nYour task is to extract factual information from the provided text and represent it strictly as a ***JSON array*** of knowledge graph triples.\n\n### Output Format\n- The output must be a **JSON array**.\n- Each element in the array must be a **JSON object** with exactly three non-empty keys:\n  - \"subject\": the main entity, concept, event, or attribute.\n  - \"relation\": a concise, descriptive phrase or verb that describes the relationship.\n  - \"object\": the entity, concept, value, event, or attribute that the subject has a relationship with.\n\n### Constraints\n- **Do not include any text other than the JSON output.**\n- Extract **all possible and relevant triples**.\n- If no triples can be extracted, return exactly: `[]`.",
    "event_entity_extraction": "You are an expert in event entity extraction...",
    "event_event_extraction": "You are an expert in event-event relation extraction..."
  }
}

2. Creating a Custom Schema

Create a JSON schema file (e.g., custom_schema.json) to enforce the structure of the extracted data. This ensures that the LLM output conforms to your expected format.

Important: If you defined multiple extraction keys in your custom_prompt.json (e.g., "triple_extraction", "time_extraction"), you must provide a corresponding schema for each key in this file.

(Caution: Shcema Induction (Concept Generation) and retrievers may not handle node_type other than “entity” or “event” properly. Custom node types are set for custom use cases but may not be fully supported in all components in atlas-rag.)

Example custom_schema.json:

{   
    // One-to-one example schema
    "triple_extraction": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "subject": { "type": "string", "node_type": "entity" },
                "relation": { "type": "string" },
                "object": { "type": "string", "node_type": "entity" }
            },
            "required": ["subject", "relation", "object"]
        }
    },
    // One-to-many example schema (event-to-entities)
    "event_entity_extraction": { 
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "event": { "type": "string", "node_type": "event" },
                // relation can be added if needed for scehma validation
                "entity": { "type": "array", "items": { "type": "string", "node_type": "entity" } },
            },
            "required": ["event", "entity"]
        }
    },

    "event_event_extraction": { 
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "event1": { "type": "string", "node_type": "event" },
                "relation": { "type": "string" },
                "event2": { "type": "string", "node_type": "event" }
            },
            "required": ["event1", "relation", "event2"]
        }
    }
}

3. Running Custom Extraction

Use the KnowledgeGraphExtractor class with a ProcessingConfig that points to your custom prompt and schema files.

Example Script (custom_kg_extraction.py):

from atlas_rag.kg_construction.triple_extraction import KnowledgeGraphExtractor
from atlas_rag.kg_construction.triple_config import ProcessingConfig
from atlas_rag.llm_generator import LLMGenerator
from openai import OpenAI

# Initialize LLM
client = OpenAI(base_url="http://0.0.0.0:8129/v1", api_key="EMPTY")
triple_generator = LLMGenerator(client=client, model_name="Qwen/Qwen2.5-7B-Instruct")

# Configure Extraction with Custom Files
kg_extraction_config = ProcessingConfig(
      model_path="Qwen/Qwen2.5-7B-Instruct",
      data_directory="benchmark_data",
      filename_pattern="musique",
      output_directory="example",
      triple_extraction_prompt_path='example/example_scripts/custom_extraction/custom_benchmark/custom_prompt.json', # Path to your custom prompt
      triple_extraction_schema_path='example/example_scripts/custom_extraction/custom_benchmark/custom_schema.json', # Path to your custom schema
)

# Run Extraction
kg_extractor = KnowledgeGraphExtractor(model=triple_generator, config=kg_extraction_config)
kg_extractor.run_extraction()
kg_extractor.convert_json_to_csv()
kg_extractor.convert_to_graphml()