File Formats & Data Loading

AutoSchemaKG supports various input formats for knowledge graph construction and benchmarking.

Input Data Formats

1. Raw Text (Corpus)

For building a Knowledge Graph from scratch, the system typically expects a directory of text files or a structured JSON/JSONL corpus.

JSONL Format (Recommended): Each line is a JSON object representing a document.

{"id": "title_1", "text": "Full text content of the document...", "metadata": {"lang": "en"}}
{"id": "title_2", "text": "Another document content...", "metadata": {"lang": "en"}}

Directory of Text Files: You can also point data_directory to a folder containing .txt or .md files. The filename is often used as the document ID.

2. Benchmark Datasets

For evaluation, the system supports standard QA dataset formats like HotpotQA, 2WikiMultihopQA, and MuSiQue.

Standard QA Format (JSON):

[
  {
    "_id": "5a7a06935542990198eaf050",
    "question": "Which magazine was published first, Arthur's Magazine or First for Women?",
    "answer": "Arthur's Magazine",
    "supporting_facts": [["Arthur's Magazine", 0], ["First for Women", 0]],
    "context": [
      ["Arthur's Magazine", ["Arthur's Magazine (1844–1846) was an American literary periodical..."]],
      ["First for Women", ["First for Women is a woman's magazine published by Bauer Media Group..."]]
    ]
  }
]

PDF & Document Processing

AutoSchemaKG includes utilities to convert unstructured documents (PDFs) into Markdown/Text for processing.

Note: Complete example scripts and configuration details are available in the GitHub repository.

Workflow

The general pipeline for PDFs is:

  1. Convert to Markdown: Use the provided tools (based on marker-pdf) to extract text and structure.

  2. Convert to JSON: Transform the Markdown output into the JSONL format required by AutoSchemaKG.

  3. KG Extraction: Run the KnowledgeGraphExtractor on the processed data.

Quick Start

  1. Install Dependencies:

    # Create a separate environment recommended
    conda create --name pdf-marker pip python=3.10
    conda activate pdf-marker
    pip install 'marker-pdf[full]' google-genai
    
  2. Configure: Edit config.yaml to set your LLM service (Azure OpenAI or Gemini) and input/output paths.

  3. Run Conversion:

    # Convert PDF to Markdown
    bash run.sh
    
    # Convert Markdown to JSON (from AutoSchemaKG root)
    python -m atlas_rag.kg_construction.utils.md_processing.markdown_to_json \
        --input example_data/md_data \
        --output example_data
    

Output Formats

1. GraphML

The final Knowledge Graph is often exported as .graphml, which can be opened in networkx.

2. CSV (Triples & Concepts)

Intermediate results are stored as CSVs:

  • Triples CSV: triple_nodes, triple_edges, text_nodes, text_edges

  • Concepts CSV: triple_edges, concept_edges, concept_nodes

3. NetworkX / GraphDatabase

For programmatic access, graphs are manipulated as NetworkX objects or stored as Graph Database instances.