# File Formats & Data Loading AutoSchemaKG supports various input formats for knowledge graph construction and benchmarking. ## Input Data Formats ### 1. Raw Text (Corpus) For building a Knowledge Graph from scratch, the system typically expects a directory of text files or a structured JSON/JSONL corpus. **JSONL Format (Recommended):** Each line is a JSON object representing a document. ```json {"id": "title_1", "text": "Full text content of the document...", "metadata": {"lang": "en"}} {"id": "title_2", "text": "Another document content...", "metadata": {"lang": "en"}} ``` **Directory of Text Files:** You can also point `data_directory` to a folder containing `.txt` or `.md` files. The filename is often used as the document ID. ### 2. Benchmark Datasets For evaluation, the system supports standard QA dataset formats like HotpotQA, 2WikiMultihopQA, and MuSiQue. **Standard QA Format (JSON):** ```json [ { "_id": "5a7a06935542990198eaf050", "question": "Which magazine was published first, Arthur's Magazine or First for Women?", "answer": "Arthur's Magazine", "supporting_facts": [["Arthur's Magazine", 0], ["First for Women", 0]], "context": [ ["Arthur's Magazine", ["Arthur's Magazine (1844–1846) was an American literary periodical..."]], ["First for Women", ["First for Women is a woman's magazine published by Bauer Media Group..."]] ] } ] ``` ## PDF & Document Processing AutoSchemaKG includes utilities to convert unstructured documents (PDFs) into Markdown/Text for processing. > **Note:** Complete example scripts and configuration details are available in the [GitHub repository](https://github.com/HKUST-KnowComp/AutoSchemaKG/tree/main/example/pdf_md_conversion). ### Workflow The general pipeline for PDFs is: 1. **Convert to Markdown**: Use the provided tools (based on `marker-pdf`) to extract text and structure. 2. **Convert to JSON**: Transform the Markdown output into the JSONL format required by AutoSchemaKG. 3. **KG Extraction**: Run the `KnowledgeGraphExtractor` on the processed data. ### Quick Start 1. **Install Dependencies**: ```bash # Create a separate environment recommended conda create --name pdf-marker pip python=3.10 conda activate pdf-marker pip install 'marker-pdf[full]' google-genai ``` 2. **Configure**: Edit `config.yaml` to set your LLM service (Azure OpenAI or Gemini) and input/output paths. 3. **Run Conversion**: ```bash # Convert PDF to Markdown bash run.sh # Convert Markdown to JSON (from AutoSchemaKG root) python -m atlas_rag.kg_construction.utils.md_processing.markdown_to_json \ --input example_data/md_data \ --output example_data ``` ## Output Formats ### 1. GraphML The final Knowledge Graph is often exported as `.graphml`, which can be opened in networkx. ### 2. CSV (Triples & Concepts) Intermediate results are stored as CSVs: - **Triples CSV**: `triple_nodes, triple_edges, text_nodes, text_edges` - **Concepts CSV**: `triple_edges, concept_edges, concept_nodes` ### 3. NetworkX / GraphDatabase For programmatic access, graphs are manipulated as NetworkX objects or stored as Graph Database instances.