# Hosting Existing Billion-Scale Knowledge Graphs This guide explains how to host and use our pre-constructed billion-scale knowledge graphs (Wiki, Pes2o, Common Crawl) using Neo4j Community Edition. ## Option 1: Quick Start (Pre-built Server) The easiest way to get started is to download the pre-configured Neo4j server with the data already imported. 1. **Download**: Get the zipped server from our [Huggingface Dataset](https://huggingface.co/datasets/gzone0111/AutoSchemaKG). You can download the specific dataset you need: | Dataset | File Name | Size | Description | |---------|-----------|------|-------------| | **Common Crawl** | `neo4j-server-cc.zip` | ~213 GB | Large-scale web crawl data. | | **Wiki** | `neo4j-server-wiki.zip` | ~74.1 GB | Wikipedia-based knowledge graph. | | **Pes2o** | `neo4j-server-pes2o.zip` | ~53.2 GB | Academic papers dataset. | **Download via CLI:** ```bash # Install huggingface-cli pip install huggingface_hub # Download specific file (e.g., Wiki) hf download gzone0111/AutoSchemaKG 'ATLAS Neo4j Server Zip/neo4j-server-wiki.zip' --local-dir . --repo-type dataset ``` 2. **Unzip**: Extract the downloaded file. 3. **Run**: Start the server using the bash file in the unzipped folder. ```bash # Replace {dataset-name} with wiki, pes2o, or cc ./neo4j-server-{dataset-name}/bin/neo4j start ``` ## Option 2: Build from Source If you prefer to build the database yourself from the raw CSV files, follow these steps. ### 1. Setup Neo4j We provide scripts to download Neo4j Community Edition, install required plugins (APOC, GDS), and configure the environment. ```bash cd neo4j_scripts sh get_neo4j_cc.sh # For Common Crawl sh get_neo4j_pes2o.sh # For Pes2o sh get_neo4j_wiki.sh # For Wiki ``` **Configuration**: - Copy `AutoschemaKG/neo4j_scripts/neo4j.conf` to `neo4j-server-{dataset}/conf/neo4j.conf`. - Update `dbms.default_database` to your desired dataset name (e.g., `wiki-csv-json-text`). - Configure ports (Bolt, HTTP, HTTPS) to avoid conflicts if running multiple servers. ### 2. Prepare Data 1. **Download Data**: Download the CSV dumps from our [Huggingface Dataset](https://huggingface.co/datasets/gzone0111/AutoSchemaKG/tree/main). You need to download all the zip files from the `ATLAS Neo4j Dump` folder. **Download via CLI:** ```bash hf download gzone0111/AutoSchemaKG --include "ATLAS Neo4j Dump/*" --local-dir . --repo-type dataset ``` 2. **Decompress**: Run the `decompress_csv_files.sh` script to decompress all zip files in parallel to a `decompressed` directory. Then move the files to the `./import` directory of your Neo4j server. **Storage Requirements:** Ensure you have sufficient disk space. Approximate sizes after import and decompression: | Directory | Size | |-----------|------| | `./neo4j-server-wiki` | 342 GB | | `./neo4j-server-cc` | 907 GB | | `./neo4j-server-pes2o` | 249 GB | | `./import` (Raw CSVs) | 2.3 TB | 3. **Add Numeric IDs**: (Optional if using provided processed CSVs) If building from raw extraction output, you may need to add numeric IDs for vector indexing. ### 3. Import Data Use `neo4j-admin import` to load the CSVs. This is much faster than Cypher `LOAD CSV` for large datasets. **Example: Importing Wiki Graph** ```bash ./neo4j-server-wiki/bin/neo4j-admin database import full wiki-csv-json-text \ --nodes=./import/text_nodes_en_simple_wiki_v0_from_json_with_numeric_id.csv \ ./import/triple_nodes_en_simple_wiki_v0_from_json_without_emb_with_numeric_id.csv \ ./import/concept_nodes_en_simple_wiki_v0_from_json_without_emb.csv \ --relationships=./import/text_edges_en_simple_wiki_v0_from_json.csv \ ./import/triple_edges_en_simple_wiki_v0_from_json_without_emb_full_concept_with_numeric_id.csv \ ./import/concept_edges_en_simple_wiki_v0_from_json_without_emb.csv \ --overwrite-destination \ --multiline-fields=true \ --id-type=string \ --verbose --skip-bad-relationships=true ``` *(Refer to the notebook or repository scripts for Pes2o and CC import commands)* ## Hosting the RAG API Once the Neo4j server is running, you can host the ATLAS RAG API to perform retrieval. ```bash python example/example_scripts/neo4j_kg/atlas_api_server_demo.py ``` Ensure you configure the `LargeKGConfig` in the script to point to your Neo4j instance (URI, username, password) and the correct FAISS indices. ## Usage Example You can query the hosted RAG API using an OpenAI-compatible client. (ref: `example/example_scripts/neo4j_kg/atlas_api_client_demo.py`) ```python from openai import OpenAI # Point to your hosted API base_url = "http://0.0.0.0:10089/v1/" client = OpenAI(api_key="EMPTY", base_url=base_url) message = [ { "role": "system", "content": "You are a helpful assistant that answers questions based on the knowledge graph.", }, { "role": "user", "content": "Question: Who is Alex Mercer?", } ] response = client.chat.completions.create( model="llama", messages=message, max_tokens=2048, temperature=0.5, extra_body = { "retriever_config": { "topN": 5, "number_of_source_nodes_per_ner": 1, "sampling_area": 10 } } ) print(response.choices[0].message.content) ```