Hosting Existing Billion-Scale Knowledge Graphs

This guide explains how to host and use our pre-constructed billion-scale knowledge graphs (Wiki, Pes2o, Common Crawl) using Neo4j Community Edition.

Option 1: Quick Start (Pre-built Server)

The easiest way to get started is to download the pre-configured Neo4j server with the data already imported.

Download: Get the zipped server from our Huggingface Dataset.

You can download the specific dataset you need:

Dataset	File Name	Size	Description
Common Crawl	`neo4j-server-cc.zip`	~213 GB	Large-scale web crawl data.
Wiki	`neo4j-server-wiki.zip`	~74.1 GB	Wikipedia-based knowledge graph.
Pes2o	`neo4j-server-pes2o.zip`	~53.2 GB	Academic papers dataset.

Download via CLI:

# Install huggingface-cli
pip install huggingface_hub

# Download specific file (e.g., Wiki)
hf download gzone0111/AutoSchemaKG 'ATLAS Neo4j Server Zip/neo4j-server-wiki.zip' --local-dir . --repo-type dataset

Unzip: Extract the downloaded file.
Run: Start the server using the bash file in the unzipped folder.

# Replace {dataset-name} with wiki, pes2o, or cc
./neo4j-server-{dataset-name}/bin/neo4j start

Option 2: Build from Source

If you prefer to build the database yourself from the raw CSV files, follow these steps.

1. Setup Neo4j

We provide scripts to download Neo4j Community Edition, install required plugins (APOC, GDS), and configure the environment.

cd neo4j_scripts
sh get_neo4j_cc.sh    # For Common Crawl
sh get_neo4j_pes2o.sh # For Pes2o
sh get_neo4j_wiki.sh  # For Wiki

Configuration:

Copy AutoschemaKG/neo4j_scripts/neo4j.conf to neo4j-server-{dataset}/conf/neo4j.conf.
Update dbms.default_database to your desired dataset name (e.g., wiki-csv-json-text).
Configure ports (Bolt, HTTP, HTTPS) to avoid conflicts if running multiple servers.

2. Prepare Data

Download Data: Download the CSV dumps from our Huggingface Dataset. You need to download all the zip files from the ATLAS Neo4j Dump folder.

Download via CLI:
```
hf download gzone0111/AutoSchemaKG --include "ATLAS Neo4j Dump/*" --local-dir . --repo-type dataset
```
Decompress: Run the decompress_csv_files.sh script to decompress all zip files in parallel to a decompressed directory. Then move the files to the ./import directory of your Neo4j server.

Storage Requirements: Ensure you have sufficient disk space. Approximate sizes after import and decompression:

Directory

Size

./neo4j-server-wiki

342 GB

./neo4j-server-cc

907 GB

./neo4j-server-pes2o

249 GB

./import (Raw CSVs)

2.3 TB
Add Numeric IDs: (Optional if using provided processed CSVs) If building from raw extraction output, you may need to add numeric IDs for vector indexing.

3. Import Data

Use neo4j-admin import to load the CSVs. This is much faster than Cypher LOAD CSV for large datasets.

Example: Importing Wiki Graph

./neo4j-server-wiki/bin/neo4j-admin database import full wiki-csv-json-text \
    --nodes=./import/text_nodes_en_simple_wiki_v0_from_json_with_numeric_id.csv \
    ./import/triple_nodes_en_simple_wiki_v0_from_json_without_emb_with_numeric_id.csv \
    ./import/concept_nodes_en_simple_wiki_v0_from_json_without_emb.csv \
    --relationships=./import/text_edges_en_simple_wiki_v0_from_json.csv \
    ./import/triple_edges_en_simple_wiki_v0_from_json_without_emb_full_concept_with_numeric_id.csv \
    ./import/concept_edges_en_simple_wiki_v0_from_json_without_emb.csv \
    --overwrite-destination \
    --multiline-fields=true \
    --id-type=string \
    --verbose --skip-bad-relationships=true

(Refer to the notebook or repository scripts for Pes2o and CC import commands)

Hosting the RAG API

Once the Neo4j server is running, you can host the ATLAS RAG API to perform retrieval.

python example/example_scripts/neo4j_kg/atlas_api_server_demo.py

Ensure you configure the LargeKGConfig in the script to point to your Neo4j instance (URI, username, password) and the correct FAISS indices.

Usage Example

You can query the hosted RAG API using an OpenAI-compatible client. (ref: example/example_scripts/neo4j_kg/atlas_api_client_demo.py)

from openai import OpenAI

# Point to your hosted API
base_url = "http://0.0.0.0:10089/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)

message = [
    {
        "role": "system",
        "content": "You are a helpful assistant that answers questions based on the knowledge graph.",
    },
    {
        "role": "user",
        "content": "Question: Who is Alex Mercer?",
    }
]

response = client.chat.completions.create(
    model="llama",
    messages=message,
    max_tokens=2048,
    temperature=0.5,
    extra_body = {
        "retriever_config": {
            "topN": 5,
            "number_of_source_nodes_per_ner": 1,
            "sampling_area": 10 
        }
    }
)

print(response.choices[0].message.content)

Directory	Size
`./neo4j-server-wiki`	342 GB
`./neo4j-server-cc`	907 GB
`./neo4j-server-pes2o`	249 GB
`./import` (Raw CSVs)	2.3 TB