Hosting Existing Billion-Scale Knowledge Graphs
This guide explains how to host and use our pre-constructed billion-scale knowledge graphs (Wiki, Pes2o, Common Crawl) using Neo4j Community Edition.
Option 1: Quick Start (Pre-built Server)
The easiest way to get started is to download the pre-configured Neo4j server with the data already imported.
Download: Get the zipped server from our Huggingface Dataset.
You can download the specific dataset you need:
Dataset
File Name
Size
Description
Common Crawl
neo4j-server-cc.zip~213 GB
Large-scale web crawl data.
Wiki
neo4j-server-wiki.zip~74.1 GB
Wikipedia-based knowledge graph.
Pes2o
neo4j-server-pes2o.zip~53.2 GB
Academic papers dataset.
Download via CLI:
# Install huggingface-cli pip install huggingface_hub # Download specific file (e.g., Wiki) hf download gzone0111/AutoSchemaKG 'ATLAS Neo4j Server Zip/neo4j-server-wiki.zip' --local-dir . --repo-type dataset
Unzip: Extract the downloaded file.
Run: Start the server using the bash file in the unzipped folder.
# Replace {dataset-name} with wiki, pes2o, or cc
./neo4j-server-{dataset-name}/bin/neo4j start
Option 2: Build from Source
If you prefer to build the database yourself from the raw CSV files, follow these steps.
1. Setup Neo4j
We provide scripts to download Neo4j Community Edition, install required plugins (APOC, GDS), and configure the environment.
cd neo4j_scripts
sh get_neo4j_cc.sh # For Common Crawl
sh get_neo4j_pes2o.sh # For Pes2o
sh get_neo4j_wiki.sh # For Wiki
Configuration:
Copy
AutoschemaKG/neo4j_scripts/neo4j.conftoneo4j-server-{dataset}/conf/neo4j.conf.Update
dbms.default_databaseto your desired dataset name (e.g.,wiki-csv-json-text).Configure ports (Bolt, HTTP, HTTPS) to avoid conflicts if running multiple servers.
2. Prepare Data
Download Data: Download the CSV dumps from our Huggingface Dataset. You need to download all the zip files from the
ATLAS Neo4j Dumpfolder.Download via CLI:
hf download gzone0111/AutoSchemaKG --include "ATLAS Neo4j Dump/*" --local-dir . --repo-type dataset
Decompress: Run the
decompress_csv_files.shscript to decompress all zip files in parallel to adecompresseddirectory. Then move the files to the./importdirectory of your Neo4j server.Storage Requirements: Ensure you have sufficient disk space. Approximate sizes after import and decompression:
Directory
Size
./neo4j-server-wiki342 GB
./neo4j-server-cc907 GB
./neo4j-server-pes2o249 GB
./import(Raw CSVs)2.3 TB
Add Numeric IDs: (Optional if using provided processed CSVs) If building from raw extraction output, you may need to add numeric IDs for vector indexing.
3. Import Data
Use neo4j-admin import to load the CSVs. This is much faster than Cypher LOAD CSV for large datasets.
Example: Importing Wiki Graph
./neo4j-server-wiki/bin/neo4j-admin database import full wiki-csv-json-text \
--nodes=./import/text_nodes_en_simple_wiki_v0_from_json_with_numeric_id.csv \
./import/triple_nodes_en_simple_wiki_v0_from_json_without_emb_with_numeric_id.csv \
./import/concept_nodes_en_simple_wiki_v0_from_json_without_emb.csv \
--relationships=./import/text_edges_en_simple_wiki_v0_from_json.csv \
./import/triple_edges_en_simple_wiki_v0_from_json_without_emb_full_concept_with_numeric_id.csv \
./import/concept_edges_en_simple_wiki_v0_from_json_without_emb.csv \
--overwrite-destination \
--multiline-fields=true \
--id-type=string \
--verbose --skip-bad-relationships=true
(Refer to the notebook or repository scripts for Pes2o and CC import commands)
Hosting the RAG API
Once the Neo4j server is running, you can host the ATLAS RAG API to perform retrieval.
python example/example_scripts/neo4j_kg/atlas_api_server_demo.py
Ensure you configure the LargeKGConfig in the script to point to your Neo4j instance (URI, username, password) and the correct FAISS indices.
Usage Example
You can query the hosted RAG API using an OpenAI-compatible client. (ref: example/example_scripts/neo4j_kg/atlas_api_client_demo.py)
from openai import OpenAI
# Point to your hosted API
base_url = "http://0.0.0.0:10089/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)
message = [
{
"role": "system",
"content": "You are a helpful assistant that answers questions based on the knowledge graph.",
},
{
"role": "user",
"content": "Question: Who is Alex Mercer?",
}
]
response = client.chat.completions.create(
model="llama",
messages=message,
max_tokens=2048,
temperature=0.5,
extra_body = {
"retriever_config": {
"topN": 5,
"number_of_source_nodes_per_ner": 1,
"sampling_area": 10
}
}
)
print(response.choices[0].message.content)