Multilingual Support
AutoSchemaKG provides built-in support for processing documents in multiple languages, with the ability to extend to additional languages through custom prompts.
Built-in Language Support
AutoSchemaKG comes with pre-configured prompts for the following languages:
English (
en)Simplified Chinese (
zh-CN) - Mainland ChinaTraditional Chinese (
zh-HK) - Hong Kong
These languages have complete support for:
Triple extraction (entity relations, event entities, event relations)
Concept generation (events, entities, relations)
Using Built-in Languages
1. Prepare Your Data
Each document in your corpus must include language metadata:
[
{
"id": "1",
"text": "The quick brown fox jumps over the lazy dog.",
"metadata": {
"lang": "en"
}
},
{
"id": "2",
"text": "话说天下大势,分久必合,合久必分。",
"metadata": {
"lang": "zh-CN"
}
},
{
"id": "3",
"text": "話說天下大勢,分久必合,合久必分。",
"metadata": {
"lang": "zh-HK"
}
}
]
Required Fields:
id: Unique document identifiertext: Document content in the specified languagemetadata.lang: Language code (en,zh-CN, orzh-HK)
Note: If
metadata.langis not specified, the system defaults to English (en).
2. Run Knowledge Graph Extraction
from atlas_rag.kg_construction.triple_extraction import KnowledgeGraphExtractor
from atlas_rag.kg_construction.triple_config import ProcessingConfig
from atlas_rag.llm_generator import LLMGenerator
from openai import OpenAI
# Initialize LLM client
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
model_name = "Qwen/Qwen2.5-7B-Instruct"
triple_generator = LLMGenerator(client, model_name=model_name)
# Configure extraction
kg_extraction_config = ProcessingConfig(
model_path=model_name,
data_directory="example_data/multilingual_data",
filename_pattern="RomanceOfTheThreeKingdom",
batch_size_triple=16,
batch_size_concept=64,
output_directory="generated/multilingual_output",
max_new_tokens=8192
)
# Create extractor
kg_extractor = KnowledgeGraphExtractor(
model=triple_generator,
config=kg_extraction_config
)
# Extract triples (automatically uses correct language from metadata)
kg_extractor.run_extraction()
# Convert to CSV
kg_extractor.convert_json_to_csv()
# Generate concepts - specify language explicitly
kg_extractor.generate_concept_csv_temp(language='zh-CN') # For Simplified Chinese
# kg_extractor.generate_concept_csv_temp(language='zh-HK') # For Traditional Chinese
# kg_extractor.generate_concept_csv_temp(language='en') # For English
3. How Language Detection Works
Automatic Triple Extraction:
The system reads
metadata.langfrom each documentAutomatically selects the corresponding prompt from
TRIPLE_INSTRUCTIONSApplies language-specific extraction rules
Manual Concept Generation:
You must explicitly specify the language when generating concepts
Use the same language code as in your document metadata
Adding Custom Language Support
To add support for additional languages (e.g., Japanese, Korean, French), you need to create custom prompt files.
Step 1: Create Custom Triple Extraction Prompts
Create a JSON file with prompts for your target language(s):
custom_prompts/multilingual_triple_prompt.json:
{
"ja": {
"system": "あなたは常に有効なJSON配列形式で応答する役立つアシスタントです",
"entity_relation": "与えられたテキストから重要なエンティティとそれらの関係を抽出し、簡潔に要約してください。関係は、先頭エンティティと末尾エンティティの情報を繰り返すことなく、エンティティ間の接続を簡潔に捉える必要があります。\n\n次のJSON形式で厳密に出力してください:\n[\n {\n \"Head\": \"{名詞}\",\n \"Relation\": \"{動詞}\",\n \"Tail\": \"{名詞}\"\n }...\n]\n\n以下はテキストです:",
"event_entity": "...",
"event_relation": "..."
},
"ko": {
"system": "당신은 항상 유효한 JSON 배열 형식으로 응답하는 유용한 도우미입니다",
"entity_relation": "주어진 텍스트에서 중요한 개체와 그들 간의 관계를 추출하고 간결하게 요약하세요...",
"event_entity": "...",
"event_relation": "..."
}
}
Step 2: Create Custom Concept Generation Prompts
Create prompts for concept abstraction in your target language:
custom_prompts/multilingual_concept_prompt.json:
{
"ja": {
"event": "イベントを提供します。このイベントの抽象的なイベントとして、1〜2語を含むいくつかのフレーズを提供する必要があります...",
"entity": "エンティティを提供します。このエンティティの抽象的なエンティティとして、1〜2語を含むいくつかのフレーズを提供する必要があります...",
"relation": "関係を提供します。この関係の抽象的な関係として、1〜2語を含むいくつかのフレーズを提供する必要があります..."
},
"ko": {
"event": "이벤트를 제공합니다. 이 이벤트의 추상적 이벤트에 대해 1-2단어를 포함하는 여러 구문을 제공해야 합니다...",
"entity": "엔티티를 제공합니다. 이 엔티티의 추상적 엔티티에 대해 1-2단어를 포함하는 여러 구문을 제공해야 합니다...",
"relation": "관계를 제공합니다. 이 관계의 추상적 관계에 대해 1-2단어를 포함하는 여러 구문을 제공해야 합니다..."
}
}
Step 3: Configure with Custom Prompts
from atlas_rag.kg_construction.triple_extraction import KnowledgeGraphExtractor
from atlas_rag.kg_construction.triple_config import ProcessingConfig
from atlas_rag.llm_generator import LLMGenerator
# Configure with custom prompts
kg_extraction_config = ProcessingConfig(
model_path="Qwen/Qwen2.5-7B-Instruct",
data_directory="example_data/japanese_data",
filename_pattern="japanese_corpus",
batch_size_triple=16,
batch_size_concept=64,
output_directory="generated/japanese_output",
# Specify custom prompt files
triple_extraction_prompt_path="custom_prompts/multilingual_triple_prompt.json",
triple_extraction_schema_path="custom_prompts/custom_schema.json", # Optional
max_new_tokens=8192
)
kg_extractor = KnowledgeGraphExtractor(
model=triple_generator,
config=kg_extraction_config
)
# Run extraction with custom prompts
kg_extractor.run_extraction()
kg_extractor.convert_json_to_csv()
# Generate concepts for Japanese
kg_extractor.generate_concept_csv_temp(language='ja')
Step 4: Update Your Data
Make sure your corpus includes the correct language codes:
[
{
"id": "1",
"text": "吾輩は猫である。名前はまだ無い。",
"metadata": {
"lang": "ja"
}
},
{
"id": "2",
"text": "나는 고양이로소이다. 이름은 아직 없다.",
"metadata": {
"lang": "ko"
}
}
]
Prompt Structure Reference
When creating custom language prompts, follow this structure:
Triple Extraction Prompts
{
"<lang_code>": {
"system": "System message for the LLM in target language",
"entity_relation": "Instructions for extracting entity-relation-entity triples",
"event_entity": "Instructions for extracting event-entity relationships",
"event_relation": "Instructions for extracting event-event relationships"
}
}
Concept Generation Prompts
{
"<lang_code>": {
"event": "Instructions for generating abstract concepts from events",
"entity": "Instructions for generating abstract concepts from entities",
"relation": "Instructions for generating abstract concepts from relations"
}
}
Example Files
Sample multilingual data:
example/example_data/multilingual_data/Built-in prompts:
atlas_rag/llm_generator/prompt/triple_extraction_prompt.pyComplete tutorial:
example/multilingual_processing.md
Tips for Custom Languages
Prompt Quality: High-quality, native-language prompts yield better extraction results
Test Incrementally: Start with a small dataset to verify your prompts work correctly
JSON Format: Ensure your prompts explicitly request JSON output format
Examples: Include few-shot examples in your target language when possible
Language Models: Use multilingual or language-specific models for best results (e.g., Qwen for Chinese, GPT-4 for general multilingual support)