A powerful AI-driven tool for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.
- 🤖 AI-Powered Extraction: Intelligently identifies bilingual term pairs using advanced LLMs (GPT-4, Claude, DeepSeek, etc.)
- 🔍 Quality Control: Automatically evaluates term alignment quality and filters low-quality results
- 📝 Intelligent Normalization:
- Chinese: Traditional/Simplified unification, structural markers (第XX条)
- English: Singular/plural normalization, verb tense unification, structural markers (Article XX)
- Japanese: Notation unification, okurigana standardization, structural markers
- 🎯 Deduplication & Standardization: Intelligently merges synonym variants and selects best translations
- ⚡ High Performance: Supports concurrent processing for large-scale documents
- 🌐 Multilingual Support: Chinese, English, Japanese, and more
- Python 3.8+
- OpenAI API key (or OpenAI-compatible API)
# Clone the repository
git clone https://github.com/wang-h/bilingual_term_extractor.git
cd bilingual_term_extractor
# Install dependencies
pip install -r requirements.txt
# Configure environment variables
cp .env.example .env
# Edit .env and set your API keys:
# OPENAI_API_KEY=your-api-key-here
# OPENAI_BASE_URL=https://api.openai.com/v1 # Optional: for OpenAI-compatible APIs
# OPENAI_API_MODEL=gpt-4o-mini # Optional: default modelimport asyncio
from src.agents.bilingual_term_extract import BilingualTermExtractAgent
from src.agents.bilingual_term_quality_check import BilingualTermQualityCheckAgent
from src.agents.bilingual_term_normalization import TermNormalizationAgent
from src.agents.bilingual_term_standardization import BilingualTermStandardizationAgent
async def extract_terms():
# Source text (Chinese)
source_text = """
第三条 劳动者享有平等就业和选择职业的权利、取得劳动报酬的权利...
"""
# Target text (English)
target_text = """
Article 3: Workers shall have the right to employment on an equal basis...
"""
# Stage 1: Extract terms
extract_agent = BilingualTermExtractAgent(locale='zh')
extracted = await extract_agent.run({
'source_text': source_text,
'target_text': target_text,
'src_lang': 'zh',
'tgt_lang': 'en'
}, None)
# Stage 2: Quality check
quality_agent = BilingualTermQualityCheckAgent(locale='zh')
filtered = await quality_agent.run({
'terms': [t.__dict__ for t in extracted],
'source_text': source_text,
'target_text': target_text,
'src_lang': 'zh',
'tgt_lang': 'en'
}, None)
# Stage 3: Normalize
norm_agent = TermNormalizationAgent(locale='zh')
normalized = await norm_agent.run({
'terms': [t.__dict__ for t in filtered],
'src_lang': 'zh',
'tgt_lang': 'en'
}, None)
# Stage 4: Standardize
std_agent = BilingualTermStandardizationAgent(locale='zh')
final_terms = await std_agent.execute({
'terms': [t.__dict__ for t in normalized]
}, None)
return final_terms
# Run
asyncio.run(extract_terms())python term_extract.py test_data/sample_zh_en_100.json -o outputs --checkpoint outputs/checkpoint.jsonRaw Bilingual Texts
↓
[Stage 1] Term Extraction (BilingualTermExtractAgent)
├─ AI identifies term pairs
├─ Extracts context information
└─ Confidence scoring
↓
[Stage 2] Quality Check (BilingualTermQualityCheckAgent)
├─ Semantic consistency validation
├─ Term accuracy evaluation
└─ Filters low-quality results
↓
[Stage 3] Term Normalization (TermNormalizationAgent)
├─ Format standardization
│ ├─ Chinese: Traditional/Simplified, "第36条"→"第XX条"
│ ├─ English: Singular/plural, "Article 36"→"Article XX"
│ └─ Japanese: Notation, "第36条"→"第XX条"
├─ Tense unification (English)
└─ Abbreviation standardization
↓
[Stage 4] Deduplication & Standardization (BilingualTermStandardizationAgent)
├─ Deduplicate by normalized forms
├─ Merge synonym variants
└─ Select best translations
↓
Final Standardized Terminology
- Traditional/Simplified: 協議 → 协议
- Abbreviation: 有限公司 → 有限责任公司
- Structural Markers:
- 第36条 → 第XX条
- 第三十六条 → 第XX条
- 第40条第一项 → 第XX条第XX项
- 第二章 → 第XX章
- (一) → (XX)
- Singular/Plural: contracts → contract/contracts
- Verb Tense: terminated → terminate
- Structural Markers:
- Article 36 → Article XX
- Section 5 → Section XX
- Chapter 3 → Chapter XX
- Paragraph 2 → Paragraph XX
- Notation: けいやく → 契約
- Okurigana: Following Cabinet Notice standards
- Structural Markers:
- 第36条 → 第XX条
- 第三十六条 → 第XX条
- 第2章 → 第XX章
bilingual_term_extractor/
├── src/
│ ├── agents/ # Core Agent modules
│ │ ├── base.py # Base Agent class
│ │ ├── bilingual_term_extract.py
│ │ ├── bilingual_term_quality_check.py
│ │ ├── bilingual_term_normalization.py
│ │ └── bilingual_term_standardization.py
│ ├── lib/ # Utility libraries
│ │ └── llm_client.py # LLM client
│ └── workflows/ # Workflows
│ └── bilingual_term_extract.py
├── examples/ # Example scripts
│ ├── simple_extract.py # Simple example
│ └── concurrent_bilingual_term_extract_v2.py # Concurrent processing
├── outputs/ # Output directory
├── requirements.txt
├── README.md # English documentation
├── README_zh.md # Chinese documentation
└── README_ja.md # Japanese documentation
Supports the following LLM providers:
- OpenAI (GPT-4, GPT-4-turbo, GPT-3.5)
- Anthropic (Claude-3.5-sonnet, Claude-3-opus)
- Other OpenAI-compatible APIs
# Quality check batch size
batch_size = 10
# Maximum target terms per source term
max_targets_per_source = 3
# Scoring weights
confidence_weight = 0.4
quality_weight = 0.6Example of standardized term output:
{
"source_term": "劳动报酬",
"target_term": "remuneration for work",
"original_source_term": "劳动报酬",
"original_target_term": "remuneration for work",
"category": "Legal Concept",
"confidence": 0.95,
"quality_score": 0.92,
"combined_score": 0.93,
"law": "Labor Law",
"domain": "LaborLaw",
"year": 1995,
"occurrence_count": 3
}Use concurrent_bilingual_term_extract_v2.py for large-scale document processing:
python examples/concurrent_bilingual_term_extract_v2.py \
--input data/parallel_texts.json \
--output outputs/ \
--max-workers 5Customize normalization behavior by modifying prompt templates in Agents:
# Add custom rules in TermNormalizationAgent
custom_rules = """
7. **Custom Rules**: Your domain-specific rules
- Example: Specialized terminology handling
"""Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details
- Project Homepage: https://github.com/wang-h/bilingual_term_extractor
- Issue Tracker: Issues
- OpenAI for GPT models
- Anthropic for Claude models
- All contributors
Note: Using this tool requires valid LLM API keys. Please ensure compliance with relevant terms of service.