Skip to content

A powerful AI-driven tool (an LLM-based multi-agent system) for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.

License

Notifications You must be signed in to change notification settings

wang-h/bilingual_term_extractor

Repository files navigation

An LLM-based Multi-Agent System for Bilingual Legal Term Extractor

English | 中文 | 日本語

A powerful AI-driven tool for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.

✨ Key Features

  • 🤖 AI-Powered Extraction: Intelligently identifies bilingual term pairs using advanced LLMs (GPT-4, Claude, DeepSeek, etc.)
  • 🔍 Quality Control: Automatically evaluates term alignment quality and filters low-quality results
  • 📝 Intelligent Normalization:
    • Chinese: Traditional/Simplified unification, structural markers (第XX条)
    • English: Singular/plural normalization, verb tense unification, structural markers (Article XX)
    • Japanese: Notation unification, okurigana standardization, structural markers
  • 🎯 Deduplication & Standardization: Intelligently merges synonym variants and selects best translations
  • High Performance: Supports concurrent processing for large-scale documents
  • 🌐 Multilingual Support: Chinese, English, Japanese, and more

📦 Installation

Requirements

  • Python 3.8+
  • OpenAI API key (or OpenAI-compatible API)

Quick Install

# Clone the repository
git clone https://github.com/wang-h/bilingual_term_extractor.git
cd bilingual_term_extractor

# Install dependencies
pip install -r requirements.txt

# Configure environment variables
cp .env.example .env
# Edit .env and set your API keys:
# OPENAI_API_KEY=your-api-key-here
# OPENAI_BASE_URL=https://api.openai.com/v1  # Optional: for OpenAI-compatible APIs
# OPENAI_API_MODEL=gpt-4o-mini  # Optional: default model

🚀 Quick Start

Basic Usage

import asyncio
from src.agents.bilingual_term_extract import BilingualTermExtractAgent
from src.agents.bilingual_term_quality_check import BilingualTermQualityCheckAgent
from src.agents.bilingual_term_normalization import TermNormalizationAgent
from src.agents.bilingual_term_standardization import BilingualTermStandardizationAgent

async def extract_terms():
    # Source text (Chinese)
    source_text = """
    第三条 劳动者享有平等就业和选择职业的权利、取得劳动报酬的权利...
    """
    
    # Target text (English)
    target_text = """
    Article 3: Workers shall have the right to employment on an equal basis...
    """
    
    # Stage 1: Extract terms
    extract_agent = BilingualTermExtractAgent(locale='zh')
    extracted = await extract_agent.run({
        'source_text': source_text,
        'target_text': target_text,
        'src_lang': 'zh',
        'tgt_lang': 'en'
    }, None)
    
    # Stage 2: Quality check
    quality_agent = BilingualTermQualityCheckAgent(locale='zh')
    filtered = await quality_agent.run({
        'terms': [t.__dict__ for t in extracted],
        'source_text': source_text,
        'target_text': target_text,
        'src_lang': 'zh',
        'tgt_lang': 'en'
    }, None)
    
    # Stage 3: Normalize
    norm_agent = TermNormalizationAgent(locale='zh')
    normalized = await norm_agent.run({
        'terms': [t.__dict__ for t in filtered],
        'src_lang': 'zh',
        'tgt_lang': 'en'
    }, None)
    
    # Stage 4: Standardize
    std_agent = BilingualTermStandardizationAgent(locale='zh')
    final_terms = await std_agent.execute({
        'terms': [t.__dict__ for t in normalized]
    }, None)
    
    return final_terms

# Run
asyncio.run(extract_terms())

Run Example

python term_extract.py test_data/sample_zh_en_100.json -o outputs --checkpoint outputs/checkpoint.json

📊 Processing Pipeline

Raw Bilingual Texts
    ↓
[Stage 1] Term Extraction (BilingualTermExtractAgent)
    ├─ AI identifies term pairs
    ├─ Extracts context information
    └─ Confidence scoring
    ↓
[Stage 2] Quality Check (BilingualTermQualityCheckAgent)
    ├─ Semantic consistency validation
    ├─ Term accuracy evaluation
    └─ Filters low-quality results
    ↓
[Stage 3] Term Normalization (TermNormalizationAgent)
    ├─ Format standardization
    │   ├─ Chinese: Traditional/Simplified, "第36条"→"第XX条"
    │   ├─ English: Singular/plural, "Article 36"→"Article XX"
    │   └─ Japanese: Notation, "第36条"→"第XX条"
    ├─ Tense unification (English)
    └─ Abbreviation standardization
    ↓
[Stage 4] Deduplication & Standardization (BilingualTermStandardizationAgent)
    ├─ Deduplicate by normalized forms
    ├─ Merge synonym variants
    └─ Select best translations
    ↓
Final Standardized Terminology

🎯 Normalization Rules

Chinese Normalization

  1. Traditional/Simplified: 協議 → 协议
  2. Abbreviation: 有限公司 → 有限责任公司
  3. Structural Markers:
    • 第36条 → 第XX条
    • 第三十六条 → 第XX条
    • 第40条第一项 → 第XX条第XX项
    • 第二章 → 第XX章
    • (一) → (XX)

English Normalization

  1. Singular/Plural: contracts → contract/contracts
  2. Verb Tense: terminated → terminate
  3. Structural Markers:
    • Article 36 → Article XX
    • Section 5 → Section XX
    • Chapter 3 → Chapter XX
    • Paragraph 2 → Paragraph XX

Japanese Normalization

  1. Notation: けいやく → 契約
  2. Okurigana: Following Cabinet Notice standards
  3. Structural Markers:
    • 第36条 → 第XX条
    • 第三十六条 → 第XX条
    • 第2章 → 第XX章

📁 Project Structure

bilingual_term_extractor/
├── src/
│   ├── agents/              # Core Agent modules
│   │   ├── base.py         # Base Agent class
│   │   ├── bilingual_term_extract.py
│   │   ├── bilingual_term_quality_check.py
│   │   ├── bilingual_term_normalization.py
│   │   └── bilingual_term_standardization.py
│   ├── lib/                 # Utility libraries
│   │   └── llm_client.py   # LLM client
│   └── workflows/           # Workflows
│       └── bilingual_term_extract.py
├── examples/                # Example scripts
│   ├── simple_extract.py   # Simple example
│   └── concurrent_bilingual_term_extract_v2.py  # Concurrent processing
├── outputs/                 # Output directory
├── requirements.txt
├── README.md               # English documentation
├── README_zh.md            # Chinese documentation
└── README_ja.md            # Japanese documentation

⚙️ Configuration

LLM Configuration

Supports the following LLM providers:

  • OpenAI (GPT-4, GPT-4-turbo, GPT-3.5)
  • Anthropic (Claude-3.5-sonnet, Claude-3-opus)
  • Other OpenAI-compatible APIs

Term Extraction Configuration

# Quality check batch size
batch_size = 10

# Maximum target terms per source term
max_targets_per_source = 3

# Scoring weights
confidence_weight = 0.4
quality_weight = 0.6

📊 Output Format

Example of standardized term output:

{
  "source_term": "劳动报酬",
  "target_term": "remuneration for work",
  "original_source_term": "劳动报酬",
  "original_target_term": "remuneration for work",
  "category": "Legal Concept",
  "confidence": 0.95,
  "quality_score": 0.92,
  "combined_score": 0.93,
  "law": "Labor Law",
  "domain": "LaborLaw",
  "year": 1995,
  "occurrence_count": 3
}

🔧 Advanced Usage

Concurrent Batch Processing

Use concurrent_bilingual_term_extract_v2.py for large-scale document processing:

python examples/concurrent_bilingual_term_extract_v2.py \
    --input data/parallel_texts.json \
    --output outputs/ \
    --max-workers 5

Custom Normalization Rules

Customize normalization behavior by modifying prompt templates in Agents:

# Add custom rules in TermNormalizationAgent
custom_rules = """
7. **Custom Rules**: Your domain-specific rules
   - Example: Specialized terminology handling
"""

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details

📧 Contact

🙏 Acknowledgments

  • OpenAI for GPT models
  • Anthropic for Claude models
  • All contributors

Note: Using this tool requires valid LLM API keys. Please ensure compliance with relevant terms of service.

About

A powerful AI-driven tool (an LLM-based multi-agent system) for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages