This project implements a Python-based NLP pipeline that automatically extracts structured metadata from Large Language Model (LLM) research papers and populates the ORKG comparison Generative AI Model Landscape.
The pipeline fetches papers from ArXiv, parses their PDFs, extracts key properties via an LLM (KISSKI Chat AI API), normalises and merges the results, and uploads them to ORKG — all in a single command.
It also supports HPC cluster deployment (GPU-based transformers on GWDG Grete) for large-scale batch extraction without API costs.
- Automated paper fetching from ArXiv (by ID or search query)
- PDF parsing with fallback mechanisms (
pdfplumber) - LLM-based extraction via KISSKI Chat AI API (OpenAI-compatible)
- HPC / GPU extraction using Hugging Face Transformers on GWDG Grete
- Extraction normalisation canonical date, organisation, and parameter formats
- Model variant merging groups size variants (e.g. 7B/13B/70B) into one entry
- Primary contribution selection filters auxiliary artefacts, keeps main models
- Structured mapping to ORKG template R609825
- Automatic ORKG upload with duplicate detection
- Strict evaluation framework match-based F1 + BERTScore for semantic fields
- Batch processing for HPC environments
Bachelor-Arbeit-NLP/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI
├── grete/ # HPC cluster deployment
│ ├── extraction/
│ │ ├── grete_extract_paper.py
│ │ ├── grete_extract_from_url.py
│ │ └── grete_extract_paper_distilgpt2.py
│ ├── jobs/
│ │ ├── grete_extract_job.sh # SLURM single job
│ │ ├── grete_extract_batch.sh # SLURM batch array
│ │ └── grete_extract_url_job.sh
│ └── README.md
├── src/ # Core pipeline
│ ├── __init__.py
│ ├── pipeline.py # Main orchestration
│ ├── llm_extractor.py # KISSKI API extraction
│ ├── llm_extractor_transformers.py # GPU-based extraction (Grete)
│ ├── pdf_parser.py # PDF text extraction
│ ├── paper_fetcher.py # ArXiv paper fetching
│ ├── template_mapper.py # ORKG template mapping
│ ├── orkg_client.py # ORKG API wrapper
│ ├── orkg_manager.py # ORKG paper/contribution management
│ ├── extraction_normalizer.py # Canonical format normalisation
│ ├── model_variant_merger.py # Size-variant merging
│ ├── model_contribution_selector.py # Primary model selection
│ ├── baseline_filter.py # Baseline/comparison model filter
│ └── comparison_updater.py # Legacy comparison updater
├── scripts/
│ ├── evaluation/ # Evaluation scripts and README
│ │ ├── evaluate_extraction_strict.py # Strict evaluator (thesis metrics)
│ │ ├── evaluate_extraction.py # Relaxed evaluator
│ │ ├── convert_gold_standard.py # ORKG CSV → gold JSON converter
│ │ └── README.md
│ ├── batch_extract_all_papers.py
│ ├── build_results_table.py
│ └── import_extracted_to_model_folders.py
├── finetuning/ # Optional LoRA fine-tuning workflow
├── tests/ # Unit tests
│ ├── __init__.py
│ ├── test_pipeline.py
│ ├── test_llm_extractor.py
│ ├── test_orkg_client.py
│ ├── test_orkg_append.py
│ └── test_pdf_parser.py
├── data/
│ ├── papers/ # Downloaded PDFs (gitignored)
│ ├── extracted/ # Extraction results (gitignored)
│ ├── gold_standard/ # Gold-standard evaluation data
│ └── logs/ # Pipeline logs (gitignored)
├── results/ # Aggregated evaluation tables (CSV/JSON)
├── config/
│ └── config.yaml # Pipeline configuration
├── .env.example # Environment variables template
├── .gitignore
├── LICENSE
├── pyproject.toml # Python packaging and tool config
├── requirements.txt # Runtime dependencies
└── requirements-dev.txt # Development dependencies
- Python 3.9 or higher
- A KISSKI API key (provided by GWDG for thesis work)
- An ORKG account at sandbox.orkg.org or orkg.org
-
Clone the repository:
git clone https://github.com/sciknoworg/llm-atlas.git cd llm-atlas -
Create and activate a virtual environment:
python -m venv venv # Windows venv\Scripts\activate # Linux / macOS source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Copy the environment template and fill in your credentials:
# Windows (PowerShell) Copy-Item .env.example .env # Linux / macOS cp .env.example .env
# ORKG credentials
ORKG_EMAIL=your.email@example.com
ORKG_PASSWORD=your_password_here
ORKG_HOST=sandbox # Options: sandbox, incubating, production
# KISSKI Chat AI API (SAIA platform, OpenAI-compatible)
KISSKI_API_KEY=your_kisski_api_key_here
KISSKI_BASE_URL=https://chat-ai.academiccloud.de/v1For Grete HPC users:
KISSKI_API_KEYis not required when running GPU-based extraction. See grete/README.md.
Key settings:
| Section | Key | Description |
|---|---|---|
orkg |
host |
sandbox (testing) or production |
orkg |
template_id |
ORKG LLM template (R609825) |
orkg |
comparison_id |
Target comparison (R2147679) |
kisski |
model |
LLM model name (e.g. qwen3-235b-a22b) |
kisski |
temperature |
0.0 for deterministic extraction |
extraction |
max_chunk_size |
Max characters per PDF chunk (default: 6000) |
python -m src.pipeline --arxiv-id 2302.13971python -m src.pipeline --pdf-url https://example.org/paper.pdf --paper-title "Paper Title"python -m src.pipeline --json-file data/extracted/2302.13971_20260101_120000.jsonpython -m src.pipeline --arxiv-id 2302.13971 --no-updatepython -m src.pipeline --arxiv-id 2302.13971 --no-evaluatepython -m src.pipeline --statuspython -m src.pipeline --testpython -m src.pipeline --search "large language model" --max-results 5python -m src.pipeline --arxiv-id 2302.13971 --gold data/gold_standard/R1364660.jsonfrom src.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
result = pipeline.process_paper("2302.13971")
print(result["extraction_data"])The evaluation framework measures extraction quality against a gold standard derived from the ORKG comparison R1364660.
# PowerShell (recommended): one line
python scripts/evaluation/evaluate_extraction_strict.py --gold data/gold_standard/R1364660.json --prediction data/extracted/your_extraction.json
# Linux / macOS (multiline)
python scripts/evaluation/evaluate_extraction_strict.py \
--gold data/gold_standard/R1364660.json \
--prediction data/extracted/your_extraction.json# PowerShell (recommended): one line
python scripts/evaluation/evaluate_extraction_strict.py --gold data/gold_standard/R1364660.json --prediction data/extracted/your_extraction.json --output results/evaluation_report.json
# Linux / macOS (multiline)
python scripts/evaluation/evaluate_extraction_strict.py \
--gold data/gold_standard/R1364660.json \
--prediction data/extracted/your_extraction.json \
--output results/evaluation_report.jsonStructured fields (model name, parameters, date, organisation, etc.) are evaluated with match-based F1:
| Metric | Description |
|---|---|
| Precision | Of all extracted values, how many are correct? |
| Recall | Of all gold values, how many were found? |
| F1-Score | Harmonic mean of precision and recall |
Semantic fields (innovation, extension, application, research_problem, pretraining_corpus) are evaluated with BERTScore (token-level contextual similarity using RoBERTa-large embeddings), which captures paraphrases that exact matching would penalise.
Field-specific matching rules (date year tolerance, organisation aliases, parameter set F1, etc.) are documented in scripts/evaluation/README.md.
# Run all unit tests
pytest tests/ -v
# Run with coverage report
pytest tests/ -v --cov=src --cov-report=term-missingFor large-scale extraction without API limits, the pipeline can run on GWDG Grete using local GPU-based transformers.
# Submit a single extraction job
sbatch grete/jobs/grete_extract_job.sh 2302.13971
# Submit batch array job
sbatch grete/jobs/grete_extract_batch.shFull setup guide: grete/README.md and KISSKI_SETUP.md.
| Resource | Link |
|---|---|
| ORKG Platform | https://orkg.org/ |
| ORKG Sandbox | https://sandbox.orkg.org/ |
| ORKG Python Client | https://orkg.readthedocs.io/en/latest/ |
| LLM Template (R609825) | https://orkg.org/templates/R609825 |
| Target Comparison (R1364660) | https://orkg.org/comparisons/R1364660 |
| KISSKI Chat AI API | https://chat-ai.academiccloud.de |
| ArXiv API | https://info.arxiv.org/help/api/ |
| GWDG Grete HPC | https://info.gwdg.de/docs/doku.php?id=en:services:application_services:high_performance_computing:grete |
This project is licensed under the MIT License — see the LICENSE file for details.
- ORKG platform (orkg.org) for knowledge graph infrastructure
- GWDG for providing HPC resources (Grete cluster) and KISSKI API access
- SAIA platform for Chat AI API services