Bachelor Thesis: Automated Knowledge Extraction for Large-Scale Generative AI Models Catalog

Overview

This project implements a Python-based NLP pipeline that automatically extracts structured metadata from Large Language Model (LLM) research papers and populates the ORKG comparison Generative AI Model Landscape.

The pipeline fetches papers from ArXiv, parses their PDFs, extracts key properties via an LLM (KISSKI Chat AI API), normalises and merges the results, and uploads them to ORKG — all in a single command.

It also supports HPC cluster deployment (GPU-based transformers on GWDG Grete) for large-scale batch extraction without API costs.

Features

Automated paper fetching from ArXiv (by ID or search query)
PDF parsing with fallback mechanisms (pdfplumber)
LLM-based extraction via KISSKI Chat AI API (OpenAI-compatible)
HPC / GPU extraction using Hugging Face Transformers on GWDG Grete
Extraction normalisation canonical date, organisation, and parameter formats
Model variant merging groups size variants (e.g. 7B/13B/70B) into one entry
Primary contribution selection filters auxiliary artefacts, keeps main models
Structured mapping to ORKG template R609825
Automatic ORKG upload with duplicate detection
Strict evaluation framework match-based F1 + BERTScore for semantic fields
Batch processing for HPC environments

Project Structure

Bachelor-Arbeit-NLP/
├── .github/
│   └── workflows/
│       └── ci.yml                          # GitHub Actions CI
├── grete/                                  # HPC cluster deployment
│   ├── extraction/
│   │   ├── grete_extract_paper.py
│   │   ├── grete_extract_from_url.py
│   │   └── grete_extract_paper_distilgpt2.py
│   ├── jobs/
│   │   ├── grete_extract_job.sh            # SLURM single job
│   │   ├── grete_extract_batch.sh          # SLURM batch array
│   │   └── grete_extract_url_job.sh
│   └── README.md
├── src/                                    # Core pipeline
│   ├── __init__.py
│   ├── pipeline.py                         # Main orchestration
│   ├── llm_extractor.py                    # KISSKI API extraction
│   ├── llm_extractor_transformers.py       # GPU-based extraction (Grete)
│   ├── pdf_parser.py                       # PDF text extraction
│   ├── paper_fetcher.py                    # ArXiv paper fetching
│   ├── template_mapper.py                  # ORKG template mapping
│   ├── orkg_client.py                      # ORKG API wrapper
│   ├── orkg_manager.py                     # ORKG paper/contribution management
│   ├── extraction_normalizer.py            # Canonical format normalisation
│   ├── model_variant_merger.py             # Size-variant merging
│   ├── model_contribution_selector.py      # Primary model selection
│   ├── baseline_filter.py                  # Baseline/comparison model filter
│   └── comparison_updater.py              # Legacy comparison updater
├── scripts/
│   ├── evaluation/                         # Evaluation scripts and README
│   │   ├── evaluate_extraction_strict.py   # Strict evaluator (thesis metrics)
│   │   ├── evaluate_extraction.py          # Relaxed evaluator
│   │   ├── convert_gold_standard.py        # ORKG CSV → gold JSON converter
│   │   └── README.md
│   ├── batch_extract_all_papers.py
│   ├── build_results_table.py
│   └── import_extracted_to_model_folders.py
├── finetuning/                             # Optional LoRA fine-tuning workflow
├── tests/                                  # Unit tests
│   ├── __init__.py
│   ├── test_pipeline.py
│   ├── test_llm_extractor.py
│   ├── test_orkg_client.py
│   ├── test_orkg_append.py
│   └── test_pdf_parser.py
├── data/
│   ├── papers/                             # Downloaded PDFs (gitignored)
│   ├── extracted/                          # Extraction results (gitignored)
│   ├── gold_standard/                      # Gold-standard evaluation data
│   └── logs/                              # Pipeline logs (gitignored)
├── results/                                # Aggregated evaluation tables (CSV/JSON)
├── config/
│   └── config.yaml                         # Pipeline configuration
├── .env.example                            # Environment variables template
├── .gitignore
├── LICENSE
├── pyproject.toml                          # Python packaging and tool config
├── requirements.txt                        # Runtime dependencies
└── requirements-dev.txt                    # Development dependencies

Installation

Prerequisites

Python 3.9 or higher
A KISSKI API key (provided by GWDG for thesis work)
An ORKG account at sandbox.orkg.org or orkg.org

Setup

Clone the repository:

git clone https://github.com/sciknoworg/llm-atlas.git
cd llm-atlas

Create and activate a virtual environment:

python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Copy the environment template and fill in your credentials:

# Windows (PowerShell)
Copy-Item .env.example .env
# Linux / macOS
cp .env.example .env

Configuration

Environment Variables (`.env`)

# ORKG credentials
ORKG_EMAIL=your.email@example.com
ORKG_PASSWORD=your_password_here
ORKG_HOST=sandbox  # Options: sandbox, incubating, production

# KISSKI Chat AI API (SAIA platform, OpenAI-compatible)
KISSKI_API_KEY=your_kisski_api_key_here
KISSKI_BASE_URL=https://chat-ai.academiccloud.de/v1

For Grete HPC users: KISSKI_API_KEY is not required when running GPU-based extraction. See grete/README.md.

Pipeline Configuration (`config/config.yaml`)

Key settings:

Section	Key	Description
`orkg`	`host`	`sandbox` (testing) or `production`
`orkg`	`template_id`	ORKG LLM template (`R609825`)
`orkg`	`comparison_id`	Target comparison (`R2147679`)
`kisski`	`model`	LLM model name (e.g. `qwen3-235b-a22b`)
`kisski`	`temperature`	`0.0` for deterministic extraction
`extraction`	`max_chunk_size`	Max characters per PDF chunk (default: 6000)

Usage

Process a Single Paper (ArXiv ID)

python -m src.pipeline --arxiv-id 2302.13971

Process a Paper from PDF URL

python -m src.pipeline --pdf-url https://example.org/paper.pdf --paper-title "Paper Title"

Upload an Existing Extraction JSON to ORKG

python -m src.pipeline --json-file data/extracted/2302.13971_20260101_120000.json

Skip ORKG Upload (Extraction Only)

python -m src.pipeline --arxiv-id 2302.13971 --no-update

Skip Evaluation After Extraction

python -m src.pipeline --arxiv-id 2302.13971 --no-evaluate

Show Pipeline Status

python -m src.pipeline --status

Test Connections

python -m src.pipeline --test

Search ArXiv and Process Results

python -m src.pipeline --search "large language model" --max-results 5

Specify Custom Gold Standard for Auto-Evaluation

python -m src.pipeline --arxiv-id 2302.13971 --gold data/gold_standard/R1364660.json

Python API

from src.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
result = pipeline.process_paper("2302.13971")
print(result["extraction_data"])

Evaluation

The evaluation framework measures extraction quality against a gold standard derived from the ORKG comparison R1364660.

Run Strict Evaluation

# PowerShell (recommended): one line
python scripts/evaluation/evaluate_extraction_strict.py --gold data/gold_standard/R1364660.json --prediction data/extracted/your_extraction.json

# Linux / macOS (multiline)
python scripts/evaluation/evaluate_extraction_strict.py \
  --gold data/gold_standard/R1364660.json \
  --prediction data/extracted/your_extraction.json

Save Evaluation Report

# PowerShell (recommended): one line
python scripts/evaluation/evaluate_extraction_strict.py --gold data/gold_standard/R1364660.json --prediction data/extracted/your_extraction.json --output results/evaluation_report.json

# Linux / macOS (multiline)
python scripts/evaluation/evaluate_extraction_strict.py \
  --gold data/gold_standard/R1364660.json \
  --prediction data/extracted/your_extraction.json \
  --output results/evaluation_report.json

Evaluation Metrics

Structured fields (model name, parameters, date, organisation, etc.) are evaluated with match-based F1:

Metric	Description
Precision	Of all extracted values, how many are correct?
Recall	Of all gold values, how many were found?
F1-Score	Harmonic mean of precision and recall

Semantic fields (innovation, extension, application, research_problem, pretraining_corpus) are evaluated with BERTScore (token-level contextual similarity using RoBERTa-large embeddings), which captures paraphrases that exact matching would penalise.

Field-specific matching rules (date year tolerance, organisation aliases, parameter set F1, etc.) are documented in scripts/evaluation/README.md.

Testing

# Run all unit tests
pytest tests/ -v

# Run with coverage report
pytest tests/ -v --cov=src --cov-report=term-missing

HPC Cluster (Grete)

For large-scale extraction without API limits, the pipeline can run on GWDG Grete using local GPU-based transformers.

# Submit a single extraction job
sbatch grete/jobs/grete_extract_job.sh 2302.13971

# Submit batch array job
sbatch grete/jobs/grete_extract_batch.sh

Full setup guide: grete/README.md and KISSKI_SETUP.md.

Resources

Resource	Link
ORKG Platform	https://orkg.org/
ORKG Sandbox	https://sandbox.orkg.org/
ORKG Python Client	https://orkg.readthedocs.io/en/latest/
LLM Template (R609825)	https://orkg.org/templates/R609825
Target Comparison (R1364660)	https://orkg.org/comparisons/R1364660
KISSKI Chat AI API	https://chat-ai.academiccloud.de
ArXiv API	https://info.arxiv.org/help/api/
GWDG Grete HPC	https://info.gwdg.de/docs/doku.php?id=en:services:application_services:high_performance_computing:grete

License

This project is licensed under the MIT License — see the LICENSE file for details.

Acknowledgments

ORKG platform (orkg.org) for knowledge graph infrastructure
GWDG for providing HPC resources (Grete cluster) and KISSKI API access
SAIA platform for Chat AI API services

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data		data
finetuning		finetuning
grete		grete
results		results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
KISSKI_SETUP.md		KISSKI_SETUP.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Bachelor Thesis: Automated Knowledge Extraction for Large-Scale Generative AI Models Catalog

Overview

Features

Project Structure

Installation

Prerequisites

Setup

Configuration

Environment Variables (.env)

Pipeline Configuration (config/config.yaml)

Usage

Process a Single Paper (ArXiv ID)

Process a Paper from PDF URL

Upload an Existing Extraction JSON to ORKG

Skip ORKG Upload (Extraction Only)

Skip Evaluation After Extraction

Show Pipeline Status

Test Connections

Search ArXiv and Process Results

Specify Custom Gold Standard for Auto-Evaluation

Python API

Evaluation

Run Strict Evaluation

Save Evaluation Report

Evaluation Metrics

Testing

HPC Cluster (Grete)

Resources

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Variables (`.env`)

Pipeline Configuration (`config/config.yaml`)

Packages