PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.
-
Updated
May 22, 2026 - Python
PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.
Extract structured data from local or remote LLM models
A schema-driven framework for LLM structured extraction enhanced by multi-stage RL training (SFT→DPO→GRPO), with interpretable reward design and end-to-end reproducibility.
Reproducible diagnostic investigation of a fine-tuned SLM that scored 99.75% on evaluation and failed silently on 10% of production inputs. Full pipeline. Every number verified.
Claude Code Skill for structured information extraction from code/docs/logs. 6-step Python pipeline (source grounding, dedup, confidence scoring, entity resolution, relation inference, KG injection). Zero dependencies, no API keys. Replaces LangExtract.
Auditable LLM extraction for Java: structured output with source citations, PDF bounding boxes, confidence, provenance, and audit JSON.
A simple llm library
news-summizr extracts structured summaries from headlines, labeling key points like announcement, products, region for quick insight.
Collection of purpose-built MCP servers for AI agent workflows.
A new package is designed to facilitate structured, reliable extraction of key insights from user-provided texts about cultural topics. It accepts a text input, such as an article or discussion prompt
Turn tutorial videos into structured specs — Pine Script, recipes, code walkthroughs
Automated research paper analysis: PDF → JSON with evidence extraction using LLMs (DeepSeek, Gemma). Extracts methods, results, datasets, and claims with precise evidence grounding.
Automated prompt optimization using mentor-agent architecture. Generate and refine prompts from labeled data.
AI-powered travel agency assistant (*) a LangGraph stateful agent on Telegram that captures preferences through natural conversation, generates personalized itineraries via Groq/Llama 3.3, auto-manages leads in Excel, and remembers returning users. Built with LangChain, FastAPI, and python-telegram-bot.
Multilingual structured OCR (11+ languages, CJK-tuned) — MCP server with verified per-character bboxes for AI agents
Human-in-the-loop LLM orchestration with structured signal extraction and session persistence. Annotate confusion and curiosity—feedback shapes responses, topology accumulates over time. API-first design, no gamification. FastAPI + Claude + SQLite + D3.
ReAct-based intelligent analysis Agent with 4-layer architecture (Skill-Agent-LLMService-Tool), dual tool-calling modes (Native FC / Prompt-based), triple execution engine (Offline/Fast/Agent), incremental reflection with convergence detection, Skill template system, SSE streaming, Prometheus monitoring, and SFT trajectory export.
Source content for Vstorm blog posts—carefully crafted to provide both depth and clarity, with practical insights readers can apply immediately.
Robust extraction of structured signals from messy unstructured text. Hybrid LLM + tool-use schema + source span linking + eval harness.
Extract structured data from SEC EDGAR 10-K filings using LLMs (Claude/GPT-4o) + Pydantic v2 validation
Add a description, image, and links to the structured-extraction topic page so that developers can more easily learn about it.
To associate your repository with the structured-extraction topic, visit your repo's landing page and select "manage topics."