Data Dictionary Agent is a local-first, CLI-first Python tool that profiles CSV/XLSX/XLSM datasets and produces first-pass data dictionary artifacts. It combines deterministic profiling, deterministic semantic inference, human-provided config context, bounded agent orchestration, and optional LLM wording suggestions without making the LLM the source of truth.
Documenting datasets by hand is repetitive and often inconsistent.
This project gives you a practical first pass:
- profile what is actually in the data,
- infer likely column roles with deterministic rules,
- apply known business context with optional config overrides,
- and produce readable outputs you can review and refine.
It is designed for local runs, clear artifacts, and reviewable evidence.
You can ask an LLM to draft documentation, but that alone has limits:
- it may invent details that are not in your data,
- it may miss distribution-level facts,
- and it may blur observed evidence with guessed interpretation.
This project keeps those boundaries clear:
- deterministic profiling captures observed facts,
- deterministic semantic inference suggests likely roles,
- optional LLM output is only wording help from a safe summary,
- and authoritative outputs remain deterministic/profile/config grounded.
- Local-first tabular intake (
.csv,.xlsx,.xlsm) - Deterministic physical profiling
- Deterministic semantic inference
- Optional YAML config overrides for business context
- Deterministic dictionary outputs (
.md,.csv,.json) - Suggested override workflow (
suggested_overrides.yaml) - Bounded agent mode with trace/report artifacts
- Optional LLM description suggestions with safe summaries and fallback generation
In this repo, hybrid agent does not just mean “deterministic plus LLM.”
It means several bounded layers work together:
- deterministic profiling produces observed evidence,
- deterministic semantic inference suggests likely roles,
- config overrides let a human provide known business context,
- bounded agent mode plans, orchestrates, and records decisions/review items,
- optional LLM suggestions help with wording only from safe summaries,
- deterministic/profile/config artifacts remain the source of truth.
python -m venv .venv
source .venv/Scripts/activate
python -m pip install -e ".[dev]"
python -m data_dictionary_agent.cli \
--input sample_data/crm_contacts/contacts_clean.csv \
--output-dir outputs/crm_contacts_profilemacOS/Linux activation uses source .venv/bin/activate.
python -m pip install -e ".[dev,llm]"
export OPENAI_API_KEY="your-key"
export OPENAI_MODEL="gpt-4o-mini" # optional--llm-descriptions still works without an API key. In that case deterministic fallback suggestions are generated. LLM suggestion files stay separate and do not overwrite dictionary outputs.
python -m data_dictionary_agent.cli \
--input sample_data/crm_contacts/contacts_clean.csv \
--output-dir outputs/crm_contacts_profilepython -m data_dictionary_agent.cli \
--input sample_data/crm_contacts/contacts_clean.csv \
--config config/examples/crm_context.yaml \
--output-dir outputs/crm_contacts_with_configpython -m data_dictionary_agent.cli \
--input sample_data/crm_contacts/contacts_clean.csv \
--config config/examples/crm_context.yaml \
--mode agent \
--output-dir outputs/crm_contacts_agentpython -m data_dictionary_agent.cli \
--input sample_data/crm_contacts/contacts_clean.csv \
--llm-descriptions \
--output-dir outputs/crm_contacts_llmpython -m data_dictionary_agent.cli \
--input sample_data/crm_contacts/contacts_clean.csv \
--config config/examples/crm_context.yaml \
--mode agent \
--llm-descriptions \
--output-dir outputs/crm_contacts_agent_llmprofiling_trace.json
data_dictionary.mddata_dictionary.csvdata_dictionary.json
suggested_overrides.yaml
agent_trace.jsonagent_report.md
llm_safe_summary.jsonllm_description_suggestions.jsonllm_description_suggestions.md
- Deterministic profiling trace is authoritative evidence for observed data facts.
- Deterministic semantic inference is a suggestion layer.
- Config overrides are human-provided context.
- Dictionary files are first-pass documentation built from those inputs.
- LLM description suggestions are optional, off by default, written to separate files, and do not overwrite
data_dictionary.md/.csv/.json. - Possible sensitive fields are redacted in
llm_safe_summary.json. - No full raw rows are sent to the LLM.
- If no API key is available, fallback description suggestions are generated.
src/data_dictionary_agent/cli.py— command-line entrypoint and run orchestration.src/data_dictionary_agent/intake.py— CSV/XLSX/XLSM loading and input validation.src/data_dictionary_agent/profiling.py— deterministic physical profiling logic.src/data_dictionary_agent/semantic_inference.py— deterministic semantic role suggestion rules.src/data_dictionary_agent/config.py— YAML config override loading/validation.src/data_dictionary_agent/dictionary_builder.py— first-pass dictionary entry construction.src/data_dictionary_agent/suggested_overrides.py— suggested review override artifact generation.src/data_dictionary_agent/agent_runner.py— bounded agent-mode execution and trace capture.src/data_dictionary_agent/agent_reporting.py— human-readable agent run reporting.src/data_dictionary_agent/llm_descriptions.py— safe summary creation and optional LLM/fallback suggestions.src/data_dictionary_agent/output_writers.py— writes dictionary and related output artifacts.tests/— automated test coverage for pipeline behavior.docs/— usage guides, boundaries, and release notes for maintainers.
python -m pytest- Not a formal sensitive-data classification system.
- Not a data catalog publishing platform.
- Not a framework-based orchestration project.
- Not an autonomous, open-ended agent.
- Not a replacement for governed business definitions.
ARCHITECTURE.mdPROJECT_SCOPE.mddocs/how_it_works.mddocs/interpreting_outputs.mddocs/hybrid_agent.mddocs/llm_boundary.mddocs/example_commands.mddocs/config_overrides.mddocs/agent_mode.mddocs/release_checklist.mdFUTURE_WORK.md