cedarkit is a support package for CCM/EDM workflows. It provides the project-configuration loader, data-variable metadata objects, routing and path helpers, parquet packaging utilities, visualization helpers, and runner scripts used to move from project configs and parameter files to calculation outputs and aggregated artifacts.
This README is meant to help readers recognize the main files, directories, and output products in the current CSV/parquet-oriented workflow. SQLite-related paths and options are intentionally not documented here.
The package metadata lives in pyproject.toml and currently targets Python 3.11+.
For development, install from the cedar_util directory with:
pip install -e .There is also an optional Conda environment in env.yml for projects that want a fuller local environment around the package.
Core dependencies currently include:
pandaspyarrowmatplotlibseabornPyYAMLcloudpicklejoblibpyleoclimpyEDM
For reproducible project environments, some workflows may pin pyEDM or related dependencies to specific commits outside the minimal package metadata.
cedar_util/
├── cedarkit/
│ ├── core/ # ProjectConfig, data-variable objects, relationship/data objects
│ ├── utils/ # Routing, IO, workflow, CLI, tables, plotting helpers
│ └── viz/ # Grid/panel visualization helpers
├── runners/ # Script entrypoints for grouping, running, packaging, and grids
├── tests/ # Lightweight comparison / validation scripts
├── pyproject.toml # Package metadata
├── requirements.txt # Pinned environment-style dependency list
├── env.yml # Optional Conda environment
└── README.md
At a high level, the current documented workflow looks like this:
- A project directory contains a
proj_config.yamlplus variable YAMLs indata_var_configs/. - Parameter CSVs in a
parameters/directory define the CCM runs to perform. - Runner scripts operate on those parameter or group files and write intermediate calculation outputs.
- Packaging and aggregation steps turn those outputs into parquet tables and object-grid artifacts.
The important thing to recognize is the artifact flow:
proj_config.yaml
data_var_configs/*.yaml
↓
parameters/*.csv
↓
calc_grps.csv / E_tau_grps.csv
↓
run-level output files
↓
parquet outputs and object-grid joblib files
The main config loader is cedarkit.core.project_config.load_config, which reads a top-level project YAML and can also load per-variable YAML files from data_var_configs/. The resulting ProjectConfig object exposes nested config sections as attributes.
Key config ideas reflected in the code:
data_vars: names or IDs of variable definitions to load- project-level metadata: labels, prefixes, run settings, file naming blocks
pal: palette information, optionally supplemented bydata_var_configs/palette.yaml- location-specific settings: for example
localandhpcblocks used by routing helpers - variable metadata consumed by
DataVarConfig, such as real/surrogate source fields and labels
A minimal fragment looks more like this:
proj_name: ExampleProject
prefix: ExProj
data_vars:
col: NGRIP1
target: Wu18TSI
col:
var_id: NGRIP1
var: d18O
target:
var_id: Wu18TSI
var: TSI
csvs:
calc_grps: calc_grps
e_tau_grps: E_tau_grpsTypical companion files live under:
proj_config.yaml
data_var_configs/
├── NGRIP1.yaml
├── Wu18TSI.yaml
└── palette.yaml
Parameter and grouping files are plain CSVs. The exact columns vary by workflow, but these field names appear repeatedly in runners and output packaging:
col_var_id,target_var_id,E,tau,lag,knn,Tp,train_ind_i,id
NGRIP1,Wu18TSI,4,1,-5,4,0,0,101Some files to look for:
parameters/*.csv: run definitions consumed by calculation runnerscalc_grps.csv: grouped combinations used for packaging and aggregationE_tau_grps.csv: deduplicatedE/tau-style grouping file written alongside calc groups
If you are scanning a project directory, the filenames are usually more informative than the command history.
The scripts in cedar_util/runners/ are the main operational entrypoints. They share a common CLI parser, so you will often see flags such as --project, --config, --parameters, --inds, --proj_dir, and --test, but the most useful thing here is knowing what each script reads and writes.
Purpose: turn parameter CSVs into deduplicated group definitions.
Typical inputs:
proj_config.yamlparameters/*.csv
Typical outputs:
calc_grps.csvE_tau_grps.csv
Recognition snippet:
parameters/params_lag_bycol_*.csv
calc_local_tmp/calc_grps.csv
calc_local_tmp/E_tau_grps.csv
Purpose: execute CCM runs for rows in a parameter CSV using the loaded project config and write run-level output files.
Typical inputs:
proj_config.yamlparameters/<something>.csv
Typical outputs:
- per-run files under the calculation directory chosen by config/routing
Recognition snippet:
proj_config.yaml
parameters/params_*.csv
calc_local_tmp/
Purpose: package grouped calculation outputs into parquet tables.
Typical inputs:
calc_grps.csvE_tau_grps.csvor another grouping CSV- intermediate run outputs from the calculation directory
Typical outputs:
- parquet outputs organized under the configured output directory structure
Recognition snippet:
calc_grps.csv
E_tau_grps.csv
output/parquet/
Purpose: build grid-style aggregate objects and derived tables across E/tau combinations.
Typical inputs:
- grouped output artifacts
- project config metadata
Typical outputs:
delta_rho_statstablesdelta_rho_fulltableslibsize_aggregatedtables- object-grid joblib artifacts
Recognition snippet:
E4_tau1__delta_rho_stats
E4_tau1__delta_rho_full
E4_tau1__libsize_aggregated
*.joblib
Purpose: rewrite embedded paths inside an existing object-grid joblib when moving the artifact to a new dyad/tmp home layout.
Typical inputs:
- an existing object-grid
.joblib
Typical outputs:
- a rewritten
.joblib, either in place or at a new path
Recognition snippet:
--input-grid old_grid.joblib
--output-grid rewritten_grid.joblib
The README does not try to fully document the Python API, but these names are useful anchors when reading code:
cedarkit.core.project_config.load_configcedarkit.core.project_config.ProjectConfigcedarkit.core.data_var.DataVarConfig
These interfaces are mostly useful for understanding how project YAML, variable YAML, and path-routing information are turned into runtime objects.
A few output patterns show up repeatedly across the package:
proj_config.yaml
parameters/params_*.csv
calc_grps.csv
E_tau_grps.csv
E{E}_tau{tau}__delta_rho_stats
E{E}_tau{tau}__delta_rho_full
E{E}_tau{tau}__libsize_aggregated
*.parquet
*.joblib
If you are trying to identify the right file to inspect, these names are usually the quickest landmarks.
This README describes the current cedarkit CSV/parquet-oriented workflow and the artifacts most readers will encounter in this repository. Alternate or evolving paths in the codebase, including SQLite-related routing, are intentionally left out here.
The current package metadata declares GPL-3.0.