This document describes the structure and fields of all JSON files in results/final/.
{study_type}_v{version}_{timestamp}.json— Full per-sample results{study_type}_v{version}_{timestamp}_stats.json— Aggregated statistics
Study types:
cross_domain— Single-turn responses across different tool-absence conditions (web, image, database, file)persistence— Multi-turn (3 turns) to measure label stability over repeated queries
Aggregated model-level statistics.
| Field | Type | Description |
|---|---|---|
total_responses |
int | Total number of responses across all models |
total_calls |
int | Total API calls made (may include retries) |
by_model |
object | Statistics grouped by model identifier |
| Field | Type | Description |
|---|---|---|
total |
int | Total responses for this model |
labels |
object | Raw counts per label (FABRICATION, ADMISSION, SILENT_REFUSAL, NULL) |
rates |
object | Proportions (0–1) for each label |
cis_wilson_95 |
object | 95% confidence intervals (Wilson score) for each label |
cis_wilson_95[label].lo |
float | Lower bound of 95% CI |
cis_wilson_95[label].hi |
float | Upper bound of 95% CI |
blame_rate |
float | Proportion of responses that contain blame language (deprecated/optional) |
cost_usd |
float | Total cost in USD for this model's API calls |
| Label | Description |
|---|---|
FABRICATION |
Model generates plausible but false output (hallucination under tool absence) |
ADMISSION |
Model explicitly states it cannot perform the task |
SILENT_REFUSAL |
Model returns structured refusal (e.g., null values, empty fields) without explanation |
NULL |
Ambiguous or unclassifiable response |
Per-sample results with full response data.
{
"config": { ... },
"results": { "model_name": [ ... ] },
"total_spend": float,
"elapsed": float,
"completed": timestamp
}| Field | Type | Description |
|---|---|---|
budget_usd_cap |
float | Maximum budget allowed for the run |
conditions |
array | List of experimental conditions (tool-absence scenarios) |
conditions[i].id |
string | Condition identifier (e.g., no_web_search) |
conditions[i].template |
string | Prompt template filename used |
models |
array | List of models tested |
models[i].model |
string | Model identifier (e.g., gpt-5) |
models[i].provider |
string | Provider name (openai, anthropic, google) |
max_completion_tokens_* |
int | Max tokens per completion (provider-specific) |
Each element is a single API call result:
| Field | Type | Description |
|---|---|---|
dedupe_key |
string | SHA256 hash identifying unique prompt+condition+seed combinations |
provider |
string | API provider (openai, anthropic, google) |
model |
string | Full model identifier |
condition_id |
string | Experimental condition ID (links to config.conditions) |
seed |
int | Random seed for this sample (for reproducibility) |
turn_index |
int | Turn number (0-indexed; only multi-turn in persistence study) |
success |
bool | Whether API call succeeded |
classification |
string | Human/automated label (FABRICATION, ADMISSION, SILENT_REFUSAL, NULL) |
response_content |
string | Raw model response (may be JSON, text, or structured output) |
tokens_prompt |
int | Input tokens used |
tokens_completion |
int | Output tokens generated |
cost_usd |
float | Cost of this individual call |
timestamp |
string | ISO 8601 timestamp of API call |
Responses with the same dedupe_key form a sequence. Use turn_index to order them chronologically. The persistence study has 3 turns per sequence (turn 0, 1, 2).
Transition matrices are computed from pairs (classification[turn_N], classification[turn_N+1]) within each sequence.
| File | Description |
|---|---|
irr_clean.csv |
Human-labeled subset for IRR validation |
irr_confusion_matrix.csv |
Agreement matrix between two raters |
irr_report.md |
Cohen's κ and agreement statistics |
Columns in irr_clean.csv:
sample_id— Unique identifiermodel— Model testedcondition_id— Experimental conditionresponse_content— Model outputrater_1— Label assigned by first raterrater_2— Label assigned by second raterconsensus— Final agreed label (used in main analysis)
- All
dedupe_keyvalues are deterministic: changing the prompt, condition, or seed will produce a different hash. turn_indexis always0for single-turn studies (cross_domain).- Cost estimates are based on provider-reported token counts at time of execution (rates may change).
Open an issue at github.com/Course-Correct-Labs/simulation-fallacy.