microsoft · babaknaderi · Oct 7, 2025 · Oct 7, 2025 · Oct 23, 2025 · Dec 3, 2025
diff --git a/.github/agents/analyze-results.agent.md b/.github/agents/analyze-results.agent.md
@@ -0,0 +1,191 @@
+---
+name: analyze-results
+description: Analyzes crowdsourced subjective test results — runs result_parser.py for data cleaning, quality checks, and per-clip/per-worker MOS aggregation.
+---
+
+# Analyze subjective test results
+
+Use this runbook when asked to analyze, parse, or evaluate results from a completed
+subjective speech quality test (ACR, DCR, CCR, P.835, P.804, echo impairment, or
+personalized P.835).
+
+**Trigger phrases**: "analyze results", "parse results", "evaluate the study",
+"process the answers", "run result parser".
+
+## Platform and shell adaptation
+
+Code examples use **PowerShell on Windows** (`\` paths). Adapt for other OS/shells:
+replace PowerShell cmdlets with equivalents, use `python3` if needed, convert paths.
+Replace `REPO_ROOT` with the actual absolute path of this repository.
+
+## Mandatory pre-check
+
+Before running anything:
+
+1. Read `AGENTS.md` and `.github\copilot-instructions.md`.
+2. Confirm this is an analysis task, not creation. For study creation, use
+   the `create-study` agent instead.
+
+## Environment prerequisites
+
+Verify once at the start:
+
+1. **Python deps**: `pip install -r requirements.txt --quiet` in `src\`.
+
+## Inputs the agent must collect
+
+Do not guess these values if they are missing:
+
+1. **Test method**: one of `acr`, `dcr`, `ccr`, `p835`, `p804`,
+   `echo_impairment_test`, `pp835`.
+2. **Result parser config file** (`*_result_parser.cfg`): generated by
+   `master_script.py` during study creation. Located in the project output
+   directory.
+3. **Answers CSV** (`Batch_XXX.csv`): exported from the crowdsourcing platform
+   (AMT) or HIT App server. Contains worker responses.
+4. **Prolific demographic CSV** (optional): `prolific_demographic_export_*.csv`
+   — only needed if the study was run on Prolific via HIT App server.
+
+## Execution workflow
+
+### 1. Collect input files
+
+**[ASK]** Ask the user for:
+- The path to the project directory (where the `*_result_parser.cfg` is).
+- The test method used.
+- Whether they used Prolific or AMT.
+
+Then instruct: "Please download the answers file (`Batch_XXX.csv`) and, if using
+Prolific, the demographic export (`prolific_demographic_export_*.csv`) and place
+them in the same directory as the config file."
+
+**[ASK]** Once the user confirms the files are ready, ask for the exact
+filenames.
+
+### 2. Validate input files
+
+Before running the parser, verify:
+
+1. The `*_result_parser.cfg` file exists and is readable.
+2. The answers CSV (`Batch_XXX.csv`) exists, is non-empty, and is valid CSV.
+3. If Prolific was used, the demographic CSV exists and is valid.
+
+```powershell
+# Validate files exist
+Test-Path "PROJECT_DIR\*_result_parser.cfg"
+Test-Path "PROJECT_DIR\Batch_XXX.csv"
+# If Prolific:
+Test-Path "PROJECT_DIR\prolific_demographic_export_*.csv"
+```
+
+### 3. Run the result parser
+
+Set the working directory to the project directory before running.
+
+**Without Prolific:**
+
+```powershell
+Set-Location PROJECT_DIR
+python REPO_ROOT\src\result_parser.py `
+	--cfg RESULT_PARSER_CFG `
+	--method METHOD `
+	--answers Batch_XXX.csv
+```
+
+**With Prolific demographic data:**
+
+```powershell
+Set-Location PROJECT_DIR
+python REPO_ROOT\src\result_parser.py `
+	--cfg RESULT_PARSER_CFG `
+	--method METHOD `
+	--answers Batch_XXX.csv `
+	--prolific_answers prolific_demographic_export_XXX.csv
+```
+
+**Notes:**
+- Use **full absolute paths** for `--cfg` and script path to avoid resolution
+  issues.
+- The working directory should be the project directory so output files are
+  written there.
+
+### 4. Analyze the output and summarize
+
+After the parser completes, provide a summary covering:
+
+#### 4a. Rejection rate
+
+Extract from the parser output:
+- `"Number of submissions: YYYY"`
+- `"overall XXXX answers are rejected"`
+
+Calculate: `rejection_percentage = XXXX / YYYY * 100`
+
+**⚠️ If rejection rate > 35%**: flag as alarming. Ask the user to investigate
+the rejection reasons in the data cleaning report.
+
+#### 4b. Gold question performance
+
+Read `detailed_gold_question_performance.csv` from the working directory.
+
+- Look for columns matching `wrong*` — these indicate how many times each gold
+  clip received a wrong answer.
+- Look for columns matching `url*` — these identify the gold clip URLs.
+- **Any row where the sum of `wrong*` columns > 0** means that gold clip received
+  at least one wrong answer.
+- Calculate the rejection rate per gold clip:
+  `gold_rejection_rate = wrong_count / total_times_shown * 100`
+
+**⚠️ If any gold clip is rejected > 20% of the time**: flag as alarming. Ask the
+user to check that clip and verify the expected answer is correct. It may
+indicate a bad gold clip rather than bad workers.
+
+#### 4c. Summary to present
+
+Provide the user with a structured summary:
+
+```
+📊 Result Parser Summary
+─────────────────────────
+Method:              [method]
+Total submissions:   [N]
+Rejected:            [X] ([%]%)
+Accepted & used:     [Y] ([%]%)
+─────────────────────────
+⚠️ Alerts:           [any alarming findings]
+```
+
+### 5. Point user to output files
+
+After analysis, direct the user to the key output files:
+
+| File pattern | Purpose |
+|-------------|---------|
+| `Batch_XXX_votes_per_clip_[SCALE].csv` | Per-clip MOS ratings for each scale |
+| `Batch_XXX_votes_per_clip_all-scales.csv` | Aggregated per-clip ratings across all scales (multi-scale methods like P.804, P.835) |
+| `Batch_XXX_votes_per_worker_[SCALE].csv` | Per-worker rating statistics |
+| `Batch_XXX_all_votes_per_clip.csv` | All individual votes per clip (key: `all_votes` in name) |
+| `Batch_XXX_data_cleaning_report.csv` | Detailed per-submission data cleaning report |
+| `detailed_gold_question_performance.csv` | Per-gold-clip acceptance/rejection statistics |
+| `Batch_XXX_quantity_bonus_report.csv` | Quantity bonus calculations |
+
+**Scale suffixes by method:**
+
+| Method | Scales |
+|--------|--------|
+| `acr` | `_mos` |
+| `dcr` | `_dmos` |
+| `ccr` | `_cmos` |
+| `p835` | `_sig`, `_bak`, `_ovrl` + `all-scales` |
+| `p804` | `_noi`, `_col`, `_dis`, `_loud`, `_reverb`, `_sig`, `_ovrl` + `all-scales` |
+| `echo_impairment_test` | `_echo` |
+
+### 6. Handle follow-up questions
+
+If the user asks why specific submissions were rejected or not used:
+- Direct them to `Batch_XXX_data_cleaning_report.csv`.
+- Key columns: `accept` (1 = accepted), `accept_and_use` (1 = used for
+  aggregation), `failures` (reasons for rejection/exclusion).
+- Common rejection reasons: `gold` (failed gold question), `variance` (low
+  rating variance), `comparisons` (failed pair comparisons), `performance`
+  (overall rater performance below threshold).