-
Notifications
You must be signed in to change notification settings - Fork 351
fix: deterministic audit metrics via run_summary.json cache and workflow-logs/ exclusion #26148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e0e6d66
530b52a
d63e3be
44bd32e
ee21de2
a748291
a2a1266
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| # ADR-26148: Deterministic Audit Metrics via run_summary.json Cache and workflow-logs/ Exclusion | ||
|
|
||
| **Date**: 2026-04-14 | ||
| **Status**: Draft | ||
| **Deciders**: pelikhan, Copilot | ||
|
|
||
| --- | ||
|
|
||
| ## Part 1 — Narrative (Human-Friendly) | ||
|
|
||
| ### Context | ||
|
|
||
| The `audit` command reported wildly inconsistent `token_usage` and `turns` on repeated invocations for the same workflow run (observed: 9 turns / 381k tokens on one call, 22 turns / 4.7M tokens on another). Two compounding bugs caused this: (1) `AuditWorkflowRun` unconditionally re-processed all local log files on every call, even when a fully-computed `run_summary.json` was already on disk; and (2) the log-file walk in `extractLogMetrics` did not exclude the `workflow-logs/` directory, which `downloadWorkflowRunLogs` populates with GitHub Actions step-output — files that capture the same agent stdout already present in the agent artifact logs, inflating token counts by approximately 12×. | ||
|
|
||
| ### Decision | ||
|
|
||
| We will adopt a **cache-first strategy** for `AuditWorkflowRun`: before performing any API calls or log processing, check whether a valid `run_summary.json` exists on disk (validated by CLI version). If a cache hit is found, reconstruct `ProcessedRun` from the cached summary and return immediately via a shared `renderAuditReport` helper, bypassing all re-download and re-parse logic. We will additionally **exclude the `workflow-logs/` directory** from the `extractLogMetrics` log walk by returning `filepath.SkipDir` whenever the walk visits a directory named `workflow-logs`, preventing GitHub Actions runner captures from being counted as agent artifact data. Together, these two changes ensure that repeated `audit` calls for the same run produce identical metrics. | ||
|
|
||
| ### Alternatives Considered | ||
|
|
||
| #### Alternative 1: Invalidate and Overwrite the Cache on Every Call | ||
|
|
||
| Rather than treating the cached `run_summary.json` as authoritative, re-process logs on every call and overwrite the cache. This would keep the cache "fresh" but would perpetuate the inconsistency problem: log re-processing can produce different values depending on which files are present at the time (e.g., if `workflow-logs/` has been populated between calls). This was rejected because consistency of audit metrics across repeated calls is the primary requirement. | ||
|
|
||
| #### Alternative 2: Exclude `workflow-logs/` Files by Name Pattern Instead of Directory Skip | ||
|
|
||
| Rather than skipping the entire `workflow-logs/` directory with `filepath.SkipDir`, selectively exclude individual files whose names match known GitHub Actions runner-log patterns (e.g., `*_Run log step.txt`). This would be fragile: GitHub Actions file naming conventions can change, and any unrecognized file would silently inflate metrics again. Skipping the entire directory by name is simpler, robust, and aligns with how `downloadWorkflowRunLogs` places its output. | ||
|
|
||
| #### Alternative 3: Store Canonical Metrics in a Separate Lock File | ||
|
|
||
| Record only the metrics (token usage, turns) in a dedicated lock file separate from `run_summary.json`, and read that lock file on subsequent calls. This adds file-system complexity without meaningful benefit over reusing the existing `run_summary.json` structure. The current `loadRunSummary` already performs CLI-version validation, providing a clean automatic invalidation mechanism. | ||
|
|
||
| ### Consequences | ||
|
|
||
| #### Positive | ||
| - Repeated `audit` calls for the same run are now deterministic and produce identical output. | ||
| - The cache-hit path avoids all API calls and file re-parsing, making subsequent audits significantly faster. | ||
| - The `renderAuditReport` helper function eliminates the duplicated render + finalization logic that previously existed in both the fresh-download and (now) cache-hit code paths. | ||
| - Cache invalidation on CLI upgrade is automatic via the existing `CLIVersion` check in `loadRunSummary`. | ||
|
|
||
| #### Negative | ||
| - The first successful `audit` call becomes the canonical source of truth. If log files were incomplete on the first run (e.g., partial download), the cached metrics will be wrong until the cache is manually cleared or the CLI is upgraded. | ||
| - The `workflow-logs/` exclusion is a directory-name-based heuristic. If `downloadWorkflowRunLogs` ever changes the output directory name, the exclusion silently stops working. | ||
| - Adding a new top-level helper (`renderAuditReport`) increases the surface area of the package's internal API. | ||
|
|
||
| #### Neutral | ||
| - The `run_summary.json` format is unchanged; only the read/write ordering is adjusted (save-before-render in the fresh-download path). | ||
| - Existing tests for `loadRunSummary` and `saveRunSummary` remain valid; new regression tests were added for the cache-hit path and the `workflow-logs/` exclusion. | ||
|
|
||
| --- | ||
|
|
||
| ## Part 2 — Normative Specification (RFC 2119) | ||
|
|
||
| > The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119). | ||
|
|
||
| ### Cache-First Audit Strategy | ||
|
|
||
| 1. Implementations **MUST** check for a valid `run_summary.json` on disk before initiating any API calls or log-file processing in `AuditWorkflowRun`. | ||
| 2. Implementations **MUST** treat a cache hit (valid `run_summary.json` with matching `CLIVersion`) as the authoritative source of metrics and return immediately without re-processing logs. | ||
| 3. Implementations **MUST NOT** overwrite an existing `run_summary.json` when serving a cache hit; the cached file **MUST** remain unmodified. | ||
| 4. Implementations **MUST** persist `run_summary.json` to disk before calling the render step in the fresh-download path, so that a render failure does not prevent future cache hits. | ||
| 5. Implementations **SHOULD** log a message (at the appropriate verbosity level) indicating that a cached summary is being used, including the original `ProcessedAt` timestamp. | ||
|
|
||
| ### Log Metric Extraction | ||
|
|
||
| 1. Implementations **MUST** skip the `workflow-logs/` directory (and its contents) when walking the run output directory in `extractLogMetrics`. | ||
| 2. Implementations **MUST** use `filepath.SkipDir` (or equivalent) to exclude the entire `workflow-logs/` subtree, not individual files within it. | ||
| 3. Implementations **MUST NOT** include token-usage data found in `workflow-logs/` in the `LogMetrics.TokenUsage` or `LogMetrics.Turns` totals. | ||
| 4. Implementations **MAY** log a debug message when skipping the `workflow-logs/` directory to aid in future diagnostics. | ||
|
|
||
| ### Shared Render Path | ||
|
|
||
| 1. Implementations **MUST** use a single shared function (currently `renderAuditReport`) to build and emit the audit report, invoked by both the cache-hit path and the fresh-download path. | ||
| 2. The shared render function **MUST NOT** re-extract metrics from log files; it **MUST** use only the metrics passed to it by the caller. | ||
|
|
||
| ### Conformance | ||
|
|
||
| An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance. | ||
|
|
||
| --- | ||
|
|
||
| *This is a DRAFT ADR generated by the [Design Decision Gate](https://github.com/github/gh-aw/actions/runs/24396807146) workflow. The PR author must review, complete, and finalize this document before the PR can merge.* |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -189,6 +189,41 @@ func AuditWorkflowRun(ctx context.Context, runID int64, owner, repo, hostname st | |
| return auditJobRun(runID, jobID, stepNumber, owner, repo, hostname, runOutputDir, verbose, jsonOutput) | ||
| } | ||
|
|
||
| // Use cached run summary when available to ensure deterministic metrics across repeated calls. | ||
| // Re-processing the same log files can produce different results (e.g. when GitHub's API | ||
| // returns aggregated data that differs from the locally-stored firewall logs), so we always | ||
| // prefer the first fully-processed summary written to disk. The cache is automatically | ||
| // invalidated whenever the CLI version changes (see loadRunSummary). | ||
| if summary, ok := loadRunSummary(runOutputDir, verbose); ok { | ||
| auditLog.Printf("Using cached run summary for run %d (processed at %s)", runID, summary.ProcessedAt.Format(time.RFC3339)) | ||
| if verbose { | ||
| fmt.Fprintln(os.Stderr, console.FormatInfoMessage(fmt.Sprintf("Using cached run summary for run %d (processed at %s)", runID, summary.ProcessedAt.Format(time.RFC3339)))) | ||
| } | ||
| processedRun := ProcessedRun{ | ||
| Run: summary.Run, | ||
| AwContext: summary.AwContext, | ||
| TaskDomain: summary.TaskDomain, | ||
| BehaviorFingerprint: summary.BehaviorFingerprint, | ||
| AgenticAssessments: summary.AgenticAssessments, | ||
| AccessAnalysis: summary.AccessAnalysis, | ||
| FirewallAnalysis: summary.FirewallAnalysis, | ||
| PolicyAnalysis: summary.PolicyAnalysis, | ||
| RedactedDomainsAnalysis: summary.RedactedDomainsAnalysis, | ||
| MissingTools: summary.MissingTools, | ||
| MissingData: summary.MissingData, | ||
| Noops: summary.Noops, | ||
| MCPFailures: summary.MCPFailures, | ||
| TokenUsage: summary.TokenUsage, | ||
| GitHubRateLimitUsage: summary.GitHubRateLimitUsage, | ||
| JobDetails: summary.JobDetails, | ||
| } | ||
| // Override the cached LogsPath with the current runOutputDir so that downstream | ||
| // file reads (created items, aw_info, etc.) resolve correctly even if the run | ||
| // directory has been moved or copied since the summary was first written. | ||
| processedRun.Run.LogsPath = runOutputDir | ||
| return renderAuditReport(processedRun, summary.Metrics, summary.MCPToolUsage, runOutputDir, owner, repo, hostname, verbose, parse, jsonOutput) | ||
|
Comment on lines
+192
to
+224
|
||
| } | ||
|
|
||
| // Check if we have locally cached artifacts first | ||
| hasLocalCache := fileutil.DirExists(runOutputDir) && !fileutil.IsDirEmpty(runOutputDir) | ||
|
|
||
|
|
@@ -416,6 +451,50 @@ func AuditWorkflowRun(ctx context.Context, runID int64, owner, repo, hostname st | |
| processedRun.BehaviorFingerprint = behaviorFingerprint | ||
| processedRun.AgenticAssessments = agenticAssessments | ||
|
|
||
| // Save run summary for caching future audit runs | ||
| summary := &RunSummary{ | ||
| CLIVersion: GetVersion(), | ||
| RunID: run.DatabaseID, | ||
| ProcessedAt: time.Now(), | ||
| Run: run, | ||
| Metrics: metrics, | ||
| AwContext: processedRun.AwContext, | ||
| TaskDomain: processedRun.TaskDomain, | ||
| BehaviorFingerprint: processedRun.BehaviorFingerprint, | ||
| AgenticAssessments: processedRun.AgenticAssessments, | ||
| AccessAnalysis: accessAnalysis, | ||
| FirewallAnalysis: firewallAnalysis, | ||
| PolicyAnalysis: policyAnalysis, | ||
| RedactedDomainsAnalysis: redactedDomainsAnalysis, | ||
| MissingTools: missingTools, | ||
| MissingData: missingData, | ||
| Noops: noops, | ||
| MCPFailures: mcpFailures, | ||
| MCPToolUsage: mcpToolUsage, | ||
| TokenUsage: tokenUsageSummary, | ||
| GitHubRateLimitUsage: rateLimitUsage, | ||
| ArtifactsList: artifacts, | ||
| JobDetails: jobDetails, | ||
| } | ||
|
|
||
| if err := saveRunSummary(runOutputDir, summary, verbose); err != nil { | ||
| if verbose { | ||
| fmt.Fprintln(os.Stderr, console.FormatWarningMessage(fmt.Sprintf("Failed to save run summary: %v", err))) | ||
| } | ||
| } | ||
|
|
||
| return renderAuditReport(processedRun, metrics, mcpToolUsage, runOutputDir, owner, repo, hostname, verbose, parse, jsonOutput) | ||
| } | ||
|
|
||
| // renderAuditReport builds and renders the audit report from a fully-populated processedRun. | ||
| // It is called both when serving from a cached run summary and after a fresh processing pass, | ||
| // ensuring that the two paths produce identical output. | ||
| func renderAuditReport(processedRun ProcessedRun, metrics LogMetrics, mcpToolUsage *MCPToolUsageData, runOutputDir string, owner, repo, hostname string, verbose bool, parse bool, jsonOutput bool) error { | ||
| runID := processedRun.Run.DatabaseID | ||
|
|
||
| currentCreatedItems := extractCreatedItemsFromManifest(runOutputDir) | ||
| processedRun.Run.SafeItemsCount = len(currentCreatedItems) | ||
|
|
||
| currentSnapshot := buildAuditComparisonSnapshot(processedRun, currentCreatedItems) | ||
| comparison := buildAuditComparisonForRun(processedRun, currentSnapshot, runOutputDir, owner, repo, hostname, verbose) | ||
|
|
||
|
|
@@ -474,38 +553,6 @@ func AuditWorkflowRun(ctx context.Context, runID int64, owner, repo, hostname st | |
| } | ||
| } | ||
|
|
||
| // Save run summary for caching future audit runs | ||
| summary := &RunSummary{ | ||
| CLIVersion: GetVersion(), | ||
| RunID: run.DatabaseID, | ||
| ProcessedAt: time.Now(), | ||
| Run: run, | ||
| Metrics: metrics, | ||
| AwContext: processedRun.AwContext, | ||
| TaskDomain: processedRun.TaskDomain, | ||
| BehaviorFingerprint: processedRun.BehaviorFingerprint, | ||
| AgenticAssessments: processedRun.AgenticAssessments, | ||
| AccessAnalysis: accessAnalysis, | ||
| FirewallAnalysis: firewallAnalysis, | ||
| PolicyAnalysis: policyAnalysis, | ||
| RedactedDomainsAnalysis: redactedDomainsAnalysis, | ||
| MissingTools: missingTools, | ||
| MissingData: missingData, | ||
| Noops: noops, | ||
| MCPFailures: mcpFailures, | ||
| MCPToolUsage: mcpToolUsage, | ||
| TokenUsage: tokenUsageSummary, | ||
| GitHubRateLimitUsage: rateLimitUsage, | ||
| ArtifactsList: artifacts, | ||
| JobDetails: jobDetails, | ||
| } | ||
|
|
||
| if err := saveRunSummary(runOutputDir, summary, verbose); err != nil { | ||
| if verbose { | ||
| fmt.Fprintln(os.Stderr, console.FormatWarningMessage(fmt.Sprintf("Failed to save run summary: %v", err))) | ||
| } | ||
| } | ||
|
|
||
| // Display logs location (only for console output) | ||
| if !jsonOutput { | ||
| absOutputDir, _ := filepath.Abs(runOutputDir) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the cache-hit path,
processedRun.Runis taken verbatim fromsummary.Run, includingLogsPath. BecauseLogsPathis persisted inrun_summary.json(often as an absolute path), loading a cached summary from a copied/moved run directory can leaveprocessedRun.Run.LogsPathpointing at the original machine/location. That will break downstream file reads inbuildAuditData(downloaded files, created items, aw_info parsing, etc.). Consider overridingprocessedRun.Run.LogsPath(and/orsummary.Run.LogsPath) to the currentrunOutputDirbefore callingrenderAuditReportso cached summaries remain portable and consistent with the fresh-download path.