Skip to content

Redact client secrets and shell-stdin literal transcripts#1203

Open
peyton-alt wants to merge 4 commits into
mainfrom
redact-2
Open

Redact client secrets and shell-stdin literal transcripts#1203
peyton-alt wants to merge 4 commits into
mainfrom
redact-2

Conversation

@peyton-alt
Copy link
Copy Markdown
Contributor

@peyton-alt peyton-alt commented May 13, 2026

https://entire.io/gh/entireio/cli/trails/372

Summary

Closes structural redaction gaps in transcript content:

  • *_CLIENT_SECRET, *_API_KEY, *_TOKEN env-shape redaction in both raw text and JSONL structured fields (e.g. GITHUB_CLIENT_SECRET=… was previously leaking).
  • Shell-stdin literal pattern: printf 'value' | examplectl secret put EXAMPLE_API_KEY is now redacted. These secrets are often human-ish (correct-horse-client) and below the entropy threshold, so entropy + Betterleaks couldn't catch them — the structural shape is the signal.
  • Identifier env-shape guard: *_id / *_account keys with non-high-entropy values are preserved (so GOOGLE_ADSENSE_ACCOUNT=pub-1234567890123456 no longer collapses to REDACTED by the entropy detector). Values that independently look secret-like still redact.
  • \$VAR references in secret-put commands continue to be treated as placeholders, not values.

The regex tier stays the deterministic, structural layer — important for layered composition with future ML-based redaction (e.g. OpenAI Privacy Filter), which is strong on free-text PII but lists "missed secrets for novel credential formats" and "over-redaction of benign high-entropy strings" as known failure modes.

Test plan

  • go test ./redact -count=1 — full package, all green
  • mise run fmt && mise run lint — 0 issues
  • mise run test:ci — unit + integration + Vogon canary (58/58) + roger-roger canary (4/4), exit 0
  • New tests: github_client_secret_env_var, google_adsense_account_is_preserved_as_identifier, TestJSONLContent_StructuredCredentialFieldsRedacted (extended with GITHUB_CLIENT_SECRET), TestJSONLContent_ShellStdinSecretCommandRedactsPrintfLiteral, TestJSONLContent_ShellStdinSecretCommandOverRedactionGuards

🤖 Generated with Claude Code


Note

Medium Risk
Expands and reorders redaction heuristics with new regexes and guard logic, which could introduce false positives/negatives in transcript redaction. Changes are localized to redaction behavior but affect all logged/serialized content.

Overview
Extends deterministic redaction beyond DB-password keys to also redact secret-shaped keys (e.g. *_CLIENT_SECRET, *_API_KEY, *_TOKEN) in both raw text key=value/key:value forms and structured JSON fields.

Adds a new detector for shell transcripts that pipe a literal into secret-management commands (e.g. printf '...' | ... secrets put ...), redacting only the stdin literal while preserving command context and treating $VAR references as placeholders.

Tightens entropy-based redaction by skipping low-entropy identifier assignments like *_id/*_account unless the value itself looks like a high-entropy secret. Updates/extends tests to cover the new secret shapes, shell-stdin redaction, and the new over-redaction guard.

Reviewed by Cursor Bugbot for commit 4018d6a. Configure here.

Adds three structural redaction rules to close transcript leak gaps:

- CLIENT_SECRET / API_KEY / TOKEN env-shape values, in both raw text
  and JSONL structured fields (e.g. GITHUB_CLIENT_SECRET=...).
- Shell-stdin literals piped into secret-management commands such as
  `printf 'value' | examplectl secret put EXAMPLE_API_KEY`, which are
  often human-ish and below the entropy threshold.
- Identifier env-var shapes (`*_id`, `*_account`) are preserved when
  the value is not independently high-entropy, preventing benign IDs
  like `GOOGLE_ADSENSE_ACCOUNT=pub-...` from being collapsed to
  `REDACTED` by the entropy detector.

Shell variable references like `\$EXAMPLE_API_KEY` continue to be
treated as placeholders rather than secret values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7daa370e8203
@peyton-alt peyton-alt requested a review from a team as a code owner May 13, 2026 20:56
Copilot AI review requested due to automatic review settings May 13, 2026 20:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens transcript redaction in the redact package to close three structural gaps: env-shape secret keys (*_CLIENT_SECRET, *_API_KEY, *_TOKEN) in raw text and JSONL fields, shell-stdin literals piped into secret-management CLIs (e.g. printf 'value' | examplectl secret put ...), and an over-redaction guard so *_id/*_account identifiers with non-high-entropy values aren't collapsed by the entropy detector. $VAR references continue to be treated as placeholders.

Changes:

  • Add secretValueKeyShape/secretValuePattern/secretJSONKeyRegex and a shellStdinSecretPattern regex; wire them into String/detectCredentialValues/detectShellStdinSecrets and into the JSON key sensitivity check.
  • Add isNonSecretIdentifierAssignment + isHighEntropySecretCandidate to preserve *_id/*_account env assignments with non-entropic values, and isShellVariableReference to treat $VAR as a placeholder.
  • Extend unit tests covering GITHUB_CLIENT_SECRET, GOOGLE_ADSENSE_ACCOUNT, structured JSONL credential fields, and shell-stdin printf | secret put redaction plus over-redaction guards.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
redact/redact.go New regexes and helpers to detect env-shape secrets and shell-stdin literals, plus an identifier-assignment guard against entropy over-redaction.
redact/redact_test.go New tests for client-secret env vars, identifier preservation, structured JSONL credential fields, and shell-stdin `printf

@peyton-alt
Copy link
Copy Markdown
Contributor Author

@BugBot review

Comment thread redact/redact.go Outdated
Per the existing pattern documented on dbPasswordKeyShape — the same
key list should compose both the env-var regex and the JSON-key regex
so the vendor list stays in one place. The hand-written duplicate in
secretJSONKeyRegex would have silently diverged if a new key type were
added to one side but not the other. Behavior is unchanged: the regex
runs on normalized keys, where [_-]+ and _+ match identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 20953e233268
@peyton-alt
Copy link
Copy Markdown
Contributor Author

@BugBot review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 4018d6a. Configure here.

peyton-alt and others added 2 commits May 15, 2026 14:31
Following review feedback on this PR, four scoped changes narrow the
new patterns introduced earlier so they stop flat-redacting prose,
debug identifiers, and account fields:

- secretValuePattern now uses `=` only (not `[:=]`). The JSON-aware
  path handles `key: value` shapes via secretJSONKeyRegex, so dropping
  `:` from the raw-text regex prevents prose like "Got the token: foo"
  from over-redacting the next word.

- secretValueKeyShape splits into two alternation branches so bare
  `secret`/`token` keywords require at least one prefix segment (e.g.
  csrf_token, auth_secret). Standalone `{"token":"x"}` /
  `{"secret":"x"}` JSON fields are no longer flat-redacted; entropy
  and Betterleaks still catch real high-entropy values.

- New nonSecretTokenPrefixes allowlist (cancel, continuation, cursor,
  idempotency, next, page, pagination, prev, previous, sync) is
  checked in isCredentialJSONSecretKey so debug/pagination tokens like
  `{"cancel_token":"abc-debug"}` and `{"pagination_token":"..."}` are
  preserved in transcripts. Real credential-shaped values still get
  caught by the per-value String() call.

- shouldSkipJSONLField now also skips keys ending in `account`, so
  identifier fields like `{"google_adsense_account":"pub-..."}` are
  preserved by design rather than by entropy coincidence.

Regression tests cover the three previously-leaking prose cases, the
seven JSON token-key cases, and a high-entropy account value to prove
the protection is structural rather than entropy-dependent. The PR's
original coverage of `printf 'value' | examplectl secret put KEY` and
structured credential field redaction is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 02c170aecfae
Three tests in manifest_test.go hardcoded `started := time.Date(2026, 5, 8, ...)`
as the session.State.StartedAt anchor. session.StateStore.Load auto-deletes
sessions whose StartedAt is older than StaleSessionThreshold (7 days) and
returns nil, so the hardcoded date silently rots: tests pass while the
calendar is inside the 7-day window, then fail forever once it crosses.

CI on PR #964 caught this — same SHA passed yesterday (6 days after the
hardcoded date) and failed today (7 days after). Unrelated to the
streaming/diagnostic work in this PR; the manifest_test.go file isn't
touched by any other commit on this branch.

Switch all three tests to `time.Now().UTC().Add(-time.Hour)` so the
session is always one hour old at test time. Still exercises the
5-second jitter check inside matchReviewSessionState; stays well inside
the staleness window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants