Skip to content

Desktop: add Gemini thinking budget controls to cut API costs ~50%#7159

Open
beastoin wants to merge 8 commits intomainfrom
worktree-gemini-thinking-budget
Open

Desktop: add Gemini thinking budget controls to cut API costs ~50%#7159
beastoin wants to merge 8 commits intomainfrom
worktree-gemini-thinking-budget

Conversation

@beastoin
Copy link
Copy Markdown
Collaborator

@beastoin beastoin commented May 4, 2026

Summary

Cut Gemini 2.5 Flash thinking token costs by setting explicit thinkingBudget=0 on all production extraction/classification paths in the desktop macOS app, and adding defense-in-depth budget injection at the Rust proxy layer.

Problem

  • Gemini 2.5 Flash thinking output tokens cost 5.8x more than regular output ($3.50/M vs $0.60/M)
  • Thinking tokens accounted for 65% of daily Gemini spend (~$513/day)
  • Without explicit thinkingConfig, Gemini defaults to unlimited thinking
  • All desktop Gemini usage is extraction/classification that doesn't need chain-of-thought reasoning

Changes

Swift client (GeminiClient.swift):

  • Added ThinkingConfig struct with model-aware minimumBudget(for:) — Flash=0, Pro=128
  • All 4 production methods now clamp thinkingBudget to model minimum:
    • sendRequest (image+schema) — Memory, Focus, Onboarding
    • sendTextRequest (text only) — LiveNotes, Goals, Profile, PTT
    • sendRequest (text+schema) — Prioritization, Dedup, Goals
    • sendImageToolLoop (image+tools) — TaskAssistant, InsightAssistant
  • Removed 5 unused methods and 3 unused structs (685 lines of dead code):
    • sendChatStreamRequest, sendToolChatRequest, continueWithToolResults, sendImageToolRequest, continueImageToolRequest
    • GeminiChatRequest, GeminiStreamChunk, GeminiToolChatRequest

Rust proxy (proxy.rs):

  • Defense-in-depth: sanitize_gemini_body() injects thinkingConfig(budget=1024) when client omits it
  • Creates generationConfig entirely when absent (caps legacy clients with no generation_config)
  • Handles both snake_case and camelCase field names
  • 8 new tests: injection, preservation, embed skip, missing config, dual casing, null/string config

Expected Impact

  • Eliminates ~100% of thinking token spend on current app (Flash: budget=0, Pro: budget=128)
  • Old app versions capped at 1024 tokens via proxy (vs unlimited before)
  • No impact on extraction quality — these paths are classification, not reasoning

Test plan

  • Swift builds clean (0 errors, 14.81s)
  • All 202 Rust proxy tests pass (8 thinking budget tests)
  • Model-aware budget: Pro gets 128 minimum, Flash gets 0
  • Proxy creates generationConfig when absent entirely
  • Edge cases: dual casing, null, string generation_config all handled
  • Monitor Gemini billing post-deploy for thinking token reduction

🤖 Generated with Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR adds ThinkingConfig with thinkingBudget to all Gemini request types in Swift (budget=0 for extraction, budget=4096 for chat) and adds a Rust proxy fallback that injects a default budget of 1024 when the client omits thinkingConfig. The cost-reduction rationale is sound, the Swift changes are clean, and 4 new Rust tests are included.

  • P1 — Proxy defense gap: The Rust injection only fires when a generation_config/generationConfig object is already present; requests that omit the key entirely bypass the cap, defeating the stated defense-in-depth contract.
  • P2 — ThinkingConfig key casing: Swift encodes as \"thinking_budget\" (snake_case) while the proxy injects \"thinkingBudget\" (camelCase) — worth aligning for consistency.

Confidence Score: 3/5

Safe to merge for immediate cost reduction, but the proxy defense-in-depth has a logic gap that should be fixed before relying on it as a safety net.

One P1 logic bug — proxy doesn't inject thinking budget when generation_config is absent — means the safety net is incomplete. All current Swift callers are protected since they now always set generationConfig, but the gap undermines the stated contract and creates risk for future callers.

desktop/Backend-Rust/src/routes/proxy.rs — the thinking budget injection block needs a fallback for requests that omit generation_config entirely.

Important Files Changed

Filename Overview
desktop/Backend-Rust/src/routes/proxy.rs Adds DEFAULT_THINKING_BUDGET constant and injects thinkingConfig into generation_config when absent; injection is skipped entirely if generation_config is not present, leaving a gap in defense-in-depth.
desktop/Desktop/Sources/ProactiveAssistants/Core/GeminiClient.swift Adds ThinkingConfig struct and wires thinkingBudget=0 to extraction calls and thinkingBudget=4096 to chat/streaming calls; responseMimeType correctly made optional; minor CodingKeys casing inconsistency.
desktop/CHANGELOG.json Adds unreleased changelog entry for thinking budget controls — no issues.

Sequence Diagram

sequenceDiagram
    participant SW as Swift Client
    participant PX as Rust Proxy
    participant GM as Gemini API

    Note over SW: Extraction call (Focus/Task/Memory)
    SW->>PX: POST generateContent budget=0
    PX->>PX: thinking_config present, skip injection
    PX->>GM: forward with budget=0
    GM-->>SW: response (no thinking tokens)

    Note over SW: Chat / streaming call
    SW->>PX: POST generateContent budget=4096
    PX->>PX: thinking_config present, skip injection
    PX->>GM: forward with budget=4096
    GM-->>SW: response (moderate thinking)

    Note over PX: Defense-in-depth path
    SW->>PX: POST generateContent, generation_config present, NO thinking_config
    PX->>PX: thinking_config absent, inject budget=1024
    PX->>GM: forward with injected budget=1024
    GM-->>SW: response (capped thinking)

    Note over PX,GM: Gap: if generation_config absent entirely, no injection occurs
Loading

Reviews (1): Last reviewed commit: "Add changelog entry for thinking budget ..." | Re-trigger Greptile

Comment on lines +535 to +545
// Defense-in-depth: inject default thinking budget if client omits it.
// Gemini 2.5 Flash defaults to unlimited thinking which is 5.8x more
// expensive than regular output tokens. Cap at 1024 when absent.
let has_thinking = gc.contains_key("thinking_config")
|| gc.contains_key("thinkingConfig");
if !has_thinking {
gc.insert(
"thinkingConfig".to_string(),
serde_json::json!({"thinkingBudget": DEFAULT_THINKING_BUDGET}),
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Defense-in-depth bypass when generation_config is absent

The injection only fires when the request already contains a generation_config/generationConfig object. A request that omits the key entirely (valid Gemini API behavior — model uses defaults) skips this block, leaving thinking unlimited. The PR comment says "inject default budget=1024 when client omits thinkingConfig" but the actual contract is narrower: the budget is injected only when a generation_config exists without a thinkingConfig. Any future client call that forgets to set generationConfig bypasses the proxy's cost cap entirely, defeating the stated defense-in-depth goal.

The fix is to add a fallback after the loop: if neither generation_config nor generationConfig exists in the object, insert a new generation_config containing only the default thinkingConfig.

Comment on lines +11 to +13
enum CodingKeys: String, CodingKey {
case thinkingBudget = "thinking_budget"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 thinking_budget key name inconsistency

Swift's ThinkingConfig maps thinkingBudget"thinking_budget" (snake_case), while the Rust proxy injects "thinkingBudget" (camelCase). Both are accepted by Gemini's protobuf JSON layer today, but they're inconsistent with each other and could silently break if the API tightens JSON strictness.

Suggested change
enum CodingKeys: String, CodingKey {
case thinkingBudget = "thinking_budget"
}
enum CodingKeys: String, CodingKey {
case thinkingBudget = "thinkingBudget"
}

@beastoin
Copy link
Copy Markdown
Collaborator Author

beastoin commented May 6, 2026

PR #7159 Testing Friction Points (for @sora / workflow improvement)

1. Partial knowledge of beast omi dev tools

I didn't know about these commands until sora pointed them out mid-test:

  • beast omi dev auth-token <uid> — standalone dev token generator
  • beast omi dev doctor — environment health check
  • beast omi dev start — dev backend launcher
  • beast omi dev evidence — CP9 evidence capture

Impact: I manually built auth tokens from prod app instead of using the dev token generator, which caused a cascade of auth/project mismatch issues.

Suggestion: Add beast omi dev tool inventory to the desktop-app-walkthrough skill prerequisites or CP9 section of the PR workflow skill.

2. GoogleService-Info-Dev.plist points to prod project

Both GoogleService-Info.plist and GoogleService-Info-Dev.plist in the Desktop package use PROJECT_ID=based-hardware (prod). There is no config pointing to based-hardware-dev. This means:

  • Dev tokens generated for based-hardware-dev are rejected by the app's Firebase Auth
  • Auth injection from a dev-signed-in app fails because no app is signed into a dev Firebase project
  • Testing requires prod-compatible tokens, which conflicts with the dev backend expecting based-hardware-dev

Impact: Required swapping FIREBASE_PROJECT_ID in backend .env from based-hardware-dev to based-hardware to match the app's Firebase config.

3. Other blockers encountered

  • SwiftPM lock contention: run.sh uses a broad pgrep pattern that matches shell command strings containing SWIFT_BUILD_DIR, falsely detecting lock contention. Had to kill 3 stale processes (one 21hr old).
  • Missing framework copies in run.sh: ContentsquareCore.framework, onnxruntime.framework, and Sentry.framework are not copied by run.sh's bundle creation logic (lines 381-455), causing runtime crashes.
  • Resource bundle path: Binary rename without matching resource bundle causes Fatal error: could not load resource bundle.

None of these are code flaws in PR #7159 — they're environment/tooling gaps in the desktop dev workflow.

by AI for @beastoin

beastoin and others added 8 commits May 6, 2026 10:14
…ction

Gemini 2.5 Flash thinking output costs $3.50/M tokens vs $0.60/M regular
(5.8x). Without explicit thinkingConfig, the model defaults to unlimited
thinking on every call — representing 65% of daily Gemini spend.

- Add ThinkingConfig struct with thinkingBudget field
- Add thinkingConfig to all three GenerationConfig structs
- Add thinkingBudget parameter to all 6 public GeminiClient methods
- Proactive extraction (Focus, Task, Insight, Memory): budget=0 (no thinking)
- User-facing chat (streaming + tool-calling): budget=4096 (moderate thinking)
- Make responseMimeType optional in GeminiRequest.GenerationConfig

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inject default thinkingConfig (budget=1024) in sanitize_gemini_body when
client omits it. Catches old app versions and any code path that bypasses
the Swift-side ThinkingConfig. Respects both snake_case and camelCase
existing configs. 4 new tests for injection, preservation, and embed skip.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to all paths

5 unused methods removed (sendChatStreamRequest, sendToolChatRequest,
continueWithToolResults, sendImageToolRequest, continueImageToolRequest)
plus associated structs (GeminiChatRequest, GeminiStreamChunk,
GeminiToolChatRequest). 685 lines of dead code eliminated.

Added generationConfig with thinkingBudget=0 to GeminiImageToolRequest
so task extraction and insight tool loop paths explicitly disable
thinking tokens instead of relying on proxy default.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Proxy default stays at 1024 to cap old clients that don't send
thinkingConfig. Current Swift client explicitly sends budget=0
on all production paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ws 0)

Gemini 2.5 Pro requires minimum thinkingBudget=128 while Flash supports 0.
Added ThinkingConfig.minimumBudget(for:) that returns 128 for Pro models
and 0 for Flash. All methods now clamp budget to model minimum.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old clients may send requests with no generation_config at all.
Previously the proxy only injected thinkingConfig into an existing
generation_config object. Now it creates generationConfig with the
default thinking budget when the key is missing entirely.

Added regression test for contents-only request body.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for: dual generation_config casings, null generation_config,
string generation_config. All malformed cases get a fresh
generationConfig with default thinking budget.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin beastoin force-pushed the worktree-gemini-thinking-budget branch from d2c947f to fd46118 Compare May 6, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant