Desktop: add Gemini thinking budget controls to cut API costs ~50%#7159
Desktop: add Gemini thinking budget controls to cut API costs ~50%#7159
Conversation
Greptile SummaryThis PR adds
Confidence Score: 3/5Safe to merge for immediate cost reduction, but the proxy defense-in-depth has a logic gap that should be fixed before relying on it as a safety net. One P1 logic bug — proxy doesn't inject thinking budget when generation_config is absent — means the safety net is incomplete. All current Swift callers are protected since they now always set generationConfig, but the gap undermines the stated contract and creates risk for future callers. desktop/Backend-Rust/src/routes/proxy.rs — the thinking budget injection block needs a fallback for requests that omit generation_config entirely. Important Files Changed
Sequence DiagramsequenceDiagram
participant SW as Swift Client
participant PX as Rust Proxy
participant GM as Gemini API
Note over SW: Extraction call (Focus/Task/Memory)
SW->>PX: POST generateContent budget=0
PX->>PX: thinking_config present, skip injection
PX->>GM: forward with budget=0
GM-->>SW: response (no thinking tokens)
Note over SW: Chat / streaming call
SW->>PX: POST generateContent budget=4096
PX->>PX: thinking_config present, skip injection
PX->>GM: forward with budget=4096
GM-->>SW: response (moderate thinking)
Note over PX: Defense-in-depth path
SW->>PX: POST generateContent, generation_config present, NO thinking_config
PX->>PX: thinking_config absent, inject budget=1024
PX->>GM: forward with injected budget=1024
GM-->>SW: response (capped thinking)
Note over PX,GM: Gap: if generation_config absent entirely, no injection occurs
Reviews (1): Last reviewed commit: "Add changelog entry for thinking budget ..." | Re-trigger Greptile |
| // Defense-in-depth: inject default thinking budget if client omits it. | ||
| // Gemini 2.5 Flash defaults to unlimited thinking which is 5.8x more | ||
| // expensive than regular output tokens. Cap at 1024 when absent. | ||
| let has_thinking = gc.contains_key("thinking_config") | ||
| || gc.contains_key("thinkingConfig"); | ||
| if !has_thinking { | ||
| gc.insert( | ||
| "thinkingConfig".to_string(), | ||
| serde_json::json!({"thinkingBudget": DEFAULT_THINKING_BUDGET}), | ||
| ); | ||
| } |
There was a problem hiding this comment.
Defense-in-depth bypass when
generation_config is absent
The injection only fires when the request already contains a generation_config/generationConfig object. A request that omits the key entirely (valid Gemini API behavior — model uses defaults) skips this block, leaving thinking unlimited. The PR comment says "inject default budget=1024 when client omits thinkingConfig" but the actual contract is narrower: the budget is injected only when a generation_config exists without a thinkingConfig. Any future client call that forgets to set generationConfig bypasses the proxy's cost cap entirely, defeating the stated defense-in-depth goal.
The fix is to add a fallback after the loop: if neither generation_config nor generationConfig exists in the object, insert a new generation_config containing only the default thinkingConfig.
| enum CodingKeys: String, CodingKey { | ||
| case thinkingBudget = "thinking_budget" | ||
| } |
There was a problem hiding this comment.
thinking_budget key name inconsistency
Swift's ThinkingConfig maps thinkingBudget → "thinking_budget" (snake_case), while the Rust proxy injects "thinkingBudget" (camelCase). Both are accepted by Gemini's protobuf JSON layer today, but they're inconsistent with each other and could silently break if the API tightens JSON strictness.
| enum CodingKeys: String, CodingKey { | |
| case thinkingBudget = "thinking_budget" | |
| } | |
| enum CodingKeys: String, CodingKey { | |
| case thinkingBudget = "thinkingBudget" | |
| } |
PR #7159 Testing Friction Points (for @sora / workflow improvement)1. Partial knowledge of
|
…ction Gemini 2.5 Flash thinking output costs $3.50/M tokens vs $0.60/M regular (5.8x). Without explicit thinkingConfig, the model defaults to unlimited thinking on every call — representing 65% of daily Gemini spend. - Add ThinkingConfig struct with thinkingBudget field - Add thinkingConfig to all three GenerationConfig structs - Add thinkingBudget parameter to all 6 public GeminiClient methods - Proactive extraction (Focus, Task, Insight, Memory): budget=0 (no thinking) - User-facing chat (streaming + tool-calling): budget=4096 (moderate thinking) - Make responseMimeType optional in GeminiRequest.GenerationConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inject default thinkingConfig (budget=1024) in sanitize_gemini_body when client omits it. Catches old app versions and any code path that bypasses the Swift-side ThinkingConfig. Respects both snake_case and camelCase existing configs. 4 new tests for injection, preservation, and embed skip. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to all paths 5 unused methods removed (sendChatStreamRequest, sendToolChatRequest, continueWithToolResults, sendImageToolRequest, continueImageToolRequest) plus associated structs (GeminiChatRequest, GeminiStreamChunk, GeminiToolChatRequest). 685 lines of dead code eliminated. Added generationConfig with thinkingBudget=0 to GeminiImageToolRequest so task extraction and insight tool loop paths explicitly disable thinking tokens instead of relying on proxy default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Proxy default stays at 1024 to cap old clients that don't send thinkingConfig. Current Swift client explicitly sends budget=0 on all production paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ws 0) Gemini 2.5 Pro requires minimum thinkingBudget=128 while Flash supports 0. Added ThinkingConfig.minimumBudget(for:) that returns 128 for Pro models and 0 for Flash. All methods now clamp budget to model minimum. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old clients may send requests with no generation_config at all. Previously the proxy only injected thinkingConfig into an existing generation_config object. Now it creates generationConfig with the default thinking budget when the key is missing entirely. Added regression test for contents-only request body. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for: dual generation_config casings, null generation_config, string generation_config. All malformed cases get a fresh generationConfig with default thinking budget. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d2c947f to
fd46118
Compare
Summary
Cut Gemini 2.5 Flash thinking token costs by setting explicit
thinkingBudget=0on all production extraction/classification paths in the desktop macOS app, and adding defense-in-depth budget injection at the Rust proxy layer.Problem
thinkingConfig, Gemini defaults to unlimited thinkingChanges
Swift client (
GeminiClient.swift):ThinkingConfigstruct with model-awareminimumBudget(for:)— Flash=0, Pro=128sendRequest(image+schema) — Memory, Focus, OnboardingsendTextRequest(text only) — LiveNotes, Goals, Profile, PTTsendRequest(text+schema) — Prioritization, Dedup, GoalssendImageToolLoop(image+tools) — TaskAssistant, InsightAssistantsendChatStreamRequest,sendToolChatRequest,continueWithToolResults,sendImageToolRequest,continueImageToolRequestGeminiChatRequest,GeminiStreamChunk,GeminiToolChatRequestRust proxy (
proxy.rs):sanitize_gemini_body()injectsthinkingConfig(budget=1024)when client omits itgenerationConfigentirely when absent (caps legacy clients with no generation_config)Expected Impact
Test plan
🤖 Generated with Claude Code