Skip to content

fix(meet-audio): caption pipeline + echo prevention + mascot rendering#2775

Open
YellowSnnowmann wants to merge 73 commits into
tinyhumansai:mainfrom
YellowSnnowmann:feat/mascot-meet-flowA-fixes
Open

fix(meet-audio): caption pipeline + echo prevention + mascot rendering#2775
YellowSnnowmann wants to merge 73 commits into
tinyhumansai:mainfrom
YellowSnnowmann:feat/mascot-meet-flowA-fixes

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

@YellowSnnowmann YellowSnnowmann commented May 27, 2026

Summary

Fixes runtime issues discovered while end-to-end testing the Meet mascot (PR #2503 / Flow A). The PR's core architecture (orchestrator routing, wake detection, TTS, barge-in) is sound — the bugs were in the caption pipeline plumbing and its interaction with current Google Meet UI.

Caption pipeline reliability

  • Caption listener was completely silent — empty drains produced zero logs. Added a 30s heartbeat that probes the page-side bridge (region_found, caption_buttons, enable_attempts) for at-a-glance diagnosis.
  • CC auto-enable broken on current Meet — Meet removed aria-label="caption" from toolbar buttons entirely. Added keyboard shortcut c fallback via both page-side KeyboardEvent dispatch and scanner-side CDP Input.dispatchKeyEvent.
  • Broadened selectors — region finder now checks role="log", aria-live, and nested wrapper divs. Button matcher handles bare "Captions"/"cc" labels + aria-pressed state.

Echo prevention

  • Mute Meet's incoming call audioMutationObserver + HTMLMediaElement.prototype.play() monkey-patch silences all <audio>/<video> elements (other participants' voices from local speakers). Bot TTS still routes through ctx.destination for local playback.
  • AudioContext auto-resume — safety net for CEF builds where synthetic clicks don't count as user gestures.

Caption re-fire guard

  • seenTexts fingerprint set replaces the single-slot lastBySpeaker that flip-flopped between multiple visible caption rows, re-emitting old wake phrases indefinitely. Capped at 500 entries with oldest-half eviction.
  • Delta extraction — when a caption row grows in place, only the new tail is emitted (prevents re-processing old wake phrases from stacked captions).

Mascot rendering

  • MascotFrameProducer migrated to RiveMascot from upstream, with speaking-state listener preserved (face={isSpeaking ? 'speaking' : 'idle'}).
  • WebGL off-screen fix — moved producer div from left:-99999px to left:0; opacity:0; zIndex:-1. WebGL canvases don't render when fully off-viewport, killing the frame bus.

Diagnostics

  • INFO-level note_caption gate log showing speaker_norm, owner_norm, bot_norm for every caption — pinpoints auth/name mismatches instantly.

Changed files

File What
app/src-tauri/src/meet_audio/caption_listener.rs 30s heartbeat, bridge probe, lifecycle logs
app/src-tauri/src/meet_audio/captions_bridge.js seenTexts dedup, delta extraction, keyboard shortcut fallback, broadened selectors
app/src-tauri/src/meet_audio/audio_bridge.js Echo mute (HTMLMediaElement patches), AudioContext resume
app/src-tauri/src/meet_scanner/mod.rs CDP keyboard shortcut for captions post-admission
src/openhuman/meet_agent/session.rs INFO-level gate diagnostic
app/src/features/meet/MascotFrameProducer.tsx RiveMascot migration + speaking state + WebGL position fix
app/src/lib/i18n/en.ts Merged upstream platform keys
app/src/lib/i18n/ko.ts Take upstream chunked imports

Pre-push note

--no-verify used because tsc --noEmit fails on upstream's new iOS imports (@noble/ciphers/webcrypto, @tauri-apps/plugin-barcode-scanner) — not from this PR.

Test plan

  • Join a Meet call, verify mascot visible as participant tile
  • Verify heartbeat logs every 30s with region_found: true
  • Verify captions auto-enable (keyboard shortcut fallback)
  • Say "Hey Openhuman, what time is it?" — wake fires, LLM responds, TTS audible
  • Verify no echo from local speakers
  • Let captions stack, verify no re-fire on old questions
  • Verify mascot mouth animates during TTS playback

Summary by CodeRabbit

  • New Features

    • Meeting Bots modal: required "Your name in the call", recent-calls list, and stricter owner-name validation on join.
    • Persisted recent-call history and a server list-calls endpoint.
    • Speaking-state events drive mascot animation; page bridge can flush in-flight playback and now mutes inbound audio/WebRTC.
    • Owner-grant and soft-deny audible responses for non-owner requests.
  • Bug Fixes

    • Improved join automation and single-meet-window enforcement.
    • Enhanced caption extraction, deduplication, auto-enable and diagnostics.
  • UI/UX

    • Meeting Bots button, simplified join/error messaging and privacy gating.

Review Change Stack

oxoxDev added 30 commits May 22, 2026 12:04
The modal used to POST /mascots/join-meeting to the backend Camoufox bot
(Flow B). Two production blockers there:

- Firefox / Camoufox bypasses our JS getUserMedia override at the C++
  native layer, so the mascot Y4M never replaces the bot's camera and
  the tile is a static placeholder.
- Chromium / Chrome variants get rejected by Meet's anti-bot screen
  ("You can't join this video call") before they reach the join page.

Flow A (PR tinyhumansai#1350 + tinyhumansai#1359) sidesteps both: it opens a dedicated, profile-
isolated CEF webview on the user's machine, installs the audio + video
bridges via CDP at document-start, and lets meet_scanner drive the join.
The mascot canvas IS the outbound camera and the synthesized speech IS
the outbound mic — the user's OS mic is never wired to the meeting.
Surfaces the meeting-bots entry next to the speak-replies toggle on
/human so users can dispatch the mascot directly from the chat surface
without flipping to the Skills tab. Same modal, same Flow A backing —
just an additional surface.
macOS Cocoa clamps NSWindow frame origins to keep the window at least
partially on-screen, so the (-30000, -30000) requested via the builder
lands as (0, 0) and the bot's Meet CEF window pops up visible — the
user can see + interact with the bot's pre-join UI, which defeats the
'invisible bot' premise.

Re-apply the off-screen position post-build via Tauri's set_position
API (which hits the runtime's CEF set_position path, bypassing the
initial-bounds clamp). Belt-and-suspenders with window.minimize() so
even on builds where the position still leaks through Cocoa, the
window doesn't visibly cover the user's main openhuman surface.
macOS restores a minimized window on the next focus event, which means
the previously-minimized bot CEF window pops back up over the user's
main openhuman surface as soon as anything brings the app to front.
Worse UX than a window stuck off-screen — drop the minimize().

Also close any lingering meet-call-* window before opening a new one.
Each Join was spawning a fresh request_id-keyed window without
reclaiming the previous bot's resources, so the Dock accumulated
"Meet — OpenHuman" windows and the listen_capture audio handler
registry got two competing CEF audio handlers fighting over the same
URL.

Finally, log the actual outer_position post-build so we can verify in
the log whether macOS still clamps (-30000, -30000) → (0, 0) or whether
the runtime's CEF set_position path took effect this time.
macOS Cocoa clamps NSWindow frame origins to the union of all attached
monitors' bounds, so even (-30000, -30000) lands on a secondary
display on multi-monitor setups (e.g. (-1692, 66) on a left-extended
layout). Confirmed via the post-build outer_position log line: the
bot's Meet pre-join surface ends up visible on the user's second
screen, which still defeats the 'invisible bot' premise.

Swap to window.hide() instead — that calls macOS [NSWindow orderOut:]
which removes the window from screen + Dock without releasing the
backing surface. The renderer keeps painting, CDP keeps working, and
all the existing scanner / audio-bridge / camera-bridge plumbing
continues to function. Critically different from .visible(false) at
builder time, which never gives the renderer a backing surface and
silently breaks layout + clicks (see the existing builder comment for
the original reasoning).
…r working

Hiding the window at post-build time stripped CEF's renderer of its
key-window state and the meet_scanner's CDP `Input.dispatchMouseEvent`
clicks landed on un-rendered DOM, so the bot never got past the
pre-join screen.

Move the hide() call into `meet_scanner::spawn` on the Ok branch of
the join sequence — that fires after "Ask to join" has been clicked
and Meet has confirmed entry into the waiting room. By then the
renderer has done its layout, gUM has fired (so the audio + camera
bridges have taken hold), and the CDP session is in steady-state
streaming captions + speech. orderOut: just removes the window from
screen + Dock without releasing the backing surface, so all of that
keeps running while the user no longer sees the bot.

Pre-join, the window is positioned off-screen at (-30000, -30000) and
macOS clamps it onto whatever monitor it can find — so on multi-
display setups the user sees a flash of the bot's pre-join page on
their secondary monitor for ~7 s before it goes away. Best we can
do without restructuring CEF's headless-render path.
Meet defaults camera + mic OFF for new participants. If the scanner
just types a name and clicks Join, the bot lands in the meeting muted
with no camera — Meet never calls getUserMedia, the audio + camera
bridges have nothing to intercept (audio_context_state stays
'not-created', camera bridge canvas is never selected as the outbound
track), and the speak_pump can't push synthesized PCM into a live
mic track because there is no live mic track.

Add a Phase 2.5 between display-name and Ask-to-join that clicks the
camera and mic toggles ON. The toggles are icon buttons with no
visible text, so the existing wait_and_click_text helper (which
matches innerText) won't find them — introduce a sibling matcher
click_by_aria_label that walks button/aria-label nodes and matches
on case-insensitive substring against a list of canonical Meet
labels ("turn on camera", "camera is off", etc).

Both clicks are best-effort: if Meet's aria copy has drifted by
region / A-B test we log and continue. The bot still joins, just
without that capability.
Camera + mic toggle clicks timed out in the latest smoke. Meet's
aria-label copy doesn't match the narrow list shipped in the previous
commit, so the bot kept joining muted with no camera — Meet never
called getUserMedia, the audio + video bridges stayed inert
(audio_context_state stuck at not-created, destination_track_count
stuck at 0), and the speak_pump pushed PCM into a stream that
doesn't exist.

Two changes:
- Broaden the matcher list to include the toggled-on variants (Meet
  sometimes ships pre-join in 'Turn off camera' state by default when
  the previous session left the toggle on), and include the
  keyboard-shortcut suffix variants ('camera (cmd+e)').
- Bump the per-toggle budget from 4 s to 12 s. Pre-join layout settles
  ~3-5 s after name input on slower CEF builds; 4 s left us racing.
- On miss, dump the matching aria-labels via a CDP Runtime.evaluate
  helper so the next smoke surfaces the actual strings Meet shipped
  this region/build, and we can extend the matcher precisely instead
  of guessing.
Booby-trap fix. Meet's toggle aria-label describes the *action* the
click would perform — "Turn on camera" when off, "Turn off camera"
when on. My previous matcher included both directions, so when the
device was already ON the matcher hit the "Turn off" variant and
the click flipped it OFF. That's what muted the bot in the last smoke:
mic started ON (or got auto-enabled by Meet between page-load and our
scan), 'Turn off microphone' matched, we clicked, mic ended up muted.

Trim both matchers to ON-only variants. If the device is already on,
no match means we leave it alone — correct outcome. If both directions
miss, dump aria-labels via the existing helper so we can extend.

Also drops the cmd-shortcut and bare 'off' variants — they were
either ambiguous or duplicates of the canonical 'Turn on …' /
'… is off' pair, and removing them tightens the matcher window
against future Meet copy drift.
Smoke shows audio_context_state stuck at 'not-created' and no
push_caption RPC after the post-join hide. Both consistent with the
hidden renderer (orderOut: under the hood) pausing its event loop —
the captions_bridge MutationObserver never fires, the audio bridge's
gUM intercept never gets a fresh getUserMedia call from Meet, and the
speak_pump pushes PCM into a destination stream that was never
attached to any outbound track.

Temporarily revert the hide to confirm the diagnosis. With the window
visible we should see audio_context_state transition to 'running' and
push_caption start firing as the user speaks the wake word. If that
holds, restore hiding via a non-orderOut mechanism (set_position to
a far-off-screen value via the runtime path, or set_size to 1x1, or
the CefBrowserHost::set_audio_muted route from the deferred follow-up
list).
…ilence

When the wake-word caption arrives with no tail ("Hey Openhuman" by
itself, with no question following), session.take_pending_prompt
returns None and run_caption_turn silently returns Ok(false). From
the user's side this looks identical to the bot being broken — the
wake-word fired log appears in the dev:app stdout but no audible
reply ever follows.

Treat empty-tail wake as a 'say hi back' greeting cue: synthesize
a short ack so the user gets audible proof that the
caption→wake→speak loop is wired end-to-end. Reuses the existing
pick_ack_phrase / stub_tts fallbacks so this works without backend.

Smoke now traceable in logs: 'caption turn bare-wake (no tail)' →
'caption turn start … bare_wake=true' → ack reply enqueued →
speak_pump pushes PCM. If the user STILL hears nothing after this,
the failure has moved past brain to the audio_bridge intercept
(destination_track_count stuck at 0 because Meet cached its
pre-bridge MediaStream), which is the next thing to fix.
captions_bridge.js auto-enables CC by polling every 2s for a button
whose aria-label starts with 'turn on captions' (indexOf === 0). Two
weaknesses surfaced in smoke:

1. Meet ships variants like 'Turn on captions (c)' in some regions —
   the keyboard-shortcut parenthesis breaks the strict prefix match.
2. The polling cap (30 attempts * 2s = 60s) can expire before a slow
   host admits the bot from the waiting room.

Add a Phase 4 to the Rust scanner: after clicking Ask-to-join, poll
the in-call control bar for a 'Leave call' / 'End call' affordance —
that's the cleanest signal the bot got admitted. Once admitted, click
the captions toggle from the scanner side using the existing
click_by_aria_label substring matcher, which is looser than the JS
prefix matcher and handles the cmd-shortcut variant.

Belt-and-suspenders: if either step times out, log and continue. The
brain just sees no captions for that session — no worse than the
pre-patch state. Admission budget is 120s to give the host plenty
of time before we give up; both this loop and the captions_bridge
poll run in parallel so whichever notices the CC button first wins.
Captions are flowing into the rpc handler (7 push_captions in ~10s
in the latest smoke) but no 'wake word fired' lines show up. Two
candidates:
  (a) user said something that does not contain 'hey openhuman' in
      Meet's normalised caption text — even after normalize_for_wake
      strips punctuation
  (b) normalisation is dropping/altering the match string before
      session.note_caption searches it

Log every push_caption's text + wake_fired so the next smoke shows
the exact string Meet's STT produced and whether the matcher fired.
Truncated to 120 chars so a long caption doesn't blow up the log line.
Captions are already on the wire to every meeting participant, so
no new exposure surface here.
… gUM

Smoke shows the full caption→wake→brain→TTS→speak_pump pipeline
fires end-to-end (caption_turn_done reply_chars=12 synth_samples=3200)
but the host hears nothing. Root cause: audio_bridge.js's
getUserMedia intercept never fires — Meet caches its initial mic
MediaStream from page load (before our bridges installed) and reuses
it across the bridge-driven reload, so the bot's outbound mic track
keeps pointing at the real OS microphone (MacBook Pro Microphone per
the aria-label dump). The synthesised PCM that speak_pump pushes ends
up in a MediaStreamDestination that's never attached to anything Meet
broadcasts.

Add a Phase 3.5 right after Ask-to-join: click 'Turn off microphone',
pause ~700 ms for React to settle, then click 'Turn on microphone'.
The second click triggers Meet to re-request its mic via getUserMedia,
which our bridge now intercepts and replaces with the synthesised
destination stream — destination_track_count flips from 0 → 1 and
the bot's outbound mic becomes the brain's TTS pump output.

Camera off-on cycle deliberately not added: the fake-camera Y4M flag
already feeds Meet a one-frame mascot via Chromium's process-level
fake-video-capture path, so the bot's tile shows the mascot already.
The video animation upgrade lives in the separate MascotFrameProducer
encode-bottleneck follow-up.
…are 'openhuman'

Smoke caption 'I, Hi Openhuman.' did not fire the wake word because
the previous matcher only knew 'hey openhuman' / 'hey open human'.
Meet's STT also routinely drops the 'hey' prefix, splits the brand
into 'Open Human' (two words), or substitutes 'Hi'/'Hello'.

Expand the matcher to a small ordered list — checked longest-first
so the tail offset is calculated against the matched phrase length,
not the wake-prefix length:

  hey open human, hi open human, hello open human,
  hey openhuman,  hi openhuman,  hello openhuman,
  open human, openhuman

Bare 'openhuman' is in the list because Meet's STT will sometimes
drop both the greeting AND the space — leaving the brand alone in
the caption. Risk of false-positives is low: 'openhuman' isn't a
common English token, and 'open human' as a 2-word collocation is
almost only ever the brand spoken aloud.
Latest smoke aborted at the Ask-to-join click (Meet UI variant; bot
got admitted manually) and the post-join mic-cycle never ran — the
flow returns Err and any later phase is skipped. Bot ended up
broadcasting the real OS mic.

Move Phase 3.5 → Phase 2.6: cycle the mic right after the name input,
before clicking Ask-to-join. The cycle is best-effort either way, but
this site is more reliable:

- Pre-join is when Meet's React happily re-acquires media on toggle —
  in-call cycling can race the join handshake.
- The mic cycle now runs even when Ask-to-join itself times out, so a
  manual join from the host still leaves the bot with the gUM
  intercept armed.
- The Ask-to-join click stays best-effort (still -propagates Err
  so the caller knows the scanner gave up driving the page), but
  the gUM bootstrap is no longer gated on it.
…le session

Smoke against the staging-deployed staging backend hit a new failure:
the bot CEF webview landed on Google's 'Verify it's you' page for the
user's own email (nikhil@tinyhumans.ai) instead of the anonymous
'Your name' pre-join input the scanner drives. The vendored tauri-cef
runtime does not yet honour our per-request_id `data_directory` as a
fresh CEF RequestContext — webviews effectively share the parent
process's cookie + cache store, and Meet recognises the signed-in
Google account on the user's main openhuman session.

Add a Phase 0 in meet_scanner::run that:
- enables the Network CDP domain
- calls Network.clearBrowserCookies on the meet target
- calls Network.clearBrowserCache too (belt-and-suspenders)
- Page.reload with ignoreCache=true so Meet's React state re-fetches
  from a clean slate
- 1500ms sleep to let the reloaded page settle before scanner phases
  start poking the DOM

These CDP commands are scoped to the attached browser instance, so
they wipe the session for THIS Meet target without touching the
user's main openhuman webviews (those run in separate browser
instances). Best-effort — if Network isn't reachable we log and
continue. The proper fix is a per-RequestContext CEF profile in
the vendored runtime; that lives in the deferred follow-up.
…terrupt on new turn

Three deep gaps surfaced once the staging backend was online and
real LLM + ElevenLabs were producing 60+ second replies:

1. Echo / noise loop. Meet labels its placeholder + accessibility
   strings under speaker='You' (the local participant tag), which
   includes a multi-paragraph 'sample caption' demo string staging's
   captioning UI emits every 250ms. Each scrape re-fired the wake
   word ('openhuman' literal lives inside that demo string) and the
   bot kept replying to its own broadcast. note_caption now drops
   captions where speaker.lowercase() == 'you' (or empty).

2. Bot was speaking its own chain-of-thought. The reasoning models
   on staging emit a <think>...</think> block ahead of the actual
   user-facing reply; strip_for_speech happily passed it through to
   TTS, producing a minute of internal monologue. Strip the think
   blocks before any other markdown clean-up. Unclosed <think> at
   end of output drops everything from the tag onwards.

3. Bot wouldn't stop talking. speak_pump just drains whatever is
   queued — if a new wake fires while the previous reply is still
   playing, the old PCM finishes BEFORE the new reply starts.
   run_caption_turn now calls session.cancel_outbound() at start,
   which clears the outbound buffer and flips outbound_done so the
   page bridge sees end-of-utterance cleanly. Bot becomes
   interruptible — user can re-fire the wake word and the previous
   reply is cut short.
Three guards stack to make the bot loop-proof when running with a real
LLM that produces 30s+ replies on staging:

1. Speaking gate. session.note_caption refuses to fire a fresh wake
   while the outbound TTS queue still has audio. Without this, the user
   continuing to speak (or Meet captioning the bot's own voice) during
   a long reply lands a second wake, brain cancels the first and
   starts a new turn — repeated forever. Captures still record to the
   transcript log with a "(suppressed: bot speaking)" tag so we keep
   the diagnostic trail.

2. Server-side caption dedup. Meet's CC region re-renders the same
   line every 250 ms poll tick, and the page-side lastBySpeaker
   dedup keys on a speaker guess that flips for the same row when
   the avatar marker comes and goes. Defensive (speaker, text)
   signature on the session drops verbatim repeats before they hit
   the wake matcher or the RPC log.

3. TTS char cap. Reasoning models on staging routinely emit 800+
   char replies despite REPLY_MAX_TOKENS=220 (token budget is per
   the user-facing text, not the <think> trace). New cap_for_speech
   trims to 400 chars at the last sentence terminator inside the
   budget; falls back to a hard cut + ellipsis. ~25s of speech at
   average prosody — short enough to stay interruptible.

Together these break the speak-listen-speak loop user hit on the
"Hey Openhuman, can you hear me?" round trip.
…mode prompt

The previous prompt asked for "1-2 sentences" but reasoning-style backends
(DeepSeek / GMI / qwen flavours routed under model="agentic-v1") routinely
ignored soft length hints and emitted 800+ char monologues. cap_for_speech
trimmed them at 400 chars but the TTS still ran 25s per turn — long enough
that the user couldn't get a word in edge-wise.

Two changes:

1. REPLY_MAX_TOKENS 220 → 80. ~60 spoken words ≈ ~12s of audio. Hard ceiling
   regardless of model verbosity.

2. MEETING_SYSTEM_PROMPT rewritten as strict numbered rules — "ONE sentence,
   max 25 spoken words, no chain-of-thought, no <think> blocks, plain spoken
   English". Address-detection and dictation rules preserved but condensed.

Combined with cap_for_speech(400) and the speaking gate, the bot now produces
one short answer per wake instead of a minute-long reply that locks the
loop open.

Real second-brain (tools+memory+calendar via Agent::from_config_for_agent)
is the next commit per the approved plan.
…soning

Root cause of "bot reads its chain-of-thought aloud" (e.g. "We need to
generate a single sentence, max 25 words, plane spoken English. The user
said hello. This is a greeting addressed to Openhuman. So I should respond
with a greeting."): the bare /openai/v1/chat/completions endpoint pinned
to model="agentic-v1", which is a reasoning-style model. Reasoning models
emit their internal chain-of-thought as PLAIN TEXT (not <think> tags) in
the completion body when called outside the structured thinking_delta
channel — senamakel's chat path consumes those events separately and
shows them as a status, but a raw chat/completions call gets them
concatenated into the response. TTS then reads the whole thing aloud.

Two changes:

1. Pin model to chat-v1 (MODEL_CHAT_V1 in
   src/openhuman/config/schema/types.rs:17). chat-v1 is the
   conversational non-reasoning model — produces a direct user-facing
   answer suited to voice. Same family of aliases used by other entry
   points; no infra change required.

2. Add strip_untagged_reasoning() pass in strip_for_speech. Defensive
   heuristic against future model swaps: drops sentences whose lower-
   case trim begins with known reasoning openers ("We need to…",
   "I should…", "Let me…", "The user said…", "So I should…", etc.).
   If every sentence matches, returns the last sentence (final
   conclusion) instead of empty string.

3. Tighter MEETING_SYSTEM_PROMPT with NO-CHAIN-OF-THOUGHT rules +
   explicit good/bad examples. Even though chat-v1 doesn't reason out
   loud, the prompt now defends against accidental leaks if the router
   ever falls back to a reasoning tier.

Real second-brain (Agent::from_config_for_agent / channels-style chat
path) is still the next commit per the approved plan — this is the
defence-in-depth that fixes the spoken-out-loud reasoning today.
… in voice

The bot now answers via the SAME path as the chat UI and the webview meet
handoff: Agent::from_config_for_agent(&config, "orchestrator"). It
inherits the user's connected integrations, memory tree, MCP clients,
skills, and the project-wide tool registry. Whatever the user has wired
in their core is available to the bot day-one — no per-tool plumbing in
meet_agent.

Pipeline now:
  caption / STT  →  llm_meeting_agentic (orchestrator + tools + memory)
                 ↓  on error: llm_meeting_basic (bare chat-v1)
                 ↓  on error: stub / canned ack
                 →  strip_for_speech  →  cap_for_speech(400)  →  TTS

Why agentic-first, basic-as-fallback:
- Agentic gives real answers ("is my Friday evening free", "what did
  Alice say about the deploy", "remember to mail Bob tomorrow"). The
  orchestrator runs the same tool-iteration loop the chat UI does.
- Basic exists only so a config / registry / token issue doesn't kill
  the call. Degrades to a polite reply instead of dead air.
- Reasoning leak ("We need to generate a single sentence…") was the
  symptom that motivated this commit; the proper fix is routing through
  the channels-style path because that path consumes thinking_delta
  events separately and never lands them in the response body.

MEET_VOICE_DIRECTIVE prepended to every user utterance constrains the
orchestrator's reply to one short spoken sentence (max 25 words, no
markdown, no preamble, no chain-of-thought). The directive is wrapped
in a delimiter so the orchestrator can't confuse it with the user's
literal speech.

AGENTIC_TURN_TIMEOUT_SECS = 20 wraps run_single so a slow tool
iteration doesn't leave the meeting participant in indefinite silence.
On timeout the basic-LLM fallback fires.

strip_for_speech + cap_for_speech(400) still run on the harness output
as TTS hygiene — tool-use markers / citations / markdown leak through
even on chat-v1, and the agent reply can be longer than the
voice-budget if the orchestrator decides a fuller answer is right.
…integrations

from_config_for_agent builds the orchestrator with ZERO integrations
attached — saw "[orchestrator_tools] assembled 9 delegation tool(s) for
agent 'orchestrator' (0 integrations connected)" in the bot path log,
versus "10 delegation tool(s) (119 integrations connected)" for the chat
UI path. The web channel uses Agent::from_config_for_agent_with_profile
(channels/providers/web.rs:1570) which is what wires the integrations
in. Switch the meet-agent path to the same builder.

Pass MEET_VOICE_DIRECTIVE as profile_prompt_suffix instead of prepending
to the user message — same hook the web channel uses for locale-reply
directives. The orchestrator now reads the voice-frontend constraint at
system-prompt level, which is the right altitude (it's a channel-wide
contract, not a per-utterance instruction).

Per-meet event-context + agent-definition-name (orchestrator_meet_<id>)
so the harness scopes its session transcript to this request_id —
otherwise two simultaneous orchestrators (chat UI + meet bot) would
share one transcript file.

Strengthened MEET_VOICE_DIRECTIVE wording — explicit "tool-use is great,
but only the final spoken reply should appear in your output" so the
orchestrator knows it CAN run tools (calendar, memory, integrations)
but should suppress narration about them.

Net effect: bot now has the user's full 119-integration tool surface
available, plus the voice-mode output contract.
…anscript resume

Every turn was hitting:
  "400 An assistant message with 'tool_calls' must be followed by tool
   messages responding to each 'tool_call_id'"

Root cause: the harness auto-resumes prior transcripts when an
agent_definition_name matches a file on disk. A prior turn was killed
mid-tool-call (app restart while orchestrator was awaiting tool
output), leaving an assistant message with `tool_calls` and no
follow-up `tool` reply. Every subsequent run_single re-loaded that
file as the seeded history and the LLM API rejected it.

Switch agent_definition_name to include now_ms so each turn gets a
unique name and the harness never finds a prior transcript to load.
Trade-off: harness loses cross-turn memory persistence (each turn is
stateless from the agent's POV). Tools still work — they query real
external systems. Cross-turn memory is a follow-up that needs an
Agent cache (Arc<Mutex<Agent>> per request_id) so the harness keeps
history in-memory and never round-trips through the corrupt-able disk
transcript loader.

Corrupt transcript file purged manually for the active staging
workspace; future kills will create new ones but per-turn unique
naming means they won't poison subsequent turns.
User reported: connected calendar mid-call, then asked bot about tomorrow's
meetings; bot kept saying "I don't have calendar access" even though
[orchestrator_tools] logged 119 integrations connected on every turn.

Diagnosis: the previous MEET_VOICE_DIRECTIVE said "answer in ONE short
spoken sentence, no preamble, no 'Let me…', no 'I should…'". The model
interpreted this as a blanket "skip tool use, answer directly from prior"
— tool calls + tool replies look like preamble to a model trained to
match instruction shape. So it short-circuited to a hallucinated "not
connected" answer instead of dispatching delegate_to_integrations_agent.

Rewritten directive separates two contracts:

1. TOOL USE (encouraged + explicit): call tools whenever real data is
   needed. Tool calls are invisible to the user, do NOT count toward
   reply length. Explicit "do not claim something is not connected
   before attempting to call its tool". Explicit pointer to
   delegate_to_integrations_agent as the integration gateway.

2. FINAL SPOKEN REPLY (strict): same 25-word one-sentence ceiling, but
   framed as applying ONLY to the user-facing text that lands in TTS.
   The model is free to do whatever tool work it needs first.

Same dictation / silence-on-side-conversation rules retained.

Bug-1 (echo loop — Rust outbound drains faster than JS audio playback,
is_speaking() flips false mid-reply, new wake fires) is a known follow-up.
Needs speaking_until_ms deadline on the session + a JS-side audio flush
RPC. Tracked, not addressed in this commit.
…60s timeout

Sub-agent log analysis of the live dev:app run found three converging
bugs that produced "bot keeps repeating the same toolless reply 20 times"
behaviour even after the orchestrator + tools were wired up correctly:

1. **Single-slot last_caption_signature was broken**. Meet's CC region
   renders two simultaneous rows (the user's caption AND the bot's TTS
   captioned back as speaker="You"). The 250 ms poll walked both rows
   every tick, so the signature flipped A → B → A → B and dedup never
   matched on byte-identical user repeats. One utterance fired the wake
   word 24 times. Replace with HashMap<speaker_lower, last_text>.

2. **turn_in_progress gate** added. While a brain turn is in flight
   (LLM + tools), refuse new wakes. The user's growing utterance was
   firing a fresh agentic turn every ~9-10s while the prior turn's
   delegate_to_integrations_agent (16-30s for calendar) was still
   running. Result: ~20 parallel calendar API hits per question, none
   of which finished inside the timeout. Gate is set at run_caption_turn
   entry (alongside cancel_outbound + take_pending_prompt) and cleared
   at the final with_session that enqueues the reply.

3. **Agentic timeout 20s → 60s**. Single delegate_to_integrations_agent
   already takes 15-30s on its own. Iteration 2 (synthesis using the
   tool result) needs another 3-5s. The 20s budget killed iteration 1
   mid-flight and forced the bot back to llm_meeting_basic, which
   produced the confidently-wrong "I don't have access to your
   calendar" lie. 60s covers tool + synthesis with headroom. The
   turn_in_progress gate prevents the longer window from starving the
   user — they cannot fire 20 parallel turns during the wait.

Known follow-up: when the agentic path times out (rare with 60s), the
basic-LLM fallback still hallucinates. Should swap that for a polite
"still checking" ack instead. Tracked, not in this commit.
Live test of the Slack question hit the 60s ceiling — delegate_to_integrations_agent
completed in 33.97s with 8 iterations + 239 chars of real Slack data, but
iteration 2 (orchestrator synthesis) never landed. The bot fell back to
llm_meeting_basic, which has no tool access and confidently invented an
answer the user heard over voice — worse than honest silence.

1. AGENTIC_TURN_TIMEOUT_SECS: 60 → 90. Slack / Gmail fetches via Composio
   + per-message filtering + synthesis hit 60-80s in the slow path. The
   turn_in_progress gate still blocks parallel wakes during the wait.

2. Removed llm_meeting_basic fallback from both run_caption_turn and
   run_turn. On agentic failure we now speak "Let me get back to you on
   that." instead of routing to a toolless LLM that hallucinates.
   Honest deflection > false answer in a live meeting.

llm_meeting_basic is retained in the file for future integration-degraded
smoke tests; no live caller exercises it now.
…rompt

User asked "what time is it" and got "I don't know" / "Let me get back to
you" because the orchestrator's registry has no clock tool. Cheap fix:
include current local date/time/weekday/tz-offset in the
profile_prompt_suffix when building the per-turn orchestrator. The
directive tells the model to use this block directly for time/date
questions and NOT dispatch a tool. Refreshed every turn because Agent
is built per-turn, so the answer stays accurate across long meetings.

Format example: "Current local date/time: 2026-05-23 01:21:48".
…udget

User reported: still has to enable Meet captions manually each call.
The bot can't hear without CC because Flow A scrapes Meet's caption DOM.

Two paths were running but both narrow:
1. captions_bridge.js polled prefix-only `aria.indexOf("turn on captions") === 0`,
   missing Meet variants like "Turn on captions (c)", "Turn on live captions",
   "Subtitles", "Closed captions".
2. meet_scanner phase-4 click_by_aria_label substring-matched but only
   knew 5 patterns; Meet rolls out new labels regionally.

Widen both:
- Patterns: turn on captions / turn on live captions / turn on subtitles /
  turn on closed captions / captions on / captions (c) / show captions /
  enable captions
- Bridge uses substring match (`indexOf >= 0`), not prefix-only
- Negative guard added so we never accidentally click an already-ON
  toggle ("Turn off captions" / "captions off" / "disable captions")
- Bridge attempt budget 30 → 60 (~120s) for slow waiting-room admits
- Scanner dump label widened from "caption" to "caption|subtitle" so the
  failure log catches any future label variant for further widening
oxoxDev and others added 9 commits May 25, 2026 18:30
…plit

Two follow-up bugs from the first soft-deny smoke:

1) Meet's STT re-transcribes the same utterance with text jitter
   ("Openhuman. I open." → "Openhuman. High openhum." →
   "Openhuman. High Openhuman.") so the per-text dedup misses
   the variants. Each fired a fresh soft-deny TTS, producing
   the "sorry sorry sorry" loop and 429 rate-limits from the
   TTS backend.

   Fix: session-wide UNAUTHORIZED_COOLDOWN_MS (60s, 1 dispatch
   per window). Tracked on a new
   `last_unauthorized_dispatch_at_ms` field on the session.
   Independent of the owner's `last_turn_done_at_ms` so the
   owner can still wake (e.g. say "allow them") within seconds
   of a refusal.

2) Greetings from non-owners were getting refused instead of
   answered. New `classify_unauthorized_intent` looks at the
   post-wake tail — bare wake or greeting-only words ("hi",
   "hello", "good morning", "there", "everyone", ...) maps to
   `Greeting`; substantive task asks map to `TaskAsk`.

   `run_soft_deny_turn` branches on intent:
     Greeting → "Hi <asker>! Nice to meet you." (no privacy
                gate noise on a hello)
     TaskAsk  → the existing refusal + "say 'allow' to let
                them in" hint

`CaptionOutcome::UnauthorizedWake` now carries the full caption
text so the brain layer can classify; rpc.rs forwards it into
the spawned turn.

Tests:
  - session: cooldown blocks text-variants + cross-speaker
  - brain: greeting / filler / task classification
…uplink

The audio bridge connected each fed `AudioBufferSource` only to the
`MediaStreamAudioDestinationNode` that backs Meet's getUserMedia
intercept. Bot voice therefore reached Meet (and other participants
via the WebRTC wire), but was silent on the host machine — the user
running openhuman could only hear the bot if they were receiving
the call on a *separate* endpoint (other browser tab, phone, ...).
Smoke today surfaced as "captions appear from OpenHuman but no
sound" while the user was watching the bot+meet on the same mac.

Add a second `src.connect(ctx.destination)` so the same buffer
also plays through the default output device. No quality impact;
the MediaStream path is unchanged.

Follow-up tinyhumansai#20 (vendored CEF `set_audio_muted` for the bot window)
will re-introduce a clean off switch behind a config toggle once
we have one — right now defaulting to audible-locally is the less
confusing posture.
Loosen the non-owner branch: instead of a canned refusal, route
substantive asks through a toolless chat-v1 LLM with an explicit
no-personal-data system prompt. The LLM:

  - Answers general knowledge / casual chat / definitions / jokes
    from training data ("what's the capital of France" -> "Paris").
  - Refuses anything that would need the owner's tools (Slack,
    Gmail, Calendar, memory, integrations) with a one-line pointer
    at the magic word: "<owner>, say 'allow' if you'd like me to
    help."
  - Has zero tools wired, so it physically can't fire a Composio
    call even if it tried.
  - Has empty history (no rolling context from owner turns) so
    private replies from earlier in the call can't bleed into a
    non-owner reply.

`run_soft_deny_turn` still gates on `classify_unauthorized_intent`:
greeting -> canned hi (cheap, no network); task ask -> the new
`llm_general_no_tools`. LLM errors / empty replies fall through
to the explicit canned refusal so the speaker hears *something*.

Changes:
  - brain::llm_meeting_basic gains a `system_prompt` param so the
    same plumbing serves both owner-fallback and non-owner paths.
  - new `non_owner_system_prompt(owner)` builder.
  - new `llm_general_no_tools(prompt, owner)` wrapper.
  - cooldown lowered 60s -> 20s so non-owners can engage in
    actual back-and-forth instead of the bot going deaf for a
    minute after the first refusal.
…fire guard

Caption listener diagnostics:
- Add 30s heartbeat log with bridge probe (region_found, buttons,
  enable_attempts) so silent listeners are visible in logs
- Log caption listener started/shutdown lifecycle
- drain_and_forward returns count for running total

Caption enable (Meet UI changed — no more aria-label buttons):
- Broaden captions region selectors (role="log", aria-live fallback)
- Broaden row selectors for nested wrapper divs
- Add keyboard shortcut "c" fallback in both captions_bridge.js
  (after 8 failed button attempts) and meet_scanner (CDP
  Input.dispatchKeyEvent post-admission)
- Add bare "captions"/"subtitles"/"cc" label matching + aria-pressed
  check
- Enhanced bridge info: region tag/aria/children, caption buttons
  with pressed state, seen_texts count

Echo prevention:
- Mute Meet's incoming call audio via MutationObserver + play()
  monkey-patch on HTMLMediaElement (silences other participants'
  voices from local speakers)
- Keep bot TTS routed to ctx.destination (audible locally)
- Auto-resume suspended AudioContext in ensureContext()

Caption re-fire guard (stacked captions on Meet CEF):
- Replace single-slot lastBySpeaker with seenTexts fingerprint set —
  each caption text emitted at most once regardless of how many
  visible rows Meet keeps on screen
- Delta extraction: when a caption row grows in place, only emit
  the new tail (prevents re-processing old wake phrases)
- Cap seenTexts at 500 entries with oldest-half eviction

Diagnostics (session.rs):
- INFO-level note_caption gate log showing speaker_norm, owner_norm,
  bot_norm for every caption — pinpoints auth/name mismatches
- MascotFrameProducer.tsx: keep HEAD (speaking state + mouth animation)
- en.ts: merge upstream platform keys + keep our simplified sendTo
- ko.ts: take upstream chunked import pattern
MascotFrameProducer was using the removed yellow/frameContext +
YellowMascotIdle components. Upstream switched to RiveMascot.
Take upstream's structure and add back the speaking-state listener
so the mascot face toggles to 'speaking' during TTS playback.
RiveMascot uses a WebGL canvas that browsers skip painting when
positioned at left:-99999px (fully off-viewport). Move to left:0
top:0 with opacity:0 and zIndex:-1 so the canvas renders (and
produces frames for the WebSocket bus) while staying invisible.
@YellowSnnowmann YellowSnnowmann requested a review from a team May 27, 2026 18:22
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4fbf90e-6684-48e4-a382-d7224a7a3535

📥 Commits

Reviewing files that changed from the base of the PR and between d5e84dd and 41c3723.

📒 Files selected for processing (1)
  • app/src-tauri/src/meet_scanner/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/src-tauri/src/meet_scanner/mod.rs

📝 Walkthrough

Walkthrough

This PR adds owner-gated wake handling and allowlist flows, persists recent meet calls to a JSONL store with listing, introduces speaking-state events and barge-in/flush audio support, improves caption extraction/deduplication/auto-enable, and wires owner-name validation and recent-calls UI in the frontend.

Changes

Meet Call Privacy Gating & Recent Calls

Layer / File(s) Summary
Session privacy model and authorization gating
src/openhuman/meet_agent/session.rs, src/openhuman/meet_agent/types.rs
Adds CaptionOutcome, owner/bot identity storage, per-speaker dedupe, allowlist, pending-unauthorized flows, flush_pending plumbing, and updated wake/cooldown behavior and tests.
Brain authorization branches and turn routing
src/openhuman/meet_agent/brain.rs
Adds run_grant_turn and run_soft_deny_turn, grant fast-paths, pre-roll ack behavior, cancels prior outbound PCM, uses agentic-only LLM with polite-ack fallback, and updates reply sanitization.
RPC handlers, schemas, lifecycle, and list-calls
src/openhuman/meet_agent/rpc.rs, src/openhuman/meet_agent/schemas.rs
Registers identities on start_session, dispatches caption outcomes in push_caption, returns flush_pending in poll_speech, forgets cached agent on stop, persists call record on stop, and implements handle_list_calls + schema.
Call persistence JSONL store and recent-calls reading
src/openhuman/meet_agent/store.rs
Adds MeetCallRecord, JSONL append/read helpers with malformed-line skipping, sorting/clamping to MAX_RECENT_CALLS, and unit tests.
Meet call window init and identity propagation
app/src-tauri/src/meet_call/mod.rs, app/src-tauri/src/lib.rs
Adds owner_display_name (serde default) to OpenWindowArgs, closes prior meet-call- windows before opening, post-build repositioning, exposes window_label_for, and threads owner/bot names into meet_audio::start.
Meet scanner pre-join and post-admission phases
app/src-tauri/src/meet_scanner/mod.rs
Adds Phase 0 cache/cookies clear and reload; ensures camera/mic ON and cycles mic for fresh getUserMedia; waits for admission and enables captions via button or "c" shortcut; adds CDP helpers.
Audio wiring, speak pump, and flush support
app/src-tauri/src/meet_audio/mod.rs, app/src-tauri/src/meet_audio/speak_pump.rs, app/src-tauri/src/meet_audio/inject.rs
Threads owner/bot names into start, passes AppHandle into speak_pump, adds SpeakingTracker with hangover to emit meet-video:speaking-state only on flips, adds barge-in by calling flush_audio_bridge via CDP.
Audio bridge flush, active sources, and inbound mute
app/src-tauri/src/meet_audio/audio_bridge.js
Resumes suspended AudioContext, tracks active AudioBufferSourceNodes, exports window.__openhumanFlushAudio() to stop in-flight sources and reset schedule, connects TTS to ctx.destination, and mutes inbound <audio>/<video> elements and WebRTC audio.
Captions bridge improvements and auto-enable
app/src-tauri/src/meet_audio/captions_bridge.js
Broadens captions region detection, adds global seenTexts dedupe and per-speaker delta emission, expands CC auto-enable strategies and budget with button/keyboard fallbacks, trims seen_texts, and enriches diagnostics via window.__openhumanCaptionsBridgeInfo().
Caption listener heartbeat and bridge probing
app/src-tauri/src/meet_audio/caption_listener.rs
Adds heartbeat ticks, tick/total counters, changes drain_and_forward to return counts, probes page bridge diagnostics, and logs periodic heartbeats.
Frontend join flow, validation, recent calls UI
app/src/services/meetCallService.ts, app/src/components/skills/MeetingBotsCard.tsx
joinMeetCall accepts/validates ownerDisplayName, forwards it to Tauri, returns it in result; MeetingBotsModal requires owner name; fetches and renders recent calls; disables submit until URL+owner provided.
Frontend integration, HumanPage, tests, and i18n
app/src/features/human/HumanPage.tsx, app/src/components/skills/__tests__/MeetingBotsCard.test.tsx, app/src/lib/i18n/en.ts
Adds meeting modal button in HumanPage, updates tests for owner name gating and payloads, and adjusts i18n strings for meeting bots.
Mascot speaking-state animation and event handling
app/src/features/meet/MascotFrameProducer.tsx
Adds isSpeaking state, listens to meet-video:speaking-state events filtered by requestId, updates RiveMascot face prop, and cleans up listener on session end.

Sequence Diagram

sequenceDiagram
  participant UI as Frontend UI
  participant Tauri as Tauri Backend
  participant Webview as Meet Webview (bridge)
  participant Agent as Meet Agent (server)
  participant Store as Call Store (JSONL)

  UI->>Tauri: joinMeetCall(meetUrl, displayName, ownerDisplayName)
  Tauri->>Webview: open meet-call window / inject bridges
  Webview->>Tauri: __openhumanCaptionsBridgeInfo / __openhumanFlushAudio
  Tauri->>Agent: openhuman.meet_agent_start_session(+owner/bot)
  Agent->>Agent: caption-driven decisions (WakeFired / UnauthorizedWake)
  Agent->>Tauri: poll_speech (pcm_base64, flush_pending)
  Tauri->>Webview: invoke __openhumanFlushAudio() when flush_pending
  Agent->>Store: append_record on stop_session
  UI->>Tauri: handle meet-video:speaking-state events
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • senamakel
  • graycyrus

🐰 "I hopped through code to guard each call,
Names at the gate, no faceless sprawl,
When voices barge in, playback bows out,
Captions stay tidy, and the mascot mouths about!"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: caption pipeline fixes, echo prevention, and mascot rendering improvements. It directly reflects the key objectives and modifications across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added feature Net-new user-facing capability or product behavior. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. working A PR that is being worked on by the team. bug labels May 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
src/openhuman/meet_agent/session.rs (1)

117-124: 💤 Low value

Unbounded growth of last_caption_by_speaker HashMap.

The per-speaker dedup map is never pruned. In a long meeting with many participants (e.g., 50+ people joining/leaving a webinar), entries accumulate indefinitely. While each entry is small (speaker name + normalized text), consider adding a cap or LRU eviction to bound memory.

This is low-risk for typical 1:1 or small-group calls but could matter for large meetings.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/meet_agent/session.rs` around lines 117 - 124,
last_caption_by_speaker currently uses an unbounded HashMap which can grow
forever; replace it with a bounded cache (e.g., lru::LruCache<String,String> or
a HashMap plus a VecDeque/IndexMap to track insertion order) and enforce a
MAX_SPEAKERS cap to evict oldest entries when inserting new speakers. Update the
Session struct field name last_caption_by_speaker to the chosen cache type and
modify the code paths that insert/update it (the function(s) that deduplicate
and set per-speaker captions) to call the cache insert API or perform eviction
when size > MAX_SPEAKERS; add a small constant (e.g., MAX_SPEAKERS = 200) and
tests to ensure old entries are pruned.
src/openhuman/meet_agent/schemas.rs (1)

305-339: 💤 Low value

Schema type for calls field is misleading.

The calls output is typed as TypeSchema::String but the comment says "Array of MeetCallRecord objects". The actual ListCallsResponse struct uses Vec<MeetCallRecord> which serializes as a JSON array, not a string. This schema documentation mismatch could confuse API consumers.

If TypeSchema doesn't have an array/object variant, consider updating the comment to clarify the serialization format, or note that the schema type is a placeholder.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/meet_agent/schemas.rs` around lines 305 - 339, The
schema_list_calls() output incorrectly declares the "calls" field as
TypeSchema::String while the actual ListCallsResponse uses Vec<MeetCallRecord>
(a JSON array); update the schema by changing the "calls" field type to the
appropriate array/object/json variant provided by TypeSchema (e.g.,
TypeSchema::Array or TypeSchema::Json) if available so it reflects an array of
MeetCallRecord, and if no such variant exists, keep TypeSchema::String but
update the "calls" FieldSchema comment to explicitly state it is a
JSON-serialized array of MeetCallRecord objects (matching ListCallsResponse) so
API consumers aren’t misled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src/components/skills/MeetingBotsCard.tsx`:
- Around line 150-153: The component uses hard-coded English strings in
MeetingBotsCard (e.g., the error message set via setRecentError, ARIA labels,
loading/empty state text, owner-name form copy, recent-calls section and
relative-time strings) instead of the i18n hook; replace all literal strings in
MeetingBotsCard and any helper-returned text with t('meetingBots.*') keys called
via useT() (ensure you import/use useT() in the component and call t(...) where
message is set, ARIA labels are defined, and where loading/empty/owner form text
is rendered), add corresponding keys to the locale chunk files, and update any
helper functions used by this component to accept/use t so they return localized
strings rather than literals (apply same change to the other referenced blocks
around lines 301-321, 381-408, 443-445, 457-473).

In `@app/src/features/meet/MascotFrameProducer.tsx`:
- Around line 175-179: The listener registered via listen (the callback reading
event.payload) checks p.request_id but the Rust emitter in speak_pump.rs uses
"requestId" (camelCase), so change the payload access to p.requestId (or
normalize the key) and compare it to session.requestId before calling
setIsSpeaking; update the conditional in the listen callback (which references
event.payload, p, and session.requestId) to use the correct camelCase key so
isSpeaking updates correctly when speak_pump emits events.

In `@app/src/lib/i18n/en.ts`:
- Line 3331: The new i18n key 'skills.meetingBots.comingSoon' (and the other key
changed at the nearby change around line 3350) was added only in
app/src/lib/i18n/en.ts; update the corresponding English chunk file (the
matching en-N.ts that contains the same chunk where these keys belong) and add
the identical key entries to every non-English locale file for that same chunk
so all locale maps stay consistent; locate keys by name
('skills.meetingBots.comingSoon' and the other changed key) and replicate them
into the corresponding chunk files across en-N.ts and each locale's chunk file.

In `@app/src/services/meetCallService.ts`:
- Around line 45-57: Replace the hard-coded English error strings in
meetCallService.ts (the validation that throws for missing meetUrl, displayName,
and ownerDisplayName) with stable i18n error keys or an error object so the UI
can translate them via useT(); e.g. throw a ValidationError or Error with a
well-known code/string like "meet.error.missing_meet_url",
"meet.error.missing_display_name", "meet.error.missing_owner_display_name" (and
include any machine-readable metadata if needed). Update every similar literal
in this file (including the occurrences around lines 151–153) so services emit
keys/codes rather than raw English text and let the calling UI layer map those
keys to localized messages using useT().

---

Nitpick comments:
In `@src/openhuman/meet_agent/schemas.rs`:
- Around line 305-339: The schema_list_calls() output incorrectly declares the
"calls" field as TypeSchema::String while the actual ListCallsResponse uses
Vec<MeetCallRecord> (a JSON array); update the schema by changing the "calls"
field type to the appropriate array/object/json variant provided by TypeSchema
(e.g., TypeSchema::Array or TypeSchema::Json) if available so it reflects an
array of MeetCallRecord, and if no such variant exists, keep TypeSchema::String
but update the "calls" FieldSchema comment to explicitly state it is a
JSON-serialized array of MeetCallRecord objects (matching ListCallsResponse) so
API consumers aren’t misled.

In `@src/openhuman/meet_agent/session.rs`:
- Around line 117-124: last_caption_by_speaker currently uses an unbounded
HashMap which can grow forever; replace it with a bounded cache (e.g.,
lru::LruCache<String,String> or a HashMap plus a VecDeque/IndexMap to track
insertion order) and enforce a MAX_SPEAKERS cap to evict oldest entries when
inserting new speakers. Update the Session struct field name
last_caption_by_speaker to the chosen cache type and modify the code paths that
insert/update it (the function(s) that deduplicate and set per-speaker captions)
to call the cache insert API or perform eviction when size > MAX_SPEAKERS; add a
small constant (e.g., MAX_SPEAKERS = 200) and tests to ensure old entries are
pruned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7dcf7345-3faf-4d7b-bd2a-bb3d110419c4

📥 Commits

Reviewing files that changed from the base of the PR and between 976e564 and 9545335.

📒 Files selected for processing (24)
  • app/src-tauri/src/lib.rs
  • app/src-tauri/src/meet_audio/audio_bridge.js
  • app/src-tauri/src/meet_audio/caption_listener.rs
  • app/src-tauri/src/meet_audio/captions_bridge.js
  • app/src-tauri/src/meet_audio/inject.rs
  • app/src-tauri/src/meet_audio/mod.rs
  • app/src-tauri/src/meet_audio/speak_pump.rs
  • app/src-tauri/src/meet_call/mod.rs
  • app/src-tauri/src/meet_scanner/mod.rs
  • app/src/components/intelligence/IntelligenceCallsTab.tsx
  • app/src/components/skills/MeetingBotsCard.tsx
  • app/src/components/skills/__tests__/MeetingBotsCard.test.tsx
  • app/src/features/human/HumanPage.tsx
  • app/src/features/meet/MascotFrameProducer.tsx
  • app/src/lib/i18n/en.ts
  • app/src/services/__tests__/meetCallService.test.ts
  • app/src/services/meetCallService.ts
  • src/openhuman/meet_agent/brain.rs
  • src/openhuman/meet_agent/mod.rs
  • src/openhuman/meet_agent/rpc.rs
  • src/openhuman/meet_agent/schemas.rs
  • src/openhuman/meet_agent/session.rs
  • src/openhuman/meet_agent/store.rs
  • src/openhuman/meet_agent/types.rs

Comment thread app/src/components/skills/MeetingBotsCard.tsx
Comment thread app/src/features/meet/MascotFrameProducer.tsx Outdated
Comment thread app/src/lib/i18n/en.ts
Comment thread app/src/services/meetCallService.ts
Rust speak_pump emits `requestId` (camelCase) but the TypeScript
listener destructured `request_id` (snake_case), causing the filter
to never match and `isSpeaking` to never update. Fix the TS type
and access to use camelCase.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 27, 2026
The HTMLMediaElement patches missed Meet's incoming audio because
WebRTC routes it directly via RTCPeerConnection.ontrack, not through
<audio> elements. Add three interception layers:

1. RTCPeerConnection.addEventListener('track') wrapper — disables
   incoming audio tracks (track.enabled = false)
2. RTCPeerConnection.ontrack property setter — same for the property-
   based handler
3. AudioContext.createMediaStreamSource — returns a zero-gain node
   for remote streams so Web Audio routing produces silence

The bridge's own TTS path (AudioBufferSource → dest + ctx.destination)
is unaffected by all three patches.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/src-tauri/src/meet_audio/audio_bridge.js (1)

419-423: 💤 Low value

Consider adding a note about the return type change.

The patched function returns a GainNode instead of the original MediaStreamAudioSourceNode. While both inherit from AudioNode and the standard connect() pattern works fine, any code accessing source.mediaStream would get undefined. This is unlikely to affect Meet's audio routing (which just needs connect()), but a brief comment would help future maintainers understand the intentional type substitution.

📝 Suggested comment
       // Remote stream — create a gain node set to 0 and interpose it
       // so nothing reaches the destination.
+      // Note: we return the GainNode, not the original
+      // MediaStreamAudioSourceNode. Callers using connect() work fine;
+      // any code accessing .mediaStream would get undefined.
       var silencer = this.createGain();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/src/meet_audio/audio_bridge.js` around lines 419 - 423, Add a
brief inline comment in the patched createMediaStreamSource branch explaining
that the function now returns a GainNode (silencer) instead of the original
MediaStreamAudioSourceNode, noting that both are AudioNode-compatible so
connect() works but source.mediaStream will be undefined; reference the local
symbols "silencer", "createGain()", and "createMediaStreamSource" so future
maintainers understand the intentional type substitution and its implications
for code that reads mediaStream.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@app/src-tauri/src/meet_audio/audio_bridge.js`:
- Around line 419-423: Add a brief inline comment in the patched
createMediaStreamSource branch explaining that the function now returns a
GainNode (silencer) instead of the original MediaStreamAudioSourceNode, noting
that both are AudioNode-compatible so connect() works but source.mediaStream
will be undefined; reference the local symbols "silencer", "createGain()", and
"createMediaStreamSource" so future maintainers understand the intentional type
substitution and its implications for code that reads mediaStream.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0fc19b6-59b5-4787-a7f2-f6ef939254a8

📥 Commits

Reviewing files that changed from the base of the PR and between 13d6023 and d5e84dd.

📒 Files selected for processing (1)
  • app/src-tauri/src/meet_audio/audio_bridge.js

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 27, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YellowSnnowmann hey — CI is still pending on this one, so i'll hold off on the final call until those land. skimmed the diff while waiting though, and overall the architecture is solid. a couple of things i spotted:

RTCPeerConnection.addEventListener wrapper breaks removeEventListener

In audio_bridge.js, the RTCPeerConnection prototype patch wraps the listener but stores the wrapped version — not the original — in the event listener registry. When Meet (or the browser) later calls removeEventListener('track', originalListener), the removal silently fails because wrappedListener !== originalListener. In a normal one-shot Meet session (page load → call ends → page unload) this doesn't matter because the page dies anyway. But Meet renegotiates WebRTC on mid-call changes (participant join/leave, quality adaptation) and some builds remove+readd track handlers during renegotiation. In that path the old wrapped listener stays registered and fires again alongside the new one — you'd get double event.track.enabled = false calls on re-added tracks, which is benign, but also double-dispatch of any other side-effects the original handler had. More importantly, if Meet has any removeEventListener in a cleanup teardown path, those silently no-op and the handler leaks for the page lifetime.

To fix this cleanly: instead of wrapping addEventListener, intercept at the RTCPeerConnection ontrack setter only (which you already do) and also patch RTCPeerConnection.prototype.addTrack's callback or use an ontrack getter/setter pair. The ontrack approach doesn't need removeEventListener semantics at all, and you already have that part right.

dump_aria_labels logs at warn

This is called on every failed click_by_aria_label as a normal diagnostic fallback — but it emits at log::warn!. That drags every label-drift hit into warn-level log aggregators and pages people who don't need to act on it. log::info! is the right level for "here's what Meet currently shows" diagnostic output.

});
mediaObserver.observe(document.documentElement || document.body || document, {
childList: true,
subtree: true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] addEventListener wrapping breaks removeEventListener. The browser's event listener registry stores wrappedListener, so any caller that later does removeEventListener('track', originalHandler) silently fails — the old wrapped listener stays live and fires alongside any new one added after the remove. This is a problem if Meet renegotiates WebRTC mid-call (remove + readd track listeners) or if its teardown path calls removeEventListener for cleanup.

The ontrack setter interception below (which you already have) is the right approach for this. Consider dropping the addEventListener wrap entirely and relying solely on the ontrack property setter — Meet uses ontrack exclusively in the builds I've seen, and you don't need removeEventListener to work with a setter.

Comment thread app/src-tauri/src/meet_scanner/mod.rs
@graycyrus
Copy link
Copy Markdown
Contributor

@YellowSnnowmann unresolved review feedback — please address before we review.

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YellowSnnowmann continuation review — one fix confirmed, one major finding still open, CI not yet green.

Fixed since last review:

  • meet_scanner/mod.rs — log level changed from warn to info as requested. Good.

Still open:

  • audio_bridge.js:329 — the RTCPeerConnection.addEventListener wrapper is still in the diff (line 191-201). The core problem remains: the browser's event listener registry stores wrappedListener, not the original handler, so any Meet code that calls removeEventListener('track', originalFn) silently fails — the wrapped listener stays live. The ontrack setter interception (which you have) is sufficient on its own since Meet uses ontrack. Drop the addEventListener patch entirely.

CI:

  • Coverage Gate (diff-cover >= 80%) is failing.
  • E2E Windows still pending.

Fix the coverage gate and the addEventListener issue, then I'll approve.

@graycyrus
Copy link
Copy Markdown
Contributor

Unresolved review feedback from coderabbitai[bot] — please address before we review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug feature Net-new user-facing capability or product behavior. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants