Feature/fix 3039 speaker#3817
Conversation
…tion Fixes BasedHardware#3039 - Replace regex-based speaker identification with Qwen2.5-1.5B-Instruct LLM - Distinguish between addressed vs mentioned speakers - Support multiple addressees (returns list) - Add GPU acceleration (Metal/CUDA) - Thread-safe singleton pattern - Add README_SPEAKER_ID.md with setup instructions Performance: - 100% accuracy on test suite - ~300ms latency with GPU - Model: qwen_1.5b_speaker.gguf (1.1GB, Apache 2.0)
…tection - Restore detect_speaker_from_text() with original multi-language regex patterns for self-identification (e.g., 'I am Alice', 'My name is Bob') - Keep identify_speaker_from_transcript() for LLM-based addressee detection (e.g., 'Hey Alice, help' -> ['Alice']) - Fix warmup exception handler with proper logging - Both functions now coexist for different use cases
- Add STRICT EXCLUSION RULE in prompt for verbs like told/said/saw/asked - 'I told Alice' now correctly returns null (mentioned, not addressed) - 'Hey Alice, help' still correctly returns ['Alice'] (addressed) - Fixes false positive detection of mentioned names
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement by replacing the previous regex-based speaker detection with a more robust, self-hosted LLM solution. The implementation is well-done, incorporating thread-safe model loading, GPU acceleration, and a warmup mechanism to reduce latency. The code is clean, and the inclusion of new documentation and comprehensive tests is excellent. I have one suggestion to improve the performance of the legacy regex function to better align with its stated goal.
|
Interesting. @uulasb what do you think is the best model we can use for speaker identification and transcript cleaning? @thainguyensunya please remind me once we have the self-hosted Llama so we can continue with this ticket. |
|
ah, one more thing, it would be great if you guys could work together to make it happen. we need to self-host the model so we can test it on our dev environment first, then move to production later. thank you. |
|
@uulasb For your information, we will have a separate/external self-hosted LLM inference with OpenAI-compatiable API Server endpoint. Specifically, we will use vLLM for inference, Llama-3.1-8B-Instruct model and 1 x NVIDIA L4 GPU. This approach supports serving high throughput inference on production and optimize GPU power effectively. Please let me know your thought. (Do you think Llama-3.1-8B-Instruct is overkill for speaker identification and transcript cleaning?) |
… transcript cleaning - Replace llama-cpp-python with strict AsyncOpenAI client - Add transcript cleaning to system prompt - Update unit tests with AsyncMock and edge cases (12/12 pass) - Update documentation for VLLM_ env vars - Remove local dependencies
|
@thainguyensunya @beastoin Thanks for the guidance! I completely agree with the move to external vLLM, it makes the backend much lighter and easier to scale. I've just pushed the refactor to match your roadmap. @thainguyensunya On the 8B model size, If this were just for name detection, I'd agree it's overkill. However, to justify using the L4 GPU, I updated the prompt to also handle Transcript Cleaning in the same pass. It now identifies the speaker AND scrubs filler words ("um", "uh") / fixes grammar simultaneously. We get a much better user experience for the same inference cost, which makes the 8B model a great fit. @beastoin As requested, I have removed the local GGUF/llama-cpp code entirely. I've switched the backend to use AsyncOpenAI, which clears out the heavy llama-cpp dependencies and keeps the event loop non blocking. I also made sure to keep the original regex function synchronous, so we don't accidentally break any legacy calls. I verified the logic on Groq (simulating your setup) and it's hitting 300ms. For Configuration, I standardized the environment variables for your vLLM deployment as follows: VLLM_API_BASE |
- Moved speaker_identification.py -> text_speaker_detection.py to avoid conflict with upstream audio code - Updated imports in transcribe.py and verify_llama_8b.py - Renamed and updated tests/test_speaker_identification.py -> tests/test_text_speaker_detection.py
|
@thainguyensunya @beastoin I noticed main recently introduced a new "speaker_identification.py" for audio embedding logic. To resolve the merge conflict and keep concerns separate, I have renamed my module to "backend/utils/text_speaker_detection.py". This ensures my vLLM/Text logic coexists cleanly with your new Audio logic without overwriting it. |
|
what is the test results ? |
|
@beastoin Test Results 12/12 Passed in in 0.27s. Legacy Regex: 6/6 |
|
@uulasb First priority: I don’t think this PR works end‑to‑end yet — the verification script crashes because by AI for @beastoin |
|
@thainguyensunya What is the current self-hosted model's maximum RPS and latency? Asking because the real-time capability can consume the LLM's resources heavily. |
|
@beastoin Thanks. I deployed the fixes. Updated transcribe.py to correctly await the new async LLM path (with the regex fallback preserved). Implemented a Singleton pattern for AsyncOpenAI with a proper shutdown hook in main.py (resolving the memory leak risk). Fixed the in-function test imports and removed the duplicate requirements.txt. Resolved the merge conflict in main.py while ensuring the new shutdown logic is included |
|
@thinhx Do we have avg. tokens of a prompt and expected concurrent for the estimation ? |
|
Just a note @thainguyensunya to help with the estimation, I used a Regex First approach, so self identifications (e.g., "I am Alice") cost 0 tokens (the LLM is bypassed completely). For all other segments (Addressee detection & Cleaning): So, the average throughput depends on how often users self identify vs. speak normally. |
|
@uulasb Thank you for your information. |
|
Hey, friendly nudge — could you add a demo or end-to-end testing evidence to this PR? Screenshots, video, terminal output showing it working, anything that proves it runs correctly. In the AI era, generating code is the easy part — what really matters is showing it works. A solid demo or test run gives reviewers the confidence to merge quickly. Without it, PRs tend to sit in the queue longer than they need to. Thanks for contributing! |
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @beastoin, of course. Firstly, unit tests (16/16 passed, 0.34s) Ekran.Kaydi.2026-02-18.01.58.51.movSecondly, Live Integration Demo, Llama 3.1 8B via Groq (8/8 passed). Ekran.Kaydi.2026-02-18.02.05.25.movAlso changes in this push includes,
Reproducible via: |
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @beastoin, thanks for the ping. I’ve rebased on the latest main. I’m looking forward to your review |
Fixes #3039
Summary
Replaces the outdated regex-based speaker detection with a robust, self-hosted LLM solution using
Qwen2.5-1.5B-Instructandllama-cpp-python. This PR significantly improves accuracy by distinguishing between addressed speakers (e.g., "Hey Alice") and mentioned names (e.g., "I told Alice"), while preserving legacy compatibility.Changes
Core Functionality
Implemented Addressee Detection: Uses Qwen2.5-1.5B to identify who is being spoken TO.
["Alice", "Bob"]NoneRestored Legacy Compatibility: The
detect_speaker_from_textfunction now correctly uses the original multi-language regex patterns for self-identification (e.g., "I am Alice"), ensuring existing backend logic remains unbroken.Performance & Reliability
Performance Optimization:
n_gpu_layers=-1Reliability:
Documentation & Testing
backend/README_SPEAKER_ID.mdwith setup/usage instructionsbackend/tests/test_speaker_identification.pyVerification Results
Ran comprehensive test suite
test_speaker_identification.py:Setup
pip install llama-cpp-python # Download Model (1.1GB) curl -L -o backend/utils/qwen_1.5b_speaker.gguf \ https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.ggufAI Disclosure
Tools used: Cursor / Gemini