feat(provider): add Qwen3-ASR-Flash STT provider & fix STT not triggering bug#6789
feat(provider): add Qwen3-ASR-Flash STT provider & fix STT not triggering bug#6789muchstarlight wants to merge 2 commits intoAstrBotDevs:masterfrom
Conversation
Add a new speech-to-text provider using DashScope's Qwen3-ASR-Flash model with base64 encoded audio support.
The Record component stores the audio URL in the 'file' attribute, not 'url' attribute. This caused STT to never be triggered since component.url was always empty when checking for voice messages.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Speech-to-Text (STT) provider, Qwen3-ASR-Flash, enhancing the system's capability to transcribe audio. Concurrently, it addresses a critical bug that prevented the STT functionality from activating, ensuring that audio messages are now correctly processed for transcription. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The
timeoutvalue from the provider config is stored but never used when callingMultiModalConversation.call; consider passing it through (if supported) or removing the setting to avoid confusion. - In
_prepare_audio, temporary files created when downloading from a URL (the initialpathbefore any conversion) are never cleaned up, which can leak files in the temp directory; consider tracking and removing the downloaded temp file infinally. _get_audio_formatis defined asasyncbut only performs synchronous file I/O and is awaited in one place; consider making it a regular synchronous method to simplify usage and avoid unnecessaryawait.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `timeout` value from the provider config is stored but never used when calling `MultiModalConversation.call`; consider passing it through (if supported) or removing the setting to avoid confusion.
- In `_prepare_audio`, temporary files created when downloading from a URL (the initial `path` before any conversion) are never cleaned up, which can leak files in the temp directory; consider tracking and removing the downloaded temp file in `finally`.
- `_get_audio_format` is defined as `async` but only performs synchronous file I/O and is awaited in one place; consider making it a regular synchronous method to simplify usage and avoid unnecessary `await`.
## Individual Comments
### Comment 1
<location path="astrbot/core/provider/sources/qwen_asr_flash_source.py" line_range="59-68" />
<code_context>
+ self.timeout = provider_config.get("timeout", 30)
</code_context>
<issue_to_address>
**issue (bug_risk):** The configured timeout value is stored but not applied to the DashScope API call.
`timeout` is read from `provider_config`, but `MultiModalConversation.call` is not given any timeout-related argument, so this value is never used. If the DashScope client exposes a timeout option (e.g., `request_timeout`), pass `self.timeout` through so the config actually controls request duration and avoids indefinitely hanging calls.
</issue_to_address>
### Comment 2
<location path="astrbot/core/provider/sources/qwen_asr_flash_source.py" line_range="110-119" />
<code_context>
+ output_path = None
+
+ # Download from URL if needed
+ if audio_url.startswith("http"):
+ if "multimedia.nt.qq.com.cn" in audio_url:
+ is_tencent = True
+
+ temp_dir = get_astrbot_temp_path()
+ path = os.path.join(
+ temp_dir,
+ f"qwen_asr_{os.urandom(4).hex()}.input",
+ )
+ await download_file(audio_url, path)
+ audio_url = path
+
</code_context>
<issue_to_address>
**suggestion:** Preserving the original file extension when downloading could improve MIME detection for base64 encoding.
Because downloaded files are saved with a `.input` suffix, `_get_mime_type` always falls back to the default `audio/mpeg` for these URLs. If you derive and reuse the original extension from the remote URL (e.g., `.wav`, `.mp3`, `.ogg`) in the temp filename, `_get_mime_type` can infer a more accurate MIME type and better match what the API expects.
Suggested implementation:
```python
import base64
import os
import pathlib
from urllib.parse import urlparse
import dashscope
from dashscope import MultiModalConversation
```
```python
is_tencent = False
output_path = None
# Download from URL if needed
if audio_url.startswith("http"):
if "multimedia.nt.qq.com.cn" in audio_url:
is_tencent = True
temp_dir = get_astrbot_temp_path()
# Preserve original file extension (if present) to improve MIME detection
parsed_url = urlparse(audio_url)
_, ext = os.path.splitext(parsed_url.path)
if not ext:
ext = ".input"
path = os.path.join(
temp_dir,
f"qwen_asr_{os.urandom(4).hex()}{ext}",
)
await download_file(audio_url, path)
audio_url = path
```
</issue_to_address>
### Comment 3
<location path="astrbot/core/provider/sources/qwen_asr_flash_source.py" line_range="166" />
<code_context>
+ base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
+ return f"data:{mime_type};base64,{base64_str}"
+
+ async def get_text(self, audio_url: str) -> str:
+ """Transcribe audio file to text using Qwen3-ASR-Flash API.
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting response parsing into a helper, simplifying audio format detection, and optionally splitting out URL downloading to make `get_text` and `_prepare_audio` easier to follow without changing behaviour.
You can reduce complexity in a few focused spots without changing behaviour.
### 1. Flatten response parsing in `get_text`
The current parsing is quite nested and defensive. You can make it easier to follow by:
- Using `getattr(...)` with sane defaults.
- Normalising `content` into a list.
- Using a single dict branch and `.get()` calls.
Example:
```python
def _extract_text_from_response(self, response) -> str:
if response.status_code != 200:
error_msg = getattr(response, "message", None) or f"API error: {response.status_code}"
logger.error(f"Qwen3-ASR-Flash API error: {error_msg}")
raise Exception(f"Qwen3-ASR-Flash API error: {error_msg}")
output = getattr(response, "output", None)
choices = getattr(output, "choices", []) or []
if not choices:
return ""
message = getattr(choices[0], "message", None)
content = getattr(message, "content", "")
# normalise to list
if isinstance(content, str):
return content.strip()
if not isinstance(content, list):
return ""
parts: list[str] = []
for item in content:
if isinstance(item, dict):
if "text" in item:
parts.append(item.get("text", ""))
elif "audio" in item:
parts.append(item.get("audio", ""))
return "".join(parts).strip()
```
Then `get_text` becomes:
```python
async def get_text(self, audio_url: str) -> str:
output_path = None
try:
audio_path, output_path = await self._prepare_audio(audio_url)
data_uri = self._encode_audio_base64(audio_path)
messages = [{"role": "user", "content": [{"audio": data_uri}]}]
asr_options = {"enable_itn": self.enable_itn}
if self.language != "auto":
asr_options["language"] = self.language
response = MultiModalConversation.call(
api_key=self.api_key,
model=self.model,
messages=messages,
result_format="message",
asr_options=asr_options,
)
text = self._extract_text_from_response(response)
logger.debug(f"Qwen3-ASR-Flash transcription: {text}")
return text
except Exception as e:
logger.error(f"Qwen3-ASR-Flash transcription error: {e}")
raise
finally:
if output_path and os.path.exists(output_path):
try:
os.remove(output_path)
except Exception as e:
logger.error(f"Failed to remove temp file {output_path}: {e}")
```
This keeps all behaviour but separates orchestration from parsing and removes nested `hasattr` checks.
### 2. Make `_get_audio_format` synchronous
The method does only synchronous file I/O and header checks; making it async forces callers to `await` with no real benefit.
```python
def _get_audio_format(self, file_path: str) -> str | None:
silk_header = b"SILK"
amr_header = b"#!AMR"
try:
with open(file_path, "rb") as f:
file_header = f.read(8)
except FileNotFoundError:
return None
if silk_header in file_header:
return "silk"
if amr_header in file_header:
return "amr"
return None
```
And update `_prepare_audio` accordingly:
```python
elif (
lower_audio_url.endswith(".amr")
or lower_audio_url.endswith(".silk")
or is_tencent
):
file_format = self._get_audio_format(audio_url)
if file_format in ["silk", "amr"]:
temp_dir = get_astrbot_temp_path()
output_path = os.path.join(temp_dir, f"qwen_asr_{os.urandom(4).hex()}.wav")
if file_format == "silk":
logger.info("Converting silk file to wav...")
await tencent_silk_to_wav(audio_url, output_path)
elif file_format == "amr":
logger.info("Converting amr file to wav...")
await convert_to_pcm_wav(audio_url, output_path)
audio_url = output_path
```
This removes unnecessary async surface area while preserving functionality.
### 3. Optional: small helpers to clarify `_prepare_audio`
Without changing the logic, you can pull out URL handling to reduce branching inside `_prepare_audio`:
```python
async def _download_if_url(self, audio_url: str) -> tuple[str, bool]:
if not audio_url.startswith("http"):
return audio_url, False
is_tencent = "multimedia.nt.qq.com.cn" in audio_url
temp_dir = get_astrbot_temp_path()
path = os.path.join(temp_dir, f"qwen_asr_{os.urandom(4).hex()}.input")
await download_file(audio_url, path)
return path, is_tencent
```
Then in `_prepare_audio`:
```python
async def _prepare_audio(self, audio_url: str) -> tuple[str, str | None]:
audio_url, is_tencent = await self._download_if_url(audio_url)
output_path = None
if not os.path.exists(audio_url):
raise FileNotFoundError(f"File not found: {audio_url}")
lower_audio_url = audio_url.lower()
if lower_audio_url.endswith(".opus"):
...
elif lower_audio_url.endswith((".amr", ".silk")) or is_tencent:
...
return audio_url, output_path
```
This keeps behaviour intact but makes the flow (download → conversion decisions) easier to follow.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| self.timeout = provider_config.get("timeout", 30) | ||
|
|
||
| # Set the DashScope API base URL | ||
| dashscope.base_http_api_url = self.api_base | ||
|
|
||
| self.set_model(self.model) | ||
|
|
||
| def _get_mime_type(self, file_path: str) -> str: | ||
| """Get MIME type based on file extension.""" | ||
| ext_to_mime = { |
There was a problem hiding this comment.
issue (bug_risk): The configured timeout value is stored but not applied to the DashScope API call.
timeout is read from provider_config, but MultiModalConversation.call is not given any timeout-related argument, so this value is never used. If the DashScope client exposes a timeout option (e.g., request_timeout), pass self.timeout through so the config actually controls request duration and avoids indefinitely hanging calls.
| if audio_url.startswith("http"): | ||
| if "multimedia.nt.qq.com.cn" in audio_url: | ||
| is_tencent = True | ||
|
|
||
| temp_dir = get_astrbot_temp_path() | ||
| path = os.path.join( | ||
| temp_dir, | ||
| f"qwen_asr_{os.urandom(4).hex()}.input", | ||
| ) | ||
| await download_file(audio_url, path) |
There was a problem hiding this comment.
suggestion: Preserving the original file extension when downloading could improve MIME detection for base64 encoding.
Because downloaded files are saved with a .input suffix, _get_mime_type always falls back to the default audio/mpeg for these URLs. If you derive and reuse the original extension from the remote URL (e.g., .wav, .mp3, .ogg) in the temp filename, _get_mime_type can infer a more accurate MIME type and better match what the API expects.
Suggested implementation:
import base64
import os
import pathlib
from urllib.parse import urlparse
import dashscope
from dashscope import MultiModalConversation is_tencent = False
output_path = None
# Download from URL if needed
if audio_url.startswith("http"):
if "multimedia.nt.qq.com.cn" in audio_url:
is_tencent = True
temp_dir = get_astrbot_temp_path()
# Preserve original file extension (if present) to improve MIME detection
parsed_url = urlparse(audio_url)
_, ext = os.path.splitext(parsed_url.path)
if not ext:
ext = ".input"
path = os.path.join(
temp_dir,
f"qwen_asr_{os.urandom(4).hex()}{ext}",
)
await download_file(audio_url, path)
audio_url = path| base64_str = base64.b64encode(file_path_obj.read_bytes()).decode() | ||
| return f"data:{mime_type};base64,{base64_str}" | ||
|
|
||
| async def get_text(self, audio_url: str) -> str: |
There was a problem hiding this comment.
issue (complexity): Consider extracting response parsing into a helper, simplifying audio format detection, and optionally splitting out URL downloading to make get_text and _prepare_audio easier to follow without changing behaviour.
You can reduce complexity in a few focused spots without changing behaviour.
1. Flatten response parsing in get_text
The current parsing is quite nested and defensive. You can make it easier to follow by:
- Using
getattr(...)with sane defaults. - Normalising
contentinto a list. - Using a single dict branch and
.get()calls.
Example:
def _extract_text_from_response(self, response) -> str:
if response.status_code != 200:
error_msg = getattr(response, "message", None) or f"API error: {response.status_code}"
logger.error(f"Qwen3-ASR-Flash API error: {error_msg}")
raise Exception(f"Qwen3-ASR-Flash API error: {error_msg}")
output = getattr(response, "output", None)
choices = getattr(output, "choices", []) or []
if not choices:
return ""
message = getattr(choices[0], "message", None)
content = getattr(message, "content", "")
# normalise to list
if isinstance(content, str):
return content.strip()
if not isinstance(content, list):
return ""
parts: list[str] = []
for item in content:
if isinstance(item, dict):
if "text" in item:
parts.append(item.get("text", ""))
elif "audio" in item:
parts.append(item.get("audio", ""))
return "".join(parts).strip()Then get_text becomes:
async def get_text(self, audio_url: str) -> str:
output_path = None
try:
audio_path, output_path = await self._prepare_audio(audio_url)
data_uri = self._encode_audio_base64(audio_path)
messages = [{"role": "user", "content": [{"audio": data_uri}]}]
asr_options = {"enable_itn": self.enable_itn}
if self.language != "auto":
asr_options["language"] = self.language
response = MultiModalConversation.call(
api_key=self.api_key,
model=self.model,
messages=messages,
result_format="message",
asr_options=asr_options,
)
text = self._extract_text_from_response(response)
logger.debug(f"Qwen3-ASR-Flash transcription: {text}")
return text
except Exception as e:
logger.error(f"Qwen3-ASR-Flash transcription error: {e}")
raise
finally:
if output_path and os.path.exists(output_path):
try:
os.remove(output_path)
except Exception as e:
logger.error(f"Failed to remove temp file {output_path}: {e}")This keeps all behaviour but separates orchestration from parsing and removes nested hasattr checks.
2. Make _get_audio_format synchronous
The method does only synchronous file I/O and header checks; making it async forces callers to await with no real benefit.
def _get_audio_format(self, file_path: str) -> str | None:
silk_header = b"SILK"
amr_header = b"#!AMR"
try:
with open(file_path, "rb") as f:
file_header = f.read(8)
except FileNotFoundError:
return None
if silk_header in file_header:
return "silk"
if amr_header in file_header:
return "amr"
return NoneAnd update _prepare_audio accordingly:
elif (
lower_audio_url.endswith(".amr")
or lower_audio_url.endswith(".silk")
or is_tencent
):
file_format = self._get_audio_format(audio_url)
if file_format in ["silk", "amr"]:
temp_dir = get_astrbot_temp_path()
output_path = os.path.join(temp_dir, f"qwen_asr_{os.urandom(4).hex()}.wav")
if file_format == "silk":
logger.info("Converting silk file to wav...")
await tencent_silk_to_wav(audio_url, output_path)
elif file_format == "amr":
logger.info("Converting amr file to wav...")
await convert_to_pcm_wav(audio_url, output_path)
audio_url = output_pathThis removes unnecessary async surface area while preserving functionality.
3. Optional: small helpers to clarify _prepare_audio
Without changing the logic, you can pull out URL handling to reduce branching inside _prepare_audio:
async def _download_if_url(self, audio_url: str) -> tuple[str, bool]:
if not audio_url.startswith("http"):
return audio_url, False
is_tencent = "multimedia.nt.qq.com.cn" in audio_url
temp_dir = get_astrbot_temp_path()
path = os.path.join(temp_dir, f"qwen_asr_{os.urandom(4).hex()}.input")
await download_file(audio_url, path)
return path, is_tencentThen in _prepare_audio:
async def _prepare_audio(self, audio_url: str) -> tuple[str, str | None]:
audio_url, is_tencent = await self._download_if_url(audio_url)
output_path = None
if not os.path.exists(audio_url):
raise FileNotFoundError(f"File not found: {audio_url}")
lower_audio_url = audio_url.lower()
if lower_audio_url.endswith(".opus"):
...
elif lower_audio_url.endswith((".amr", ".silk")) or is_tencent:
...
return audio_url, output_pathThis keeps behaviour intact but makes the flow (download → conversion decisions) easier to follow.
There was a problem hiding this comment.
Code Review
This pull request introduces the Qwen3-ASR-Flash speech-to-text provider and resolves a bug that was preventing STT from triggering. The bug fix appears correct and addresses the issue described. However, the implementation of the new provider has several areas for improvement, primarily concerning synchronous (blocking) operations within an asynchronous context. These blocking calls for file I/O and network requests can significantly degrade application performance by holding up the event loop. Additionally, there's an issue with modifying global state for the API endpoint, which could lead to incorrect behavior if multiple provider instances are used. I've provided specific comments and code suggestions to address these points by using asyncio.to_thread for blocking calls and ensuring thread-safe configuration.
| self.timeout = provider_config.get("timeout", 30) | ||
|
|
||
| # Set the DashScope API base URL | ||
| dashscope.base_http_api_url = self.api_base |
There was a problem hiding this comment.
Setting the global dashscope.base_http_api_url in the initializer can lead to incorrect behavior when multiple provider instances with different api_base URLs are used. The last provider to initialize would set the URL for all of them. This line should be removed, and the api_base should be set just-in-time within the get_text method before making the API call to ensure each request uses its correct endpoint.
| # Call API | ||
| response = MultiModalConversation.call( | ||
| api_key=self.api_key, | ||
| model=self.model, | ||
| messages=messages, | ||
| result_format="message", | ||
| asr_options=asr_options, | ||
| ) |
There was a problem hiding this comment.
MultiModalConversation.call is a synchronous (blocking) network call. Invoking it directly within an async method will block the entire event loop, severely impacting application performance and responsiveness. You should use asyncio.to_thread to run this blocking operation in a separate thread. Additionally, to fix the issue of using a global api_base set in __init__, the api_base should be configured here, just before making the API call.
# Set API base for this call and execute the blocking call in a thread
def _blocking_call():
dashscope.base_http_api_url = self.api_base
return MultiModalConversation.call(
api_key=self.api_key,
model=self.model,
messages=messages,
result_format="message",
asr_options=asr_options,
)
response = await asyncio.to_thread(_blocking_call)| """ | ||
|
|
||
| import base64 | ||
| import os |
| with open(file_path, "rb") as f: | ||
| file_header = f.read(8) |
There was a problem hiding this comment.
The open() and read() calls are synchronous file operations that will block the asyncio event loop. This can degrade performance, especially with slow storage. These operations should be performed asynchronously by running them in a separate thread using asyncio.to_thread.
| with open(file_path, "rb") as f: | |
| file_header = f.read(8) | |
| def _read_header(p): | |
| with open(p, "rb") as f: | |
| return f.read(8) | |
| file_header = await asyncio.to_thread(_read_header, file_path) |
Summary / 概要
This PR adds a new speech-to-text provider and fixes a bug that prevented STT from being triggered.
Modifications / 改动点
feat(provider): Add Qwen3-ASR-Flash STT Provider
astrbot/core/provider/sources/qwen_asr_flash_source.pyfix(stt): Fix STT not triggering bug
PreProcessStagewas checkingRecord.urlfor audio pathRecord.fromURL()stores the URL inRecord.fileattribute, noturlcomponent.urlto always be empty, making STT never triggerRecord.fileinsteadastrbot/core/pipeline/preprocess_stage/stage.pyVerification Steps / 验证步骤
provider_stt_settings.enable = trueScreenshots or Test Results / 运行截图或测试结果
(Please add screenshots showing the voice message being transcribed successfully)
Summary by Sourcery
Add a new DashScope-based Qwen3-ASR-Flash speech-to-text provider and fix speech-to-text not triggering due to incorrect record path handling.
New Features:
Bug Fixes:
Enhancements: