Automatic Speech Recognition
1. What Is an Automatic Speech Recognition Task?
Automatic Speech Recognition (ASR) is an important task in large-model inference services. It converts audio input into corresponding text output automatically. With ASR capabilities, users can quickly transcribe spoken conversations, meeting recordings, and media content into structured text, making it widely applicable in use cases such as meeting minutes, real-time subtitles, and voice-driven interactions.
2. Typical Use Cases
- Meeting Transcription: Automatically transcribe meeting recordings into text with speaker labels, making it easy to review and archive discussions.
- Real-Time Subtitles: Provide low-latency subtitle generation for live video streams, online education, and similar scenarios.
- Voice Customer Service: Transcribe user voice input into text for downstream NLP systems to perform intent recognition and generate responses.
- Multilingual Content Processing: Support transcription of multiple languages and dialects, suitable for international content production.
- Accessibility Support: Help hearing-impaired users access spoken content through text output.
3. Key Factors Affecting Inference Quality
Audio Input Quality
- The sample rate, signal-to-noise ratio, and encoding format of the audio directly affect recognition accuracy.
- It is recommended to use 16 kHz mono-channel audio and minimize background noise to achieve the best recognition results.
Parameter Configuration
The following parameters control the behavior and output format of speech recognition:
Language
- Specifies the language of the audio to improve recognition accuracy.
- Can be set to
autoto enable automatic language detection, which is useful for multilingual or mixed-language audio. - Use case: Specify this parameter when the audio language is known to reduce misrecognition.
Response Format
- Controls the format of the transcription output. Common options include:
text: Plain text — returns only the recognized content.verbose_json: Detailed JSON format — includes text, timestamps, speaker labels, and other structured information.
- Use case: Choose based on your needs; use
verbose_jsonif timestamps or speaker information are required.
Hotword
- Provide domain-specific vocabulary or high-frequency terms to guide the model in prioritizing these words and improving accuracy in specialized domains.
- Use case: Suitable for fields with abundant technical terminology such as healthcare, legal, and finance.
4. Sample Code
import requests
url = "https://xxxxxxxxxxxx.space.opencsg.com/v1/audio/transcriptions"
headers = {}
with open("audio.wav", "rb") as audio_file:
files = {
"file": ("audio.wav", audio_file, "audio/wav"),
}
data = {
"model": "sensevoice",
"language": "auto",
"response_format": "verbose_json",
}
response = requests.post(url=url, headers=headers, files=files, data=data)
if response.status_code == 200:
result = response.json()
print(result)