Skip to main content

Automatic Speech Recognition

1. What Is an Automatic Speech Recognition Task?

Automatic Speech Recognition (ASR) is an important task in large-model inference services. It converts audio input into corresponding text output automatically. With ASR capabilities, users can quickly transcribe spoken conversations, meeting recordings, and media content into structured text, making it widely applicable in use cases such as meeting minutes, real-time subtitles, and voice-driven interactions.

2. Typical Use Cases

  • Meeting Transcription: Automatically transcribe meeting recordings into text with speaker labels, making it easy to review and archive discussions.
  • Real-Time Subtitles: Provide low-latency subtitle generation for live video streams, online education, and similar scenarios.
  • Voice Customer Service: Transcribe user voice input into text for downstream NLP systems to perform intent recognition and generate responses.
  • Multilingual Content Processing: Support transcription of multiple languages and dialects, suitable for international content production.
  • Accessibility Support: Help hearing-impaired users access spoken content through text output.

3. Key Factors Affecting Inference Quality

Audio Input Quality

  • The sample rate, signal-to-noise ratio, and encoding format of the audio directly affect recognition accuracy.
  • It is recommended to use 16 kHz mono-channel audio and minimize background noise to achieve the best recognition results.

Parameter Configuration

The following parameters control the behavior and output format of speech recognition:

Language

  • Specifies the language of the audio to improve recognition accuracy.
  • Can be set to auto to enable automatic language detection, which is useful for multilingual or mixed-language audio.
  • Use case: Specify this parameter when the audio language is known to reduce misrecognition.

Response Format

  • Controls the format of the transcription output. Common options include:
    • text: Plain text — returns only the recognized content.
    • verbose_json: Detailed JSON format — includes text, timestamps, speaker labels, and other structured information.
  • Use case: Choose based on your needs; use verbose_json if timestamps or speaker information are required.

Hotword

  • Provide domain-specific vocabulary or high-frequency terms to guide the model in prioritizing these words and improving accuracy in specialized domains.
  • Use case: Suitable for fields with abundant technical terminology such as healthcare, legal, and finance.

4. Sample Code

import requests

url = "https://xxxxxxxxxxxx.space.opencsg.com/v1/audio/transcriptions"
headers = {}

with open("audio.wav", "rb") as audio_file:
files = {
"file": ("audio.wav", audio_file, "audio/wav"),
}
data = {
"model": "sensevoice",
"language": "auto",
"response_format": "verbose_json",
}
response = requests.post(url=url, headers=headers, files=files, data=data)

if response.status_code == 200:
result = response.json()
print(result)