Text-to-Speech

1. What Is a Text-to-Speech Task?

Text-to-Speech (TTS) is an essential task in large-model inference services. It converts input text into natural and audible speech output. With TTS capabilities, users can quickly transform text into speech for scenarios such as audio broadcasting, conversational feedback, and multimedia content generation.

2. Typical Use Cases

Audio Broadcasting: Automatically convert articles, announcements, and notifications into speech.
Conversational Systems: Provide spoken responses for chatbots and intelligent assistants.
Digital Humans / Virtual Characters: Generate speech with specific voice styles for virtual avatars.
Accessibility Support: Help visually impaired users access textual information through audio.

3. Key Factors Affecting Inference Quality

Input Text

As the source content for speech generation, the semantics, punctuation, and structure of the text directly affect speech pauses and rhythm.
It is recommended to use punctuation marks (such as commas and periods) properly to achieve more natural speech output and avoid overly long or poorly structured text.

Parameter Configuration

alt text The following parameters control the stability, diversity, and voice characteristics of speech generation:

Temperature (Randomness Control, 0.0 – 1.0)

Controls the degree of randomness during speech generation.
Lower values (e.g., 0.1): More stable and consistent output, suitable for formal narration.
Higher values (e.g., 0.8): More expressive variations, suitable for natural or personalized speech.
Use case: Balancing stability and naturalness.

Top P (Vocabulary Sampling Range, 0.0 – 1.0)

Influences the diversity of speech expression by controlling the probability range of candidate tokens.
Lower values: More predictable output with a consistent speaking style.
Higher values: More flexible expression with richer speech variations.
Use case: Enhancing speech naturalness while maintaining controllability.

Voice Type (Reference ID)

Used to select the reference voice or voice template for speech generation.
Different Reference IDs correspond to different voice styles (e.g., gender, tone, or character).
Use case: Choose an appropriate voice to enhance affinity and personalization based on specific needs.

4. Sample Code

import requests
url = "https://xxxxxxxxxxxx.space.opencsg-stg.com/v1/tts"
headers = {
    'Content-Type': 'application/json'
}
data = {"format":"wav","temperature":0.8,"top_p":0.8,"reference_id":"musk"}
response = requests.post(url=url, json=data, headers=headers, stream=True)
response.raise_for_status()
if response.status_code == 200:
    for line in response.iter_lines():
        if line:
            try:
                decoded_line = re.sub(r'^data:', '', line.decode('utf-8'))
                data = json.loads(decoded_line)
                if ("choices" in data) and data['choices'][0]['delta'] and ("content" in data['choices'][0]['delta']):
                    print(data['choices'][0]['delta'])
            except json.JSONDecodeError:
                pass

1. What Is a Text-to-Speech Task?​

2. Typical Use Cases​

3. Key Factors Affecting Inference Quality​

Input Text​

Parameter Configuration​

Temperature (Randomness Control, 0.0 – 1.0)​

Top P (Vocabulary Sampling Range, 0.0 – 1.0)​

Voice Type (Reference ID)​

4. Sample Code​