Qwen3-TTS-Streaming ONNX Inference

Pure ONNX Runtime inference pipeline for Qwen3-TTS-12Hz-0.6B-Base, enabling real-time streaming text-to-speech without PyTorch and Transformers dependencies at runtime.

Updates

  • As of 2026/04/27, you can synthesize multiple rounds of text with continuous streaming, in addition to the streaming processing within each round.
  • As of 2026/05/04, it is now also independent from transformers library with standalone Qwen3TTSTextProcessor implementation that mimics the original.
  • As of 2026/05/06, this system has been integrated in our streaming-speech-translation pipeline. This also includes the revision on codec reset threshold per round, which is slightly longer than the talker, i.e., 125 and 50, respectively, where previously it followed the talker.

Overview

This repository provides:

  • qwen3_tts_inferencer_onnx.py — Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime.
  • test_qwen3-tts-streaming_onnx.py — End-to-end test script that simulates LLM streaming text and produces a WAV file.

Architecture

Reference Audio ──► Speaker Encoder ──► Speaker Embedding Vector (voice clone context)
                                           │
                                           â–¼
            Text Deltas ──► Talker LLM (Qwen3-0.6B) ──► [Hidden States, VQ Token]
                                                          │
                                                          â–¼
                                                Local Transformer ──► 15-codebook RVQ Tokens
                                                                            │
                                                                            â–¼
                                                          VQ Token ──►  [4 Frames Chunks] ──► Codec Decoder ──► 24 kHz Waveform Chunks (320 ms)
Component ONNX Model Description
Talker LLM talker_model_*.onnx Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation.
Local Talker talker_local_model_*.onnx Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame.
LM Head of Local Talker talker_local_lm_head.onnx Projection head for each of the 15 codebook output of the local talker transformer.
Codec Decoder codec_decoder_model.onnx Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode.
Speaker Encoder speaker_encoder_model.onnx ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning.
Talker Codec Embed talker_codec_embed_model.onnx VQ embedding for the talker model. Consists of 2048 token vocabs.
Text Embed Projection text_embed_proj_model.onnx Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs.

Requirements

librosa
numpy
onnxruntime-gpu
python-box
soundfile

Example installation with conda env:

conda create --name qwen3-tts-streaming-onnx-1 python=3.12
conda activate qwen3-tts-streaming-onnx-1
pip install -r requirements.txt

Directory Structure

.
├── test_qwen3-tts-streaming_onnx.py        # End-to-end test script
├── README.md
├── requirements.txt
├── qwen3-tts_onnx/  # FP32
│   ├── talker_model_prefill.onnx
│   ├── talker_model_step.onnx
│   ├── talker_local_model_prefill.onnx
│   ├── talker_local_model_step.onnx
│   ├── talker_local_lm_head.onnx
│   ├── codec_decoder_model.onnx
│   ├── speaker_encoder_model.onnx
│   ├── talker_codec_embed_model.onnx
│   └── text_embed_proj_model.onnx
├── configs/
│   ├── config.json                         # Talker, Local Talker, Speaker Encoder config
│   ├── speech_tokenizer_config.json        # Codec config
│   ├── preprocessor_config.json            # Text Processor configs
│   ├── tokenizer_config.json
│   ├── vocab.json
│   └── merges.txt
├── src/
│   ├── inference/
│   │   └── qwen3_tts_inferencer_onnx.py    # Core ONNX inference engine 
│   └── utils/
│       └── audio_utils.py
├── logs/
│   └── <log_synth>.txt
├── audio_ref/
│   └── <reference_speaker>.[wav|mp3|flac]
└── audio_synth/
    └── <synthesized_example>.wav

Usage

Basic streaming TTS usage

python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt
# audio automatically saved in audio_synth/ with default parameters, text, language.

Usage with parameters

python test_qwen3-tts-streaming_onnx.py \
    --onnx_dir qwen3-tts_onnx/ \
    --model_config_path configs/config.json \
    --codec_config_path configs/tokenizer_config.json \
    --preprocessor_config_dir configs/ \
    --temperature 0.75 \
    --top_p 0.85 \
    --top_k 50 \
    --repetition_penalty 9.5 \
    --repetition_window 75 \
    --num_threads 4 \
    --audio_ref_path audio_ref/speaker.[wav|flac|mp3] \
    --out_wav output.wav \
    --text "Text to be synthesized" "Yet another text here" "And another" \
    --language "english"

Available Languages

  • You can use language name or its code as follows:
"chinese", "zh", "english", "en", "german", "de", "italian", "it", "portuguese", "pt",
"spanish", "es", "japanese", "ja", "korean", "ko", "french", "fr", "russian", "ru"

Programmatic Usage

from src.inference import Qwen3TTSInferencerONNX

# Create inferencer
inferencer = Qwen3TTSInferencerONNX(
    talker_prefill, talker_step, talker_local_prefill, talker_local_step,
    talker_local_lm_head, codec_decoder, codec_decoder_dynamic,
    speaker_encoder, talker_codec_embed, text_embed_proj,
    preprocessor_config_dir, model_config, codec_config,
    audio_ref_path, language,
)
inferencer.reset_turn(reset_cache=True, force_reset_codec_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    ...
    for audio_tokens in audio_frames:
        ...
        inferencer.push_tokens(audio_tokens)
        for wav in inferencer.audio_chunks():
            ...
            yield wav
# End of text and collect audio
audio_frames = inferencer.end_text()
for audio_tokens in audio_frames:
    ...
    inferencer.push_tokens(audio_tokens)
    for wav in inferencer.audio_chunks():
        ...
        yield wav
# Drain remaining audio (text will be with pad token)
audio_frames = inferencer.drain()
for audio_tokens in audio_frames:
    ...
    inferencer.push_tokens(audio_tokens)
    for wav in inferencer.audio_chunks():
        ...
        yield wav
# Flush any remaining audio tokens
for wav in inferencer.flush():
    ...
    yield wav

Command-Line Arguments

Argument Type Default Description
--onnx_dir str "qwen3-tts_onnx/" Directory path to all onnx models
--preprocessor_config_dir str "configs/" Directory path to configuration files for the Qwen3 text tokenizer
--model_config_path str "configs/config.json" Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base
--codec_config_path str "configs/speech_tokenizer_config.json" Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base
--temperature float 0.75 Sampling temperature
--top_p float 0.85 Nucleus sampling threshold
--top_k int 50 Top-k sampling cutoff
--repetition_penalty float 9.5 Repetition penalty coefficient
--repetition_window int 75 Window for repetition penalty
--delta_chunk_chars int 1 Characters per simulated LLM delta
--delta_delay_s float 0.0 Delay between simulated deltas (seconds)
--num_threads int 4 Number of threads used in sess.intra_op_num_threads of the onnxruntime session options
--prompt_wav str audio_ref/male_stewie.mp3 Reference speaker audio for voice cloning
--out_wav str out_streaming.wav Output WAV file path
--text str (Russian text) Text to synthesize
--language str "russian" Language of the text to synthesize

By: Patrick Lumbantobing

Copyright@VertoX-AI

Citation

If you use this system in your research, please cite:

@misc{vertoxai2026qwen3ttsstreamingonnxcudagraph,
  title={Qwen3-TTS-Streaming-ONNX — VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}

License

This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS.

Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.

This work is licensed under the Apache License, Version 2.0.
To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).

Acknowledgements

Downloads last month
752
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pltobing/Qwen3-TTS-Streaming-ONNX

Quantized
(13)
this model

Space using pltobing/Qwen3-TTS-Streaming-ONNX 1

Collection including pltobing/Qwen3-TTS-Streaming-ONNX

Paper for pltobing/Qwen3-TTS-Streaming-ONNX