Qwen3-TTS-Streaming ONNX Inference

Pure ONNX Runtime inference pipeline for Qwen3-TTS-12Hz-0.6B-Base, enabling real-time streaming text-to-speech without PyTorch and Transformers dependencies at runtime.

Updates

As of 2026/04/27, you can synthesize multiple rounds of text with continuous streaming, in addition to the streaming processing within each round.
As of 2026/05/04, it is now also independent from transformers library with standalone Qwen3TTSTextProcessor implementation that mimics the original.
As of 2026/05/06, this system has been integrated in our streaming-speech-translation pipeline. This also includes the revision on codec reset threshold per round, which is slightly longer than the talker, i.e., 125 and 50, respectively, where previously it followed the talker.

Overview

This repository provides:

qwen3_tts_inferencer_onnx.py — Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime.
test_qwen3-tts-streaming_onnx.py — End-to-end test script that simulates LLM streaming text and produces a WAV file.

Architecture

Reference Audio ──► Speaker Encoder ──► Speaker Embedding Vector (voice clone context)
                                           │
                                           ▼
            Text Deltas ──► Talker LLM (Qwen3-0.6B) ──► [Hidden States, VQ Token]
                                                          │
                                                          ▼
                                                Local Transformer ──► 15-codebook RVQ Tokens
                                                                            │
                                                                            ▼
                                                          VQ Token ──►  [4 Frames Chunks] ──► Codec Decoder ──► 24 kHz Waveform Chunks (320 ms)

Component	ONNX Model	Description
Talker LLM	`talker_model_*.onnx`	Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation.
Local Talker	`talker_local_model_*.onnx`	Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame.
LM Head of Local Talker	`talker_local_lm_head.onnx`	Projection head for each of the 15 codebook output of the local talker transformer.
Codec Decoder	`codec_decoder_model.onnx`	Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode.
Speaker Encoder	`speaker_encoder_model.onnx`	ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning.
Talker Codec Embed	`talker_codec_embed_model.onnx`	VQ embedding for the talker model. Consists of 2048 token vocabs.
Text Embed Projection	`text_embed_proj_model.onnx`	Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs.

Requirements

librosa
numpy
onnxruntime-gpu
python-box
soundfile

Example installation with conda env:

conda create --name qwen3-tts-streaming-onnx-1 python=3.12
conda activate qwen3-tts-streaming-onnx-1
pip install -r requirements.txt

Directory Structure

.
├── test_qwen3-tts-streaming_onnx.py        # End-to-end test script
├── README.md
├── requirements.txt
├── qwen3-tts_onnx/  # FP32
│   ├── talker_model_prefill.onnx
│   ├── talker_model_step.onnx
│   ├── talker_local_model_prefill.onnx
│   ├── talker_local_model_step.onnx
│   ├── talker_local_lm_head.onnx
│   ├── codec_decoder_model.onnx
│   ├── speaker_encoder_model.onnx
│   ├── talker_codec_embed_model.onnx
│   └── text_embed_proj_model.onnx
├── configs/
│   ├── config.json                         # Talker, Local Talker, Speaker Encoder config
│   ├── speech_tokenizer_config.json        # Codec config
│   ├── preprocessor_config.json            # Text Processor configs
│   ├── tokenizer_config.json
│   ├── vocab.json
│   └── merges.txt
├── src/
│   ├── inference/
│   │   └── qwen3_tts_inferencer_onnx.py    # Core ONNX inference engine 
│   └── utils/
│       └── audio_utils.py
├── logs/
│   └── <log_synth>.txt
├── audio_ref/
│   └── <reference_speaker>.[wav|mp3|flac]
└── audio_synth/
    └── <synthesized_example>.wav

Usage

Basic streaming TTS usage

python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt
# audio automatically saved in audio_synth/ with default parameters, text, language.

Usage with parameters

python test_qwen3-tts-streaming_onnx.py \
    --onnx_dir qwen3-tts_onnx/ \
    --model_config_path configs/config.json \
    --codec_config_path configs/tokenizer_config.json \
    --preprocessor_config_dir configs/ \
    --temperature 0.75 \
    --top_p 0.85 \
    --top_k 50 \
    --repetition_penalty 9.5 \
    --repetition_window 75 \
    --num_threads 4 \
    --audio_ref_path audio_ref/speaker.[wav|flac|mp3] \
    --out_wav output.wav \
    --text "Text to be synthesized" "Yet another text here" "And another" \
    --language "english"

Available Languages

You can use language name or its code as follows:

"chinese", "zh", "english", "en", "german", "de", "italian", "it", "portuguese", "pt",
"spanish", "es", "japanese", "ja", "korean", "ko", "french", "fr", "russian", "ru"

Programmatic Usage

from src.inference import Qwen3TTSInferencerONNX

# Create inferencer
inferencer = Qwen3TTSInferencerONNX(
    talker_prefill, talker_step, talker_local_prefill, talker_local_step,
    talker_local_lm_head, codec_decoder, codec_decoder_dynamic,
    speaker_encoder, talker_codec_embed, text_embed_proj,
    preprocessor_config_dir, model_config, codec_config,
    audio_ref_path, language,
)
inferencer.reset_turn(reset_cache=True, force_reset_codec_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    ...
    for audio_tokens in audio_frames:
        ...
        inferencer.push_tokens(audio_tokens)
        for wav in inferencer.audio_chunks():
            ...
            yield wav
# End of text and collect audio
audio_frames = inferencer.end_text()
for audio_tokens in audio_frames:
    ...
    inferencer.push_tokens(audio_tokens)
    for wav in inferencer.audio_chunks():
        ...
        yield wav
# Drain remaining audio (text will be with pad token)
audio_frames = inferencer.drain()
for audio_tokens in audio_frames:
    ...
    inferencer.push_tokens(audio_tokens)
    for wav in inferencer.audio_chunks():
        ...
        yield wav
# Flush any remaining audio tokens
for wav in inferencer.flush():
    ...
    yield wav

Command-Line Arguments

Argument	Type	Default	Description
`--onnx_dir`	str	"qwen3-tts_onnx/"	Directory path to all onnx models
`--preprocessor_config_dir`	str	"configs/"	Directory path to configuration files for the Qwen3 text tokenizer
`--model_config_path`	str	"configs/config.json"	Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base
`--codec_config_path`	str	"configs/speech_tokenizer_config.json"	Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base
`--temperature`	float	`0.75`	Sampling temperature
`--top_p`	float	`0.85`	Nucleus sampling threshold
`--top_k`	int	`50`	Top-k sampling cutoff
`--repetition_penalty`	float	`9.5`	Repetition penalty coefficient
`--repetition_window`	int	`75`	Window for repetition penalty
`--delta_chunk_chars`	int	`1`	Characters per simulated LLM delta
`--delta_delay_s`	float	`0.0`	Delay between simulated deltas (seconds)
`--num_threads`	int	`4`	Number of threads used in sess.intra_op_num_threads of the onnxruntime session options
`--prompt_wav`	str	audio_ref/male_stewie.mp3	Reference speaker audio for voice cloning
`--out_wav`	str	`out_streaming.wav`	Output WAV file path
`--text`	str	(Russian text)	Text to synthesize
`--language`	str	"russian"	Language of the text to synthesize

By: Patrick Lumbantobing

Copyright@VertoX-AI

Citation

If you use this system in your research, please cite:

@misc{vertoxai2026qwen3ttsstreamingonnxcudagraph,
  title={Qwen3-TTS-Streaming-ONNX — VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}

License

This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS.

Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.

This work is licensed under the Apache License, Version 2.0.
To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).