๐๐ป CosyVoice ๐๐ป
Fun-CosyVoice 3.0: Demos; Paper; Modelscope; CV3-Eval
CosyVoice 2.0: Demos; Paper; Modelscope; HuggingFace
CosyVoice 1.0: Demos; Paper; Modelscope
Highlight๐ฅ
Fun-CosyVoice 3.0 is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
Key Features
- Language Coverage: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
- Content Consistency & Naturalness: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
- Pronunciation Inpainting: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
- Text Normalization: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
- Bi-Streaming: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
- Instruct Support: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
Roadmap
2025/12
- release Fun-CosyVoice3-0.5B-2512 base model and its training/inference script
- release Fun-CosyVoice3-0.5B modelscope gradio space
2025/08
- Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
2025/07
- release CosyVoice 3.0 eval set
2025/05
- add CosyVoice2-0.5B vllm support
2024/12
- 25hz CosyVoice2-0.5B released
2024/09
- 25hz CosyVoice-300M base model
- 25hz CosyVoice-300M voice conversion function
2024/08
- Repetition Aware Sampling(RAS) inference for llm stability
- Streaming inference mode support, including kv cache and sdpa for rtf optimization
2024/07
- Flow matching training support
- WeTextProcessing support when ttsfrd is not available
- Fastapi server and client
Evaluation
| Model | Open-Source | Model Size | test-zh CER (%) โ |
test-zh Speaker Similarity (%) โ |
test-en WER (%) โ |
test-en Speaker Similarity (%) โ |
test-hard CER (%) โ |
test-hard Speaker Similarity (%) โ |
|---|---|---|---|---|---|---|---|---|
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
| Seed-TTS | โ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
| MiniMax-Speech | โ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
| F5-TTS | โ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
| Spark TTS | โ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
| CosyVoice2 | โ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
| FireRedTTS2 | โ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
| Index-TTS2 | โ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
| VibeVoice-1.5B | โ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
| VibeVoice-Realtime | โ | 0.5B | - | - | 2.05 | 63.3 | - | - |
| HiggsAudio-v2 | โ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
| VoxCPM | โ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
| GLM-TTS | โ | 1.5B | 1.03 | 76.1 | - | - | - | - |
| GLM-TTS RL | โ | 1.5B | 0.89 | 76.4 | - | - | - | - |
| Fun-CosyVoice3-0.5B-2512 | โ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
| Fun-CosyVoice3-0.5B-2512_RL | โ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
Install
Clone and install
Clone the repo
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git # If you failed to clone the submodule due to network failures, please run the following command until success cd CosyVoice git submodule update --init --recursiveInstall Conda: please see https://docs.conda.io/en/latest/miniconda.html
Create Conda env:
conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com # If you encounter sox compatibility issues # ubuntu sudo apt-get install sox libsox-dev # centos sudo yum install sox sox-devel
Model download
We strongly recommend that you download our pretrained Fun-CosyVoice3-0.5B model and CosyVoice-ttsfrd resource.
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
Optionally, you can unzip ttsfrd resource and install ttsfrd package for better text normalization performance.
Notice that this step is not necessary. If you do not install ttsfrd package, we will use wetext by default.
cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
Basic Usage
We strongly recommend using Fun-CosyVoice3-0.5B for better performance.
Follow the code in example.py for detailed usage of each model.
python example.py
Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.