happyme531's picture
更新转换脚本和文档(claude写的,感觉也不是特别好)
b70ae8e verified

Qwen3-ASR-1.7B → RK3588 模型转换

(English README see below)

Qwen/Qwen3-ASR-1.7B 转成 RK3588 上可运行的 RKNN + RKLLM。

  • 音频编码器(thinker.audio_tower)→ ONNX → RKNN
  • 文本 LLM(thinker.model + thinker.lm_head)→ 标准 Qwen3ForCausalLMRKLLM
  • 全程 FP16,不量化

目录

convert/
├── audio_encoder/
│   ├── common.py                       共享工具
│   ├── export_audio_encoder_onnx.py    PyTorch → ONNX
│   ├── export_audio_encoder_rknn.py    ONNX → RKNN
│   └── onnx_run_audio_encoder.py       ONNX 对齐校验(可选)
└── llm/
    ├── extract_qwen3_text_model.py     抽取标准 Qwen3 文本权重
    └── export_rkllm_direct.py          HF → RKLLM

准备

把原始模型放到工作目录,并把官方 QwenLM/Qwen3-ASR 仓库克隆到同级目录(audio_encoder/common.py 需要从里面 import modeling_qwen3_asr):

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir .
git clone https://github.com/QwenLM/Qwen3-ASR.git

主机依赖:torch transformers safetensors numpy scipy soundfile onnx onnxruntime,以及 rknn-toolkit2rkllm-toolkit

1. LLM → RKLLM

先抽出干净的 Qwen3 文本权重(直接喂原模型会因为残留的 mrope/vision_config 字段被 RKLLM 误判成视觉模型):

python convert/llm/extract_qwen3_text_model.py \
  --model-path . \
  --output-dir ./qwen3_text_hf

然后转 RKLLM:

python convert/llm/export_rkllm_direct.py \
  --model-path ./qwen3_text_hf \
  --target-platform rk3588 --num-npu-core 3 \
  --dtype float16 --max-context 4096 \
  --savepath ./rknn/language_model.rkllm

2. 音频编码器 → RKNN

音频塔被包成「100 mel 帧 / chunk」的静态模型(这是原模型本身的处理粒度),长音频在运行时分块跑再拼回去。

# PyTorch → ONNX
python convert/audio_encoder/export_audio_encoder_onnx.py \
  --model-path . --savepath ./onnx/qwen3_asr_audio_chunk100.onnx

# (可选) 对齐校验,正常 max_abs_diff ≈ 1e-7
python convert/audio_encoder/onnx_run_audio_encoder.py \
  --model-path . --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --audio-path asr_example_zh.wav --compare-torch

# ONNX → RKNN
python convert/audio_encoder/export_audio_encoder_rknn.py \
  --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --target-platform rk3588 --savepath ./rknn/audio_encoder.rknn

3. 产物

rknn/
├── audio_encoder.rknn
└── language_model.rkllm

直接对接仓库根目录的 run_qwen3_asr_e2e.py


Qwen3-ASR-1.7B → RK3588 Model Conversion

Convert Qwen/Qwen3-ASR-1.7B to RKNN + RKLLM for RK3588.

  • Audio encoder (thinker.audio_tower) → ONNX → RKNN
  • Text LLM (thinker.model + thinker.lm_head) → standard Qwen3ForCausalLMRKLLM
  • FP16 throughout, no quantization

Layout

convert/
├── audio_encoder/
│   ├── common.py                       shared helpers
│   ├── export_audio_encoder_onnx.py    PyTorch → ONNX
│   ├── export_audio_encoder_rknn.py    ONNX → RKNN
│   └── onnx_run_audio_encoder.py       ONNX parity check (optional)
└── llm/
    ├── extract_qwen3_text_model.py     extract standard Qwen3 text weights
    └── export_rkllm_direct.py          HF → RKLLM

Setup

Place the original model in your working directory and clone the official QwenLM/Qwen3-ASR repo as a sibling (audio_encoder/common.py imports modeling_qwen3_asr from it):

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir .
git clone https://github.com/QwenLM/Qwen3-ASR.git

Host dependencies: torch transformers safetensors numpy scipy soundfile onnx onnxruntime, plus rknn-toolkit2 and rkllm-toolkit.

1. LLM → RKLLM

First extract clean Qwen3 text weights (feeding the original model directly trips RKLLM into thinking it's a vision model because of leftover mrope/vision_config fields):

python convert/llm/extract_qwen3_text_model.py \
  --model-path . \
  --output-dir ./qwen3_text_hf

Then convert to RKLLM:

python convert/llm/export_rkllm_direct.py \
  --model-path ./qwen3_text_hf \
  --target-platform rk3588 --num-npu-core 3 \
  --dtype float16 --max-context 4096 \
  --savepath ./rknn/language_model.rkllm

2. Audio encoder → RKNN

The audio tower is wrapped as a static "100 mel frames / chunk" model (this matches the model's native processing granularity); long audio is split, run chunk-by-chunk and concatenated at runtime.

# PyTorch → ONNX
python convert/audio_encoder/export_audio_encoder_onnx.py \
  --model-path . --savepath ./onnx/qwen3_asr_audio_chunk100.onnx

# (optional) parity check, expect max_abs_diff ≈ 1e-7
python convert/audio_encoder/onnx_run_audio_encoder.py \
  --model-path . --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --audio-path asr_example_zh.wav --compare-torch

# ONNX → RKNN
python convert/audio_encoder/export_audio_encoder_rknn.py \
  --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --target-platform rk3588 --savepath ./rknn/audio_encoder.rknn

3. Artifacts

rknn/
├── audio_encoder.rknn
└── language_model.rkllm

These plug directly into run_qwen3_asr_e2e.py at the repo root.