Qwen3-ASR-1.7B → RK3588 模型转换

(English README see below)

把 Qwen/Qwen3-ASR-1.7B 转成 RK3588 上可运行的 RKNN + RKLLM。

音频编码器（thinker.audio_tower）→ ONNX → RKNN
文本 LLM（thinker.model + thinker.lm_head）→ 标准 Qwen3ForCausalLM → RKLLM
全程 FP16，不量化

convert/
├── audio_encoder/
│   ├── common.py                       共享工具
│   ├── export_audio_encoder_onnx.py    PyTorch → ONNX
│   ├── export_audio_encoder_rknn.py    ONNX → RKNN
│   └── onnx_run_audio_encoder.py       ONNX 对齐校验（可选）
└── llm/
    ├── extract_qwen3_text_model.py     抽取标准 Qwen3 文本权重
    └── export_rkllm_direct.py          HF → RKLLM

准备

把原始模型放到工作目录，并把官方 QwenLM/Qwen3-ASR 仓库克隆到同级目录（audio_encoder/common.py 需要从里面 import modeling_qwen3_asr）：

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir .
git clone https://github.com/QwenLM/Qwen3-ASR.git

主机依赖：torch transformers safetensors numpy scipy soundfile onnx onnxruntime，以及 rknn-toolkit2 和 rkllm-toolkit。

1. LLM → RKLLM

先抽出干净的 Qwen3 文本权重（直接喂原模型会因为残留的 mrope/vision_config 字段被 RKLLM 误判成视觉模型）：

python convert/llm/extract_qwen3_text_model.py \
  --model-path . \
  --output-dir ./qwen3_text_hf

然后转 RKLLM：

python convert/llm/export_rkllm_direct.py \
  --model-path ./qwen3_text_hf \
  --target-platform rk3588 --num-npu-core 3 \
  --dtype float16 --max-context 4096 \
  --savepath ./rknn/language_model.rkllm

2. 音频编码器 → RKNN

音频塔被包成「100 mel 帧 / chunk」的静态模型（这是原模型本身的处理粒度），长音频在运行时分块跑再拼回去。

# PyTorch → ONNX
python convert/audio_encoder/export_audio_encoder_onnx.py \
  --model-path . --savepath ./onnx/qwen3_asr_audio_chunk100.onnx

# (可选) 对齐校验，正常 max_abs_diff ≈ 1e-7
python convert/audio_encoder/onnx_run_audio_encoder.py \
  --model-path . --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --audio-path asr_example_zh.wav --compare-torch

# ONNX → RKNN
python convert/audio_encoder/export_audio_encoder_rknn.py \
  --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --target-platform rk3588 --savepath ./rknn/audio_encoder.rknn

3. 产物

rknn/
├── audio_encoder.rknn
└── language_model.rkllm

直接对接仓库根目录的 run_qwen3_asr_e2e.py。

Qwen3-ASR-1.7B → RK3588 Model Conversion

Convert Qwen/Qwen3-ASR-1.7B to RKNN + RKLLM for RK3588.

Audio encoder (thinker.audio_tower) → ONNX → RKNN
Text LLM (thinker.model + thinker.lm_head) → standard Qwen3ForCausalLM → RKLLM
FP16 throughout, no quantization

Layout

convert/
├── audio_encoder/
│   ├── common.py                       shared helpers
│   ├── export_audio_encoder_onnx.py    PyTorch → ONNX
│   ├── export_audio_encoder_rknn.py    ONNX → RKNN
│   └── onnx_run_audio_encoder.py       ONNX parity check (optional)
└── llm/
    ├── extract_qwen3_text_model.py     extract standard Qwen3 text weights
    └── export_rkllm_direct.py          HF → RKLLM

Setup

Place the original model in your working directory and clone the official QwenLM/Qwen3-ASR repo as a sibling (audio_encoder/common.py imports modeling_qwen3_asr from it):

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir .
git clone https://github.com/QwenLM/Qwen3-ASR.git

Host dependencies: torch transformers safetensors numpy scipy soundfile onnx onnxruntime, plus rknn-toolkit2 and rkllm-toolkit.

1. LLM → RKLLM

First extract clean Qwen3 text weights (feeding the original model directly trips RKLLM into thinking it's a vision model because of leftover mrope/vision_config fields):

python convert/llm/extract_qwen3_text_model.py \
  --model-path . \
  --output-dir ./qwen3_text_hf

Then convert to RKLLM:

python convert/llm/export_rkllm_direct.py \
  --model-path ./qwen3_text_hf \
  --target-platform rk3588 --num-npu-core 3 \
  --dtype float16 --max-context 4096 \
  --savepath ./rknn/language_model.rkllm

2. Audio encoder → RKNN

The audio tower is wrapped as a static "100 mel frames / chunk" model (this matches the model's native processing granularity); long audio is split, run chunk-by-chunk and concatenated at runtime.

# PyTorch → ONNX
python convert/audio_encoder/export_audio_encoder_onnx.py \
  --model-path . --savepath ./onnx/qwen3_asr_audio_chunk100.onnx

# (optional) parity check, expect max_abs_diff ≈ 1e-7
python convert/audio_encoder/onnx_run_audio_encoder.py \
  --model-path . --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --audio-path asr_example_zh.wav --compare-torch

# ONNX → RKNN
python convert/audio_encoder/export_audio_encoder_rknn.py \
  --onnx-path ./onnx/qwen3_asr_audio_chunk100.onnx \
  --target-platform rk3588 --savepath ./rknn/audio_encoder.rknn

3. Artifacts

rknn/
├── audio_encoder.rknn
└── language_model.rkllm

These plug directly into run_qwen3_asr_e2e.py at the repo root.