Model Overview
- Model Architecture: Kimi-K2.7-Code
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI350/MI355
- ROCm: 7.2.3
- PyTorch: 2.10.0
- Transformers: 5.12.1
- Operating System(s): Linux
- Inference Engine: vLLM
- Model Optimizer: AMD-Quark (V0.12)
- Weight quantization: OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
- Activation quantization: OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
- Excluded from quantization: MoE gates,
lm_head, vision tower and multimodal projector
This model was built with the Kimi-K2.7-Code model by applying AMD-Quark for MXFP4 quantization.
Model Quantization
The model was quantized from moonshotai/Kimi-K2.7-Code using AMD-Quark. The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.
Quantization script:
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py \
--model_dir moonshotai/Kimi-K2.7-Code \
--output_dir Kimi-K2.7-Code-MXFP4 \
--file2file_quantization \
--trust_remote_code \
--quant_scheme mxfp4 \
--layer_quant_scheme '*self_attn*' ptpc_fp8 \
--exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
"*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
--model_export hf_format
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend.
Note: this model has 64 KV heads, which is incompatible with the AITER MLA kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
--model amd/Kimi-K2.7-Code-MXFP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
Evaluation
The model was evaluated on the GSM8K benchmark.
Accuracy
| Benchmark | Kimi-K2.7-Code | Kimi-K2.7-Code-MXFP4 (this model) | Recovery |
| GSM8K (strict-match) | 95.07% | 94.80% | 99.7% |
| GSM8K (flexible-extract) | 95.15% | 94.77% | 99.6% |
GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated stable runs (range: strict 0.9439–0.9560, flexible 0.9439–0.9553).
Reproduction
The GSM8K results were obtained using the lm-evaluation-harness framework
with the vLLM backend (rocm/vllm-dev nightly, vLLM 0.23.1rc1). The model
is served first, then evaluated via the OpenAI-compatible completions API.
Important: serve with automatic prefix caching disabled
(--no-enable-prefix-caching) for deterministic evaluation results.
# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
--model amd/Kimi-K2.7-Code-MXFP4 \
--trust-remote-code --tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 --max-model-len 8192 \
--seed 42 --no-enable-prefix-caching
# 2) Evaluate
lm_eval --model local-completions \
--model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
--tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42
License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- -
Model tree for amd/Kimi-K2.7-Code-MXFP4
Base model
moonshotai/Kimi-K2.7-Code