Model Overview

  • Model Architecture: Kimi-K2.7-Code
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.2.3
  • PyTorch: 2.10.0
  • Transformers: 5.12.1
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark (V0.12)
    • Weight quantization: OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
    • Activation quantization: OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
    • Excluded from quantization: MoE gates, lm_head, vision tower and multimodal projector

This model was built with the Kimi-K2.7-Code model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from moonshotai/Kimi-K2.7-Code using AMD-Quark. The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.

Quantization script:

cd Quark/examples/torch/language_modeling/llm_ptq/

python3 quantize_quark.py \
    --model_dir moonshotai/Kimi-K2.7-Code \
    --output_dir Kimi-K2.7-Code-MXFP4 \
    --file2file_quantization \
    --trust_remote_code \
    --quant_scheme mxfp4 \
    --layer_quant_scheme '*self_attn*' ptpc_fp8 \
    --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
        "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
    --model_export hf_format

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Note: this model has 64 KV heads, which is incompatible with the AITER MLA kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0

python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192

Evaluation

The model was evaluated on the GSM8K benchmark.

Accuracy

Benchmark Kimi-K2.7-Code Kimi-K2.7-Code-MXFP4 (this model) Recovery
GSM8K (strict-match) 95.07% 94.80% 99.7%
GSM8K (flexible-extract) 95.15% 94.77% 99.6%

GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated stable runs (range: strict 0.9439–0.9560, flexible 0.9439–0.9553).

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework with the vLLM backend (rocm/vllm-dev nightly, vLLM 0.23.1rc1). The model is served first, then evaluated via the OpenAI-compatible completions API.

Important: serve with automatic prefix caching disabled (--no-enable-prefix-caching) for deterministic evaluation results.

# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
       VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 --max-model-len 8192 \
    --seed 42 --no-enable-prefix-caching

# 2) Evaluate
lm_eval --model local-completions \
    --model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
    --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42

License

Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
-
Safetensors
Model size
550B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Kimi-K2.7-Code-MXFP4

Quantized
(15)
this model