Model Overview

Model Architecture: Kimi-K2.7-Code
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.2.3
PyTorch: 2.10.0
Transformers: 5.12.1
Operating System(s): Linux
Inference Engine: vLLM
Model Optimizer: AMD-Quark (V0.12)
- Weight quantization: OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
- Activation quantization: OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
- Excluded from quantization: MoE gates, lm_head, vision tower and multimodal projector

This model was built with the Kimi-K2.7-Code model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from moonshotai/Kimi-K2.7-Code using AMD-Quark. The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.

Quantization script:

cd Quark/examples/torch/language_modeling/llm_ptq/

python3 quantize_quark.py \
    --model_dir moonshotai/Kimi-K2.7-Code \
    --output_dir Kimi-K2.7-Code-MXFP4 \
    --file2file_quantization \
    --trust_remote_code \
    --quant_scheme mxfp4 \
    --layer_quant_scheme '*self_attn*' ptpc_fp8 \
    --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
        "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
    --model_export hf_format

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Note: this model has 64 KV heads, which is incompatible with the AITER MLA kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0

python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192

Evaluation

The model was evaluated on the GSM8K benchmark.

Accuracy

Benchmark	Kimi-K2.7-Code	Kimi-K2.7-Code-MXFP4 (this model)	Recovery
GSM8K (strict-match)	95.07%	94.80%	99.7%
GSM8K (flexible-extract)	95.15%	94.77%	99.6%

GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated stable runs (range: strict 0.9439–0.9560, flexible 0.9439–0.9553).

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework with the vLLM backend (rocm/vllm-dev nightly, vLLM 0.23.1rc1). The model is served first, then evaluated via the OpenAI-compatible completions API.

Important: serve with automatic prefix caching disabled (--no-enable-prefix-caching) for deterministic evaluation results.

# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
       VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 --max-model-len 8192 \
    --seed 42 --no-enable-prefix-caching

# 2) Evaluate
lm_eval --model local-completions \
    --model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
    --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42

License

Downloads last month: -

Safetensors

Model size

550B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Kimi-K2.7-Code-MXFP4

Base model

moonshotai/Kimi-K2.7-Code

Quantized

(15)

this model