robbiemu
/

MobileLLM-R1-950M-MLX

Text Generation

mlx - apple-mlx - runtime

Model card Files Files and versions

MobileLLM-R1-950M-MLX / quantization.log

robbiemu's picture

add mlx and mlx-lm support

e39ff3a 7 months ago

history blame contribute delete

1.69 kB

	uv run python custom_mlx_lm/custom_convert.py --hf-path . --mlx-path MobileLLM-R1-950M-mixed-4bit-mlx --dynamic-quant --target-bpw 4.5 --group-size 64 --report-ppl
	Loading model from ....
	Loading calibration data...
	Token indices sequence length is longer than the specified maximum sequence length for this model (110205 > 32768). Running this sequence through the model will result in indexing errors
	Calculating perplexity of original model...
	Original PPL: 50.262
	Starting advanced mixed-precision quantization...
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	Estimating sensitivities: 100%\|████████████████████████████████████\| 54/54 [02:03<00:00, 2.28s/it]
	Calculating perplexity of quantized model...
	Quantized PPL: 59.059

	✅ Model saved to MobileLLM-R1-950M-mixed-4bit-mlx

	uv run python custom_mlx_lm/quant_summary.py --model-path MobileLLM-R1-950M-mixed-4bit-mlx --show 8
	Method: mixed_precision_dynamic
	Group size: 64
	Total linear layers: 154
	4-bit layers: 153
	8-bit layers: 1

	Examples (8-bit):
	- layers.0.attention.o_proj

	Examples (4-bit):
	- layers.0.attention.k_proj
	- layers.0.attention.q_proj
	- layers.0.attention.v_proj
	- layers.0.feed_forward.down_proj
	- layers.0.feed_forward.gate_proj
	- layers.0.feed_forward.up_proj
	- layers.1.attention.k_proj
	- layers.1.attention.o_proj

	weights.npz contains quantized tensors: True