Request: Tensor alignment (256) for llama.cpp quantization

#165
by kusanagi-hf - opened

Hi, thank you for this great model!

Many users run models via llama.cpp, whose K-quants quantization requires the first dimension of certain weight tensors (e.g., attention keys and queries, token embeddings) to be a multiple of 256.
If this requirement is not met, the weights cannot be quantized at the intended bit-width, and a fallback to a wider bit-width is applied, preventing the intended quantization.

For context, I plan to perform quantization at a lower bit-width than MXFP4, so this alignment is particularly important for my use case.

Would it be possible to provide a variant where these tensor dimensions are aligned to 256?

Thank you for your consideration!

Sign up or log in to comment