Hardware requirement
what GPU VRAM do I need to run this?
I load gemma-4-31b-it-AWQ with vllm/vllm-openai-cu130 images on my rtx3090
my docker command is :
docker run --gpus all
--runtime nvidia
--ipc=host
-v "$MODEL_PATH:/model"
-p 8000:8000
vllm/vllm-openai:gemma4-cu130
--model /model
--served-model-name gemma-4-31b
--dtype bfloat16
--quantization compressed-tensors
--max-model-len 1536
--max-num-seqs 4
--gpu-memory-utilization 0.95
--trust-remote-code
--enable-auto-tool-choice
--tool-call-parser gemma4
Well. max-model-len can NOT exceed 1536, otherwise OOM
what GPU VRAM do I need to run this?
The technical answer is ZERO, of course (read up on turing completeness).
Probably the "practical" answer is 24GB, using an IQ3 GGUF quant with llama-server's --fit option. IQ2 quants might run as well, maybe even in 16GB, but tool calling and accuracy and context length would suffer a lot. i.e., it might fail to be agentic almost at all, or it might make REALLY bad mistakes, like picking - when it's meant to be + on that payment into your accounting software. You can mitigate with tests of course, but point is... it's probably pretty bad at IQ2. With limited context length, it's not going to be too practical for agentic dev or even long conversations anyway.
If you're tight on VRAM, consider the 27B A4B variant instead. It will require less VRAM, and be much faster, but it will be more of a convincing parrot OF an intelligent beast, than an actual intelligent beast.
Also, bear in mind that the other, smaller Gemma 4 models have audio support, which this lacks. Just trade-offs to consider, especially if you're struggling to fit this variant in.
Even Q4_K_L from bartowski fits into 24GB with 16k unquanted context at least (maybe bit more, did not try) while using the same card also as display device. No need to go below 4bpw with 24GB. For more context one can use Q4KS/IQ4_XS and/or KV cache quanted to Q8.
You will need bit more if you want to use vision though (load mmproj file) but IQ4_XS with possibly KV cache at Q8 should allow decent context still I think.
Even Q4_K_L from bartowski fits into 24GB with 16k unquanted context at least (maybe bit more, did not try) while using the same card also as display device. No need to go below 4bpw with 24GB. For more context one can use Q4KS/IQ4_XS and/or KV cache quanted to Q8.
You will need bit more if you want to use vision though (load mmproj file) but IQ4_XS with possibly KV cache at Q8 should allow decent context still I think.
16k context is junk for agentic use though, barely even fits the prompt with a few basic tools. You need like 80k as a practical starting point; 65k will work, but will churn between making some progress and compacting the context to make room for progress. Every compaction, you lose fidelity, like chinese whispers.
Even Q4_K_L from bartowski fits into 24GB with 16k unquanted context at least (maybe bit more, did not try) while using the same card also as display device. No need to go below 4bpw with 24GB. For more context one can use Q4KS/IQ4_XS and/or KV cache quanted to Q8.
You will need bit more if you want to use vision though (load mmproj file) but IQ4_XS with possibly KV cache at Q8 should allow decent context still I think.16k context is junk for agentic use though, barely even fits the prompt with a few basic tools. You need like 80k as a practical starting point; 65k will work, but will churn between making some progress and compacting the context to make room for progress. Every compaction, you lose fidelity, like chinese whispers.
80k is not going to be good with these small models anyway, especially if you can't use full precision or at least Q8 weights (and KV cache definitely full precision then). For me 16k is more than enough for almost all what I need. That said IQ4_XS will likely allow quite a lot of context. If someone really needs more, then Qwen 3.5 27B may be better choice (it has similar performance as Gemma4, losing mostly in languages/creative writing) as it is bit smaller and also uses long context quite efficiently.
I'm very confused by that statement :) If 80k isn't good, then 16k is much worse, surely?
I'm very confused by that statement :) If 80k isn't good, then 16k is much worse, surely?
No, it works exactly opposite way. The smaller the context, the better the model understands it (higher quality answer). They may still do simple tasks (needle in haystack, summarize) over such long context (so it is not completely useless) but understanding the meaning of the context (relations, subtleties etc.) drops quickly with context length, often even at 8k as show more complex long context benchmarks (but also practical experience). Some of the top closed models can use large context well, but I am not aware of any local (especially small) model that would be good at it (eg not deteriorate compared to short context).
Also recent experiment done with gemma 31B-it quants showed that even Q8 quants start to have quite high KL divergence when longer context is used (there I think it was ~16-32k). Usually quant performance is evaluated on Wikitext with relatively short sample sizes up to 1024 tokens where even 4bpw qaunts show good similarity to full precision, but it seems it may not hold over longer contexts.
I'm very confused by that statement :) If 80k isn't good, then 16k is much worse, surely?
No, it works exactly opposite way. The smaller the context, the better the model understands it (higher quality answer).
Yeah, I realised you meant the haystack thing after I posted that; should have come back and updated :)
Have you actually tried it? Minimax M2.5 UD-IQ2_K_L @ 80k probably outperforms qwen3.5 122B @ Q4 for me. MiniMax is It's slower, more swappy, but Qwen is always giving me "parrot" vibes: does smart sounding things, but is basically just hallucinating, and lucking onto the right hallucination sometimes. Their reasoning is much improved in 3.5, but generally Qwen has always been poor quality to me, since their QwQ model. So basically only Qwen... 2.5, I think it was, just before QwQ seemed reasonably impressive, as non-reasoning models went. Minimax actually thinks correctly; makes the occasional typo at Q2, but is solid enough in reasoning to figure it out and correct course.
As for 80k vs. 131k vs. 64k or whatever... I think you're overstating it. Qwen might be really bad there, maybe that's why it seems to bad in general to me, as most of my use cases are coding or detailed conversations, but minimax isn't bad at all.
I'm very confused by that statement :) If 80k isn't good, then 16k is much worse, surely?
No, it works exactly opposite way. The smaller the context, the better the model understands it (higher quality answer).
Yeah, I realised you meant the haystack thing after I posted that; should have come back and updated :)
Have you actually tried it? Minimax M2.5 UD-IQ2_K_L @ 80k
Minimax no, it is bit too large for me at 80k. I am downloading Minimax 2.7 UD_IQ4_XS though to see how it goes, but again on lower contexts. I rarely need large context and then only for summarize that is even not perfect but good enough. Programming I still do by hand (except short utilities) and I do not have that much time/use-case to fool around agentic things. But even then I would probably try to summarize/compress context as much as possible as smaller high quality context trumps large low quality one (simple copy paste of some long data/files).
For what it's worth: 24GB / Entirely in VRAM / 28K context / Q4_K_M with 1GB to spare (didn't load the mmproj, so might need to tweak a bit if you need image recognition). The KV cache is REALLY compact with Gemma4.
You can probably go to to around 40K (maybe more) with a IQ4_XL quant.
Edit: And for agentic-only use, it'd give Qwen 3.5 a go. It's really good at complexe agentic stuff I wouldn't assume a model that size would handle. That's actually the one single thing Qwen 3.5 is eerily good at. It'll just waste thousands of tokens doing it, though, which, locally can be a problem. I don't doubt that Gemma can compete, but their chat template is kinda junk atm. which probably fucks with its performances.
For what it's worth: 24GB / Entirely in VRAM / 28K context / Q4_K_M with 1GB to spare
Ahh, I see, I'm using 4x24GB, so our experiences of the various models/quants/context lengths will be very different.
I gave Qwen 3.5 122B another go today. Sure, it "completed" the assigned project, but it made up a whole lot of stuff I didn't ask for. I still maintain that Qwen models aren't competent: they're convincing parrots. Maybe at 395B; never really tried that.
For what it's worth: 24GB / Entirely in VRAM / 28K context / Q4_K_M with 1GB to spare
Ahh, I see, I'm using 4x24GB, so our experiences of the various models/quants/context lengths will be very different.
I gave Qwen 3.5 122B another go today. Sure, it "completed" the assigned project, but it made up a whole lot of stuff I didn't ask for. Minimax fixed it all. I still maintain that Qwen models aren't competent: they're convincing parrots.
Absolutely, I can only compare apples to apples in my size range. :) If i need something bigger, I just use Opus.
So let me reformulate: For models that can easily fit in 24GB (so anything from 8B to 32B params), Qwen 27B dominates all previously released models in that size range for tool-call / agentic tasks, and on that job alone (it's terrible at plenty of other things). To be fair "tool-call" is a vast domain and is very dependent on how you implement it, document your available tools for the model, and so on. I have yet to properly compare it to Gemma4 31B, as I'm waiting for the dust to settle, but I don't doubt it's a very strong contender. I hope it's better, because Gemma4 is not prone to hallucinations and to 10k token semi-infinite thinking blocks like Qwen is.