Sarvam-30B Runtime-Optimized Inference System
1. Overview
This project presents a runtime-optimized deployment system for Sarvam-30B, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.
The objective is to improve inference efficiency, stability, and output quality without modifying the model weights, making it suitable for real-world deployment scenarios.
This work focuses on system-level optimization rather than model-level compression, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.
2. Base Model
- Model:
sarvamai/sarvam-30b - Architecture: Mixture-of-Experts (MoE)
- Task: Text Generation
- Inference Engine: vLLM
- Hardware: Multi-GPU (Tensor Parallelism)
3. Problem Statement
During experimentation, two critical challenges were identified:
3.1 Reasoning Leakage
The model generates internal reasoning traces such as <think> tokens, which:
- Reduce readability
- Break structured output requirements
- Affect downstream usability
3.2 High Resource Consumption
Due to the MoE architecture:
- High GPU memory utilization (~45GB per GPU baseline)
- Large KV-cache growth with sequence length
- Reduced inference efficiency under default settings
4. Approach
4.1 Inference-Time Optimization (Core Contribution)
Instead of modifying weights (quantization/pruning), this system applies runtime-level optimization:
gpu-memory-utilization = 0.85max-model-len = 1024max-num-seqs = 4tensor-parallel-size = 4
Impact:
- Reduced KV-cache pressure
- Improved GPU memory utilization
- Stable multi-GPU execution
- Consistent latency performance
4.2 Output Governance Pipeline
A deterministic postprocessing layer (postprocess.py) is introduced to control model outputs.
This module:
- Removes internal reasoning traces (
<think>...</think>) - Extracts final answer segments
- Reformats output into structured bullet points
Impact:
- Clean, production-ready responses
- Improved readability
- Deterministic output format
5. Compression Strategies Evaluated
The following approaches were tested and rejected:
Quantization (AWQ / GPTQ)
- Compatibility issues with MoE architecture
- Output instability and degradation
Pruning
- Severe degradation in generation quality
- Early stopping and incomplete outputs
Distillation
- Not feasible due to dataset and compute constraints
Final Decision
Runtime optimization was selected because:
- Preserves original model accuracy
- Avoids architectural incompatibility
- Provides stable and reproducible results
6. System Architecture
User Input
β vLLM Inference Engine
β Raw Model Output
β Postprocessing Layer
β Clean Structured Output
This forms an Inference Optimization + Output Governance Pipeline.
7. Performance Results
| Metric | Observation |
|---|---|
| Latency | ~0.4s β 1.5s |
| GPU Memory | ~8% reduction |
| Stability | Consistent across runs |
| Output Quality | Clean and structured after postprocessing |
8. How to Run
bash run.sh
9. API Example
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sarvam-30b",
"messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
"max_tokens": 200,
"temperature": 0.2
}'
10. Files Included
run.shβ server startup scriptvllm_config.yamlβ optimized configurationpostprocess.pyβ output cleaning pipelineexamples/β raw vs cleaned outputsmodels/β Sarvam-30B weights
11. Practical Impact
This system is designed for real-world AI deployments where:
- Large models must operate under GPU constraints
- Outputs must be clean and user-facing
- Internal reasoning traces are not acceptable
The approach demonstrates:
- Runtime optimization instead of weight modification
- Output governance instead of prompt engineering
- System-level control instead of model-level changes
12. Key Insight
System-level optimization can outperform traditional compression techniques by:
- This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
- Preserving model accuracy
- Improving inference efficiency
- Ensuring stable deployment
13. Conclusion
This work delivers a deployment-ready, reproducible, and efficient inference system for large-scale MoE models.
It demonstrates that combining runtime optimization with output control provides a practical and scalable alternative to conventional model compression approaches.
14. Limitations
- Does not reduce model size (weights remain unchanged)
- Requires multi-GPU setup
- Postprocessing is rule-based (not learned)
15. Future Work
- MoE-aware quantization techniques
- KV-cache compression methods
- Adaptive decoding strategies
- Edge-device compatible distillation
16. Real-World Relevance
This system is designed for deployment scenarios where:
- Large language models must operate under strict GPU constraints
- Outputs must be clean and user-facing
- Internal reasoning traces are not acceptable in production systems
The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.
Model tree for evil-dreams/sarvam-runtime-optimized
Base model
sarvamai/sarvam-30b