You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sarvam-30B Runtime-Optimized Inference System

1. Overview

This project presents a runtime-optimized deployment system for Sarvam-30B, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.

The objective is to improve inference efficiency, stability, and output quality without modifying the model weights, making it suitable for real-world deployment scenarios.

This work focuses on system-level optimization rather than model-level compression, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.

2. Base Model

Model: sarvamai/sarvam-30b
Architecture: Mixture-of-Experts (MoE)
Task: Text Generation
Inference Engine: vLLM
Hardware: Multi-GPU (Tensor Parallelism)

3. Problem Statement

During experimentation, two critical challenges were identified:

3.1 Reasoning Leakage

The model generates internal reasoning traces such as <think> tokens, which:

Reduce readability
Break structured output requirements
Affect downstream usability

3.2 High Resource Consumption

Due to the MoE architecture:

High GPU memory utilization (~45GB per GPU baseline)
Large KV-cache growth with sequence length
Reduced inference efficiency under default settings

4. Approach

4.1 Inference-Time Optimization (Core Contribution)

Instead of modifying weights (quantization/pruning), this system applies runtime-level optimization:

gpu-memory-utilization = 0.85
max-model-len = 1024
max-num-seqs = 4
tensor-parallel-size = 4

Impact:

Reduced KV-cache pressure
Improved GPU memory utilization
Stable multi-GPU execution
Consistent latency performance

4.2 Output Governance Pipeline

A deterministic postprocessing layer (postprocess.py) is introduced to control model outputs.

This module:

Removes internal reasoning traces (<think>...</think>)
Extracts final answer segments
Reformats output into structured bullet points

Impact:

Clean, production-ready responses
Improved readability
Deterministic output format

5. Compression Strategies Evaluated

The following approaches were tested and rejected:

Quantization (AWQ / GPTQ)

Compatibility issues with MoE architecture
Output instability and degradation

Pruning

Severe degradation in generation quality
Early stopping and incomplete outputs

Distillation

Not feasible due to dataset and compute constraints

Final Decision

Runtime optimization was selected because:

Preserves original model accuracy
Avoids architectural incompatibility
Provides stable and reproducible results

6. System Architecture

User Input
→ vLLM Inference Engine
→ Raw Model Output
→ Postprocessing Layer
→ Clean Structured Output

This forms an Inference Optimization + Output Governance Pipeline.

7. Performance Results

Metric	Observation
Latency	~0.4s – 1.5s
GPU Memory	~8% reduction
Stability	Consistent across runs
Output Quality	Clean and structured after postprocessing

8. How to Run

bash run.sh

9. API Example

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b",
    "messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
    "max_tokens": 200,
    "temperature": 0.2
  }'

10. Files Included

run.sh → server startup script
vllm_config.yaml → optimized configuration
postprocess.py → output cleaning pipeline
examples/ → raw vs cleaned outputs
models/ → Sarvam-30B weights

11. Practical Impact

This system is designed for real-world AI deployments where:

Large models must operate under GPU constraints
Outputs must be clean and user-facing
Internal reasoning traces are not acceptable

The approach demonstrates:

Runtime optimization instead of weight modification
Output governance instead of prompt engineering
System-level control instead of model-level changes

12. Key Insight

System-level optimization can outperform traditional compression techniques by:

This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
Preserving model accuracy
Improving inference efficiency
Ensuring stable deployment

13. Conclusion

This work delivers a deployment-ready, reproducible, and efficient inference system for large-scale MoE models.

It demonstrates that combining runtime optimization with output control provides a practical and scalable alternative to conventional model compression approaches.

14. Limitations

Does not reduce model size (weights remain unchanged)
Requires multi-GPU setup
Postprocessing is rule-based (not learned)

15. Future Work

MoE-aware quantization techniques
KV-cache compression methods
Adaptive decoding strategies
Edge-device compatible distillation

16. Real-World Relevance

This system is designed for deployment scenarios where:

Large language models must operate under strict GPU constraints
Outputs must be clean and user-facing
Internal reasoning traces are not acceptable in production systems

The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for evil-dreams/sarvam-runtime-optimized

Base model

sarvamai/sarvam-30b

Finetuned

(5)

this model