You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam-30B Runtime-Optimized Inference System

1. Overview

This project presents a runtime-optimized deployment system for Sarvam-30B, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.

The objective is to improve inference efficiency, stability, and output quality without modifying the model weights, making it suitable for real-world deployment scenarios.

This work focuses on system-level optimization rather than model-level compression, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.


2. Base Model

  • Model: sarvamai/sarvam-30b
  • Architecture: Mixture-of-Experts (MoE)
  • Task: Text Generation
  • Inference Engine: vLLM
  • Hardware: Multi-GPU (Tensor Parallelism)

3. Problem Statement

During experimentation, two critical challenges were identified:

3.1 Reasoning Leakage

The model generates internal reasoning traces such as <think> tokens, which:

  • Reduce readability
  • Break structured output requirements
  • Affect downstream usability

3.2 High Resource Consumption

Due to the MoE architecture:

  • High GPU memory utilization (~45GB per GPU baseline)
  • Large KV-cache growth with sequence length
  • Reduced inference efficiency under default settings

4. Approach

4.1 Inference-Time Optimization (Core Contribution)

Instead of modifying weights (quantization/pruning), this system applies runtime-level optimization:

  • gpu-memory-utilization = 0.85
  • max-model-len = 1024
  • max-num-seqs = 4
  • tensor-parallel-size = 4

Impact:

  • Reduced KV-cache pressure
  • Improved GPU memory utilization
  • Stable multi-GPU execution
  • Consistent latency performance

4.2 Output Governance Pipeline

A deterministic postprocessing layer (postprocess.py) is introduced to control model outputs.

This module:

  • Removes internal reasoning traces (<think>...</think>)
  • Extracts final answer segments
  • Reformats output into structured bullet points

Impact:

  • Clean, production-ready responses
  • Improved readability
  • Deterministic output format

5. Compression Strategies Evaluated

The following approaches were tested and rejected:

Quantization (AWQ / GPTQ)

  • Compatibility issues with MoE architecture
  • Output instability and degradation

Pruning

  • Severe degradation in generation quality
  • Early stopping and incomplete outputs

Distillation

  • Not feasible due to dataset and compute constraints

Final Decision

Runtime optimization was selected because:

  • Preserves original model accuracy
  • Avoids architectural incompatibility
  • Provides stable and reproducible results

6. System Architecture

User Input
β†’ vLLM Inference Engine
β†’ Raw Model Output
β†’ Postprocessing Layer
β†’ Clean Structured Output

This forms an Inference Optimization + Output Governance Pipeline.


7. Performance Results

Metric Observation
Latency ~0.4s – 1.5s
GPU Memory ~8% reduction
Stability Consistent across runs
Output Quality Clean and structured after postprocessing

8. How to Run

bash run.sh

9. API Example

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b",
    "messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
    "max_tokens": 200,
    "temperature": 0.2
  }'

10. Files Included

  • run.sh β†’ server startup script
  • vllm_config.yaml β†’ optimized configuration
  • postprocess.py β†’ output cleaning pipeline
  • examples/ β†’ raw vs cleaned outputs
  • models/ β†’ Sarvam-30B weights

11. Practical Impact

This system is designed for real-world AI deployments where:

  • Large models must operate under GPU constraints
  • Outputs must be clean and user-facing
  • Internal reasoning traces are not acceptable

The approach demonstrates:

  • Runtime optimization instead of weight modification
  • Output governance instead of prompt engineering
  • System-level control instead of model-level changes

12. Key Insight

System-level optimization can outperform traditional compression techniques by:

  • This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
  • Preserving model accuracy
  • Improving inference efficiency
  • Ensuring stable deployment

13. Conclusion

This work delivers a deployment-ready, reproducible, and efficient inference system for large-scale MoE models.

It demonstrates that combining runtime optimization with output control provides a practical and scalable alternative to conventional model compression approaches.


14. Limitations

  • Does not reduce model size (weights remain unchanged)
  • Requires multi-GPU setup
  • Postprocessing is rule-based (not learned)

15. Future Work

  • MoE-aware quantization techniques
  • KV-cache compression methods
  • Adaptive decoding strategies
  • Edge-device compatible distillation

16. Real-World Relevance

This system is designed for deployment scenarios where:

  • Large language models must operate under strict GPU constraints
  • Outputs must be clean and user-facing
  • Internal reasoning traces are not acceptable in production systems

The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for evil-dreams/sarvam-runtime-optimized

Finetuned
(5)
this model