Shunya-0.5B-Base
Shunya-0.5B-Base is a 503M-parameter decoder-only language model trained from scratch on a curated mix of web text and synthetic educational data. It is a base (pretrained) model — not instruction-tuned — intended as a foundation for fine-tuning and research.
The model uses a Llama-style architecture with Grouped Query Attention (GQA), SiLU activations, and supports a 32,768-token context window via linear RoPE scaling.
Model Details
| Field | Value |
|---|---|
| Architecture | LlamaForCausalLM |
| Parameters | ~503M |
| Hidden size | 1,280 |
| Intermediate size | 4,864 |
| Layers | 20 |
| Attention heads | 10 |
| KV heads (GQA) | 2 |
| Head dim | 128 |
| Vocab size | 40,000 |
| Context window | 32,768 tokens |
| Positional encoding | RoPE (θ=10,000, 4× linear scaling) |
| Normalization | RMSNorm (ε=1e-6) |
| Activation | SiLU |
| Dtype | bfloat16 |
| Tied embeddings | Yes |
| Attention bias | Yes |
| Attention dropout | 0.1 |
Training
- Training tokens: ~60B token mix from wikipedia, fineweb, slimpajama, gutenberg & arxiv papers. Synthetic data from
vivekmarakana/synthetic-textbooksfor factual accuracy. - License: MIT
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("vivekmarakana/shunya-0.5b-base")
model = AutoModelForCausalLM.from_pretrained(
"vivekmarakana/shunya-0.5b-base",
torch_dtype=torch.bfloat16,
device_map="auto",
)
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note: This is a base model. For conversational or instruction-following tasks, use an instruction-tuned variant.
Benchmarks
All evaluations were run with lm-evaluation-harness. Results use normalized accuracy (acc_norm) for completion tasks (ARC, PIQA, HellaSwag) and acc for GPQA, AGIEval etc.
| Benchmark | Shots | Metric | Shunya-0.5B-Base | Qwen3-0.6B-Base | Gemma3-1B-PT |
|---|---|---|---|---|---|
| ARC-Challenge | 25 | acc_norm | 27.65 | 45.22 | 39.33 |
| ARC-Easy | 0 | acc_norm | 43.14 | 58.08 | 72.18 |
| HellaSwag | 10 | acc_norm | 35.81 | 53.56 | 62.78 |
| MMLU | 5 | acc | 26.46 | 52.51 | 26.41 |
| WinoGrande | 0 | acc | 53.67 | 58.88 | 58.09 |
| BoolQ | 0 | acc | 61.62 | 69.51 | 66.67 |
| PIQA | 0 | acc_norm | 64.09 | 69.86 | 74.81 |
| Social IQA | 0 | acc | 38.74 | 43.14 | 43.09 |
| GPQA Main | 5 | acc | 21.65 | 27.01 | 25.22 |
| GPQA Diamond | 5 | acc | 26.26 | 32.32 | 22.22 |
| AGIEval EN | 5 | acc | 18.72 | 25.91 | 19.31 |
Qwen3-0.6B-Base and Gemma3-1B-PT are included as reference points — both have more parameters and/or larger training budgets.
Notable results:
- On GPQA Diamond (graduate-level science), Shunya-0.5B-Base (26.26%) outperforms Gemma3-1B-PT (22.22%) despite being roughly half the size.
- On MMLU, Shunya-0.5B-Base (26.46%) matches Gemma3-1B-PT (26.41%) at half the parameter count.
License
- Downloads last month
- 6
Model tree for vivekmarakana/shunya-0.5b-base
Dataset used to train vivekmarakana/shunya-0.5b-base
Evaluation results
- Accuracy, Normalized (25-shot) on ARC Challengeself-reported0.277
- Accuracy, Normalized (0-shot) on ARC Easyself-reported0.431
- Accuracy, Normalized (10-shot) on HellaSwagself-reported0.358
- Accuracy (5-shot) on MMLUself-reported0.265
- Accuracy (0-shot) on WinoGrandeself-reported0.537
- Accuracy (0-shot) on BoolQself-reported0.616
- Accuracy, Normalized (0-shot) on PIQAself-reported0.641
- Accuracy (0-shot) on Social IQaself-reported0.387
- Exact Match (5-shot) on TriviaQAself-reported0.059
- Exact Match (5-shot) on Natural Questions Openself-reported0.020
- Accuracy (5-shot) on GPQA Mainself-reported0.216
- Accuracy (5-shot) on GPQA Diamondself-reported0.263
- Accuracy (5-shot) on AGIEval Englishself-reported0.187