Shunya-0.5B-Base

Shunya-0.5B-Base is a 503M-parameter decoder-only language model trained from scratch on a curated mix of web text and synthetic educational data. It is a base (pretrained) model — not instruction-tuned — intended as a foundation for fine-tuning and research.

The model uses a Llama-style architecture with Grouped Query Attention (GQA), SiLU activations, and supports a 32,768-token context window via linear RoPE scaling.

Model Details

Field	Value
Architecture	LlamaForCausalLM
Parameters	~503M
Hidden size	1,280
Intermediate size	4,864
Layers	20
Attention heads	10
KV heads (GQA)	2
Head dim	128
Vocab size	40,000
Context window	32,768 tokens
Positional encoding	RoPE (θ=10,000, 4× linear scaling)
Normalization	RMSNorm (ε=1e-6)
Activation	SiLU
Dtype	bfloat16
Tied embeddings	Yes
Attention bias	Yes
Attention dropout	0.1

Training

Training tokens: ~60B token mix from wikipedia, fineweb, slimpajama, gutenberg & arxiv papers. Synthetic data from vivekmarakana/synthetic-textbooks for factual accuracy.
License: MIT

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("vivekmarakana/shunya-0.5b-base")
model = AutoModelForCausalLM.from_pretrained(
    "vivekmarakana/shunya-0.5b-base",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: This is a base model. For conversational or instruction-following tasks, use an instruction-tuned variant.

Benchmarks

All evaluations were run with lm-evaluation-harness. Results use normalized accuracy (acc_norm) for completion tasks (ARC, PIQA, HellaSwag) and acc for GPQA, AGIEval etc.

Benchmark	Shots	Metric	Shunya-0.5B-Base	Qwen3-0.6B-Base	Gemma3-1B-PT
ARC-Challenge	25	acc_norm	27.65	45.22	39.33
ARC-Easy	0	acc_norm	43.14	58.08	72.18
HellaSwag	10	acc_norm	35.81	53.56	62.78
MMLU	5	acc	26.46	52.51	26.41
WinoGrande	0	acc	53.67	58.88	58.09
BoolQ	0	acc	61.62	69.51	66.67
PIQA	0	acc_norm	64.09	69.86	74.81
Social IQA	0	acc	38.74	43.14	43.09
GPQA Main	5	acc	21.65	27.01	25.22
GPQA Diamond	5	acc	26.26	32.32	22.22
AGIEval EN	5	acc	18.72	25.91	19.31

Qwen3-0.6B-Base and Gemma3-1B-PT are included as reference points — both have more parameters and/or larger training budgets.

Notable results:

On GPQA Diamond (graduate-level science), Shunya-0.5B-Base (26.26%) outperforms Gemma3-1B-PT (22.22%) despite being roughly half the size.
On MMLU, Shunya-0.5B-Base (26.46%) matches Gemma3-1B-PT (26.41%) at half the parameter count.

License

MIT

Downloads last month: 6

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for vivekmarakana/shunya-0.5b-base

Finetunes

1 model

Quantizations

1 model

Dataset used to train vivekmarakana/shunya-0.5b-base

Evaluation results

Accuracy, Normalized (25-shot) on ARC Challenge
self-reported

0.277
Accuracy, Normalized (0-shot) on ARC Easy
self-reported

0.431
Accuracy, Normalized (10-shot) on HellaSwag
self-reported

0.358
Accuracy (5-shot) on MMLU
self-reported

0.265
Accuracy (0-shot) on WinoGrande
self-reported

0.537
Accuracy (0-shot) on BoolQ
self-reported

0.616
Accuracy, Normalized (0-shot) on PIQA
self-reported

0.641
Accuracy (0-shot) on Social IQa
self-reported

0.387
Exact Match (5-shot) on TriviaQA
self-reported

0.059
Exact Match (5-shot) on Natural Questions Open
self-reported

0.020
Accuracy (5-shot) on GPQA Main
self-reported

0.216
Accuracy (5-shot) on GPQA Diamond
self-reported

0.263
Accuracy (5-shot) on AGIEval English
self-reported

0.187