Shunya-0.5B-Base

Shunya-0.5B-Base is a 503M-parameter decoder-only language model trained from scratch on a curated mix of web text and synthetic educational data. It is a base (pretrained) model — not instruction-tuned — intended as a foundation for fine-tuning and research.

The model uses a Llama-style architecture with Grouped Query Attention (GQA), SiLU activations, and supports a 32,768-token context window via linear RoPE scaling.

Model Details

Field Value
Architecture LlamaForCausalLM
Parameters ~503M
Hidden size 1,280
Intermediate size 4,864
Layers 20
Attention heads 10
KV heads (GQA) 2
Head dim 128
Vocab size 40,000
Context window 32,768 tokens
Positional encoding RoPE (θ=10,000, 4× linear scaling)
Normalization RMSNorm (ε=1e-6)
Activation SiLU
Dtype bfloat16
Tied embeddings Yes
Attention bias Yes
Attention dropout 0.1

Training

  • Training tokens: ~60B token mix from wikipedia, fineweb, slimpajama, gutenberg & arxiv papers. Synthetic data from vivekmarakana/synthetic-textbooks for factual accuracy.
  • License: MIT

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("vivekmarakana/shunya-0.5b-base")
model = AutoModelForCausalLM.from_pretrained(
    "vivekmarakana/shunya-0.5b-base",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: This is a base model. For conversational or instruction-following tasks, use an instruction-tuned variant.

Benchmarks

All evaluations were run with lm-evaluation-harness. Results use normalized accuracy (acc_norm) for completion tasks (ARC, PIQA, HellaSwag) and acc for GPQA, AGIEval etc.

Benchmark Shots Metric Shunya-0.5B-Base Qwen3-0.6B-Base Gemma3-1B-PT
ARC-Challenge 25 acc_norm 27.65 45.22 39.33
ARC-Easy 0 acc_norm 43.14 58.08 72.18
HellaSwag 10 acc_norm 35.81 53.56 62.78
MMLU 5 acc 26.46 52.51 26.41
WinoGrande 0 acc 53.67 58.88 58.09
BoolQ 0 acc 61.62 69.51 66.67
PIQA 0 acc_norm 64.09 69.86 74.81
Social IQA 0 acc 38.74 43.14 43.09
GPQA Main 5 acc 21.65 27.01 25.22
GPQA Diamond 5 acc 26.26 32.32 22.22
AGIEval EN 5 acc 18.72 25.91 19.31

Qwen3-0.6B-Base and Gemma3-1B-PT are included as reference points — both have more parameters and/or larger training budgets.

Notable results:

  • On GPQA Diamond (graduate-level science), Shunya-0.5B-Base (26.26%) outperforms Gemma3-1B-PT (22.22%) despite being roughly half the size.
  • On MMLU, Shunya-0.5B-Base (26.46%) matches Gemma3-1B-PT (26.41%) at half the parameter count.

License

MIT

Downloads last month
6
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vivekmarakana/shunya-0.5b-base

Finetunes
1 model
Quantizations
1 model

Dataset used to train vivekmarakana/shunya-0.5b-base

Evaluation results