nanochat-10k

Model Description

nanochat-10k is an early-stage, undertrained language model built using the nanochat framework created by Andrej Karpathy.
It is intended for educational purposes, showcasing how transformer-based language models are trained and how their behavior evolves during early training.

⚠️ This model is not optimized for performance or real-world use.

Attribution

This model was trained using the nanochat framework developed by Andrej Karpathy.
Credit goes to him for the underlying training methodology, architecture design, and tooling that made this experiment possible.

Model Details

Architecture: Decoder-only Transformer (GPT-style)
Framework: nanochat (by Andrej Karpathy)
Objective: Causal language modeling

Configuration

Parameter	Value
Layers	20
Attention Heads	10
KV Heads	10
Embedding Size	1280
Head Dimension	128
Context Length	2048 tokens
Vocabulary Size	32,768
Window Pattern	L

Training Details

Total Steps: 10,000
Resume Step: 8,000
Device Batch Size: 1
Total Batch Size: 65,536 tokens
Effective Batching: Gradient accumulation
Hardware: 1× NVIDIA H100 GPU

Optimization

Parameter	Value
Optimizer	Likely AdamW (not explicitly recorded)
Embedding LR	0.3
Unembedding LR	0.008
Matrix LR	0.02
Scalar LR	0.5
Weight Decay	0.28
Warmup Steps	100
Warmdown Ratio	0.85
Final LR Fraction	0.05

Training Strategy

Target Param/Data Ratio: 12
FP8 Training: Disabled
Evaluation Interval: Every 4,000 steps
Checkpoint Save Interval: Every 2,000 steps
Sampling Interval: Every 2,000 steps

Dataset

Source: nanochat default dataset
Format: Parquet
Scale: ~6,400 shards

Preprocessing

Tokenized using nanochat pipeline
Standard formatting and filtering

Dataloader State

Shard Index (pq_idx): 67
Row Group Index (rg_idx): 54
Epoch: 1

Limitations

Dataset composition not documented
Unknown domain distribution
Potential inherited biases

Evaluation

No formal benchmarks have been conducted.

Method: Qualitative/manual inspection only

Metrics

Metric	Value
Validation BPB	0.8325
Best Val BPB	0.7318
Train Loss (smoothed)	2.6494

Example

Prompt:

Explain recursion simply:

Output:

Recursion is when something calls itself but it can be confusing and sometimes it keeps going because it is repeating the same idea again and again.

Training Runtime

Total Training Time: ~~27,993 seconds (~~7.8 hours)

Intended Uses

Educational demonstrations
Learning transformer training pipelines
Observing early-stage model behavior

Out-of-Scope Uses

This model should not be used for:

Medical advice
Legal decisions
Financial guidance
Any high-stakes or production applications

Limitations

Undertrained and unstable outputs
Repetition and incoherence
Weak long-context handling
Prompt sensitivity
Unreliable factual accuracy

Risks and Biases

May reflect biases from training data
High likelihood of hallucinations
Inconsistent outputs due to early training stage

Versioning

nanochat-10k

Initial experimental release (10k training steps)

Planned Improvements

Record full training configuration
Tune optimizer and learning rate
Improve dataset transparency
Add quantitative benchmarks

Usage

from transformers import pipeline

generator = pipeline("text-generation", model="your-username/nanochat-10k")

print(generator("Explain recursion simply:", max_length=50))

Notes

This model is part of an educational exploration of transformer architectures using a minimal implementation (nanochat) by Andrej Karpathy.
It is shared to document training progress and provide insight into early-stage model behavior.

Downloads last month: 645