nanochat-10k
Model Description
nanochat-10k is an early-stage, undertrained language model built using the nanochat framework created by Andrej Karpathy.
It is intended for educational purposes, showcasing how transformer-based language models are trained and how their behavior evolves during early training.
⚠️ This model is not optimized for performance or real-world use.
Attribution
This model was trained using the nanochat framework developed by Andrej Karpathy.
Credit goes to him for the underlying training methodology, architecture design, and tooling that made this experiment possible.
Model Details
- Architecture: Decoder-only Transformer (GPT-style)
- Framework: nanochat (by Andrej Karpathy)
- Objective: Causal language modeling
Configuration
| Parameter | Value |
|---|---|
| Layers | 20 |
| Attention Heads | 10 |
| KV Heads | 10 |
| Embedding Size | 1280 |
| Head Dimension | 128 |
| Context Length | 2048 tokens |
| Vocabulary Size | 32,768 |
| Window Pattern | L |
Training Details
- Total Steps: 10,000
- Resume Step: 8,000
- Device Batch Size: 1
- Total Batch Size: 65,536 tokens
- Effective Batching: Gradient accumulation
- Hardware: 1× NVIDIA H100 GPU
Optimization
| Parameter | Value |
|---|---|
| Optimizer | Likely AdamW (not explicitly recorded) |
| Embedding LR | 0.3 |
| Unembedding LR | 0.008 |
| Matrix LR | 0.02 |
| Scalar LR | 0.5 |
| Weight Decay | 0.28 |
| Warmup Steps | 100 |
| Warmdown Ratio | 0.85 |
| Final LR Fraction | 0.05 |
Training Strategy
- Target Param/Data Ratio: 12
- FP8 Training: Disabled
- Evaluation Interval: Every 4,000 steps
- Checkpoint Save Interval: Every 2,000 steps
- Sampling Interval: Every 2,000 steps
Dataset
- Source: nanochat default dataset
- Format: Parquet
- Scale: ~6,400 shards
Preprocessing
- Tokenized using nanochat pipeline
- Standard formatting and filtering
Dataloader State
- Shard Index (pq_idx): 67
- Row Group Index (rg_idx): 54
- Epoch: 1
Limitations
- Dataset composition not documented
- Unknown domain distribution
- Potential inherited biases
Evaluation
No formal benchmarks have been conducted.
- Method: Qualitative/manual inspection only
Metrics
| Metric | Value |
|---|---|
| Validation BPB | 0.8325 |
| Best Val BPB | 0.7318 |
| Train Loss (smoothed) | 2.6494 |
Example
Prompt:
Explain recursion simply:
Output:
Recursion is when something calls itself but it can be confusing and sometimes it keeps going because it is repeating the same idea again and again.
Training Runtime
- Total Training Time:
27,993 seconds (7.8 hours)
Intended Uses
- Educational demonstrations
- Learning transformer training pipelines
- Observing early-stage model behavior
Out-of-Scope Uses
This model should not be used for:
- Medical advice
- Legal decisions
- Financial guidance
- Any high-stakes or production applications
Limitations
- Undertrained and unstable outputs
- Repetition and incoherence
- Weak long-context handling
- Prompt sensitivity
- Unreliable factual accuracy
Risks and Biases
- May reflect biases from training data
- High likelihood of hallucinations
- Inconsistent outputs due to early training stage
Versioning
nanochat-10k
- Initial experimental release (10k training steps)
Planned Improvements
- Record full training configuration
- Tune optimizer and learning rate
- Improve dataset transparency
- Add quantitative benchmarks
Usage
from transformers import pipeline
generator = pipeline("text-generation", model="your-username/nanochat-10k")
print(generator("Explain recursion simply:", max_length=50))
Notes
This model is part of an educational exploration of transformer architectures using a minimal implementation (nanochat) by Andrej Karpathy.
It is shared to document training progress and provide insight into early-stage model behavior.
- Downloads last month
- 645