nanochat-10k

Model Description

nanochat-10k is an early-stage, undertrained language model built using the nanochat framework created by Andrej Karpathy.
It is intended for educational purposes, showcasing how transformer-based language models are trained and how their behavior evolves during early training.

⚠️ This model is not optimized for performance or real-world use.


Attribution

This model was trained using the nanochat framework developed by Andrej Karpathy.
Credit goes to him for the underlying training methodology, architecture design, and tooling that made this experiment possible.


Model Details

  • Architecture: Decoder-only Transformer (GPT-style)
  • Framework: nanochat (by Andrej Karpathy)
  • Objective: Causal language modeling

Configuration

Parameter Value
Layers 20
Attention Heads 10
KV Heads 10
Embedding Size 1280
Head Dimension 128
Context Length 2048 tokens
Vocabulary Size 32,768
Window Pattern L

Training Details

  • Total Steps: 10,000
  • Resume Step: 8,000
  • Device Batch Size: 1
  • Total Batch Size: 65,536 tokens
  • Effective Batching: Gradient accumulation
  • Hardware: 1× NVIDIA H100 GPU

Optimization

Parameter Value
Optimizer Likely AdamW (not explicitly recorded)
Embedding LR 0.3
Unembedding LR 0.008
Matrix LR 0.02
Scalar LR 0.5
Weight Decay 0.28
Warmup Steps 100
Warmdown Ratio 0.85
Final LR Fraction 0.05

Training Strategy

  • Target Param/Data Ratio: 12
  • FP8 Training: Disabled
  • Evaluation Interval: Every 4,000 steps
  • Checkpoint Save Interval: Every 2,000 steps
  • Sampling Interval: Every 2,000 steps

Dataset

  • Source: nanochat default dataset
  • Format: Parquet
  • Scale: ~6,400 shards

Preprocessing

  • Tokenized using nanochat pipeline
  • Standard formatting and filtering

Dataloader State

  • Shard Index (pq_idx): 67
  • Row Group Index (rg_idx): 54
  • Epoch: 1

Limitations

  • Dataset composition not documented
  • Unknown domain distribution
  • Potential inherited biases

Evaluation

No formal benchmarks have been conducted.

  • Method: Qualitative/manual inspection only

Metrics

Metric Value
Validation BPB 0.8325
Best Val BPB 0.7318
Train Loss (smoothed) 2.6494

Example

Prompt:

Explain recursion simply:

Output:

Recursion is when something calls itself but it can be confusing and sometimes it keeps going because it is repeating the same idea again and again.

Training Runtime

  • Total Training Time: 27,993 seconds (7.8 hours)

Intended Uses

  • Educational demonstrations
  • Learning transformer training pipelines
  • Observing early-stage model behavior

Out-of-Scope Uses

This model should not be used for:

  • Medical advice
  • Legal decisions
  • Financial guidance
  • Any high-stakes or production applications

Limitations

  • Undertrained and unstable outputs
  • Repetition and incoherence
  • Weak long-context handling
  • Prompt sensitivity
  • Unreliable factual accuracy

Risks and Biases

  • May reflect biases from training data
  • High likelihood of hallucinations
  • Inconsistent outputs due to early training stage

Versioning

nanochat-10k

  • Initial experimental release (10k training steps)

Planned Improvements

  • Record full training configuration
  • Tune optimizer and learning rate
  • Improve dataset transparency
  • Add quantitative benchmarks

Usage

from transformers import pipeline

generator = pipeline("text-generation", model="your-username/nanochat-10k")

print(generator("Explain recursion simply:", max_length=50))

Notes

This model is part of an educational exploration of transformer architectures using a minimal implementation (nanochat) by Andrej Karpathy.
It is shared to document training progress and provide insight into early-stage model behavior.

Downloads last month
645
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support