--- license: mit --- This is the reproduction of karpathy's nanochat on AMD Hardware. # nanochat training report ## Environment ### Hardware - Platform: Linux - CPUs: 160 cores (160 logical) - Memory: 1889.8 GB - GPUs: 8x AMD Instinct MI300X VF - GPU Memory: 1533.5 GB total - CUDA Version: unknown - Hourly Rate: $16.00/hour ### Software - Python: 3.12.3 - PyTorch: 2.10.0.dev20251028+rocm7.0 --- ## Tokenizer training timestamp: 2025-10-29 03:42:34 - max_chars: 2,000,000,000 - doc_cap: 10,000 - vocab_size: 65,536 - train_time: 76.6262 - num_special_tokens: 9 - token_bytes_min: 1 - token_bytes_max: 32 - token_bytes_mean: 6.9151 - token_bytes_std: 2.8736 ## Tokenizer evaluation timestamp: 2025-10-29 03:42:37 ### Comparison with GPT-2 | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | |-----------|-------|--------------|--------------|-------------|------------|-----------------| | news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% | | korean | 893 | 745 | 1.20 | 721 | 1.24 | +3.2% | | code | 1259 | 576 | 2.19 | 493 | 2.55 | +14.4% | | math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% | | science | 1112 | 260 | 4.28 | 225 | 4.94 | +13.5% | | fwe-train | 4208518 | 900364 | 4.67 | 856901 | 4.91 | +4.8% | | fwe-val | 4908443 | 1059062 | 4.63 | 1010356 | 4.86 | +4.6% | ### Comparison with GPT-4 | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | |-----------|-------|--------------|--------------|-------------|------------|-----------------| | news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% | | korean | 893 | 364 | 2.45 | 721 | 1.24 | -98.1% | | code | 1259 | 309 | 4.07 | 493 | 2.55 | -59.5% | | math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% | | science | 1112 | 249 | 4.47 | 225 | 4.94 | +9.6% | | fwe-train | 4208518 | 874799 | 4.81 | 856901 | 4.91 | +2.0% | | fwe-val | 4908443 | 1029691 | 4.77 | 1010356 | 4.86 | +1.9% | ## Base model training timestamp: 2025-10-29 07:47:38 - run: my-llm-training-run-003 - device_type: - depth: 20 - max_seq_len: 2048 - num_iterations: -1 - target_flops: -1.0000 - target_param_data_ratio: 20 - device_batch_size: 64 - total_batch_size: 1,048,576 - embedding_lr: 0.2000 - unembedding_lr: 0.0040 - weight_decay: 0.0000 - matrix_lr: 0.0200 - grad_clip: 1.0000 - warmup_ratio: 0.0000 - warmdown_ratio: 0.2000 - final_lr_frac: 0.0000 - eval_every: 250 - eval_tokens: 10,485,760 - core_metric_every: 2000 - core_metric_max_per_task: 500 - sample_every: 2000 - model_tag: - Number of parameters: 560,988,160 - Number of FLOPs per token: 3.491758e+09 - Calculated number of iterations: 10,700 - Number of training tokens: 11,219,763,200 - Tokens : Params ratio: 20.0000 - DDP world size: 8 - warmup_ratio: 0.0000 - warmdown_ratio: 0.2000 - final_lr_frac: 0.0000 - Minimum validation bpb: 0.8119 - Final validation bpb: 0.8119 - CORE metric estimate: 0.2077 - MFU %: 33.97% - Total training flops: 3.917670e+19 - Total training time: 238.88m - Peak memory usage: 147486.90MiB ## Base model loss timestamp: 2025-10-29 07:48:10 - train bpb: 0.8146 - val bpb: 0.8120 - sample 0: <|bos|>The capital of France is Paris. It is located in the south of France. It is the second largest - sample 1: <|bos|>The chemical symbol of gold is Au. It is a soft, silvery-white metal that is malleable and ductile. - sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Saturday. The day before tomorrow will be Saturday. The day after tomorrow will be - sample 3: <|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold. - sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, - sample 5: <|bos|>My favorite color is blue. I love the color blue. I love the color blue. I love - sample 6: <|bos|>If 5*x + 3 = 13, then x is a positive integer. If 5*x + 3 = 13, ## Base model evaluation timestamp: 2025-10-29 07:51:33 - Model: base_model (step 10700) - CORE metric: 0.2017 - hellaswag_zeroshot: 0.2547 - jeopardy: 0.1053 - bigbench_qa_wikidata: 0.5239 - arc_easy: 0.5118 - arc_challenge: 0.1251 - copa: 0.2200 - commonsense_qa: 0.0981 - piqa: 0.3765 - openbook_qa: 0.1093 - lambada_openai: 0.3868 - hellaswag: 0.2586 - winograd: 0.2161 - winogrande: 0.0481 - bigbench_dyck_languages: 0.1270 - agi_eval_lsat_ar: 0.0870 - bigbench_cs_algorithms: 0.3689 - bigbench_operators: 0.1524 - bigbench_repeat_copy_logic: 0.0000 - squad: 0.2560 - coqa: 0.1929 - boolq: -0.1597 - bigbench_language_identification: 0.1793 ## Midtraining timestamp: 2025-10-29 08:02:20 - run: my-llm-training-run-003 - device_type: - dtype: bfloat16 - num_iterations: -1 - max_seq_len: 2048 - device_batch_size: 64 - unembedding_lr: 0.0040 - embedding_lr: 0.2000 - matrix_lr: 0.0200 - init_lr_frac: 1.0000 - weight_decay: 0.0000 - eval_every: 150 - eval_tokens: 10,485,760 - total_batch_size: 1,048,576 - dry_run: 0 - Number of iterations: 404 - DDP world size: 8 - Minimum validation bpb: 0.3993 ## Chat evaluation mid timestamp: 2025-10-29 08:09:19 - source: mid - task_name: None - dtype: bfloat16 - temperature: 0.0000 - max_new_tokens: 512 - num_samples: 1 - top_k: 50 - batch_size: 8 - model_tag: None - step: None - max_problems: None - device_type: - ARC-Easy: 0.4074 - ARC-Challenge: 0.3157 - MMLU: 0.3236 - GSM8K: 0.0394 - HumanEval: 0.0854 - SpellingBee: 0.9688 - ChatCORE metric: 0.2482 ## Chat SFT timestamp: 2025-10-29 08:31:28 - run: my-llm-training-run-003 - source: mid - device_type: - dtype: bfloat16 - device_batch_size: 4 - num_epochs: 1 - num_iterations: -1 - target_examples_per_step: 32 - unembedding_lr: 0.0040 - embedding_lr: 0.2000 - matrix_lr: 0.0200 - weight_decay: 0.0000 - init_lr_frac: 0.0200 - eval_every: 100 - eval_steps: 100 - eval_metrics_every: 200 - eval_metrics_max_problems: 1024 - Training rows: 22,439 - Number of iterations: 701 - Training loss: 0.5337 - Validation loss: 1.0260 ## Chat evaluation sft timestamp: 2025-10-29 08:49:17 - source: sft - task_name: None - dtype: bfloat16 - temperature: 0.0000 - max_new_tokens: 512 - num_samples: 1 - top_k: 50 - batch_size: 8 - model_tag: None - step: None - max_problems: None - device_type: - ARC-Easy: 0.4192 - ARC-Challenge: 0.3148 - MMLU: 0.3192 - GSM8K: 0.0546 - HumanEval: 0.0671 - SpellingBee: 0.9844 - ChatCORE metric: 0.2517 ## Summary - Characters: 395,259 - Lines: 9,643 - Files: 47 - Tokens (approx): 98,814 - Dependencies (uv.lock lines): 1,363 | Metric | BASE | MID | SFT | RL | |-----------------|----------|----------|----------|----------| | CORE | 0.2017 | - | - | - | | ARC-Challenge | - | 0.3157 | 0.3148 | - | | ARC-Easy | - | 0.4074 | 0.4192 | - | | GSM8K | - | 0.0394 | 0.0546 | - | | HumanEval | - | 0.0854 | 0.0671 | - | | MMLU | - | 0.3236 | 0.3192 | - | | ChatCORE | - | 0.2482 | 0.2517 | - | Total wall clock time: 5h8m