Training Your First Model from Scratch

Track 2: Practical Guides - Post 2.1 of 6
This is the first post in Track 2, focusing on hands-on training. View all posts in this track →

Introduction

You've read about the technical details—the Muon optimizer, KV caching, modern architectures. Now it's time to get your hands dirty and train your first language model from scratch.

This guide walks through training a small nanochat model step-by-step: environment setup, monitoring training progress, evaluating the final model.

This is hands-on and practical. You'll train a working language model and learn how to configure, monitor, and troubleshoot training runs. Starting small (a model you can train in minutes on a single GPU), then scaling up.

Prerequisites

Hardware

Minimum: 1× GPU with 24GB VRAM (e.g., RTX 3090, RTX 4090)
Recommended: 1× GPU with 80GB VRAM (e.g., A100, H100)
Optimal: 8× H100 GPUs for the full experience

Software

Python 3.10+
CUDA 11.8+ or 12.x
uv package manager (installs automatically)

Time

Quick tutorial model: 15-30 minutes
Full d20 model ($100 tier): 4 hours on 8×H100

Part 1: Environment Setup

Step 1: Clone and Enter Repository

git clone https://github.com/karpathy/nanochat.git
cd nanochat

Step 2: Install Dependencies

nanochat uses uv for fast, reliable dependency management:

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
 
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

NOTE

What gets installed: PyTorch 2.x with CUDA support, minimal dependencies (numpy, tiktoken, requests), total ~2000 lines in uv.lock (much cleaner than pip!)

Step 3: Download Data

nanochat trains on the FineWeb-Edu-100B dataset. For our first model, we'll download just a few shards:

# Download 10 shards (~550 MB, ~5B tokens)
python -m nanochat.dataset -n 10 -w 4

Output:

Downloading shard_00000.parquet...
Successfully downloaded shard_00000.parquet
Downloading shard_00001.parquet...
...
Done! Downloaded: 10/10 shards to /Users/you/nanochat/base_data

Storage breakdown:

10 shards: ~550 MB (good for quick experiments)
100 shards: ~5.5 GB (good for small models)
1823 shards: ~100 GB (full dataset for large models)

Step 4: Train Tokenizer

Before training the model, we need a tokenizer:

python -m scripts.tok_train

What this does:

Streams through dataset and counts byte-pair frequencies
Trains BPE with vocab size 32,000
Saves to tokenizer/tokenizer.pkl (~1 MB)
Creates token_bytes.pt mapping for bpb evaluation

Output:

Processing sequences from iterator (buffer_size: 8192)
Processed 1,000,000 sequences total, 458,234 unique
Starting BPE training: 31,744 merges to compute
...
Progress: 100% (31744/31744 merges)
Finished training: 31744 merges completed
Saved tokenizer encoding to tokenizer/tokenizer.pkl

Time: ~10 minutes (only need to do once!)

Part 2: Your First Training Run

Now for the main event—training a model! We'll start with a tiny model to understand the workflow.

Quick Start: The d8 Model

Let's train the smallest practical model—depth 8 (11M parameters):

python scripts/base_train.py --depth=8 --num_iterations=100

What this command does:

--depth=8: Model with 8 layers (11M params)
--num_iterations=100: Train for 100 optimization steps (~5 minutes)

Expected output:

╔══════════════════════════════════════════════════════════════╗
║                         nanochat                             ║
║           The best ChatGPT that $100 can buy                 ║
╚══════════════════════════════════════════════════════════════╝

Vocab size: 32,000
num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456
Estimated FLOPs per token: 8.837e+07

Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8

Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73
Total training FLOPs estimate: 4.631e+15

step 00000/00100 (0.00%) | loss: 10.389445 | lrm: 1.00 | dt: 1247.32ms | tok/sec: 525,890 | mfu: 38.52 | total time: 0.00m
step 00001/00100 (1.00%) | loss: 9.847221 | lrm: 1.00 | dt: 1189.45ms | tok/sec: 551,234 | mfu: 40.38 | total time: 0.02m
...
step 00100/00100 (100.00%) | loss: 4.125678 | lrm: 0.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 1.92m

Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945

Congratulations! You've trained your first language model. Let's break down what just happened.

Part 3: Understanding the Training Output

╔══════════════════════════════════════════════════════════════╗
║                         nanochat                             ║
║           The best ChatGPT that $100 can buy                 ║
╚══════════════════════════════════════════════════════════════╝

Just for fun. :)

Model Configuration

num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456

Key relationships:

model_dim = depth × 64 (aspect ratio)
num_heads = ceil(model_dim / 128) (head dim 128)
num_kv_heads = num_heads (1:1 MQA ratio)

For depth=8:

model_dim = 8 × 64 = 512
num_heads = ceil(512 / 128) = 4
Parameters ≈ 6 × model_dim² × num_layers (rule of thumb)

Batch Size Configuration

Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8

What this means:

Each GPU processes 32 sequences of 2048 tokens
Target total batch: 524,288 tokens
Need 8 gradient accumulation steps to reach target
Formula: grad_accum = total_batch / (device_batch × seq_len × world_size)

Training Horizon

Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73

Three ways to specify training length:

Explicit iterations (what we used):
```
--num_iterations=100
```
Target FLOPs (for scaling laws experiments):
```
--target_flops=1e19
```
Data:param ratio (default, Chinchilla-optimal):
```
--target_param_data_ratio=20  # Default
```

For our d8 model:

11M params × 20 = 220M tokens
220M / 524K batch = 420 steps (Chinchilla-optimal)
We used 100 steps for speed

Training Step Output

step 00042/00100 (42.00%) | loss: 5.234567 | lrm: 1.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 0.80m

Field breakdown:

Field	Meaning	Good Values
`step 00042/00100`	Current step / total	-
`(42.00%)`	Progress percentage	-
`loss: 5.234567`	Training loss (cross-entropy)	Decreasing
`lrm: 1.00`	Learning rate multiplier	1.0 during training, 0.0 at end
`dt: 1145.67ms`	Step duration	Lower is better
`tok/sec: 573,892`	Token throughput	Higher is better
`mfu: 42.05`	Model FLOPs Utilization (%)	35-50% typical
`total time: 0.80m`	Cumulative time	-

MFU (Model FLOPs Utilization):

MFU = actual_flops_per_sec / theoretical_peak_flops

For H100 (bfloat16): theoretical peak = 989 TFLOPs/s

Typical MFU values:

35-45%: Good (small models)
45-55%: Excellent (medium models)
55-60%: Outstanding (large models, good kernels)

Validation Evaluation

Every 250 steps (by default), the model evaluates on validation data:

Step 00000 | Validation bpb: 4.5678
Step 00250 | Validation bpb: 1.8234
Step 00500 | Validation bpb: 1.6789

Bits per byte (bpb):

4.5: Untrained model (high)
2.0: Learning something
1.5: Decent small model
1.0: Good medium model
0.6: GPT-4 level (estimated)

See Loss Landscape & Scaling Laws for details on bpb.

Final Statistics

Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945

Memory usage:

d8 model: ~8 GB (fits on consumer GPUs!)
d20 model: ~32 GB (needs A100/H100)
d26 model: ~64 GB (needs 80GB GPU or reduce batch size)

Part 4: Training a Real Model (d20)

Now let's train the "standard" nanochat model—depth 20 (83M parameters):

Step 1: Download More Data

# d20 with 20× data ratio needs ~1.66B tokens
# At ~250M chars/shard, need about 80 shards
python -m nanochat.dataset -n 100 -w 8

Step 2: Launch Training

Single GPU:

python scripts/base_train.py --depth=20

Multi-GPU (8× GPUs):

torchrun --standalone --nproc_per_node=8 scripts/base_train.py --depth=20

Training time:

Single GPU: ~32 hours
8× GPUs: ~4 hours

Step 3: Monitor Progress

The script outputs progress regularly. Key things to watch:

1. Loss should decrease smoothly:

step 00000: loss=10.38
step 00100: loss=4.12
step 00500: loss=2.89
step 01000: loss=2.34
step 02000: loss=1.98
step 03000: loss=1.78

2. Validation bpb should improve:

Step 00000 | Validation bpb: 4.4567
Step 00250 | Validation bpb: 2.1234
Step 00500 | Validation bpb: 1.8901
Step 00750 | Validation bpb: 1.7234
Step 01000 | Validation bpb: 1.6123
...
Step 03167 | Validation bpb: 1.4501

3. Model samples (every 2000 steps):

Step 02000 | Sample outputs:
The capital of France is Paris.
The chemical symbol of gold is Au
If yesterday was Friday, then tomorrow will be Sunday

These get better over time!

Step 4: Find Your Checkpoint

After training completes, find your model:

ls -lh base_checkpoints/d20/

Output:

checkpoint_003167.pt     # 335 MB - the final model
meta.json                # Metadata (config, metrics)

Part 5: Configuration Deep-Dive

nanochat uses a "Poor Man's Configurator"—you can override any training parameter via command line or config file.

Command-Line Overrides

python scripts/base_train.py \
    --depth=20 \
    --device_batch_size=16 \
    --max_seq_len=1024 \
    --matrix_lr=0.01 \
    --num_iterations=5000

Common parameters:

Parameter	Default	Description
`depth`	20	Model depth (scales everything else)
`max_seq_len`	2048	Context length
`device_batch_size`	32	Per-GPU batch size
`total_batch_size`	524288	Total tokens per step
`num_iterations`	-1	Explicit step count (-1 = auto)
`target_param_data_ratio`	20	Chinchilla ratio
`matrix_lr`	0.02	Muon learning rate
`embedding_lr`	0.2	AdamW LR for embeddings
`grad_clip`	1.0	Gradient clipping (0 = off)
`eval_every`	250	Validation frequency
`run`	"dummy"	Wandb run name ("dummy" = no logging)

Config Files

Create config/my_experiment.py:

config/my_experiment.py

# Smaller model for quick testing
depth = 12
num_iterations = 500
device_batch_size = 16
 
# More aggressive learning rates
matrix_lr = 0.03
embedding_lr = 0.3
 
# Wandb logging
run = "d12_fast_lr"

Run with:

python scripts/base_train.py config/my_experiment.py

You can combine config files with CLI overrides:

python scripts/base_train.py config/my_experiment.py --depth=16

Memory Management

Out of memory? Reduce device_batch_size:

# Original (32 → 65K tokens per GPU)
python scripts/base_train.py --depth=26
 
# Reduced (16 → 32K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=16
 
# Further reduced (8 → 16K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=8

The script automatically compensates by increasing gradient accumulation:

Same total batch size (524K tokens)
Same final model quality
Slightly slower (more sequential compute)

Memory usage by model size:

depth	params	VRAM (batch=32)	VRAM (batch=16)	VRAM (batch=8)
8	11M	8 GB	6 GB	5 GB
12	30M	16 GB	12 GB	10 GB
16	54M	28 GB	20 GB	16 GB
20	83M	42 GB	32 GB	26 GB
24	118M	64 GB	48 GB	38 GB
26	140M	76 GB	56 GB	44 GB

Part 6: Monitoring with Wandb

Enable Weights & Biases logging for beautiful dashboards:

# Install wandb
uv pip install wandb
 
# Login (first time only)
wandb login
 
# Train with logging
python scripts/base_train.py --run=my_experiment_name

What gets logged:

Training metrics (every 100 steps):
- train/loss: Cross-entropy loss
- train/lrm: Learning rate multiplier
- train/dt: Step duration
- train/tok_per_sec: Token throughput
- train/mfu: Model FLOPs utilization
Validation metrics (every 250 steps):
- val/bpb: Bits per byte
CORE metrics (every 2000 steps):
- core_metric: Overall score
- Individual task scores
System metrics:
- total_training_flops: Cumulative FLOPs
- total_training_time: Wall-clock time

Wandb dashboard view:

my_experiment_name
├── train/loss        [line chart: decreasing]
├── val/bpb          [line chart: decreasing]
├── train/mfu        [line chart: stable ~40-50%]
├── core_metric      [line chart: increasing]
└── System           [GPU utilization, memory, etc.]

Part 7: Testing Your Model

After training, let's test what the model learned!

Quick CLI Test

python scripts/chat_cli.py

Interactive session:

You: The capital of France is
Assistant: Paris. The city is known for the Eiffel Tower and the Louvre Museum.

You: If 2+2=4, then 3+3=
Assistant: 6

You: Why is the sky blue?
Assistant: The sky appears blue because of Rayleigh scattering...

(Your d8 model won't be this good yet—that's from a larger model!)

Web UI

python scripts/chat_web.py

Then visit http://localhost:8000 (or your server's IP if remote).

Evaluation Benchmarks

Validate bpb:

python scripts/base_eval.py

This loads the latest checkpoint and evaluates CORE benchmark (~15 minutes on 8 GPUs).

Output:

Model: base_model (step 3167)
================================================================================
Task                                , Accuracy  , Centered
arc_challenge                       , 0.2134    , 0.2645
arc_easy                            , 0.3456    , 0.3789
...
CORE                                ,           , 0.2219

Part 8: Common Issues and Solutions

Issue 1: Loss is NaN

step 00042: loss=nan

Causes:

Learning rate too high
Numerical instability
Corrupted data

Production-grade error handling for the training loop prevents crashes:

# Add to training loop (after loss computation)
try:
    with autocast_ctx:
        loss = model(train_inputs, train_targets)
    
    # Check for NaN loss (indicates training instability)
    if torch.isnan(loss):
        logging.warning(f"NaN loss detected at step {step}. Skipping batch.")
        optimizer.zero_grad()
        continue
    
    loss.backward()
    
except RuntimeError as e:
    if "out of memory" in str(e):
        logging.error(f"OOM at step {step}. Clearing cache and skipping batch.")
        torch.cuda.empty_cache()
        optimizer.zero_grad()
        continue
    else:
        raise e

Solutions:

# Reduce learning rates
python scripts/base_train.py --matrix_lr=0.01 --embedding_lr=0.1
 
# Enable gradient clipping (should be on by default)
python scripts/base_train.py --grad_clip=1.0

Issue 2: Loss Not Decreasing

step 00000: loss=10.38
step 00100: loss=10.35
step 00200: loss=10.32

Causes:

Learning rate too low
Not enough training steps
Model too small for data

Solutions:

# Increase learning rates
python scripts/base_train.py --matrix_lr=0.03 --embedding_lr=0.3
 
# Train longer
python scripts/base_train.py --num_iterations=1000
 
# Use larger model
python scripts/base_train.py --depth=16

Issue 3: OOM (Out of Memory)

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution:

# Reduce batch size (automatically increases grad accumulation)
python scripts/base_train.py --device_batch_size=16
 
# Or reduce sequence length
python scripts/base_train.py --max_seq_len=1024
 
# Or use smaller model
python scripts/base_train.py --depth=12

Issue 4: Slow Training

tok/sec: 50,000  (expected: 500,000+)

Causes:

CPU bottleneck (data loading)
Inefficient GPU utilization
Small batch size

Solutions:

# Increase device batch size (if memory allows)
python scripts/base_train.py --device_batch_size=64
 
# Check data loading isn't bottleneck
# (should see GPU utilization near 100% in nvidia-smi)

Issue 5: Can't Resume Training

nanochat doesn't support mid-training checkpointing by default (keeps code simple). To train longer:

# Option 1: Increase iterations from the start
python scripts/base_train.py --num_iterations=5000
 
# Option 2: Use mid-training script (covered in later posts)
python scripts/mid_train.py

Part 9: Scaling Up

Scaling to Multiple GPUs

2 GPUs:

torchrun --standalone --nproc_per_node=2 scripts/base_train.py

4 GPUs:

torchrun --standalone --nproc_per_node=4 scripts/base_train.py

8 GPUs:

torchrun --standalone --nproc_per_node=8 scripts/base_train.py

What changes:

Training time ÷ num_GPUs
Memory per GPU stays the same
Gradient accumulation automatically reduced
Final model quality identical

Scaling to Larger Models

d12 (30M params) - $30 tier:

torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
    --depth=12 \
    --run=d12_experiment

d26 (140M params) - $300 tier:

# Need more data
python -m nanochat.dataset -n 450 -w 8
 
# Train (reduce batch size for memory)
torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
    --depth=26 \
    --device_batch_size=16 \
    --run=d26_experiment

Performance expectations:

Model	Params	Training Time (8×H100)	Final bpb	CORE	Cost
d8	11M	30 min	2.1	0.12	$12
d12	30M	90 min	1.8	0.18	$36
d16	54M	2.5 hours	1.6	0.25	$60
d20	83M	4 hours	1.45	0.28	$96
d26	140M	12 hours	1.3	0.35	$288

(At $24/hr for 8×H100)

Part 10: Next Steps

Congratulations! You've trained your first language model from scratch. Here's what to explore next:

1. Fine-tune for chat:

See Fine-tuning for Chat (SFT)
Turn your base model into a conversational assistant

2. Understand the architecture:

Review Modern Transformer Architecture
Learn what's inside the model you just trained

3. Experiment with hyperparameters:

Try different learning rates
Test longer training runs
Compare model sizes

4. Train your own tokenizer:

See Tokenizer Design Choices
Customize vocabulary for your domain

5. Build custom evaluations:

See Building Custom Evaluation Tasks
Test your model on tasks you care about

Conclusion

Training a language model from scratch is no longer magic—it's engineering. With nanochat, you can:

✅ Train models from 11M to 140M+ parameters
✅ Use single GPU or scale to 8× GPUs
✅ Monitor training with comprehensive metrics
✅ Evaluate with standardized benchmarks
✅ Deploy as a chat interface

The key insights:

Start small: Train d8 (11M) first to verify your setup
Monitor metrics: Watch loss, bpb, and sample outputs
Manage memory: Adjust device_batch_size to fit your GPU
Scale gradually: d8 → d12 → d20 → d26
Iterate fast: Use config files for experiments

You now have the foundation to train, evaluate, and deploy language models. The next posts will build on this foundation to create more capable systems.

Previous in series:

Loss Landscape & Scaling Laws - Understand evaluation metrics like bpb and Chinchilla scaling laws

Next in series:

Fine-tuning for Chat (SFT) - Transform your base model into a conversational assistant

Related posts:

Modern Transformer Architecture - Deep dive into RoPE, QK normalization, and architectural innovations
Muon Optimizer Explained - Understanding the optimizer that powers nanochat training

Part of the nanochat Deep-Dive Series • Track 2: Practical Guides

GitHub: nanochat repository
Training Script: scripts/base_train.py

TIP

Experiment notebooks: Due to reader interest, interactive Jupyter notebooks for hands-on experiments are planned. Let us know if you'd like to see them!

On This Page

Training Your First Model from Scratch

Introduction

Prerequisites

Hardware

Software

Time

Part 1: Environment Setup

Step 1: Clone and Enter Repository

Step 2: Install Dependencies

Step 3: Download Data

Step 4: Train Tokenizer

Part 2: Your First Training Run

Quick Start: The d8 Model

Part 3: Understanding the Training Output

The Banner

Model Configuration

Batch Size Configuration

Training Horizon

Training Step Output

Validation Evaluation

Final Statistics

Part 4: Training a Real Model (d20)

Step 1: Download More Data

Step 2: Launch Training

Step 3: Monitor Progress

Step 4: Find Your Checkpoint

Part 5: Configuration Deep-Dive

Command-Line Overrides

Config Files

Memory Management

Part 6: Monitoring with Wandb

Part 7: Testing Your Model

Quick CLI Test

Web UI

Evaluation Benchmarks

Part 8: Common Issues and Solutions

Issue 1: Loss is NaN

Issue 2: Loss Not Decreasing

Issue 3: OOM (Out of Memory)

Issue 4: Slow Training

Issue 5: Can't Resume Training

Part 9: Scaling Up

Scaling to Multiple GPUs

Scaling to Larger Models

Part 10: Next Steps

Conclusion

Related Posts

Related Articles

🤖→🚀Fine-tuning for Chat (SFT)

🤖→🚀Tokenizer Design Choices: BPE, Vocabulary, and Implementation

🤖→🚀Building Custom Evaluation Tasks