José David Baena

Training Your First Model from Scratch

Banner.jpeg
Published on
/16 mins read

Track 2: Practical Guides - Post 2.1 of 6

This is the first post in Track 2, focusing on hands-on training. View all posts in this track →

Introduction

You've read about the technical details—the Muon optimizer, KV caching, modern architectures. Now it's time to get your hands dirty and train your first language model from scratch.

This guide walks through training a small nanochat model step-by-step: environment setup, monitoring training progress, evaluating the final model.

This is hands-on and practical. You'll train a working language model and learn how to configure, monitor, and troubleshoot training runs. Starting small (a model you can train in minutes on a single GPU), then scaling up.

Prerequisites

Hardware

  • Minimum: 1× GPU with 24GB VRAM (e.g., RTX 3090, RTX 4090)
  • Recommended: 1× GPU with 80GB VRAM (e.g., A100, H100)
  • Optimal: 8× H100 GPUs for the full experience

Software

  • Python 3.10+
  • CUDA 11.8+ or 12.x
  • uv package manager (installs automatically)

Time

  • Quick tutorial model: 15-30 minutes
  • Full d20 model ($100 tier): 4 hours on 8×H100

Part 1: Environment Setup

Step 1: Clone and Enter Repository

git clone https://github.com/karpathy/nanochat.git
cd nanochat

Step 2: Install Dependencies

nanochat uses uv for fast, reliable dependency management:

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
 
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

NOTE

What gets installed: PyTorch 2.x with CUDA support, minimal dependencies (numpy, tiktoken, requests), total ~2000 lines in uv.lock (much cleaner than pip!)

Step 3: Download Data

nanochat trains on the FineWeb-Edu-100B dataset. For our first model, we'll download just a few shards:

# Download 10 shards (~550 MB, ~5B tokens)
python -m nanochat.dataset -n 10 -w 4

Output:

Downloading shard_00000.parquet...
Successfully downloaded shard_00000.parquet
Downloading shard_00001.parquet...
...
Done! Downloaded: 10/10 shards to /Users/you/nanochat/base_data

Storage breakdown:

  • 10 shards: ~550 MB (good for quick experiments)
  • 100 shards: ~5.5 GB (good for small models)
  • 1823 shards: ~100 GB (full dataset for large models)

Step 4: Train Tokenizer

Before training the model, we need a tokenizer:

python -m scripts.tok_train

What this does:

  1. Streams through dataset and counts byte-pair frequencies
  2. Trains BPE with vocab size 32,000
  3. Saves to tokenizer/tokenizer.pkl (~1 MB)
  4. Creates token_bytes.pt mapping for bpb evaluation

Output:

Processing sequences from iterator (buffer_size: 8192)
Processed 1,000,000 sequences total, 458,234 unique
Starting BPE training: 31,744 merges to compute
...
Progress: 100% (31744/31744 merges)
Finished training: 31744 merges completed
Saved tokenizer encoding to tokenizer/tokenizer.pkl

Time: ~10 minutes (only need to do once!)

Part 2: Your First Training Run

Now for the main event—training a model! We'll start with a tiny model to understand the workflow.

Quick Start: The d8 Model

Let's train the smallest practical model—depth 8 (11M parameters):

python scripts/base_train.py --depth=8 --num_iterations=100

What this command does:

  • --depth=8: Model with 8 layers (11M params)
  • --num_iterations=100: Train for 100 optimization steps (~5 minutes)

Expected output:

╔══════════════════════════════════════════════════════════════╗
║                         nanochat                             ║
║           The best ChatGPT that $100 can buy                 ║
╚══════════════════════════════════════════════════════════════╝

Vocab size: 32,000
num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456
Estimated FLOPs per token: 8.837e+07

Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8

Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73
Total training FLOPs estimate: 4.631e+15

step 00000/00100 (0.00%) | loss: 10.389445 | lrm: 1.00 | dt: 1247.32ms | tok/sec: 525,890 | mfu: 38.52 | total time: 0.00m
step 00001/00100 (1.00%) | loss: 9.847221 | lrm: 1.00 | dt: 1189.45ms | tok/sec: 551,234 | mfu: 40.38 | total time: 0.02m
...
step 00100/00100 (100.00%) | loss: 4.125678 | lrm: 0.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 1.92m

Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945

Congratulations! You've trained your first language model. Let's break down what just happened.

Part 3: Understanding the Training Output

The Banner

╔══════════════════════════════════════════════════════════════╗
║                         nanochat                             ║
║           The best ChatGPT that $100 can buy                 ║
╚══════════════════════════════════════════════════════════════╝

Just for fun. :)

Model Configuration

num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456

Key relationships:

  • model_dim = depth × 64 (aspect ratio)
  • num_heads = ceil(model_dim / 128) (head dim 128)
  • num_kv_heads = num_heads (1:1 MQA ratio)

For depth=8:

  • model_dim = 8 × 64 = 512
  • num_heads = ceil(512 / 128) = 4
  • Parameters ≈ 6 × model_dim² × num_layers (rule of thumb)

Batch Size Configuration

Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8

What this means:

  • Each GPU processes 32 sequences of 2048 tokens
  • Target total batch: 524,288 tokens
  • Need 8 gradient accumulation steps to reach target
  • Formula: grad_accum = total_batch / (device_batch × seq_len × world_size)

Training Horizon

Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73

Three ways to specify training length:

  1. Explicit iterations (what we used):

    --num_iterations=100
  2. Target FLOPs (for scaling laws experiments):

    --target_flops=1e19
  3. Data:param ratio (default, Chinchilla-optimal):

    --target_param_data_ratio=20  # Default

For our d8 model:

  • 11M params × 20 = 220M tokens
  • 220M / 524K batch = 420 steps (Chinchilla-optimal)
  • We used 100 steps for speed

Training Step Output

step 00042/00100 (42.00%) | loss: 5.234567 | lrm: 1.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 0.80m

Field breakdown:

FieldMeaningGood Values
step 00042/00100Current step / total-
(42.00%)Progress percentage-
loss: 5.234567Training loss (cross-entropy)Decreasing
lrm: 1.00Learning rate multiplier1.0 during training, 0.0 at end
dt: 1145.67msStep durationLower is better
tok/sec: 573,892Token throughputHigher is better
mfu: 42.05Model FLOPs Utilization (%)35-50% typical
total time: 0.80mCumulative time-

MFU (Model FLOPs Utilization):

MFU = actual_flops_per_sec / theoretical_peak_flops

For H100 (bfloat16): theoretical peak = 989 TFLOPs/s

Typical MFU values:

  • 35-45%: Good (small models)
  • 45-55%: Excellent (medium models)
  • 55-60%: Outstanding (large models, good kernels)

Validation Evaluation

Every 250 steps (by default), the model evaluates on validation data:

Step 00000 | Validation bpb: 4.5678
Step 00250 | Validation bpb: 1.8234
Step 00500 | Validation bpb: 1.6789

Bits per byte (bpb):

  • 4.5: Untrained model (high)
  • 2.0: Learning something
  • 1.5: Decent small model
  • 1.0: Good medium model
  • 0.6: GPT-4 level (estimated)

See Loss Landscape & Scaling Laws for details on bpb.

Final Statistics

Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945

Memory usage:

  • d8 model: ~8 GB (fits on consumer GPUs!)
  • d20 model: ~32 GB (needs A100/H100)
  • d26 model: ~64 GB (needs 80GB GPU or reduce batch size)

Part 4: Training a Real Model (d20)

Now let's train the "standard" nanochat model—depth 20 (83M parameters):

Step 1: Download More Data

# d20 with 20× data ratio needs ~1.66B tokens
# At ~250M chars/shard, need about 80 shards
python -m nanochat.dataset -n 100 -w 8

Step 2: Launch Training

Single GPU:

python scripts/base_train.py --depth=20

Multi-GPU (8× GPUs):

torchrun --standalone --nproc_per_node=8 scripts/base_train.py --depth=20

Training time:

  • Single GPU: ~32 hours
  • 8× GPUs: ~4 hours

Step 3: Monitor Progress

The script outputs progress regularly. Key things to watch:

1. Loss should decrease smoothly:

step 00000: loss=10.38
step 00100: loss=4.12
step 00500: loss=2.89
step 01000: loss=2.34
step 02000: loss=1.98
step 03000: loss=1.78

2. Validation bpb should improve:

Step 00000 | Validation bpb: 4.4567
Step 00250 | Validation bpb: 2.1234
Step 00500 | Validation bpb: 1.8901
Step 00750 | Validation bpb: 1.7234
Step 01000 | Validation bpb: 1.6123
...
Step 03167 | Validation bpb: 1.4501

3. Model samples (every 2000 steps):

Step 02000 | Sample outputs:
The capital of France is Paris.
The chemical symbol of gold is Au
If yesterday was Friday, then tomorrow will be Sunday

These get better over time!

Step 4: Find Your Checkpoint

After training completes, find your model:

ls -lh base_checkpoints/d20/

Output:

checkpoint_003167.pt     # 335 MB - the final model
meta.json                # Metadata (config, metrics)

Part 5: Configuration Deep-Dive

nanochat uses a "Poor Man's Configurator"—you can override any training parameter via command line or config file.

Command-Line Overrides

python scripts/base_train.py \
    --depth=20 \
    --device_batch_size=16 \
    --max_seq_len=1024 \
    --matrix_lr=0.01 \
    --num_iterations=5000

Common parameters:

ParameterDefaultDescription
depth20Model depth (scales everything else)
max_seq_len2048Context length
device_batch_size32Per-GPU batch size
total_batch_size524288Total tokens per step
num_iterations-1Explicit step count (-1 = auto)
target_param_data_ratio20Chinchilla ratio
matrix_lr0.02Muon learning rate
embedding_lr0.2AdamW LR for embeddings
grad_clip1.0Gradient clipping (0 = off)
eval_every250Validation frequency
run"dummy"Wandb run name ("dummy" = no logging)

Config Files

Create config/my_experiment.py:

config/my_experiment.py
# Smaller model for quick testing
depth = 12
num_iterations = 500
device_batch_size = 16
 
# More aggressive learning rates
matrix_lr = 0.03
embedding_lr = 0.3
 
# Wandb logging
run = "d12_fast_lr"

Run with:

python scripts/base_train.py config/my_experiment.py

You can combine config files with CLI overrides:

python scripts/base_train.py config/my_experiment.py --depth=16

Memory Management

Out of memory? Reduce device_batch_size:

# Original (32 → 65K tokens per GPU)
python scripts/base_train.py --depth=26
 
# Reduced (16 → 32K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=16
 
# Further reduced (8 → 16K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=8

The script automatically compensates by increasing gradient accumulation:

  • Same total batch size (524K tokens)
  • Same final model quality
  • Slightly slower (more sequential compute)

Memory usage by model size:

depthparamsVRAM (batch=32)VRAM (batch=16)VRAM (batch=8)
811M8 GB6 GB5 GB
1230M16 GB12 GB10 GB
1654M28 GB20 GB16 GB
2083M42 GB32 GB26 GB
24118M64 GB48 GB38 GB
26140M76 GB56 GB44 GB

Part 6: Monitoring with Wandb

Enable Weights & Biases logging for beautiful dashboards:

# Install wandb
uv pip install wandb
 
# Login (first time only)
wandb login
 
# Train with logging
python scripts/base_train.py --run=my_experiment_name

What gets logged:

  1. Training metrics (every 100 steps):

    • train/loss: Cross-entropy loss
    • train/lrm: Learning rate multiplier
    • train/dt: Step duration
    • train/tok_per_sec: Token throughput
    • train/mfu: Model FLOPs utilization
  2. Validation metrics (every 250 steps):

    • val/bpb: Bits per byte
  3. CORE metrics (every 2000 steps):

    • core_metric: Overall score
    • Individual task scores
  4. System metrics:

    • total_training_flops: Cumulative FLOPs
    • total_training_time: Wall-clock time

Wandb dashboard view:

my_experiment_name
├── train/loss        [line chart: decreasing]
├── val/bpb          [line chart: decreasing]
├── train/mfu        [line chart: stable ~40-50%]
├── core_metric      [line chart: increasing]
└── System           [GPU utilization, memory, etc.]

Part 7: Testing Your Model

After training, let's test what the model learned!

Quick CLI Test

python scripts/chat_cli.py

Interactive session:

You: The capital of France is
Assistant: Paris. The city is known for the Eiffel Tower and the Louvre Museum.

You: If 2+2=4, then 3+3=
Assistant: 6

You: Why is the sky blue?
Assistant: The sky appears blue because of Rayleigh scattering...

(Your d8 model won't be this good yet—that's from a larger model!)

Web UI

python scripts/chat_web.py

Then visit http://localhost:8000 (or your server's IP if remote).

Evaluation Benchmarks

Validate bpb:

python scripts/base_eval.py

This loads the latest checkpoint and evaluates CORE benchmark (~15 minutes on 8 GPUs).

Output:

Model: base_model (step 3167)
================================================================================
Task                                , Accuracy  , Centered
arc_challenge                       , 0.2134    , 0.2645
arc_easy                            , 0.3456    , 0.3789
...
CORE                                ,           , 0.2219

Part 8: Common Issues and Solutions

Issue 1: Loss is NaN

step 00042: loss=nan

Causes:

  • Learning rate too high
  • Numerical instability
  • Corrupted data

Production-grade error handling for the training loop prevents crashes:

# Add to training loop (after loss computation)
try:
    with autocast_ctx:
        loss = model(train_inputs, train_targets)
    
    # Check for NaN loss (indicates training instability)
    if torch.isnan(loss):
        logging.warning(f"NaN loss detected at step {step}. Skipping batch.")
        optimizer.zero_grad()
        continue
    
    loss.backward()
    
except RuntimeError as e:
    if "out of memory" in str(e):
        logging.error(f"OOM at step {step}. Clearing cache and skipping batch.")
        torch.cuda.empty_cache()
        optimizer.zero_grad()
        continue
    else:
        raise e

Solutions:

# Reduce learning rates
python scripts/base_train.py --matrix_lr=0.01 --embedding_lr=0.1
 
# Enable gradient clipping (should be on by default)
python scripts/base_train.py --grad_clip=1.0

Issue 2: Loss Not Decreasing

step 00000: loss=10.38
step 00100: loss=10.35
step 00200: loss=10.32

Causes:

  • Learning rate too low
  • Not enough training steps
  • Model too small for data

Solutions:

# Increase learning rates
python scripts/base_train.py --matrix_lr=0.03 --embedding_lr=0.3
 
# Train longer
python scripts/base_train.py --num_iterations=1000
 
# Use larger model
python scripts/base_train.py --depth=16

Issue 3: OOM (Out of Memory)

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution:

# Reduce batch size (automatically increases grad accumulation)
python scripts/base_train.py --device_batch_size=16
 
# Or reduce sequence length
python scripts/base_train.py --max_seq_len=1024
 
# Or use smaller model
python scripts/base_train.py --depth=12

Issue 4: Slow Training

tok/sec: 50,000  (expected: 500,000+)

Causes:

  • CPU bottleneck (data loading)
  • Inefficient GPU utilization
  • Small batch size

Solutions:

# Increase device batch size (if memory allows)
python scripts/base_train.py --device_batch_size=64
 
# Check data loading isn't bottleneck
# (should see GPU utilization near 100% in nvidia-smi)

Issue 5: Can't Resume Training

nanochat doesn't support mid-training checkpointing by default (keeps code simple). To train longer:

# Option 1: Increase iterations from the start
python scripts/base_train.py --num_iterations=5000
 
# Option 2: Use mid-training script (covered in later posts)
python scripts/mid_train.py

Part 9: Scaling Up

Scaling to Multiple GPUs

2 GPUs:

torchrun --standalone --nproc_per_node=2 scripts/base_train.py

4 GPUs:

torchrun --standalone --nproc_per_node=4 scripts/base_train.py

8 GPUs:

torchrun --standalone --nproc_per_node=8 scripts/base_train.py

What changes:

  • Training time ÷ num_GPUs
  • Memory per GPU stays the same
  • Gradient accumulation automatically reduced
  • Final model quality identical

Scaling to Larger Models

d12 (30M params) - $30 tier:

torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
    --depth=12 \
    --run=d12_experiment

d26 (140M params) - $300 tier:

# Need more data
python -m nanochat.dataset -n 450 -w 8
 
# Train (reduce batch size for memory)
torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
    --depth=26 \
    --device_batch_size=16 \
    --run=d26_experiment

Performance expectations:

ModelParamsTraining Time (8×H100)Final bpbCORECost
d811M30 min2.10.12$12
d1230M90 min1.80.18$36
d1654M2.5 hours1.60.25$60
d2083M4 hours1.450.28$96
d26140M12 hours1.30.35$288

(At $24/hr for 8×H100)

Part 10: Next Steps

Congratulations! You've trained your first language model from scratch. Here's what to explore next:

1. Fine-tune for chat:

2. Understand the architecture:

3. Experiment with hyperparameters:

  • Try different learning rates
  • Test longer training runs
  • Compare model sizes

4. Train your own tokenizer:

5. Build custom evaluations:

Conclusion

Training a language model from scratch is no longer magic—it's engineering. With nanochat, you can:

✅ Train models from 11M to 140M+ parameters
✅ Use single GPU or scale to 8× GPUs
✅ Monitor training with comprehensive metrics
✅ Evaluate with standardized benchmarks
✅ Deploy as a chat interface

The key insights:

  1. Start small: Train d8 (11M) first to verify your setup
  2. Monitor metrics: Watch loss, bpb, and sample outputs
  3. Manage memory: Adjust device_batch_size to fit your GPU
  4. Scale gradually: d8 → d12 → d20 → d26
  5. Iterate fast: Use config files for experiments

You now have the foundation to train, evaluate, and deploy language models. The next posts will build on this foundation to create more capable systems.

Previous in series:

Next in series:

Related posts:


Part of the nanochat Deep-Dive Series • Track 2: Practical Guides

GitHub: nanochat repository
Training Script: scripts/base_train.py

TIP

Experiment notebooks: Due to reader interest, interactive Jupyter notebooks for hands-on experiments are planned. Let us know if you'd like to see them!

Related Articles