José David Baena

Loss Landscape & Scaling Laws: Understanding Training Dynamics

Loss landscape scaling laws evaluation banner.jpg
Published on
/23 mins read

Cross-entropy loss tells you nothing about what your model knows

Early in my LLM training experiments, I spent hours comparing loss numbers between models with different tokenizers. Completely meaningless. Understanding bits-per-byte and Chinchilla scaling laws changed how I approach every training decision.

Your loss number is meaningless for comparison. Bits-per-byte normalizes across tokenizers—and Chinchilla showed us how to allocate compute.

TL;DR: Bits-per-byte normalizes across tokenizers. CORE benchmark provides centered metrics (0=random, 1=perfect). Chinchilla scaling laws prove 20× data-to-params is compute-optimal. These three insights guide every training decision.

The benchmark that lied: Consider a scenario that happens more often than people admit: celebrating hitting 2.8 cross-entropy loss—lower than a baseline at 3.1—only to find the model is worse when deployed. The problem: a tokenizer change. A new tokenizer with 100K vocabulary (vs 50K in the baseline) produces fewer tokens per sentence. Lower loss, but not from a better model—from an easier prediction task. After converting to bits-per-byte, the reality emerges: the model is 0.15 bpb worse than baseline. Months of "progress" went backwards. Always normalize your metrics.

Training a language model is not just about running gradient descent until convergence. Understanding how the loss evolves, why certain hyperparameters work, and what scaling laws govern model performance can mean the difference between wasted compute and efficient training runs.

nanochat's training infrastructure includes evaluation mechanisms that go beyond simple loss tracking. It implements bits-per-byte (bpb)—a tokenization-agnostic metric—and the CORE benchmark—an 11-task evaluation suite. Together, these tools illuminate training dynamics and enable principled decision-making about model architecture, data requirements, and compute budgets.

This final post of Track 1 dissects the bits-per-byte metric, explains CORE benchmark mechanics, and shows how these tools reveal the fundamental scaling laws governing language model training.

Cross-entropy loss is tokenizer-dependent and incomparable

The standard training metric is cross-entropy loss:

loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

This seems straightforward, but it has a critical flaw: the loss is not comparable across different tokenizers.

Why Cross-Entropy Loss is Tokenizer-Dependent

Consider two tokenizers:

Tokenizer A (vocab size = 10K):

  • "Hello world" → [1523, 892]
  • Average tokens per sentence: 10
  • Cross-entropy loss: 3.2

Tokenizer B (vocab size = 100K):

  • "Hello world" → [45123]
  • Average tokens per sentence: 5
  • Cross-entropy loss: 2.8

Which is better? You can't tell! Tokenizer B has lower loss, but it's also predicting from a 10× larger vocabulary.

The cross-entropy loss depends on:

  1. Vocabulary size: Larger vocab → higher entropy → higher loss
  2. Token granularity: Byte-level vs word-level tokenization
  3. Special tokens: How are they handled?

This makes it impossible to:

  • Compare models trained with different tokenizers
  • Experiment with vocabulary size
  • Compare against published baselines (which use different tokenizers)

In practice, reviewers will rightfully reject cross-tokenizer loss comparisons. Always convert to bpb before claiming your model beats a baseline.

Bits-per-byte normalizes loss across any tokenizer

nanochat uses bits-per-byte (bpb) - a tokenization-agnostic metric that normalizes loss by the actual byte content being predicted.

The Core Idea

Instead of computing loss per token, compute loss per byte:

bpb = total_nats / (log(2) × total_bytes)

Where:

  • total_nats: Sum of cross-entropy losses (in natural log units)
  • total_bytes: Total number of UTF-8 bytes in all target tokens
  • log(2): Conversion factor from nats to bits

Key insight: The byte content is tokenizer-independent. "Hello world" is always 11 bytes, regardless of how you tokenize it.

Implementation

nanochat's implementation from nanochat/loss_eval.py:

nanochat/loss_eval.py
@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
    """
    Calculate bits per byte (bpb) - a tokenization-agnostic metric.
    
    Args:
        model: The model to evaluate
        batches: Iterator over (inputs, targets) batches
        steps: Number of evaluation steps
        token_bytes: 1D tensor mapping token_id → num_bytes (0 for special tokens)
    """
    total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
    total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
    
    batch_iter = iter(batches)
    for _ in range(steps):
        x, y = next(batch_iter)
        
        # Get per-token losses (no reduction)
        loss2d = model(x, y, loss_reduction='none')  # (B, T)
        loss2d = loss2d.view(-1)  # flatten
        y = y.view(-1)  # flatten
        
        if (y < 0).any():
            # Handle ignore_index (e.g., -1 for padding)
            valid = y >= 0
            y_safe = torch.where(valid, y, torch.zeros_like(y))
            num_bytes2d = torch.where(
                valid,
                token_bytes[y_safe],
                torch.zeros_like(y, dtype=token_bytes.dtype)
            )
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
        else:
            # Fast path: no ignored targets
            num_bytes2d = token_bytes[y]
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
    
    # Sum across all distributed ranks
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    if world_size > 1:
        dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
        dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)
    
    # Convert to bits per byte
    total_nats = total_nats.item()
    total_bytes = total_bytes.item()
    bpb = total_nats / (math.log(2) * total_bytes)
    return bpb

Key Design Decisions

1. Per-Token Loss Calculation

loss2d = model(x, y, loss_reduction='none')  # (B, T)

We need per-token losses to weight them by token byte length. Standard reduction='mean' would lose this information.

2. Token Byte Lookup

num_bytes2d = token_bytes[y]

The token_bytes tensor is precomputed during tokenizer training (from scripts/tok_train.py):

scripts/tok_train.py - Token Byte Computation
# Generated by scripts/tok_train.py
token_bytes = torch.zeros(vocab_size, dtype=torch.int64)
for token_id in range(vocab_size):
    token_str = tokenizer.decode([token_id])
    token_bytes[token_id] = len(token_str.encode('utf-8'))
    
# Special tokens get 0 bytes (excluded from metric)
for special_token_id in special_token_ids:
    token_bytes[special_token_id] = 0

Example token bytes:

  • "hello" (5 characters) → 5 bytes
  • "世界" (2 characters) → 6 bytes (UTF-8)
  • "<|bos|>" (special) → 0 bytes (excluded)

3. Byte Weighting

total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()
  • Multiply each loss by whether its token contributes bytes (0 or 1)
  • Sum total bytes separately
  • This excludes special tokens (0 bytes) from both numerator and denominator

4. Distributed Reduction

if world_size > 1:
    dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
    dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)

Bits Per Byte Calculator

Convert between loss, perplexity, and compression metrics

Presets:
Loss (bits)
3.607
bits/token
Perplexity
12.2
e^loss
Bits per Byte
0.801
BPB
Compression Ratio
10.0×
vs raw bytes

BPB Comparison

Random guess (32k vocab)
3.33
GPT-2 (WebText)
0.93
GPT-3 (1-shot)
0.76
Human (Shannon game)
0.70
Theoretical limit (English)
0.40
Your Model
0.80

Interpretation

👍 Good performance. Your model compresses text significantly better than random, but there's room for improvement.

Conversion Formulas

bits = nats / ln(2) ≈ nats × 1.443
perplexity = e^(loss_nats)
BPB = bits_per_token / avg_bytes_per_token
compression_ratio = 8 / BPB

Each rank evaluates different data (strided sharding), so we sum across ranks before computing the final ratio.

bpb Interpretation

What does a bpb value mean?

bpb = 1.5  →  On average, the model uses 1.5 bits per byte of text

Theoretical limits:

CompressionbpbInterpretation
Perfect0.0Model predicts everything perfectly
gzip~4.5Classical compression algorithm
Random8.0No compression (1 byte = 8 bits)

Typical model performance:

ModelbpbContext
GPT-2 small~1.8124M params, pre-2020 architecture
nanochat d20~1.45~561M params, modern architecture
GPT-3 175B~0.8Large-scale model
GPT-4~0.6State-of-the-art (estimated)

For your training runs, this means: always report bpb alongside loss. A bpb of 1.45 tells you something meaningful—your model compresses text 5× better than gzip.

For your model comparisons, this means: bpb is the great equalizer. When comparing your model to published results with different tokenizers, bpb is the only number that matters.

bpb vs Cross-Entropy Comparison

Cross-entropy loss:

# Simple average - tokenizer dependent
loss = sum(cross_entropy(pred, target)) / num_tokens
# Example: 2.8 (what does this mean?)

Bits per byte:

# Byte-normalized - tokenizer independent
bpb = sum(cross_entropy(pred, target)) / (log(2) × num_bytes)
# Example: 1.45 bits/byte (comparable across tokenizers!)

Concrete example:

Text: "The quick brown fox jumps"

TokenizerTokensCross-entropyBytesbpb
GPT-4 (100K)52.5251.4
GPT-2 (50K)62.8251.4
Character-level (256)251.2251.4

All three have the same bpb despite different token counts and losses—that's the power of normalization. ✅

CORE benchmark provides centered accuracy metrics

While bpb measures how well the model compresses text, it doesn't tell us how well it understands language. Enter the CORE benchmark.

What is CORE?

CORE (Common-sense Reasoning Evaluation) is a comprehensive benchmark from the DCLM paper that evaluates models on 11 diverse tasks:

CategoryTasks
KnowledgeMMLU (5-shot), HellaSwag, PIQA
Reading ComprehensionSQUAD, BoolQ, SciQ, ARC-Easy, ARC-Challenge
Common SenseOpenBookQA, Winogrande
Problem SolvingSQUAD

Why CORE?

Single-number metric:

CORE = average of centered accuracies across all 11 tasks

Where centered accuracy adjusts for random baseline:

centered = (accuracy - random_baseline) / (1.0 - random_baseline)

Benefits:

  • ✅ Comprehensive coverage of language understanding
  • ✅ Balances multiple skill types
  • ✅ Accounts for task difficulty
  • ✅ Enables comparison with published models

Evaluation Metrics Dashboard

Compare LLM performance across standard benchmarks

Your Model Scores

55%
45%
60%
70%
35%
40%

Benchmark Details

hellaswag: Commonsense reasoning about physical situations. Tests if models can complete sentences about everyday scenarios.
arc: Grade-school science questions. Tests basic scientific knowledge and reasoning.
winogrande: Pronoun resolution requiring world knowledge. Tests understanding of context and references.
piqa: Physical intuition about how objects interact. Tests common-sense physics understanding.
mmlu: Multi-domain multiple choice covering 57 subjects. Tests broad academic knowledge.
truthfulqa: Tests if models can avoid common misconceptions and falsehoods humans believe.
GPT-2 (124M)
38.7%
Average Score
LLaMA-7B
57.1%
Average Score
Your Model
50.8%
Average Score

Task Types

The CORE benchmark includes three task types:

1. Multiple Choice

Example from HellaSwag:

Context: "A person is climbing a rock wall."
Choices:
  A) "The person reaches the top and waves."
  B) "The person is eating a sandwich."
  C) "The wall turns into water."
  D) "The rock becomes a bird."
Gold: A

Evaluation method: Choose option with lowest perplexity.

2. Schema (Cloze Completion)

Example from Winogrande:

Context options:
  - "The trophy doesn't fit in the suitcase because it is too large."
  - "The trophy doesn't fit in the suitcase because it is too small."
Continuation: "it" refers to the trophy
Gold: Option 1

Evaluation method: Choose context with lowest perplexity for continuation.

3. Language Modeling

Example from SQUAD:

Context: "The Normans were originally people from..."
Continuation: "northern France"

Evaluation method: Check if model's greedy predictions match the continuation.

Implementation Deep-Dive

nanochat's CORE evaluation from nanochat/core_eval.py:

nanochat/core_eval.py
@torch.no_grad()
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
    """Evaluate a single example."""
    item = data[idx]
    task_type = task_meta['task_type']
    num_fewshot = task_meta['num_fewshot']
    
    # Sample few-shot examples (deterministic based on idx)
    fewshot_examples = []
    if num_fewshot > 0:
        rng = random.Random(1234 + idx)
        available_indices = [i for i in range(len(data)) if i != idx]
        fewshot_indices = rng.sample(available_indices, num_fewshot)
        fewshot_examples = [data[i] for i in fewshot_indices]
    
    # Render prompts based on task type
    if task_type == 'multiple_choice':
        prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
    elif task_type == 'schema':
        prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
    elif task_type == 'language_modeling':
        prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
    
    # Forward model
    input_ids = stack_sequences(tokens, pad_token_id).to(device)
    losses, predictions = forward_model(model, input_ids)
    
    # Evaluate correctness based on task type
    if task_type == 'language_modeling':
        si, ei = start_idxs[0], end_idxs[0]
        predicted_tokens = predictions[0, si-1:ei-1]
        actual_tokens = input_ids[0, si:ei]
        is_correct = torch.all(predicted_tokens == actual_tokens).item()
    elif task_type in ['multiple_choice', 'schema']:
        mean_losses = [losses[i, si-1:ei-1].mean().item()
                       for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
        pred_idx = mean_losses.index(min(mean_losses))
        is_correct = pred_idx == item['gold']
    
    return is_correct

Key Techniques

1. Few-Shot Learning

# Deterministic sampling based on example index
rng = random.Random(1234 + idx)
fewshot_indices = rng.sample(available_indices, num_fewshot)

This ensures:

  • Same few-shot examples for the same test example across runs
  • Different few-shot examples for different test examples
  • Reproducible results

2. Prompt Rendering with Jinja2

# Multiple choice template
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
 
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}"""

Example rendered prompt:

Question: What is the capital of France?
Answer: Paris

Question: What is 2+2?
Answer: 4

Question: What is the largest planet?
Answer: Jupiter

3. Common Prefix/Suffix Detection

For multiple choice:

def find_common_length(token_sequences, direction='left'):
    """Find length of common prefix (multiple choice) or suffix (schema)."""
    min_len = min(len(seq) for seq in token_sequences)
    indices = range(min_len) if direction == 'left' else range(-1, -min_len-1, -1)
    
    for i, idx in enumerate(indices):
        token = token_sequences[0][idx]
        if not all(seq[idx] == token for seq in token_sequences):
            return i
    return min_len

Why this matters: For multiple choice, all options share the same context. In the token sequences, options A, B, and C might all share [15, 42, 88, 91] as a common prefix, with only the answer portion (indices 4-7) differing. We only need to evaluate losses for the answer part, not the shared context—reducing computation by 50-80%.

4. Distributed Evaluation

def evaluate_task(model, tokenizer, data, device, task_meta):
    """Evaluate one task across all examples with distributed dispatch."""
    rank = dist.get_rank() if dist.is_initialized() else 0
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    
    correct = torch.zeros(len(data), dtype=torch.float32, device=device)
    
    # Each rank processes every Nth example
    for idx in range(rank, len(data), world_size):
        is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
        correct[idx] = float(is_correct)
    
    # Sync results across ranks
    if world_size > 1:
        dist.barrier()
        dist.all_reduce(correct, op=dist.ReduceOp.SUM)
    
    mean_correct = correct.mean().item()
    return mean_correct

Strided access pattern (same as data loading):

  • Rank 0: Examples 0, 8, 16, 24, ...
  • Rank 1: Examples 1, 9, 17, 25, ...
  • etc.

This provides automatic load balancing and parallelizes evaluation across all GPUs.

For your evaluation infrastructure, this means: reuse the same strided pattern from training. When data loading and evaluation use identical sharding, you eliminate an entire category of distribution bugs.

CORE Metric Calculation

From scripts/base_eval.py:

scripts/base_eval.py - CORE Metric
# Evaluate all tasks
results = {}
centered_results = {}
for task in tasks:
    accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
    
    # Center by random baseline
    random_baseline = eval_metadata[task_label]["Random baseline"]
    centered = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
    
    results[task_label] = accuracy
    centered_results[task_label] = centered
 
# CORE metric = average of centered results
core_metric = sum(centered_results.values()) / len(centered_results)

Why center by random baseline? Different tasks have different random baselines: 4-choice MC has 25% random accuracy, True/False has 50%, and 10-choice MC has 10%. Centering normalizes to the scale where 0.0 = random performance and 1.0 = perfect performance. This ensures all tasks contribute equally to the CORE metric.

Evaluation integrates into the training loop in 50 lines

nanochat's training loop integrates both metrics from scripts/base_train.py:

scripts/base_train.py - Evaluation Integration
for step in range(num_iterations + 1):
    
    # Evaluate validation bpb
    if last_step or step % eval_every == 0:
        model.eval()
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
        
        print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
        wandb_run.log({
            "step": step,
            "val/bpb": val_bpb,
        })
        model.train()
    
    # Evaluate CORE metric
    if last_step or (step > 0 and step % core_metric_every == 0):
        model.eval()
        with autocast_ctx:
            results = evaluate_model(orig_model, tokenizer, device, max_per_task=500)
        
        print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
        wandb_run.log({
            "step": step,
            "core_metric": results["core_metric"],
            "centered_results": results["centered_results"],
        })
        model.train()
    
    # Training step
    # ...

Evaluation frequency:

  • eval_every = 250: bpb evaluation every 250 steps
  • core_metric_every = 2000: CORE evaluation every 2000 steps

Why different frequencies?

  • bpb: Fast (~30 seconds), run frequently
  • CORE: Slow (~15 minutes), run sparingly

Chinchilla scaling laws prove 20× data-to-params is optimal

With these evaluation tools, we can now explore scaling laws - the relationships between model size, data, compute, and performance.

Chinchilla Scaling Laws

The Chinchilla paper established that for optimal compute efficiency:

Optimal data-to-parameter ratio ≈ 20:1

nanochat's implementation from scripts/base_train.py:

scripts/base_train.py - Chinchilla Ratio
# Training horizon specification
target_param_data_ratio = 20  # Chinchilla = 20
 
# Calculate training iterations
target_tokens = target_param_data_ratio * num_params
num_iterations = target_tokens // total_batch_size

Example for depth=20 model (actual nanochat):

  • Parameters: ~561M (from speedrun.sh)
  • Target tokens: 20 × 561M = ~11.2B tokens
  • Batch size: 524K tokens
  • Iterations: 11.2B / 524K ≈ 21,400 steps

Scaling Laws Calculator

Find the optimal model size for your compute budget (Chinchilla scaling)

6.0e18
Optimal Model Size
0.22B
parameters
Optimal Training Tokens
4.5B
tokens (~20× params)
Expected Loss
947.984
nat/token

Loss vs Model Size (at different token ratios)

Chinchilla Insight

Most models before Chinchilla were under-trained (too few tokens for their size). The optimal ratio is ~20 tokens per parameter.

Compute Formula

C ≈ 6 × N × D, where C = FLOPs, N = parameters, D = tokens. This estimates training compute for a forward+backward pass.

Practical Guidance

  • For inference: Prefer smaller, well-trained models (lower serving cost)
  • For research: Training beyond optimal can still help; diminishing returns
  • Data-limited? Use a smaller model to avoid overfitting
  • Compute-limited? Follow the optimal ratio closely

Observing Scaling Laws

Experiment setup:

Train models at different scales with Chinchilla-optimal data:

Note: The table below uses illustrative parameter counts to demonstrate scaling principles. Actual nanochat model sizes differ (e.g., d20 = ~561M parameters per Karpathy's walkthrough).

DepthParams (illustrative)Data (20×)val bpbCORE
1230M600M1.750.25
1654M1.08B1.580.32
2083M1.66B1.450.38
24118M2.36B1.350.43
28158M3.16B1.280.47

Observations:

  1. Power law in loss:
bpb ∝ N^(-α)
where N = parameters, α ≈ 0.05
  1. Logarithmic in CORE:
CORE ∝ log(N)
  1. Smooth improvements: No sudden jumps, steady scaling

Loss Curves

Typical training curve:

Loading diagram...

Characteristics:

  • Rapid initial improvement: First 500 steps see largest gains
  • Logarithmic progress: Later improvements come more slowly
  • Smooth convergence: No oscillations or instabilities

Compute-Optimal Frontier

Key insight: For a fixed compute budget C (in FLOPs), what's the optimal allocation?

C = 6 × N × D  (approximate FLOPs for training)
where:
  N = parameters
  D = tokens

Chinchilla finding:

Optimal: N ∝ C^0.5
         D ∝ C^0.5

This means:

  • Double compute → √2× larger model trained on √2× more data
  • NOT: Double compute → 2× larger model on same data
  • NOT: Double compute → same model on 2× more data

nanochat's default (via target_param_data_ratio=20):

  • Slightly undertrained vs Chinchilla (which suggests ~14:1 for smaller models)
  • Reasonable for experimentation where you might continue training

Loss vs Compute (Scaling Law)

Kaplan et al. (2020) scaling law:

L(C) = (C_0 / C)^α
where:
  L = validation loss (or bpb)
  C = compute (FLOPs)
  C_0 = constant (depends on architecture)
  α ≈ 0.05-0.07 (exponent)

In practice:

Compute (FLOPs)Model SizeDataExpected bpb
1e1830M600M1.75
1e1983M1.66B1.45
1e20230M4.6B1.20
1e21630M12.6B1.00

Key insight: To halve your loss, you need approximately 10× more compute. This exponential relationship is why training frontier models requires such massive investment.

These evaluation patterns reveal training dynamics

1. Early Stopping vs Overtraining

Observation: For fixed model size, loss continues improving with more data, but with diminishing returns.

# Training beyond Chinchilla ratio
target_param_data_ratio = 20  # Standard
# vs
target_param_data_ratio = 40  # 2× more data

Results (depth=20, ~561M params, from nanochat):

Data RatioTokensval bpbCORETraining Time
10×830M1.520.3530 min
20×1.66B1.450.3860 min
40×3.32B1.420.39120 min

Insight: 20× is a good default. Beyond that, returns diminish significantly.

2. bpb vs CORE Correlation

Generally:

  • Lower bpb → Higher CORE
  • But correlation is imperfect

Example anomaly:

Model A: bpb=1.45, CORE=0.38
Model B: bpb=1.44, CORE=0.36

Model B has slightly better compression but worse reasoning. This can happen when:

  • Model memorizes common phrases (lowers bpb)
  • But doesn't learn compositional reasoning (lowers CORE)

Lesson: Use both metrics. bpb tells you about training dynamics and compression efficiency; CORE tells you about actual capabilities and reasoning quality.

3. Validation Set Size

eval_tokens = 20 * 524288  # ~10M tokens

Why 10M tokens?

  • Large enough for stable estimates
  • Small enough to evaluate quickly (~30s)

Stability analysis:

eval_tokensbpb meanbpb stdEval time
1M1.4470.0233s
5M1.4510.00815s
10M1.4500.00430s
50M1.4500.002150s

Conclusion: 10M tokens provides sufficient precision with reasonable compute.

4. Learning Rate Warmdown

nanochat uses warmdown (gradual LR decay) in the final 20% of training:

warmdown_ratio = 0.2
final_lr_frac = 0.0
 
def get_lr_multiplier(it):
    warmdown_iters = round(warmdown_ratio * num_iterations)
    if it <= num_iterations - warmdown_iters:
        return 1.0
    else:
        progress = (num_iterations - it) / warmdown_iters
        return progress * 1.0 + (1 - progress) * final_lr_frac

Effect on validation loss:

Steps 0-2500:    LR = 1.0 × base_lr,  val_bpb slowly decreasing
Steps 2500-3167: LR = 1.0 → 0.0,      val_bpb drops faster

Final val_bpb improvement: ~0.03 bpb (2% relative)

Why it works: Model fine-tunes to a sharper minimum in later training.

Advanced: curriculum learning and loss spike detection

1. Bits Per Character vs Bits Per Byte

Some papers report "bits per character" (bpc):

# Bits per character
bpc = total_nats / (log(2) * num_characters)

Key difference:

  • bpb: UTF-8 bytes (e.g., "世界" = 6 bytes)
  • bpc: Unicode characters (e.g., "世界" = 2 characters)

For English text: bpc ≈ bpb (most characters = 1 byte)
For multilingual text: bpc < bpb (multi-byte characters)

nanochat uses bpb because:

  • More universal (bytes are always bytes)
  • Easier to compute (no character counting)
  • Standard in modern LLM papers

2. Perplexity

Some papers report perplexity instead of bpb:

perplexity = exp(cross_entropy_loss)

Relationship to bpb:

bpb = cross_entropy_loss / log(2)
perplexity = 2^bpb

Example:

  • bpb = 1.45
  • perplexity = 2^1.45 = 2.73

Interpretation: On average, the model is uncertain between ~2.7 likely next bytes.

3. CORE vs Other Benchmarks

Comparison:

BenchmarkTasksCoverageProsCons
CORE11BroadSingle metric, comprehensiveLess common than alternatives
HELM42Very broadIndustry standardSlow, complex
Eleuther Eval Harness200+Extremely broadMost comprehensiveVery slow
MMLU57Knowledge-focusedStandard, well-knownNarrow (only knowledge)

nanochat chooses CORE for:

  • ✅ Fast evaluation (~15 min on 8 GPUs)
  • ✅ Good coverage without redundancy
  • ✅ From recent, well-regarded paper (DCLM)

4. Curriculum Learning

Question: Should we change the data distribution during training?

# Potential curriculum strategy
if step < warmup_steps:
    data_loader = easy_data_loader  # Simpler texts
else:
    data_loader = full_data_loader  # All data

nanochat's choice: No curriculum (uniform distribution throughout)

Rationale:

  • Simpler implementation
  • Data is already shuffled (good enough)
  • No clear evidence curriculum helps for LLMs at this scale

Debugging training runs with evaluation signals

Common Issues

1. Loss diverges (NaN)

Step 100: loss=2.8
Step 101: loss=3.2
Step 102: loss=8.5
Step 103: loss=nan

Likely causes: Learning rate too high, gradient clipping disabled or set too high, or numerical instability in the forward pass.

Solution:

grad_clip = 1.0  # Enable gradient clipping
matrix_lr = 0.01  # Reduce Muon LR

2. Loss plateaus early

Step 0-500: bpb 4.5 → 2.1
Step 500-3000: bpb 2.1 → 2.09

Likely causes:

  • Learning rate too low
  • Model too small for dataset
  • Data quality issues

Solution:

  • Increase learning rate
  • Increase model depth
  • Inspect data distribution

3. Train/val loss diverge

Step 2000: train_loss=1.4, val_bpb=1.8

Likely causes:

  • Overfitting (model too large)
  • Train/val distribution mismatch
  • Bug in validation evaluation

Solution:

  • Check dataset splits
  • Verify eval implementation
  • Reduce model size if overfitting

Evaluation infrastructure isn't an afterthought—it's fundamental

nanochat's evaluation framework provides the tools needed to understand and optimize language model training:

Bits per byte (bpb):

  • ✅ Tokenization-agnostic metric
  • ✅ Enables fair comparison across models
  • ✅ Fast to evaluate (~30 seconds)
  • ✅ Stable training signal

CORE benchmark:

  • ✅ Comprehensive task coverage
  • ✅ Single-number metric for model quality
  • ✅ Normalized for fair comparison
  • ✅ Parallelized for efficiency

Scaling laws:

  • ✅ Power-law relationship between compute and performance
  • ✅ Chinchilla ratio (20:1 data:params) as default
  • ✅ Smooth, predictable improvements

Together, these tools enable:

  • Principled hyperparameter selection
  • Efficient compute allocation
  • Early detection of training issues
  • Reliable model comparison

The evaluation infrastructure is not an afterthought - it's fundamental to understanding what's happening during training. By implementing proper metrics and comprehensive benchmarks, nanochat demonstrates that even small-scale projects can employ the same rigorous evaluation methods used by frontier labs.

Training without evaluation is just generating heat. Measure everything.


Previous in series:

Track 1 Complete! 🎉 All Technical Deep-Dive posts:

Coming next: Track 2 (Practical Guides) - Hands-on tutorials for building your own ChatGPT!


Sources and References

Scaling Laws

Evaluation Benchmarks

Loss and Perplexity

Implementation

Industry Research & Benchmarks (as of January 2025)


Before you interpret your training metrics:

  1. Use bits-per-byte, not raw loss. Tokenizer vocabulary size affects loss magnitude—bpb gives you a vocabulary-agnostic metric.
  2. Compute your Chinchilla ratio. Divide training tokens by parameter count—below 20:1 means you're undertrained; above 40:1 shows diminishing returns.
  3. Validate on 10M+ tokens. Smaller validation sets give noisy bpb estimates—10M tokens gives ±0.004 bpb precision.
  4. Track both bpb and CORE. Lower bpb doesn't guarantee better reasoning—run task benchmarks alongside perplexity.
  5. Log evaluation metrics every 500 steps. Catching training divergence at step 1000 saves hours versus discovering at step 5000.

Your loss number means nothing without context. Bits-per-byte and Chinchilla tell you what it actually means.