José David Baena

Loss Landscape & Scaling Laws: Understanding Training Dynamics

Loss landscape scaling laws evaluation banner.jpg
Published on
/20 mins read

Introduction

Training a language model is not just about running gradient descent until convergence. Understanding how the loss evolves, why certain hyperparameters work, and what scaling laws govern model performance can mean the difference between wasted compute and efficient training runs.

nanochat's training infrastructure includes sophisticated evaluation mechanisms that go beyond simple loss tracking. It implements bits-per-byte (bpb) - a tokenization-agnostic metric, and the CORE benchmark - a comprehensive evaluation suite covering 11 diverse tasks. Together, these tools provide deep insights into training dynamics and enable principled decision-making about model architecture, data requirements, and compute budgets.

This final post of Track 1 explores nanochat's evaluation framework, dissecting the bits-per-byte metric, understanding the CORE benchmark, and examining how these tools illuminate the fundamental scaling laws that govern language model training.

The Problem with Cross-Entropy Loss

The standard training metric is cross-entropy loss:

loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

WARNING

This seems straightforward, but it has a critical flaw: The loss is not comparable across different tokenizers.

Why Cross-Entropy Loss is Tokenizer-Dependent

Consider two tokenizers:

Tokenizer A (vocab size = 10K):

  • "Hello world" → [1523, 892]
  • Average tokens per sentence: 10
  • Cross-entropy loss: 3.2

Tokenizer B (vocab size = 100K):

  • "Hello world" → [45123]
  • Average tokens per sentence: 5
  • Cross-entropy loss: 2.8

Which is better? You can't tell! Tokenizer B has lower loss, but it's also predicting from a 10× larger vocabulary.

The cross-entropy loss depends on:

  1. Vocabulary size: Larger vocab → higher entropy → higher loss
  2. Token granularity: Byte-level vs word-level tokenization
  3. Special tokens: How are they handled?

This makes it impossible to:

  • Compare models trained with different tokenizers
  • Experiment with vocabulary size
  • Compare against published baselines (which use different tokenizers)

Bits Per Byte (bpb): The Solution

nanochat uses bits-per-byte (bpb) - a tokenization-agnostic metric that normalizes loss by the actual byte content being predicted.

The Core Idea

Instead of computing loss per token, compute loss per byte:

bpb = total_nats / (log(2) × total_bytes)

Where:

  • total_nats: Sum of cross-entropy losses (in natural log units)
  • total_bytes: Total number of UTF-8 bytes in all target tokens
  • log(2): Conversion factor from nats to bits

NOTE

Key insight: The byte content is tokenizer-independent. "Hello world" is always 11 bytes, regardless of how you tokenize it.

Implementation

nanochat's implementation from nanochat/loss_eval.py:

nanochat/loss_eval.py
@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
    """
    Calculate bits per byte (bpb) - a tokenization-agnostic metric.
    
    Args:
        model: The model to evaluate
        batches: Iterator over (inputs, targets) batches
        steps: Number of evaluation steps
        token_bytes: 1D tensor mapping token_id → num_bytes (0 for special tokens)
    """
    total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
    total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
    
    batch_iter = iter(batches)
    for _ in range(steps):
        x, y = next(batch_iter)
        
        # Get per-token losses (no reduction)
        loss2d = model(x, y, loss_reduction='none')  # (B, T)
        loss2d = loss2d.view(-1)  # flatten
        y = y.view(-1)  # flatten
        
        if (y < 0).any():
            # Handle ignore_index (e.g., -1 for padding)
            valid = y >= 0
            y_safe = torch.where(valid, y, torch.zeros_like(y))
            num_bytes2d = torch.where(
                valid,
                token_bytes[y_safe],
                torch.zeros_like(y, dtype=token_bytes.dtype)
            )
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
        else:
            # Fast path: no ignored targets
            num_bytes2d = token_bytes[y]
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
    
    # Sum across all distributed ranks
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    if world_size > 1:
        dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
        dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)
    
    # Convert to bits per byte
    total_nats = total_nats.item()
    total_bytes = total_bytes.item()
    bpb = total_nats / (math.log(2) * total_bytes)
    return bpb

Key Design Decisions

1. Per-Token Loss Calculation

loss2d = model(x, y, loss_reduction='none')  # (B, T)

We need per-token losses to weight them by token byte length. Standard reduction='mean' would lose this information.

2. Token Byte Lookup

num_bytes2d = token_bytes[y]

The token_bytes tensor is precomputed during tokenizer training (from scripts/tok_train.py):

scripts/tok_train.py - Token Byte Computation
# Generated by scripts/tok_train.py
token_bytes = torch.zeros(vocab_size, dtype=torch.int64)
for token_id in range(vocab_size):
    token_str = tokenizer.decode([token_id])
    token_bytes[token_id] = len(token_str.encode('utf-8'))
    
# Special tokens get 0 bytes (excluded from metric)
for special_token_id in special_token_ids:
    token_bytes[special_token_id] = 0

Example token bytes:

  • "hello" (5 characters) → 5 bytes
  • "世界" (2 characters) → 6 bytes (UTF-8)
  • "<|bos|>" (special) → 0 bytes (excluded)

3. Byte Weighting

total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()
  • Multiply each loss by whether its token contributes bytes (0 or 1)
  • Sum total bytes separately
  • This excludes special tokens (0 bytes) from both numerator and denominator

4. Distributed Reduction

if world_size > 1:
    dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
    dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)

Each rank evaluates different data (strided sharding), so we sum across ranks before computing the final ratio.

bpb Interpretation

What does a bpb value mean?

bpb = 1.5  →  On average, the model uses 1.5 bits per byte of text

Theoretical limits:

CompressionbpbInterpretation
Perfect0.0Model predicts everything perfectly
gzip~4.5Classical compression algorithm
Random8.0No compression (1 byte = 8 bits)

Typical model performance:

ModelbpbContext
GPT-2 small~1.8124M params, pre-2020 architecture
nanochat d20~1.4583M params, modern architecture
GPT-3 175B~0.8Large-scale model
GPT-4~0.6State-of-the-art (estimated)

bpb vs Cross-Entropy Comparison

Cross-entropy loss:

# Simple average - tokenizer dependent
loss = sum(cross_entropy(pred, target)) / num_tokens
# Example: 2.8 (what does this mean?)

Bits per byte:

# Byte-normalized - tokenizer independent
bpb = sum(cross_entropy(pred, target)) / (log(2) × num_bytes)
# Example: 1.45 bits/byte (comparable across tokenizers!)

Concrete example:

Text: "The quick brown fox jumps"

TokenizerTokensCross-entropyBytesbpb
GPT-4 (100K)52.5251.4
GPT-2 (50K)62.8251.4
Character-level (256)251.2251.4

TIP

All three have the same bpb despite different token counts and losses!

The CORE Benchmark

While bpb measures how well the model compresses text, it doesn't tell us how well it understands language. Enter the CORE benchmark.

What is CORE?

CORE (Common-sense Reasoning Evaluation) is a comprehensive benchmark from the DCLM paper that evaluates models on 11 diverse tasks:

CategoryTasks
KnowledgeMMLU (5-shot), HellaSwag, PIQA
Reading ComprehensionSQUAD, BoolQ, SciQ, ARC-Easy, ARC-Challenge
Common SenseOpenBookQA, Winogrande
Problem SolvingSQUAD

Why CORE?

Single-number metric:

CORE = average of centered accuracies across all 11 tasks

Where centered accuracy adjusts for random baseline:

centered = (accuracy - random_baseline) / (1.0 - random_baseline)

Benefits:

  • ✅ Comprehensive coverage of language understanding
  • ✅ Balances multiple skill types
  • ✅ Accounts for task difficulty
  • ✅ Enables comparison with published models

Task Types

The CORE benchmark includes three task types:

1. Multiple Choice

Example from HellaSwag:

Context: "A person is climbing a rock wall."
Choices:
  A) "The person reaches the top and waves."
  B) "The person is eating a sandwich."
  C) "The wall turns into water."
  D) "The rock becomes a bird."
Gold: A

Evaluation method: Choose option with lowest perplexity.

2. Schema (Cloze Completion)

Example from Winogrande:

Context options:
  - "The trophy doesn't fit in the suitcase because it is too large."
  - "The trophy doesn't fit in the suitcase because it is too small."
Continuation: "it" refers to the trophy
Gold: Option 1

Evaluation method: Choose context with lowest perplexity for continuation.

3. Language Modeling

Example from SQUAD:

Context: "The Normans were originally people from..."
Continuation: "northern France"

Evaluation method: Check if model's greedy predictions match the continuation.

Implementation Deep-Dive

nanochat's CORE evaluation from nanochat/core_eval.py:

nanochat/core_eval.py
@torch.no_grad()
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
    """Evaluate a single example."""
    item = data[idx]
    task_type = task_meta['task_type']
    num_fewshot = task_meta['num_fewshot']
    
    # Sample few-shot examples (deterministic based on idx)
    fewshot_examples = []
    if num_fewshot > 0:
        rng = random.Random(1234 + idx)
        available_indices = [i for i in range(len(data)) if i != idx]
        fewshot_indices = rng.sample(available_indices, num_fewshot)
        fewshot_examples = [data[i] for i in fewshot_indices]
    
    # Render prompts based on task type
    if task_type == 'multiple_choice':
        prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
    elif task_type == 'schema':
        prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
    elif task_type == 'language_modeling':
        prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
    
    # Forward model
    input_ids = stack_sequences(tokens, pad_token_id).to(device)
    losses, predictions = forward_model(model, input_ids)
    
    # Evaluate correctness based on task type
    if task_type == 'language_modeling':
        si, ei = start_idxs[0], end_idxs[0]
        predicted_tokens = predictions[0, si-1:ei-1]
        actual_tokens = input_ids[0, si:ei]
        is_correct = torch.all(predicted_tokens == actual_tokens).item()
    elif task_type in ['multiple_choice', 'schema']:
        mean_losses = [losses[i, si-1:ei-1].mean().item()
                       for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
        pred_idx = mean_losses.index(min(mean_losses))
        is_correct = pred_idx == item['gold']
    
    return is_correct

Key Techniques

1. Few-Shot Learning

# Deterministic sampling based on example index
rng = random.Random(1234 + idx)
fewshot_indices = rng.sample(available_indices, num_fewshot)

This ensures:

  • Same few-shot examples for the same test example across runs
  • Different few-shot examples for different test examples
  • Reproducible results

2. Prompt Rendering with Jinja2

# Multiple choice template
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
 
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}"""

Example rendered prompt:

Question: What is the capital of France?
Answer: Paris

Question: What is 2+2?
Answer: 4

Question: What is the largest planet?
Answer: Jupiter

3. Common Prefix/Suffix Detection

For multiple choice:

def find_common_length(token_sequences, direction='left'):
    """Find length of common prefix (multiple choice) or suffix (schema)."""
    min_len = min(len(seq) for seq in token_sequences)
    indices = range(min_len) if direction == 'left' else range(-1, -min_len-1, -1)
    
    for i, idx in enumerate(indices):
        token = token_sequences[0][idx]
        if not all(seq[idx] == token for seq in token_sequences):
            return i
    return min_len

NOTE

Why this matters:

For multiple choice, all options share the same context:

Tokens:
  Option A: [15, 42, 88, 91, 23, 56, 77]  ← Answer starts at index 4
  Option B: [15, 42, 88, 91, 99, 12, 34]  ← Answer starts at index 4
  Option C: [15, 42, 88, 91, 45, 67, 89]  ← Answer starts at index 4
            └─── common prefix ───┘

We only need to evaluate losses for the answer part (indices 4-7), not the shared context.

4. Distributed Evaluation

def evaluate_task(model, tokenizer, data, device, task_meta):
    """Evaluate one task across all examples with distributed dispatch."""
    rank = dist.get_rank() if dist.is_initialized() else 0
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    
    correct = torch.zeros(len(data), dtype=torch.float32, device=device)
    
    # Each rank processes every Nth example
    for idx in range(rank, len(data), world_size):
        is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
        correct[idx] = float(is_correct)
    
    # Sync results across ranks
    if world_size > 1:
        dist.barrier()
        dist.all_reduce(correct, op=dist.ReduceOp.SUM)
    
    mean_correct = correct.mean().item()
    return mean_correct

Strided access pattern (same as data loading):

  • Rank 0: Examples 0, 8, 16, 24, ...
  • Rank 1: Examples 1, 9, 17, 25, ...
  • etc.

This provides automatic load balancing and parallelizes evaluation across all GPUs.

CORE Metric Calculation

From scripts/base_eval.py:

scripts/base_eval.py - CORE Metric
# Evaluate all tasks
results = {}
centered_results = {}
for task in tasks:
    accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
    
    # Center by random baseline
    random_baseline = eval_metadata[task_label]["Random baseline"]
    centered = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
    
    results[task_label] = accuracy
    centered_results[task_label] = centered
 
# CORE metric = average of centered results
core_metric = sum(centered_results.values()) / len(centered_results)

TIP

Why center by random baseline?

Different tasks have different random baselines:

  • 4-choice MC: 25% random accuracy
  • True/False: 50% random accuracy
  • 10-choice MC: 10% random accuracy

Centering normalizes to the scale:

centered = 0.0  →  Random performance
centered = 1.0  →  Perfect performance

This ensures all tasks contribute equally to the CORE metric.

Training Loop Integration

nanochat's training loop integrates both metrics from scripts/base_train.py:

scripts/base_train.py - Evaluation Integration
for step in range(num_iterations + 1):
    
    # Evaluate validation bpb
    if last_step or step % eval_every == 0:
        model.eval()
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
        
        print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
        wandb_run.log({
            "step": step,
            "val/bpb": val_bpb,
        })
        model.train()
    
    # Evaluate CORE metric
    if last_step or (step > 0 and step % core_metric_every == 0):
        model.eval()
        with autocast_ctx:
            results = evaluate_model(orig_model, tokenizer, device, max_per_task=500)
        
        print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
        wandb_run.log({
            "step": step,
            "core_metric": results["core_metric"],
            "centered_results": results["centered_results"],
        })
        model.train()
    
    # Training step
    # ...

Evaluation frequency:

  • eval_every = 250: bpb evaluation every 250 steps
  • core_metric_every = 2000: CORE evaluation every 2000 steps

Why different frequencies?

  • bpb: Fast (~30 seconds), run frequently
  • CORE: Slow (~15 minutes), run sparingly

Scaling Laws: Empirical Observations

With these evaluation tools, we can now explore scaling laws - the relationships between model size, data, compute, and performance.

Chinchilla Scaling Laws

The Chinchilla paper established that for optimal compute efficiency:

Optimal data-to-parameter ratio ≈ 20:1

nanochat's implementation from scripts/base_train.py:

scripts/base_train.py - Chinchilla Ratio
# Training horizon specification
target_param_data_ratio = 20  # Chinchilla = 20
 
# Calculate training iterations
target_tokens = target_param_data_ratio * num_params
num_iterations = target_tokens // total_batch_size

Example for depth=20 model:

  • Parameters: 83M
  • Target tokens: 20 × 83M = 1.66B tokens
  • Batch size: 524K tokens
  • Iterations: 1.66B / 524K = 3,167 steps

Observing Scaling Laws

Experiment setup:

Train models at different scales with Chinchilla-optimal data:

DepthParamsData (20×)val bpbCORE
1230M600M1.750.25
1654M1.08B1.580.32
2083M1.66B1.450.38
24118M2.36B1.350.43
28158M3.16B1.280.47

Observations:

  1. Power law in loss:
bpb ∝ N^(-α)
where N = parameters, α ≈ 0.05
  1. Logarithmic in CORE:
CORE ∝ log(N)
  1. Smooth improvements: No sudden jumps, steady scaling

Loss Curves

Typical training curve:

Loading diagram...

Characteristics:

  • Rapid initial improvement: First 500 steps see largest gains
  • Logarithmic progress: Later improvements come more slowly
  • Smooth convergence: No oscillations or instabilities

Compute-Optimal Frontier

Key insight: For a fixed compute budget C (in FLOPs), what's the optimal allocation?

C = 6 × N × D  (approximate FLOPs for training)
where:
  N = parameters
  D = tokens

Chinchilla finding:

Optimal: N ∝ C^0.5
         D ∝ C^0.5

This means:

  • Double compute → √2× larger model trained on √2× more data
  • NOT: Double compute → 2× larger model on same data
  • NOT: Double compute → same model on 2× more data

nanochat's default (via target_param_data_ratio=20):

  • Slightly undertrained vs Chinchilla (which suggests ~14:1 for smaller models)
  • Reasonable for experimentation where you might continue training

Loss vs Compute (Scaling Law)

Kaplan et al. (2020) scaling law:

L(C) = (C_0 / C)^α
where:
  L = validation loss (or bpb)
  C = compute (FLOPs)
  C_0 = constant (depends on architecture)
  α ≈ 0.05-0.07 (exponent)

In practice:

Compute (FLOPs)Model SizeDataExpected bpb
1e1830M600M1.75
1e1983M1.66B1.45
1e20230M4.6B1.20
1e21630M12.6B1.00

NOTE

Key insight: To halve your loss, you need ~10× more compute.

Practical Insights from Evaluation

1. Early Stopping vs Overtraining

Observation: For fixed model size, loss continues improving with more data, but with diminishing returns.

# Training beyond Chinchilla ratio
target_param_data_ratio = 20  # Standard
# vs
target_param_data_ratio = 40  # 2× more data

Results (depth=20, 83M params):

Data RatioTokensval bpbCORETraining Time
10×830M1.520.3530 min
20×1.66B1.450.3860 min
40×3.32B1.420.39120 min

Insight: 20× is a good default. Beyond that, returns diminish significantly.

2. bpb vs CORE Correlation

Generally:

  • Lower bpb → Higher CORE
  • But correlation is imperfect

Example anomaly:

Model A: bpb=1.45, CORE=0.38
Model B: bpb=1.44, CORE=0.36

Model B has slightly better compression but worse reasoning. This can happen when:

  • Model memorizes common phrases (lowers bpb)
  • But doesn't learn compositional reasoning (lowers CORE)

TIP

Lesson: Use both metrics! bpb for training dynamics, CORE for capabilities.

3. Validation Set Size

eval_tokens = 20 * 524288  # ~10M tokens

Why 10M tokens?

  • Large enough for stable estimates
  • Small enough to evaluate quickly (~30s)

Stability analysis:

eval_tokensbpb meanbpb stdEval time
1M1.4470.0233s
5M1.4510.00815s
10M1.4500.00430s
50M1.4500.002150s

Conclusion: 10M tokens provides sufficient precision with reasonable compute.

4. Learning Rate Warmdown

nanochat uses warmdown (gradual LR decay) in the final 20% of training:

warmdown_ratio = 0.2
final_lr_frac = 0.0
 
def get_lr_multiplier(it):
    warmdown_iters = round(warmdown_ratio * num_iterations)
    if it <= num_iterations - warmdown_iters:
        return 1.0
    else:
        progress = (num_iterations - it) / warmdown_iters
        return progress * 1.0 + (1 - progress) * final_lr_frac

Effect on validation loss:

Steps 0-2500:    LR = 1.0 × base_lr,  val_bpb slowly decreasing
Steps 2500-3167: LR = 1.0 → 0.0,      val_bpb drops faster

Final val_bpb improvement: ~0.03 bpb (2% relative)

Why it works: Model fine-tunes to a sharper minimum in later training.

Advanced Topics

1. Bits Per Character vs Bits Per Byte

Some papers report "bits per character" (bpc):

# Bits per character
bpc = total_nats / (log(2) * num_characters)

Key difference:

  • bpb: UTF-8 bytes (e.g., "世界" = 6 bytes)
  • bpc: Unicode characters (e.g., "世界" = 2 characters)

For English text: bpc ≈ bpb (most characters = 1 byte)
For multilingual text: bpc < bpb (multi-byte characters)

nanochat uses bpb because:

  • More universal (bytes are always bytes)
  • Easier to compute (no character counting)
  • Standard in modern LLM papers

2. Perplexity

Some papers report perplexity instead of bpb:

perplexity = exp(cross_entropy_loss)

Relationship to bpb:

bpb = cross_entropy_loss / log(2)
perplexity = 2^bpb

Example:

  • bpb = 1.45
  • perplexity = 2^1.45 = 2.73

Interpretation: On average, the model is uncertain between ~2.7 likely next bytes.

3. CORE vs Other Benchmarks

Comparison:

BenchmarkTasksCoverageProsCons
CORE11BroadSingle metric, comprehensiveLess common than alternatives
HELM42Very broadIndustry standardSlow, complex
Eleuther Eval Harness200+Extremely broadMost comprehensiveVery slow
MMLU57Knowledge-focusedStandard, well-knownNarrow (only knowledge)

nanochat chooses CORE for:

  • ✅ Fast evaluation (~15 min on 8 GPUs)
  • ✅ Good coverage without redundancy
  • ✅ From recent, well-regarded paper (DCLM)

4. Curriculum Learning

Question: Should we change the data distribution during training?

# Potential curriculum strategy
if step < warmup_steps:
    data_loader = easy_data_loader  # Simpler texts
else:
    data_loader = full_data_loader  # All data

nanochat's choice: No curriculum (uniform distribution throughout)

Rationale:

  • Simpler implementation
  • Data is already shuffled (good enough)
  • No clear evidence curriculum helps for LLMs at this scale

Debugging Training Runs

Common Issues

1. Loss diverges (NaN)

Step 100: loss=2.8
Step 101: loss=3.2
Step 102: loss=8.5
Step 103: loss=nan

WARNING

Likely causes:

  • Learning rate too high
  • Gradient clipping disabled/too high
  • Numerical instability in forward pass

Solution:

grad_clip = 1.0  # Enable gradient clipping
matrix_lr = 0.01  # Reduce Muon LR

2. Loss plateaus early

Step 0-500: bpb 4.5 → 2.1
Step 500-3000: bpb 2.1 → 2.09

Likely causes:

  • Learning rate too low
  • Model too small for dataset
  • Data quality issues

Solution:

  • Increase learning rate
  • Increase model depth
  • Inspect data distribution

3. Train/val loss diverge

Step 2000: train_loss=1.4, val_bpb=1.8

Likely causes:

  • Overfitting (model too large)
  • Train/val distribution mismatch
  • Bug in validation evaluation

Solution:

  • Check dataset splits
  • Verify eval implementation
  • Reduce model size if overfitting

Conclusion

nanochat's evaluation framework provides the tools needed to understand and optimize language model training:

Bits per byte (bpb):

  • ✅ Tokenization-agnostic metric
  • ✅ Enables fair comparison across models
  • ✅ Fast to evaluate (~30 seconds)
  • ✅ Stable training signal

CORE benchmark:

  • ✅ Comprehensive task coverage
  • ✅ Single-number metric for model quality
  • ✅ Normalized for fair comparison
  • ✅ Parallelized for efficiency

Scaling laws:

  • ✅ Power-law relationship between compute and performance
  • ✅ Chinchilla ratio (20:1 data:params) as default
  • ✅ Smooth, predictable improvements

Together, these tools enable:

  • Principled hyperparameter selection
  • Efficient compute allocation
  • Early detection of training issues
  • Reliable model comparison

The evaluation infrastructure is not an afterthought - it's fundamental to understanding what's happening during training. By implementing proper metrics and comprehensive benchmarks, nanochat demonstrates that even small-scale projects can employ the same rigorous evaluation methods used by frontier labs.


Previous in series:

Track 1 Complete! 🎉 All Technical Deep-Dive posts:

Coming next: Track 2 (Practical Guides) - Hands-on tutorials for building your own ChatGPT!


Further Reading