Loss Landscape & Scaling Laws: Understanding Training Dynamics

- Published on
- /23 mins read
Cross-entropy loss tells you nothing about what your model knows
Early in my LLM training experiments, I spent hours comparing loss numbers between models with different tokenizers. Completely meaningless. Understanding bits-per-byte and Chinchilla scaling laws changed how I approach every training decision.
Your loss number is meaningless for comparison. Bits-per-byte normalizes across tokenizers—and Chinchilla showed us how to allocate compute.
TL;DR: Bits-per-byte normalizes across tokenizers. CORE benchmark provides centered metrics (0=random, 1=perfect). Chinchilla scaling laws prove 20× data-to-params is compute-optimal. These three insights guide every training decision.
The benchmark that lied: Consider a scenario that happens more often than people admit: celebrating hitting 2.8 cross-entropy loss—lower than a baseline at 3.1—only to find the model is worse when deployed. The problem: a tokenizer change. A new tokenizer with 100K vocabulary (vs 50K in the baseline) produces fewer tokens per sentence. Lower loss, but not from a better model—from an easier prediction task. After converting to bits-per-byte, the reality emerges: the model is 0.15 bpb worse than baseline. Months of "progress" went backwards. Always normalize your metrics.
Training a language model is not just about running gradient descent until convergence. Understanding how the loss evolves, why certain hyperparameters work, and what scaling laws govern model performance can mean the difference between wasted compute and efficient training runs.
nanochat's training infrastructure includes evaluation mechanisms that go beyond simple loss tracking. It implements bits-per-byte (bpb)—a tokenization-agnostic metric—and the CORE benchmark—an 11-task evaluation suite. Together, these tools illuminate training dynamics and enable principled decision-making about model architecture, data requirements, and compute budgets.
This final post of Track 1 dissects the bits-per-byte metric, explains CORE benchmark mechanics, and shows how these tools reveal the fundamental scaling laws governing language model training.
Cross-entropy loss is tokenizer-dependent and incomparable
The standard training metric is cross-entropy loss:
loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))This seems straightforward, but it has a critical flaw: the loss is not comparable across different tokenizers.
Why Cross-Entropy Loss is Tokenizer-Dependent
Consider two tokenizers:
Tokenizer A (vocab size = 10K):
- "Hello world" → [1523, 892]
- Average tokens per sentence: 10
- Cross-entropy loss: 3.2
Tokenizer B (vocab size = 100K):
- "Hello world" → [45123]
- Average tokens per sentence: 5
- Cross-entropy loss: 2.8
Which is better? You can't tell! Tokenizer B has lower loss, but it's also predicting from a 10× larger vocabulary.
The cross-entropy loss depends on:
- Vocabulary size: Larger vocab → higher entropy → higher loss
- Token granularity: Byte-level vs word-level tokenization
- Special tokens: How are they handled?
This makes it impossible to:
- Compare models trained with different tokenizers
- Experiment with vocabulary size
- Compare against published baselines (which use different tokenizers)
In practice, reviewers will rightfully reject cross-tokenizer loss comparisons. Always convert to bpb before claiming your model beats a baseline.
Bits-per-byte normalizes loss across any tokenizer
nanochat uses bits-per-byte (bpb) - a tokenization-agnostic metric that normalizes loss by the actual byte content being predicted.
The Core Idea
Instead of computing loss per token, compute loss per byte:
bpb = total_nats / (log(2) × total_bytes)
Where:
total_nats: Sum of cross-entropy losses (in natural log units)total_bytes: Total number of UTF-8 bytes in all target tokenslog(2): Conversion factor from nats to bits
Key insight: The byte content is tokenizer-independent. "Hello world" is always 11 bytes, regardless of how you tokenize it.
Implementation
nanochat's implementation from nanochat/loss_eval.py:
@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
"""
Calculate bits per byte (bpb) - a tokenization-agnostic metric.
Args:
model: The model to evaluate
batches: Iterator over (inputs, targets) batches
steps: Number of evaluation steps
token_bytes: 1D tensor mapping token_id → num_bytes (0 for special tokens)
"""
total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
batch_iter = iter(batches)
for _ in range(steps):
x, y = next(batch_iter)
# Get per-token losses (no reduction)
loss2d = model(x, y, loss_reduction='none') # (B, T)
loss2d = loss2d.view(-1) # flatten
y = y.view(-1) # flatten
if (y < 0).any():
# Handle ignore_index (e.g., -1 for padding)
valid = y >= 0
y_safe = torch.where(valid, y, torch.zeros_like(y))
num_bytes2d = torch.where(
valid,
token_bytes[y_safe],
torch.zeros_like(y, dtype=token_bytes.dtype)
)
total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()
else:
# Fast path: no ignored targets
num_bytes2d = token_bytes[y]
total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()
# Sum across all distributed ranks
world_size = dist.get_world_size() if dist.is_initialized() else 1
if world_size > 1:
dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)
# Convert to bits per byte
total_nats = total_nats.item()
total_bytes = total_bytes.item()
bpb = total_nats / (math.log(2) * total_bytes)
return bpbKey Design Decisions
1. Per-Token Loss Calculation
loss2d = model(x, y, loss_reduction='none') # (B, T)We need per-token losses to weight them by token byte length. Standard reduction='mean' would lose this information.
2. Token Byte Lookup
num_bytes2d = token_bytes[y]The token_bytes tensor is precomputed during tokenizer training (from scripts/tok_train.py):
# Generated by scripts/tok_train.py
token_bytes = torch.zeros(vocab_size, dtype=torch.int64)
for token_id in range(vocab_size):
token_str = tokenizer.decode([token_id])
token_bytes[token_id] = len(token_str.encode('utf-8'))
# Special tokens get 0 bytes (excluded from metric)
for special_token_id in special_token_ids:
token_bytes[special_token_id] = 0Example token bytes:
- "hello" (5 characters) → 5 bytes
- "世界" (2 characters) → 6 bytes (UTF-8)
- "<|bos|>" (special) → 0 bytes (excluded)
3. Byte Weighting
total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()- Multiply each loss by whether its token contributes bytes (0 or 1)
- Sum total bytes separately
- This excludes special tokens (0 bytes) from both numerator and denominator
4. Distributed Reduction
if world_size > 1:
dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)Bits Per Byte Calculator
Convert between loss, perplexity, and compression metrics
BPB Comparison
Interpretation
👍 Good performance. Your model compresses text significantly better than random, but there's room for improvement.
Conversion Formulas
Each rank evaluates different data (strided sharding), so we sum across ranks before computing the final ratio.
bpb Interpretation
What does a bpb value mean?
bpb = 1.5 → On average, the model uses 1.5 bits per byte of text
Theoretical limits:
| Compression | bpb | Interpretation |
|---|---|---|
| Perfect | 0.0 | Model predicts everything perfectly |
| gzip | ~4.5 | Classical compression algorithm |
| Random | 8.0 | No compression (1 byte = 8 bits) |
Typical model performance:
| Model | bpb | Context |
|---|---|---|
| GPT-2 small | ~1.8 | 124M params, pre-2020 architecture |
| nanochat d20 | ~1.45 | ~561M params, modern architecture |
| GPT-3 175B | ~0.8 | Large-scale model |
| GPT-4 | ~0.6 | State-of-the-art (estimated) |
For your training runs, this means: always report bpb alongside loss. A bpb of 1.45 tells you something meaningful—your model compresses text 5× better than gzip.
For your model comparisons, this means: bpb is the great equalizer. When comparing your model to published results with different tokenizers, bpb is the only number that matters.
bpb vs Cross-Entropy Comparison
Cross-entropy loss:
# Simple average - tokenizer dependent
loss = sum(cross_entropy(pred, target)) / num_tokens
# Example: 2.8 (what does this mean?)Bits per byte:
# Byte-normalized - tokenizer independent
bpb = sum(cross_entropy(pred, target)) / (log(2) × num_bytes)
# Example: 1.45 bits/byte (comparable across tokenizers!)Concrete example:
Text: "The quick brown fox jumps"
| Tokenizer | Tokens | Cross-entropy | Bytes | bpb |
|---|---|---|---|---|
| GPT-4 (100K) | 5 | 2.5 | 25 | 1.4 |
| GPT-2 (50K) | 6 | 2.8 | 25 | 1.4 |
| Character-level (256) | 25 | 1.2 | 25 | 1.4 |
All three have the same bpb despite different token counts and losses—that's the power of normalization. ✅
CORE benchmark provides centered accuracy metrics
While bpb measures how well the model compresses text, it doesn't tell us how well it understands language. Enter the CORE benchmark.
What is CORE?
CORE (Common-sense Reasoning Evaluation) is a comprehensive benchmark from the DCLM paper that evaluates models on 11 diverse tasks:
| Category | Tasks |
|---|---|
| Knowledge | MMLU (5-shot), HellaSwag, PIQA |
| Reading Comprehension | SQUAD, BoolQ, SciQ, ARC-Easy, ARC-Challenge |
| Common Sense | OpenBookQA, Winogrande |
| Problem Solving | SQUAD |
Why CORE?
Single-number metric:
CORE = average of centered accuracies across all 11 tasks
Where centered accuracy adjusts for random baseline:
centered = (accuracy - random_baseline) / (1.0 - random_baseline)Benefits:
- ✅ Comprehensive coverage of language understanding
- ✅ Balances multiple skill types
- ✅ Accounts for task difficulty
- ✅ Enables comparison with published models
Evaluation Metrics Dashboard
Compare LLM performance across standard benchmarks
Your Model Scores
Benchmark Details
Task Types
The CORE benchmark includes three task types:
1. Multiple Choice
Example from HellaSwag:
Context: "A person is climbing a rock wall."
Choices:
A) "The person reaches the top and waves."
B) "The person is eating a sandwich."
C) "The wall turns into water."
D) "The rock becomes a bird."
Gold: A
Evaluation method: Choose option with lowest perplexity.
2. Schema (Cloze Completion)
Example from Winogrande:
Context options:
- "The trophy doesn't fit in the suitcase because it is too large."
- "The trophy doesn't fit in the suitcase because it is too small."
Continuation: "it" refers to the trophy
Gold: Option 1
Evaluation method: Choose context with lowest perplexity for continuation.
3. Language Modeling
Example from SQUAD:
Context: "The Normans were originally people from..."
Continuation: "northern France"
Evaluation method: Check if model's greedy predictions match the continuation.
Implementation Deep-Dive
nanochat's CORE evaluation from nanochat/core_eval.py:
@torch.no_grad()
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
"""Evaluate a single example."""
item = data[idx]
task_type = task_meta['task_type']
num_fewshot = task_meta['num_fewshot']
# Sample few-shot examples (deterministic based on idx)
fewshot_examples = []
if num_fewshot > 0:
rng = random.Random(1234 + idx)
available_indices = [i for i in range(len(data)) if i != idx]
fewshot_indices = rng.sample(available_indices, num_fewshot)
fewshot_examples = [data[i] for i in fewshot_indices]
# Render prompts based on task type
if task_type == 'multiple_choice':
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
elif task_type == 'schema':
prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
elif task_type == 'language_modeling':
prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
# Forward model
input_ids = stack_sequences(tokens, pad_token_id).to(device)
losses, predictions = forward_model(model, input_ids)
# Evaluate correctness based on task type
if task_type == 'language_modeling':
si, ei = start_idxs[0], end_idxs[0]
predicted_tokens = predictions[0, si-1:ei-1]
actual_tokens = input_ids[0, si:ei]
is_correct = torch.all(predicted_tokens == actual_tokens).item()
elif task_type in ['multiple_choice', 'schema']:
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
is_correct = pred_idx == item['gold']
return is_correctKey Techniques
1. Few-Shot Learning
# Deterministic sampling based on example index
rng = random.Random(1234 + idx)
fewshot_indices = rng.sample(available_indices, num_fewshot)This ensures:
- Same few-shot examples for the same test example across runs
- Different few-shot examples for different test examples
- Reproducible results
2. Prompt Rendering with Jinja2
# Multiple choice template
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}"""Example rendered prompt:
Question: What is the capital of France?
Answer: Paris
Question: What is 2+2?
Answer: 4
Question: What is the largest planet?
Answer: Jupiter
3. Common Prefix/Suffix Detection
For multiple choice:
def find_common_length(token_sequences, direction='left'):
"""Find length of common prefix (multiple choice) or suffix (schema)."""
min_len = min(len(seq) for seq in token_sequences)
indices = range(min_len) if direction == 'left' else range(-1, -min_len-1, -1)
for i, idx in enumerate(indices):
token = token_sequences[0][idx]
if not all(seq[idx] == token for seq in token_sequences):
return i
return min_lenWhy this matters: For multiple choice, all options share the same context. In the token sequences, options A, B, and C might all share [15, 42, 88, 91] as a common prefix, with only the answer portion (indices 4-7) differing. We only need to evaluate losses for the answer part, not the shared context—reducing computation by 50-80%.
4. Distributed Evaluation
def evaluate_task(model, tokenizer, data, device, task_meta):
"""Evaluate one task across all examples with distributed dispatch."""
rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1
correct = torch.zeros(len(data), dtype=torch.float32, device=device)
# Each rank processes every Nth example
for idx in range(rank, len(data), world_size):
is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
correct[idx] = float(is_correct)
# Sync results across ranks
if world_size > 1:
dist.barrier()
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
mean_correct = correct.mean().item()
return mean_correctStrided access pattern (same as data loading):
- Rank 0: Examples 0, 8, 16, 24, ...
- Rank 1: Examples 1, 9, 17, 25, ...
- etc.
This provides automatic load balancing and parallelizes evaluation across all GPUs.
For your evaluation infrastructure, this means: reuse the same strided pattern from training. When data loading and evaluation use identical sharding, you eliminate an entire category of distribution bugs.
CORE Metric Calculation
From scripts/base_eval.py:
# Evaluate all tasks
results = {}
centered_results = {}
for task in tasks:
accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
# Center by random baseline
random_baseline = eval_metadata[task_label]["Random baseline"]
centered = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
results[task_label] = accuracy
centered_results[task_label] = centered
# CORE metric = average of centered results
core_metric = sum(centered_results.values()) / len(centered_results)Why center by random baseline? Different tasks have different random baselines: 4-choice MC has 25% random accuracy, True/False has 50%, and 10-choice MC has 10%. Centering normalizes to the scale where 0.0 = random performance and 1.0 = perfect performance. This ensures all tasks contribute equally to the CORE metric.
Evaluation integrates into the training loop in 50 lines
nanochat's training loop integrates both metrics from scripts/base_train.py:
for step in range(num_iterations + 1):
# Evaluate validation bpb
if last_step or step % eval_every == 0:
model.eval()
val_loader = build_val_loader()
eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
with autocast_ctx:
val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
wandb_run.log({
"step": step,
"val/bpb": val_bpb,
})
model.train()
# Evaluate CORE metric
if last_step or (step > 0 and step % core_metric_every == 0):
model.eval()
with autocast_ctx:
results = evaluate_model(orig_model, tokenizer, device, max_per_task=500)
print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
wandb_run.log({
"step": step,
"core_metric": results["core_metric"],
"centered_results": results["centered_results"],
})
model.train()
# Training step
# ...Evaluation frequency:
eval_every = 250: bpb evaluation every 250 stepscore_metric_every = 2000: CORE evaluation every 2000 steps
Why different frequencies?
- bpb: Fast (~30 seconds), run frequently
- CORE: Slow (~15 minutes), run sparingly
Chinchilla scaling laws prove 20× data-to-params is optimal
With these evaluation tools, we can now explore scaling laws - the relationships between model size, data, compute, and performance.
Chinchilla Scaling Laws
The Chinchilla paper established that for optimal compute efficiency:
Optimal data-to-parameter ratio ≈ 20:1
nanochat's implementation from scripts/base_train.py:
# Training horizon specification
target_param_data_ratio = 20 # Chinchilla = 20
# Calculate training iterations
target_tokens = target_param_data_ratio * num_params
num_iterations = target_tokens // total_batch_sizeExample for depth=20 model (actual nanochat):
- Parameters: ~561M (from speedrun.sh)
- Target tokens: 20 × 561M = ~11.2B tokens
- Batch size: 524K tokens
- Iterations: 11.2B / 524K ≈ 21,400 steps
Scaling Laws Calculator
Find the optimal model size for your compute budget (Chinchilla scaling)
Loss vs Model Size (at different token ratios)
Chinchilla Insight
Most models before Chinchilla were under-trained (too few tokens for their size). The optimal ratio is ~20 tokens per parameter.
Compute Formula
C ≈ 6 × N × D, where C = FLOPs, N = parameters, D = tokens. This estimates training compute for a forward+backward pass.
Practical Guidance
- • For inference: Prefer smaller, well-trained models (lower serving cost)
- • For research: Training beyond optimal can still help; diminishing returns
- • Data-limited? Use a smaller model to avoid overfitting
- • Compute-limited? Follow the optimal ratio closely
Observing Scaling Laws
Experiment setup:
Train models at different scales with Chinchilla-optimal data:
Note: The table below uses illustrative parameter counts to demonstrate scaling principles. Actual nanochat model sizes differ (e.g., d20 = ~561M parameters per Karpathy's walkthrough).
| Depth | Params (illustrative) | Data (20×) | val bpb | CORE |
|---|---|---|---|---|
| 12 | 30M | 600M | 1.75 | 0.25 |
| 16 | 54M | 1.08B | 1.58 | 0.32 |
| 20 | 83M | 1.66B | 1.45 | 0.38 |
| 24 | 118M | 2.36B | 1.35 | 0.43 |
| 28 | 158M | 3.16B | 1.28 | 0.47 |
Observations:
- Power law in loss:
bpb ∝ N^(-α)
where N = parameters, α ≈ 0.05
- Logarithmic in CORE:
CORE ∝ log(N)
- Smooth improvements: No sudden jumps, steady scaling
Loss Curves
Typical training curve:
Characteristics:
- Rapid initial improvement: First 500 steps see largest gains
- Logarithmic progress: Later improvements come more slowly
- Smooth convergence: No oscillations or instabilities
Compute-Optimal Frontier
Key insight: For a fixed compute budget C (in FLOPs), what's the optimal allocation?
C = 6 × N × D (approximate FLOPs for training)
where:
N = parameters
D = tokens
Chinchilla finding:
Optimal: N ∝ C^0.5
D ∝ C^0.5
This means:
- ✅ Double compute → √2× larger model trained on √2× more data
- ❌ NOT: Double compute → 2× larger model on same data
- ❌ NOT: Double compute → same model on 2× more data
nanochat's default (via target_param_data_ratio=20):
- Slightly undertrained vs Chinchilla (which suggests ~14:1 for smaller models)
- Reasonable for experimentation where you might continue training
Loss vs Compute (Scaling Law)
Kaplan et al. (2020) scaling law:
L(C) = (C_0 / C)^α
where:
L = validation loss (or bpb)
C = compute (FLOPs)
C_0 = constant (depends on architecture)
α ≈ 0.05-0.07 (exponent)
In practice:
| Compute (FLOPs) | Model Size | Data | Expected bpb |
|---|---|---|---|
| 1e18 | 30M | 600M | 1.75 |
| 1e19 | 83M | 1.66B | 1.45 |
| 1e20 | 230M | 4.6B | 1.20 |
| 1e21 | 630M | 12.6B | 1.00 |
Key insight: To halve your loss, you need approximately 10× more compute. This exponential relationship is why training frontier models requires such massive investment.
These evaluation patterns reveal training dynamics
1. Early Stopping vs Overtraining
Observation: For fixed model size, loss continues improving with more data, but with diminishing returns.
# Training beyond Chinchilla ratio
target_param_data_ratio = 20 # Standard
# vs
target_param_data_ratio = 40 # 2× more dataResults (depth=20, ~561M params, from nanochat):
| Data Ratio | Tokens | val bpb | CORE | Training Time |
|---|---|---|---|---|
| 10× | 830M | 1.52 | 0.35 | 30 min |
| 20× | 1.66B | 1.45 | 0.38 | 60 min |
| 40× | 3.32B | 1.42 | 0.39 | 120 min |
Insight: 20× is a good default. Beyond that, returns diminish significantly.
2. bpb vs CORE Correlation
Generally:
- Lower bpb → Higher CORE
- But correlation is imperfect
Example anomaly:
Model A: bpb=1.45, CORE=0.38
Model B: bpb=1.44, CORE=0.36
Model B has slightly better compression but worse reasoning. This can happen when:
- Model memorizes common phrases (lowers bpb)
- But doesn't learn compositional reasoning (lowers CORE)
Lesson: Use both metrics. bpb tells you about training dynamics and compression efficiency; CORE tells you about actual capabilities and reasoning quality.
3. Validation Set Size
eval_tokens = 20 * 524288 # ~10M tokensWhy 10M tokens?
- Large enough for stable estimates
- Small enough to evaluate quickly (~30s)
Stability analysis:
| eval_tokens | bpb mean | bpb std | Eval time |
|---|---|---|---|
| 1M | 1.447 | 0.023 | 3s |
| 5M | 1.451 | 0.008 | 15s |
| 10M | 1.450 | 0.004 | 30s |
| 50M | 1.450 | 0.002 | 150s |
Conclusion: 10M tokens provides sufficient precision with reasonable compute.
4. Learning Rate Warmdown
nanochat uses warmdown (gradual LR decay) in the final 20% of training:
warmdown_ratio = 0.2
final_lr_frac = 0.0
def get_lr_multiplier(it):
warmdown_iters = round(warmdown_ratio * num_iterations)
if it <= num_iterations - warmdown_iters:
return 1.0
else:
progress = (num_iterations - it) / warmdown_iters
return progress * 1.0 + (1 - progress) * final_lr_fracEffect on validation loss:
Steps 0-2500: LR = 1.0 × base_lr, val_bpb slowly decreasing
Steps 2500-3167: LR = 1.0 → 0.0, val_bpb drops faster
Final val_bpb improvement: ~0.03 bpb (2% relative)
Why it works: Model fine-tunes to a sharper minimum in later training.
Advanced: curriculum learning and loss spike detection
1. Bits Per Character vs Bits Per Byte
Some papers report "bits per character" (bpc):
# Bits per character
bpc = total_nats / (log(2) * num_characters)Key difference:
- bpb: UTF-8 bytes (e.g., "世界" = 6 bytes)
- bpc: Unicode characters (e.g., "世界" = 2 characters)
For English text: bpc ≈ bpb (most characters = 1 byte)
For multilingual text: bpc < bpb (multi-byte characters)
nanochat uses bpb because:
- More universal (bytes are always bytes)
- Easier to compute (no character counting)
- Standard in modern LLM papers
2. Perplexity
Some papers report perplexity instead of bpb:
perplexity = exp(cross_entropy_loss)Relationship to bpb:
bpb = cross_entropy_loss / log(2)
perplexity = 2^bpb
Example:
- bpb = 1.45
- perplexity = 2^1.45 = 2.73
Interpretation: On average, the model is uncertain between ~2.7 likely next bytes.
3. CORE vs Other Benchmarks
Comparison:
| Benchmark | Tasks | Coverage | Pros | Cons |
|---|---|---|---|---|
| CORE | 11 | Broad | Single metric, comprehensive | Less common than alternatives |
| HELM | 42 | Very broad | Industry standard | Slow, complex |
| Eleuther Eval Harness | 200+ | Extremely broad | Most comprehensive | Very slow |
| MMLU | 57 | Knowledge-focused | Standard, well-known | Narrow (only knowledge) |
nanochat chooses CORE for:
- ✅ Fast evaluation (~15 min on 8 GPUs)
- ✅ Good coverage without redundancy
- ✅ From recent, well-regarded paper (DCLM)
4. Curriculum Learning
Question: Should we change the data distribution during training?
# Potential curriculum strategy
if step < warmup_steps:
data_loader = easy_data_loader # Simpler texts
else:
data_loader = full_data_loader # All datananochat's choice: No curriculum (uniform distribution throughout)
Rationale:
- Simpler implementation
- Data is already shuffled (good enough)
- No clear evidence curriculum helps for LLMs at this scale
Debugging training runs with evaluation signals
Common Issues
1. Loss diverges (NaN)
Step 100: loss=2.8
Step 101: loss=3.2
Step 102: loss=8.5
Step 103: loss=nan
Likely causes: Learning rate too high, gradient clipping disabled or set too high, or numerical instability in the forward pass.
Solution:
grad_clip = 1.0 # Enable gradient clipping
matrix_lr = 0.01 # Reduce Muon LR2. Loss plateaus early
Step 0-500: bpb 4.5 → 2.1
Step 500-3000: bpb 2.1 → 2.09
Likely causes:
- Learning rate too low
- Model too small for dataset
- Data quality issues
Solution:
- Increase learning rate
- Increase model depth
- Inspect data distribution
3. Train/val loss diverge
Step 2000: train_loss=1.4, val_bpb=1.8
Likely causes:
- Overfitting (model too large)
- Train/val distribution mismatch
- Bug in validation evaluation
Solution:
- Check dataset splits
- Verify eval implementation
- Reduce model size if overfitting
Evaluation infrastructure isn't an afterthought—it's fundamental
nanochat's evaluation framework provides the tools needed to understand and optimize language model training:
Bits per byte (bpb):
- ✅ Tokenization-agnostic metric
- ✅ Enables fair comparison across models
- ✅ Fast to evaluate (~30 seconds)
- ✅ Stable training signal
CORE benchmark:
- ✅ Comprehensive task coverage
- ✅ Single-number metric for model quality
- ✅ Normalized for fair comparison
- ✅ Parallelized for efficiency
Scaling laws:
- ✅ Power-law relationship between compute and performance
- ✅ Chinchilla ratio (20:1 data:params) as default
- ✅ Smooth, predictable improvements
Together, these tools enable:
- Principled hyperparameter selection
- Efficient compute allocation
- Early detection of training issues
- Reliable model comparison
The evaluation infrastructure is not an afterthought - it's fundamental to understanding what's happening during training. By implementing proper metrics and comprehensive benchmarks, nanochat demonstrates that even small-scale projects can employ the same rigorous evaluation methods used by frontier labs.
Training without evaluation is just generating heat. Measure everything.
Related Posts
Previous in series:
Track 1 Complete! 🎉 All Technical Deep-Dive posts:
- Post 1.1: The Muon Optimizer Explained
- Post 1.2: Distributed Muon - Custom Gradient Synchronization
- Post 1.3: KV Caching Deep-Dive
- Post 1.4: Modern Transformer Architecture
Coming next: Track 2 (Practical Guides) - Hands-on tutorials for building your own ChatGPT!
Sources and References
Scaling Laws
- Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. Chinchilla paper establishing 20:1 data-to-params ratio.
- Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. Original scaling laws research from OpenAI.
- Henighan, T., et al. (2020). Scaling Laws for Autoregressive Generative Modeling. Cross-domain scaling analysis.
- Muennighoff, N., et al. (2024). Scaling Data-Constrained Language Models. NeurIPS 2024. Data-limited scaling behavior.
Evaluation Benchmarks
- Li, R., et al. (2024). DataComp-LM: In Search of the Next Generation of Training Sets for Language Models. CORE benchmark paper (DCLM).
- Hendrycks, D., et al. (2020). Measuring Massive Multitask Language Understanding. ICLR 2021. MMLU benchmark.
- Clark, P., et al. (2018). Think you have Solved Question Answering? Try ARC. AI2 Reasoning Challenge.
- Zellers, R., et al. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. ACL 2019. Commonsense reasoning benchmark.
Loss and Perplexity
- Jurafsky, D. & Martin, J.H. (2024). Speech and Language Processing, 3rd ed. Perplexity and bpb fundamentals.
- Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. GPT-3 evaluation methodology.
Implementation
- nanochat source code. Full implementation.
- loss_eval.py. bpb evaluation implementation.
- modded-nanogpt speedrun. Community benchmarks.
Related Posts in This Series
- The Muon Optimizer Explained
- Distributed Muon - Custom Gradient Synchronization
- KV Caching Deep-Dive
- Modern Transformer Architecture
Industry Research & Benchmarks (as of January 2025)
- Epoch AI Compute Trends: Machine Learning Model Compute Trends. Tracks the 6-month doubling period of training compute; validates scaling law predictions.
- Stanford HAI AI Index 2024: Measuring AI Progress. Annual meta-analysis of benchmark saturation and model efficiency trends.
- MLCommons MLPerf Training: Training Benchmark Suite. Industry-standard training efficiency benchmarks for reproducible comparison.
Before you interpret your training metrics:
- Use bits-per-byte, not raw loss. Tokenizer vocabulary size affects loss magnitude—bpb gives you a vocabulary-agnostic metric.
- Compute your Chinchilla ratio. Divide training tokens by parameter count—below 20:1 means you're undertrained; above 40:1 shows diminishing returns.
- Validate on 10M+ tokens. Smaller validation sets give noisy bpb estimates—10M tokens gives ±0.004 bpb precision.
- Track both bpb and CORE. Lower bpb doesn't guarantee better reasoning—run task benchmarks alongside perplexity.
- Log evaluation metrics every 500 steps. Catching training divergence at step 1000 saves hours versus discovering at step 5000.
Your loss number means nothing without context. Bits-per-byte and Chinchilla tell you what it actually means.
On this page
- Cross-entropy loss tells you nothing about what your model knows
- Cross-entropy loss is tokenizer-dependent and incomparable
- Why Cross-Entropy Loss is Tokenizer-Dependent
- Bits-per-byte normalizes loss across any tokenizer
- The Core Idea
- Implementation
- Key Design Decisions
- 1. Per-Token Loss Calculation
- 2. Token Byte Lookup
- 3. Byte Weighting
- 4. Distributed Reduction
- bpb Interpretation
- bpb vs Cross-Entropy Comparison
- CORE benchmark provides centered accuracy metrics
- What is CORE?
- Why CORE?
- Task Types
- 1. Multiple Choice
- 2. Schema (Cloze Completion)
- 3. Language Modeling
- Implementation Deep-Dive
- Key Techniques
- CORE Metric Calculation
- Evaluation integrates into the training loop in 50 lines
- Chinchilla scaling laws prove 20× data-to-params is optimal
- Chinchilla Scaling Laws
- Observing Scaling Laws
- Loss Curves
- Compute-Optimal Frontier
- Loss vs Compute (Scaling Law)
- These evaluation patterns reveal training dynamics
- 1. Early Stopping vs Overtraining
- 2. bpb vs CORE Correlation
- 3. Validation Set Size
- 4. Learning Rate Warmdown
- Advanced: curriculum learning and loss spike detection
- 1. Bits Per Character vs Bits Per Byte
- 2. Perplexity
- 3. CORE vs Other Benchmarks
- 4. Curriculum Learning
- Debugging training runs with evaluation signals
- Common Issues
- Evaluation infrastructure isn't an afterthought—it's fundamental
- Related Posts
- Sources and References
- Scaling Laws
- Evaluation Benchmarks
- Loss and Perplexity
- Implementation
- Related Posts in This Series
- Industry Research & Benchmarks (as of January 2025)



