Loss Landscape & Scaling Laws: Understanding Training Dynamics

- Published on
- /20 mins read
Introduction
Training a language model is not just about running gradient descent until convergence. Understanding how the loss evolves, why certain hyperparameters work, and what scaling laws govern model performance can mean the difference between wasted compute and efficient training runs.
nanochat's training infrastructure includes sophisticated evaluation mechanisms that go beyond simple loss tracking. It implements bits-per-byte (bpb) - a tokenization-agnostic metric, and the CORE benchmark - a comprehensive evaluation suite covering 11 diverse tasks. Together, these tools provide deep insights into training dynamics and enable principled decision-making about model architecture, data requirements, and compute budgets.
This final post of Track 1 explores nanochat's evaluation framework, dissecting the bits-per-byte metric, understanding the CORE benchmark, and examining how these tools illuminate the fundamental scaling laws that govern language model training.
The Problem with Cross-Entropy Loss
The standard training metric is cross-entropy loss:
loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))WARNING
This seems straightforward, but it has a critical flaw: The loss is not comparable across different tokenizers.
Why Cross-Entropy Loss is Tokenizer-Dependent
Consider two tokenizers:
Tokenizer A (vocab size = 10K):
- "Hello world" → [1523, 892]
- Average tokens per sentence: 10
- Cross-entropy loss: 3.2
Tokenizer B (vocab size = 100K):
- "Hello world" → [45123]
- Average tokens per sentence: 5
- Cross-entropy loss: 2.8
Which is better? You can't tell! Tokenizer B has lower loss, but it's also predicting from a 10× larger vocabulary.
The cross-entropy loss depends on:
- Vocabulary size: Larger vocab → higher entropy → higher loss
- Token granularity: Byte-level vs word-level tokenization
- Special tokens: How are they handled?
This makes it impossible to:
- Compare models trained with different tokenizers
- Experiment with vocabulary size
- Compare against published baselines (which use different tokenizers)
Bits Per Byte (bpb): The Solution
nanochat uses bits-per-byte (bpb) - a tokenization-agnostic metric that normalizes loss by the actual byte content being predicted.
The Core Idea
Instead of computing loss per token, compute loss per byte:
bpb = total_nats / (log(2) × total_bytes)
Where:
total_nats: Sum of cross-entropy losses (in natural log units)total_bytes: Total number of UTF-8 bytes in all target tokenslog(2): Conversion factor from nats to bits
NOTE
Key insight: The byte content is tokenizer-independent. "Hello world" is always 11 bytes, regardless of how you tokenize it.
Implementation
nanochat's implementation from nanochat/loss_eval.py:
@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
"""
Calculate bits per byte (bpb) - a tokenization-agnostic metric.
Args:
model: The model to evaluate
batches: Iterator over (inputs, targets) batches
steps: Number of evaluation steps
token_bytes: 1D tensor mapping token_id → num_bytes (0 for special tokens)
"""
total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
batch_iter = iter(batches)
for _ in range(steps):
x, y = next(batch_iter)
# Get per-token losses (no reduction)
loss2d = model(x, y, loss_reduction='none') # (B, T)
loss2d = loss2d.view(-1) # flatten
y = y.view(-1) # flatten
if (y < 0).any():
# Handle ignore_index (e.g., -1 for padding)
valid = y >= 0
y_safe = torch.where(valid, y, torch.zeros_like(y))
num_bytes2d = torch.where(
valid,
token_bytes[y_safe],
torch.zeros_like(y, dtype=token_bytes.dtype)
)
total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()
else:
# Fast path: no ignored targets
num_bytes2d = token_bytes[y]
total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()
# Sum across all distributed ranks
world_size = dist.get_world_size() if dist.is_initialized() else 1
if world_size > 1:
dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)
# Convert to bits per byte
total_nats = total_nats.item()
total_bytes = total_bytes.item()
bpb = total_nats / (math.log(2) * total_bytes)
return bpbKey Design Decisions
1. Per-Token Loss Calculation
loss2d = model(x, y, loss_reduction='none') # (B, T)We need per-token losses to weight them by token byte length. Standard reduction='mean' would lose this information.
2. Token Byte Lookup
num_bytes2d = token_bytes[y]The token_bytes tensor is precomputed during tokenizer training (from scripts/tok_train.py):
# Generated by scripts/tok_train.py
token_bytes = torch.zeros(vocab_size, dtype=torch.int64)
for token_id in range(vocab_size):
token_str = tokenizer.decode([token_id])
token_bytes[token_id] = len(token_str.encode('utf-8'))
# Special tokens get 0 bytes (excluded from metric)
for special_token_id in special_token_ids:
token_bytes[special_token_id] = 0Example token bytes:
- "hello" (5 characters) → 5 bytes
- "世界" (2 characters) → 6 bytes (UTF-8)
- "<|bos|>" (special) → 0 bytes (excluded)
3. Byte Weighting
total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()- Multiply each loss by whether its token contributes bytes (0 or 1)
- Sum total bytes separately
- This excludes special tokens (0 bytes) from both numerator and denominator
4. Distributed Reduction
if world_size > 1:
dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)Each rank evaluates different data (strided sharding), so we sum across ranks before computing the final ratio.
bpb Interpretation
What does a bpb value mean?
bpb = 1.5 → On average, the model uses 1.5 bits per byte of text
Theoretical limits:
| Compression | bpb | Interpretation |
|---|---|---|
| Perfect | 0.0 | Model predicts everything perfectly |
| gzip | ~4.5 | Classical compression algorithm |
| Random | 8.0 | No compression (1 byte = 8 bits) |
Typical model performance:
| Model | bpb | Context |
|---|---|---|
| GPT-2 small | ~1.8 | 124M params, pre-2020 architecture |
| nanochat d20 | ~1.45 | 83M params, modern architecture |
| GPT-3 175B | ~0.8 | Large-scale model |
| GPT-4 | ~0.6 | State-of-the-art (estimated) |
bpb vs Cross-Entropy Comparison
Cross-entropy loss:
# Simple average - tokenizer dependent
loss = sum(cross_entropy(pred, target)) / num_tokens
# Example: 2.8 (what does this mean?)Bits per byte:
# Byte-normalized - tokenizer independent
bpb = sum(cross_entropy(pred, target)) / (log(2) × num_bytes)
# Example: 1.45 bits/byte (comparable across tokenizers!)Concrete example:
Text: "The quick brown fox jumps"
| Tokenizer | Tokens | Cross-entropy | Bytes | bpb |
|---|---|---|---|---|
| GPT-4 (100K) | 5 | 2.5 | 25 | 1.4 |
| GPT-2 (50K) | 6 | 2.8 | 25 | 1.4 |
| Character-level (256) | 25 | 1.2 | 25 | 1.4 |
TIP
All three have the same bpb despite different token counts and losses! ✅
The CORE Benchmark
While bpb measures how well the model compresses text, it doesn't tell us how well it understands language. Enter the CORE benchmark.
What is CORE?
CORE (Common-sense Reasoning Evaluation) is a comprehensive benchmark from the DCLM paper that evaluates models on 11 diverse tasks:
| Category | Tasks |
|---|---|
| Knowledge | MMLU (5-shot), HellaSwag, PIQA |
| Reading Comprehension | SQUAD, BoolQ, SciQ, ARC-Easy, ARC-Challenge |
| Common Sense | OpenBookQA, Winogrande |
| Problem Solving | SQUAD |
Why CORE?
Single-number metric:
CORE = average of centered accuracies across all 11 tasks
Where centered accuracy adjusts for random baseline:
centered = (accuracy - random_baseline) / (1.0 - random_baseline)Benefits:
- ✅ Comprehensive coverage of language understanding
- ✅ Balances multiple skill types
- ✅ Accounts for task difficulty
- ✅ Enables comparison with published models
Task Types
The CORE benchmark includes three task types:
1. Multiple Choice
Example from HellaSwag:
Context: "A person is climbing a rock wall."
Choices:
A) "The person reaches the top and waves."
B) "The person is eating a sandwich."
C) "The wall turns into water."
D) "The rock becomes a bird."
Gold: A
Evaluation method: Choose option with lowest perplexity.
2. Schema (Cloze Completion)
Example from Winogrande:
Context options:
- "The trophy doesn't fit in the suitcase because it is too large."
- "The trophy doesn't fit in the suitcase because it is too small."
Continuation: "it" refers to the trophy
Gold: Option 1
Evaluation method: Choose context with lowest perplexity for continuation.
3. Language Modeling
Example from SQUAD:
Context: "The Normans were originally people from..."
Continuation: "northern France"
Evaluation method: Check if model's greedy predictions match the continuation.
Implementation Deep-Dive
nanochat's CORE evaluation from nanochat/core_eval.py:
@torch.no_grad()
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
"""Evaluate a single example."""
item = data[idx]
task_type = task_meta['task_type']
num_fewshot = task_meta['num_fewshot']
# Sample few-shot examples (deterministic based on idx)
fewshot_examples = []
if num_fewshot > 0:
rng = random.Random(1234 + idx)
available_indices = [i for i in range(len(data)) if i != idx]
fewshot_indices = rng.sample(available_indices, num_fewshot)
fewshot_examples = [data[i] for i in fewshot_indices]
# Render prompts based on task type
if task_type == 'multiple_choice':
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
elif task_type == 'schema':
prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
elif task_type == 'language_modeling':
prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
# Forward model
input_ids = stack_sequences(tokens, pad_token_id).to(device)
losses, predictions = forward_model(model, input_ids)
# Evaluate correctness based on task type
if task_type == 'language_modeling':
si, ei = start_idxs[0], end_idxs[0]
predicted_tokens = predictions[0, si-1:ei-1]
actual_tokens = input_ids[0, si:ei]
is_correct = torch.all(predicted_tokens == actual_tokens).item()
elif task_type in ['multiple_choice', 'schema']:
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
is_correct = pred_idx == item['gold']
return is_correctKey Techniques
1. Few-Shot Learning
# Deterministic sampling based on example index
rng = random.Random(1234 + idx)
fewshot_indices = rng.sample(available_indices, num_fewshot)This ensures:
- Same few-shot examples for the same test example across runs
- Different few-shot examples for different test examples
- Reproducible results
2. Prompt Rendering with Jinja2
# Multiple choice template
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}"""Example rendered prompt:
Question: What is the capital of France?
Answer: Paris
Question: What is 2+2?
Answer: 4
Question: What is the largest planet?
Answer: Jupiter
3. Common Prefix/Suffix Detection
For multiple choice:
def find_common_length(token_sequences, direction='left'):
"""Find length of common prefix (multiple choice) or suffix (schema)."""
min_len = min(len(seq) for seq in token_sequences)
indices = range(min_len) if direction == 'left' else range(-1, -min_len-1, -1)
for i, idx in enumerate(indices):
token = token_sequences[0][idx]
if not all(seq[idx] == token for seq in token_sequences):
return i
return min_lenNOTE
Why this matters:
For multiple choice, all options share the same context:
Tokens:
Option A: [15, 42, 88, 91, 23, 56, 77] ← Answer starts at index 4
Option B: [15, 42, 88, 91, 99, 12, 34] ← Answer starts at index 4
Option C: [15, 42, 88, 91, 45, 67, 89] ← Answer starts at index 4
└─── common prefix ───┘
We only need to evaluate losses for the answer part (indices 4-7), not the shared context.
4. Distributed Evaluation
def evaluate_task(model, tokenizer, data, device, task_meta):
"""Evaluate one task across all examples with distributed dispatch."""
rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1
correct = torch.zeros(len(data), dtype=torch.float32, device=device)
# Each rank processes every Nth example
for idx in range(rank, len(data), world_size):
is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
correct[idx] = float(is_correct)
# Sync results across ranks
if world_size > 1:
dist.barrier()
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
mean_correct = correct.mean().item()
return mean_correctStrided access pattern (same as data loading):
- Rank 0: Examples 0, 8, 16, 24, ...
- Rank 1: Examples 1, 9, 17, 25, ...
- etc.
This provides automatic load balancing and parallelizes evaluation across all GPUs.
CORE Metric Calculation
From scripts/base_eval.py:
# Evaluate all tasks
results = {}
centered_results = {}
for task in tasks:
accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
# Center by random baseline
random_baseline = eval_metadata[task_label]["Random baseline"]
centered = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
results[task_label] = accuracy
centered_results[task_label] = centered
# CORE metric = average of centered results
core_metric = sum(centered_results.values()) / len(centered_results)TIP
Why center by random baseline?
Different tasks have different random baselines:
- 4-choice MC: 25% random accuracy
- True/False: 50% random accuracy
- 10-choice MC: 10% random accuracy
Centering normalizes to the scale:
centered = 0.0 → Random performance
centered = 1.0 → Perfect performance
This ensures all tasks contribute equally to the CORE metric.
Training Loop Integration
nanochat's training loop integrates both metrics from scripts/base_train.py:
for step in range(num_iterations + 1):
# Evaluate validation bpb
if last_step or step % eval_every == 0:
model.eval()
val_loader = build_val_loader()
eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
with autocast_ctx:
val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
wandb_run.log({
"step": step,
"val/bpb": val_bpb,
})
model.train()
# Evaluate CORE metric
if last_step or (step > 0 and step % core_metric_every == 0):
model.eval()
with autocast_ctx:
results = evaluate_model(orig_model, tokenizer, device, max_per_task=500)
print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
wandb_run.log({
"step": step,
"core_metric": results["core_metric"],
"centered_results": results["centered_results"],
})
model.train()
# Training step
# ...Evaluation frequency:
eval_every = 250: bpb evaluation every 250 stepscore_metric_every = 2000: CORE evaluation every 2000 steps
Why different frequencies?
- bpb: Fast (~30 seconds), run frequently
- CORE: Slow (~15 minutes), run sparingly
Scaling Laws: Empirical Observations
With these evaluation tools, we can now explore scaling laws - the relationships between model size, data, compute, and performance.
Chinchilla Scaling Laws
The Chinchilla paper established that for optimal compute efficiency:
Optimal data-to-parameter ratio ≈ 20:1
nanochat's implementation from scripts/base_train.py:
# Training horizon specification
target_param_data_ratio = 20 # Chinchilla = 20
# Calculate training iterations
target_tokens = target_param_data_ratio * num_params
num_iterations = target_tokens // total_batch_sizeExample for depth=20 model:
- Parameters: 83M
- Target tokens: 20 × 83M = 1.66B tokens
- Batch size: 524K tokens
- Iterations: 1.66B / 524K = 3,167 steps
Observing Scaling Laws
Experiment setup:
Train models at different scales with Chinchilla-optimal data:
| Depth | Params | Data (20×) | val bpb | CORE |
|---|---|---|---|---|
| 12 | 30M | 600M | 1.75 | 0.25 |
| 16 | 54M | 1.08B | 1.58 | 0.32 |
| 20 | 83M | 1.66B | 1.45 | 0.38 |
| 24 | 118M | 2.36B | 1.35 | 0.43 |
| 28 | 158M | 3.16B | 1.28 | 0.47 |
Observations:
- Power law in loss:
bpb ∝ N^(-α)
where N = parameters, α ≈ 0.05
- Logarithmic in CORE:
CORE ∝ log(N)
- Smooth improvements: No sudden jumps, steady scaling
Loss Curves
Typical training curve:
Characteristics:
- Rapid initial improvement: First 500 steps see largest gains
- Logarithmic progress: Later improvements come more slowly
- Smooth convergence: No oscillations or instabilities
Compute-Optimal Frontier
Key insight: For a fixed compute budget C (in FLOPs), what's the optimal allocation?
C = 6 × N × D (approximate FLOPs for training)
where:
N = parameters
D = tokens
Chinchilla finding:
Optimal: N ∝ C^0.5
D ∝ C^0.5
This means:
- ✅ Double compute → √2× larger model trained on √2× more data
- ❌ NOT: Double compute → 2× larger model on same data
- ❌ NOT: Double compute → same model on 2× more data
nanochat's default (via target_param_data_ratio=20):
- Slightly undertrained vs Chinchilla (which suggests ~14:1 for smaller models)
- Reasonable for experimentation where you might continue training
Loss vs Compute (Scaling Law)
Kaplan et al. (2020) scaling law:
L(C) = (C_0 / C)^α
where:
L = validation loss (or bpb)
C = compute (FLOPs)
C_0 = constant (depends on architecture)
α ≈ 0.05-0.07 (exponent)
In practice:
| Compute (FLOPs) | Model Size | Data | Expected bpb |
|---|---|---|---|
| 1e18 | 30M | 600M | 1.75 |
| 1e19 | 83M | 1.66B | 1.45 |
| 1e20 | 230M | 4.6B | 1.20 |
| 1e21 | 630M | 12.6B | 1.00 |
NOTE
Key insight: To halve your loss, you need ~10× more compute.
Practical Insights from Evaluation
1. Early Stopping vs Overtraining
Observation: For fixed model size, loss continues improving with more data, but with diminishing returns.
# Training beyond Chinchilla ratio
target_param_data_ratio = 20 # Standard
# vs
target_param_data_ratio = 40 # 2× more dataResults (depth=20, 83M params):
| Data Ratio | Tokens | val bpb | CORE | Training Time |
|---|---|---|---|---|
| 10× | 830M | 1.52 | 0.35 | 30 min |
| 20× | 1.66B | 1.45 | 0.38 | 60 min |
| 40× | 3.32B | 1.42 | 0.39 | 120 min |
Insight: 20× is a good default. Beyond that, returns diminish significantly.
2. bpb vs CORE Correlation
Generally:
- Lower bpb → Higher CORE
- But correlation is imperfect
Example anomaly:
Model A: bpb=1.45, CORE=0.38
Model B: bpb=1.44, CORE=0.36
Model B has slightly better compression but worse reasoning. This can happen when:
- Model memorizes common phrases (lowers bpb)
- But doesn't learn compositional reasoning (lowers CORE)
TIP
Lesson: Use both metrics! bpb for training dynamics, CORE for capabilities.
3. Validation Set Size
eval_tokens = 20 * 524288 # ~10M tokensWhy 10M tokens?
- Large enough for stable estimates
- Small enough to evaluate quickly (~30s)
Stability analysis:
| eval_tokens | bpb mean | bpb std | Eval time |
|---|---|---|---|
| 1M | 1.447 | 0.023 | 3s |
| 5M | 1.451 | 0.008 | 15s |
| 10M | 1.450 | 0.004 | 30s |
| 50M | 1.450 | 0.002 | 150s |
Conclusion: 10M tokens provides sufficient precision with reasonable compute.
4. Learning Rate Warmdown
nanochat uses warmdown (gradual LR decay) in the final 20% of training:
warmdown_ratio = 0.2
final_lr_frac = 0.0
def get_lr_multiplier(it):
warmdown_iters = round(warmdown_ratio * num_iterations)
if it <= num_iterations - warmdown_iters:
return 1.0
else:
progress = (num_iterations - it) / warmdown_iters
return progress * 1.0 + (1 - progress) * final_lr_fracEffect on validation loss:
Steps 0-2500: LR = 1.0 × base_lr, val_bpb slowly decreasing
Steps 2500-3167: LR = 1.0 → 0.0, val_bpb drops faster
Final val_bpb improvement: ~0.03 bpb (2% relative)
Why it works: Model fine-tunes to a sharper minimum in later training.
Advanced Topics
1. Bits Per Character vs Bits Per Byte
Some papers report "bits per character" (bpc):
# Bits per character
bpc = total_nats / (log(2) * num_characters)Key difference:
- bpb: UTF-8 bytes (e.g., "世界" = 6 bytes)
- bpc: Unicode characters (e.g., "世界" = 2 characters)
For English text: bpc ≈ bpb (most characters = 1 byte)
For multilingual text: bpc < bpb (multi-byte characters)
nanochat uses bpb because:
- More universal (bytes are always bytes)
- Easier to compute (no character counting)
- Standard in modern LLM papers
2. Perplexity
Some papers report perplexity instead of bpb:
perplexity = exp(cross_entropy_loss)Relationship to bpb:
bpb = cross_entropy_loss / log(2)
perplexity = 2^bpb
Example:
- bpb = 1.45
- perplexity = 2^1.45 = 2.73
Interpretation: On average, the model is uncertain between ~2.7 likely next bytes.
3. CORE vs Other Benchmarks
Comparison:
| Benchmark | Tasks | Coverage | Pros | Cons |
|---|---|---|---|---|
| CORE | 11 | Broad | Single metric, comprehensive | Less common than alternatives |
| HELM | 42 | Very broad | Industry standard | Slow, complex |
| Eleuther Eval Harness | 200+ | Extremely broad | Most comprehensive | Very slow |
| MMLU | 57 | Knowledge-focused | Standard, well-known | Narrow (only knowledge) |
nanochat chooses CORE for:
- ✅ Fast evaluation (~15 min on 8 GPUs)
- ✅ Good coverage without redundancy
- ✅ From recent, well-regarded paper (DCLM)
4. Curriculum Learning
Question: Should we change the data distribution during training?
# Potential curriculum strategy
if step < warmup_steps:
data_loader = easy_data_loader # Simpler texts
else:
data_loader = full_data_loader # All datananochat's choice: No curriculum (uniform distribution throughout)
Rationale:
- Simpler implementation
- Data is already shuffled (good enough)
- No clear evidence curriculum helps for LLMs at this scale
Debugging Training Runs
Common Issues
1. Loss diverges (NaN)
Step 100: loss=2.8
Step 101: loss=3.2
Step 102: loss=8.5
Step 103: loss=nan
WARNING
Likely causes:
- Learning rate too high
- Gradient clipping disabled/too high
- Numerical instability in forward pass
Solution:
grad_clip = 1.0 # Enable gradient clipping
matrix_lr = 0.01 # Reduce Muon LR2. Loss plateaus early
Step 0-500: bpb 4.5 → 2.1
Step 500-3000: bpb 2.1 → 2.09
Likely causes:
- Learning rate too low
- Model too small for dataset
- Data quality issues
Solution:
- Increase learning rate
- Increase model depth
- Inspect data distribution
3. Train/val loss diverge
Step 2000: train_loss=1.4, val_bpb=1.8
Likely causes:
- Overfitting (model too large)
- Train/val distribution mismatch
- Bug in validation evaluation
Solution:
- Check dataset splits
- Verify eval implementation
- Reduce model size if overfitting
Conclusion
nanochat's evaluation framework provides the tools needed to understand and optimize language model training:
Bits per byte (bpb):
- ✅ Tokenization-agnostic metric
- ✅ Enables fair comparison across models
- ✅ Fast to evaluate (~30 seconds)
- ✅ Stable training signal
CORE benchmark:
- ✅ Comprehensive task coverage
- ✅ Single-number metric for model quality
- ✅ Normalized for fair comparison
- ✅ Parallelized for efficiency
Scaling laws:
- ✅ Power-law relationship between compute and performance
- ✅ Chinchilla ratio (20:1 data:params) as default
- ✅ Smooth, predictable improvements
Together, these tools enable:
- Principled hyperparameter selection
- Efficient compute allocation
- Early detection of training issues
- Reliable model comparison
The evaluation infrastructure is not an afterthought - it's fundamental to understanding what's happening during training. By implementing proper metrics and comprehensive benchmarks, nanochat demonstrates that even small-scale projects can employ the same rigorous evaluation methods used by frontier labs.
Related Posts
Previous in series:
Track 1 Complete! 🎉 All Technical Deep-Dive posts:
- Post 1.1: The Muon Optimizer Explained
- Post 1.2: Distributed Muon - Custom Gradient Synchronization
- Post 1.3: KV Caching Deep-Dive
- Post 1.4: Modern Transformer Architecture
Coming next: Track 2 (Practical Guides) - Hands-on tutorials for building your own ChatGPT!
Further Reading
- Chinchilla paper: Training Compute-Optimal Large Language Models
- DCLM paper (CORE benchmark): DataComp-LM
- Kaplan scaling laws: Scaling Laws for Neural Language Models
- nanochat source code: GitHub
On this page
- Introduction
- The Problem with Cross-Entropy Loss
- Why Cross-Entropy Loss is Tokenizer-Dependent
- Bits Per Byte (bpb): The Solution
- The Core Idea
- Implementation
- Key Design Decisions
- 1. Per-Token Loss Calculation
- 2. Token Byte Lookup
- 3. Byte Weighting
- 4. Distributed Reduction
- bpb Interpretation
- bpb vs Cross-Entropy Comparison
- The CORE Benchmark
- What is CORE?
- Why CORE?
- Task Types
- 1. Multiple Choice
- 2. Schema (Cloze Completion)
- 3. Language Modeling
- Implementation Deep-Dive
- Key Techniques
- CORE Metric Calculation
- Training Loop Integration
- Scaling Laws: Empirical Observations
- Chinchilla Scaling Laws
- Observing Scaling Laws
- Loss Curves
- Compute-Optimal Frontier
- Loss vs Compute (Scaling Law)
- Practical Insights from Evaluation
- 1. Early Stopping vs Overtraining
- 2. bpb vs CORE Correlation
- 3. Validation Set Size
- 4. Learning Rate Warmdown
- Advanced Topics
- 1. Bits Per Character vs Bits Per Byte
- 2. Perplexity
- 3. CORE vs Other Benchmarks
- 4. Curriculum Learning
- Debugging Training Runs
- Common Issues
- Conclusion
- Related Posts
- Further Reading



