Loss Landscape & Scaling Laws: Understanding Training Dynamics

Introduction

Training a language model is not just about running gradient descent until convergence. Understanding how the loss evolves, why certain hyperparameters work, and what scaling laws govern model performance can mean the difference between wasted compute and efficient training runs.

nanochat's training infrastructure includes sophisticated evaluation mechanisms that go beyond simple loss tracking. It implements bits-per-byte (bpb) - a tokenization-agnostic metric, and the CORE benchmark - a comprehensive evaluation suite covering 11 diverse tasks. Together, these tools provide deep insights into training dynamics and enable principled decision-making about model architecture, data requirements, and compute budgets.

This final post of Track 1 explores nanochat's evaluation framework, dissecting the bits-per-byte metric, understanding the CORE benchmark, and examining how these tools illuminate the fundamental scaling laws that govern language model training.

The Problem with Cross-Entropy Loss

The standard training metric is cross-entropy loss:

loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

WARNING

This seems straightforward, but it has a critical flaw: The loss is not comparable across different tokenizers.

Why Cross-Entropy Loss is Tokenizer-Dependent

Consider two tokenizers:

Tokenizer A (vocab size = 10K):

"Hello world" → [1523, 892]
Average tokens per sentence: 10
Cross-entropy loss: 3.2

Tokenizer B (vocab size = 100K):

"Hello world" → [45123]
Average tokens per sentence: 5
Cross-entropy loss: 2.8

Which is better? You can't tell! Tokenizer B has lower loss, but it's also predicting from a 10× larger vocabulary.

The cross-entropy loss depends on:

Vocabulary size: Larger vocab → higher entropy → higher loss
Token granularity: Byte-level vs word-level tokenization
Special tokens: How are they handled?

This makes it impossible to:

Compare models trained with different tokenizers
Experiment with vocabulary size
Compare against published baselines (which use different tokenizers)

Bits Per Byte (bpb): The Solution

nanochat uses bits-per-byte (bpb) - a tokenization-agnostic metric that normalizes loss by the actual byte content being predicted.

The Core Idea

Instead of computing loss per token, compute loss per byte:

bpb = total_nats / (log(2) × total_bytes)

Where:

total_nats: Sum of cross-entropy losses (in natural log units)
total_bytes: Total number of UTF-8 bytes in all target tokens
log(2): Conversion factor from nats to bits

NOTE

Key insight: The byte content is tokenizer-independent. "Hello world" is always 11 bytes, regardless of how you tokenize it.

Implementation

nanochat's implementation from nanochat/loss_eval.py:

nanochat/loss_eval.py

@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
    """
    Calculate bits per byte (bpb) - a tokenization-agnostic metric.
    
    Args:
        model: The model to evaluate
        batches: Iterator over (inputs, targets) batches
        steps: Number of evaluation steps
        token_bytes: 1D tensor mapping token_id → num_bytes (0 for special tokens)
    """
    total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
    total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
    
    batch_iter = iter(batches)
    for _ in range(steps):
        x, y = next(batch_iter)
        
        # Get per-token losses (no reduction)
        loss2d = model(x, y, loss_reduction='none')  # (B, T)
        loss2d = loss2d.view(-1)  # flatten
        y = y.view(-1)  # flatten
        
        if (y < 0).any():
            # Handle ignore_index (e.g., -1 for padding)
            valid = y >= 0
            y_safe = torch.where(valid, y, torch.zeros_like(y))
            num_bytes2d = torch.where(
                valid,
                token_bytes[y_safe],
                torch.zeros_like(y, dtype=token_bytes.dtype)
            )
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
        else:
            # Fast path: no ignored targets
            num_bytes2d = token_bytes[y]
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
    
    # Sum across all distributed ranks
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    if world_size > 1:
        dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
        dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)
    
    # Convert to bits per byte
    total_nats = total_nats.item()
    total_bytes = total_bytes.item()
    bpb = total_nats / (math.log(2) * total_bytes)
    return bpb

Key Design Decisions

1. Per-Token Loss Calculation

loss2d = model(x, y, loss_reduction='none')  # (B, T)

We need per-token losses to weight them by token byte length. Standard reduction='mean' would lose this information.

2. Token Byte Lookup

num_bytes2d = token_bytes[y]

The token_bytes tensor is precomputed during tokenizer training (from scripts/tok_train.py):

scripts/tok_train.py - Token Byte Computation

# Generated by scripts/tok_train.py
token_bytes = torch.zeros(vocab_size, dtype=torch.int64)
for token_id in range(vocab_size):
    token_str = tokenizer.decode([token_id])
    token_bytes[token_id] = len(token_str.encode('utf-8'))
    
# Special tokens get 0 bytes (excluded from metric)
for special_token_id in special_token_ids:
    token_bytes[special_token_id] = 0

Example token bytes:

"hello" (5 characters) → 5 bytes
"世界" (2 characters) → 6 bytes (UTF-8)
"<|bos|>" (special) → 0 bytes (excluded)

3. Byte Weighting

total_nats += (loss2d * (num_bytes2d > 0)).sum()
total_bytes += num_bytes2d.sum()

Multiply each loss by whether its token contributes bytes (0 or 1)
Sum total bytes separately
This excludes special tokens (0 bytes) from both numerator and denominator

4. Distributed Reduction

if world_size > 1:
    dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
    dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)

Each rank evaluates different data (strided sharding), so we sum across ranks before computing the final ratio.

bpb Interpretation

What does a bpb value mean?

bpb = 1.5  →  On average, the model uses 1.5 bits per byte of text

Theoretical limits:

Compression	bpb	Interpretation
Perfect	0.0	Model predicts everything perfectly
gzip	~4.5	Classical compression algorithm
Random	8.0	No compression (1 byte = 8 bits)

Typical model performance:

Model	bpb	Context
GPT-2 small	~1.8	124M params, pre-2020 architecture
nanochat d20	~1.45	83M params, modern architecture
GPT-3 175B	~0.8	Large-scale model
GPT-4	~0.6	State-of-the-art (estimated)

bpb vs Cross-Entropy Comparison

Cross-entropy loss:

# Simple average - tokenizer dependent
loss = sum(cross_entropy(pred, target)) / num_tokens
# Example: 2.8 (what does this mean?)

Bits per byte:

# Byte-normalized - tokenizer independent
bpb = sum(cross_entropy(pred, target)) / (log(2) × num_bytes)
# Example: 1.45 bits/byte (comparable across tokenizers!)

Concrete example:

Text: "The quick brown fox jumps"

Tokenizer	Tokens	Cross-entropy	Bytes	bpb
GPT-4 (100K)	5	2.5	25	1.4
GPT-2 (50K)	6	2.8	25	1.4
Character-level (256)	25	1.2	25	1.4

TIP

All three have the same bpb despite different token counts and losses! ✅

The CORE Benchmark

While bpb measures how well the model compresses text, it doesn't tell us how well it understands language. Enter the CORE benchmark.

What is CORE?

CORE (Common-sense Reasoning Evaluation) is a comprehensive benchmark from the DCLM paper that evaluates models on 11 diverse tasks:

Category	Tasks
Knowledge	MMLU (5-shot), HellaSwag, PIQA
Reading Comprehension	SQUAD, BoolQ, SciQ, ARC-Easy, ARC-Challenge
Common Sense	OpenBookQA, Winogrande
Problem Solving	SQUAD

Why CORE?

Single-number metric:

CORE = average of centered accuracies across all 11 tasks

Where centered accuracy adjusts for random baseline:

centered = (accuracy - random_baseline) / (1.0 - random_baseline)

Benefits:

✅ Comprehensive coverage of language understanding
✅ Balances multiple skill types
✅ Accounts for task difficulty
✅ Enables comparison with published models

Task Types

The CORE benchmark includes three task types:

1. Multiple Choice

Example from HellaSwag:

Context: "A person is climbing a rock wall."
Choices:
  A) "The person reaches the top and waves."
  B) "The person is eating a sandwich."
  C) "The wall turns into water."
  D) "The rock becomes a bird."
Gold: A

Evaluation method: Choose option with lowest perplexity.

2. Schema (Cloze Completion)

Example from Winogrande:

Context options:
  - "The trophy doesn't fit in the suitcase because it is too large."
  - "The trophy doesn't fit in the suitcase because it is too small."
Continuation: "it" refers to the trophy
Gold: Option 1

Evaluation method: Choose context with lowest perplexity for continuation.

3. Language Modeling

Example from SQUAD:

Context: "The Normans were originally people from..."
Continuation: "northern France"

Evaluation method: Check if model's greedy predictions match the continuation.

Implementation Deep-Dive

nanochat's CORE evaluation from nanochat/core_eval.py:

nanochat/core_eval.py

@torch.no_grad()
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
    """Evaluate a single example."""
    item = data[idx]
    task_type = task_meta['task_type']
    num_fewshot = task_meta['num_fewshot']
    
    # Sample few-shot examples (deterministic based on idx)
    fewshot_examples = []
    if num_fewshot > 0:
        rng = random.Random(1234 + idx)
        available_indices = [i for i in range(len(data)) if i != idx]
        fewshot_indices = rng.sample(available_indices, num_fewshot)
        fewshot_examples = [data[i] for i in fewshot_indices]
    
    # Render prompts based on task type
    if task_type == 'multiple_choice':
        prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
    elif task_type == 'schema':
        prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
    elif task_type == 'language_modeling':
        prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
        tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
    
    # Forward model
    input_ids = stack_sequences(tokens, pad_token_id).to(device)
    losses, predictions = forward_model(model, input_ids)
    
    # Evaluate correctness based on task type
    if task_type == 'language_modeling':
        si, ei = start_idxs[0], end_idxs[0]
        predicted_tokens = predictions[0, si-1:ei-1]
        actual_tokens = input_ids[0, si:ei]
        is_correct = torch.all(predicted_tokens == actual_tokens).item()
    elif task_type in ['multiple_choice', 'schema']:
        mean_losses = [losses[i, si-1:ei-1].mean().item()
                       for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
        pred_idx = mean_losses.index(min(mean_losses))
        is_correct = pred_idx == item['gold']
    
    return is_correct

Key Techniques

1. Few-Shot Learning

# Deterministic sampling based on example index
rng = random.Random(1234 + idx)
fewshot_indices = rng.sample(available_indices, num_fewshot)

This ensures:

Same few-shot examples for the same test example across runs
Different few-shot examples for different test examples
Reproducible results

2. Prompt Rendering with Jinja2

# Multiple choice template
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
 
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}"""

Example rendered prompt:

Question: What is the capital of France?
Answer: Paris

Question: What is 2+2?
Answer: 4

Question: What is the largest planet?
Answer: Jupiter

3. Common Prefix/Suffix Detection

For multiple choice:

def find_common_length(token_sequences, direction='left'):
    """Find length of common prefix (multiple choice) or suffix (schema)."""
    min_len = min(len(seq) for seq in token_sequences)
    indices = range(min_len) if direction == 'left' else range(-1, -min_len-1, -1)
    
    for i, idx in enumerate(indices):
        token = token_sequences[0][idx]
        if not all(seq[idx] == token for seq in token_sequences):
            return i
    return min_len

NOTE

Why this matters:

For multiple choice, all options share the same context:

Tokens:
  Option A: [15, 42, 88, 91, 23, 56, 77]  ← Answer starts at index 4
  Option B: [15, 42, 88, 91, 99, 12, 34]  ← Answer starts at index 4
  Option C: [15, 42, 88, 91, 45, 67, 89]  ← Answer starts at index 4
            └─── common prefix ───┘

We only need to evaluate losses for the answer part (indices 4-7), not the shared context.

4. Distributed Evaluation

def evaluate_task(model, tokenizer, data, device, task_meta):
    """Evaluate one task across all examples with distributed dispatch."""
    rank = dist.get_rank() if dist.is_initialized() else 0
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    
    correct = torch.zeros(len(data), dtype=torch.float32, device=device)
    
    # Each rank processes every Nth example
    for idx in range(rank, len(data), world_size):
        is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
        correct[idx] = float(is_correct)
    
    # Sync results across ranks
    if world_size > 1:
        dist.barrier()
        dist.all_reduce(correct, op=dist.ReduceOp.SUM)
    
    mean_correct = correct.mean().item()
    return mean_correct

Strided access pattern (same as data loading):

Rank 0: Examples 0, 8, 16, 24, ...
Rank 1: Examples 1, 9, 17, 25, ...
etc.

This provides automatic load balancing and parallelizes evaluation across all GPUs.

CORE Metric Calculation

From scripts/base_eval.py:

scripts/base_eval.py - CORE Metric

# Evaluate all tasks
results = {}
centered_results = {}
for task in tasks:
    accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
    
    # Center by random baseline
    random_baseline = eval_metadata[task_label]["Random baseline"]
    centered = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
    
    results[task_label] = accuracy
    centered_results[task_label] = centered
 
# CORE metric = average of centered results
core_metric = sum(centered_results.values()) / len(centered_results)

TIP

Why center by random baseline?

Different tasks have different random baselines:

4-choice MC: 25% random accuracy
True/False: 50% random accuracy
10-choice MC: 10% random accuracy

Centering normalizes to the scale:

centered = 0.0  →  Random performance
centered = 1.0  →  Perfect performance

This ensures all tasks contribute equally to the CORE metric.

Training Loop Integration

nanochat's training loop integrates both metrics from scripts/base_train.py:

scripts/base_train.py - Evaluation Integration

for step in range(num_iterations + 1):
    
    # Evaluate validation bpb
    if last_step or step % eval_every == 0:
        model.eval()
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
        
        print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
        wandb_run.log({
            "step": step,
            "val/bpb": val_bpb,
        })
        model.train()
    
    # Evaluate CORE metric
    if last_step or (step > 0 and step % core_metric_every == 0):
        model.eval()
        with autocast_ctx:
            results = evaluate_model(orig_model, tokenizer, device, max_per_task=500)
        
        print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
        wandb_run.log({
            "step": step,
            "core_metric": results["core_metric"],
            "centered_results": results["centered_results"],
        })
        model.train()
    
    # Training step
    # ...

Evaluation frequency:

eval_every = 250: bpb evaluation every 250 steps
core_metric_every = 2000: CORE evaluation every 2000 steps

Why different frequencies?

bpb: Fast (~30 seconds), run frequently
CORE: Slow (~15 minutes), run sparingly

Scaling Laws: Empirical Observations

With these evaluation tools, we can now explore scaling laws - the relationships between model size, data, compute, and performance.

Chinchilla Scaling Laws

The Chinchilla paper established that for optimal compute efficiency:

Optimal data-to-parameter ratio ≈ 20:1

nanochat's implementation from scripts/base_train.py:

scripts/base_train.py - Chinchilla Ratio

# Training horizon specification
target_param_data_ratio = 20  # Chinchilla = 20
 
# Calculate training iterations
target_tokens = target_param_data_ratio * num_params
num_iterations = target_tokens // total_batch_size

Example for depth=20 model:

Parameters: 83M
Target tokens: 20 × 83M = 1.66B tokens
Batch size: 524K tokens
Iterations: 1.66B / 524K = 3,167 steps

Observing Scaling Laws

Experiment setup:

Train models at different scales with Chinchilla-optimal data:

Depth	Params	Data (20×)	val bpb	CORE
12	30M	600M	1.75	0.25
16	54M	1.08B	1.58	0.32
20	83M	1.66B	1.45	0.38
24	118M	2.36B	1.35	0.43
28	158M	3.16B	1.28	0.47

Observations:

Power law in loss:

bpb ∝ N^(-α)
where N = parameters, α ≈ 0.05

Logarithmic in CORE:

CORE ∝ log(N)

Smooth improvements: No sudden jumps, steady scaling

Loss Curves

Typical training curve:

Loading diagram...

Characteristics:

Rapid initial improvement: First 500 steps see largest gains
Logarithmic progress: Later improvements come more slowly
Smooth convergence: No oscillations or instabilities

Compute-Optimal Frontier

Key insight: For a fixed compute budget C (in FLOPs), what's the optimal allocation?

C = 6 × N × D  (approximate FLOPs for training)
where:
  N = parameters
  D = tokens

Chinchilla finding:

Optimal: N ∝ C^0.5
         D ∝ C^0.5

This means:

✅ Double compute → √2× larger model trained on √2× more data
❌ NOT: Double compute → 2× larger model on same data
❌ NOT: Double compute → same model on 2× more data

nanochat's default (via target_param_data_ratio=20):

Slightly undertrained vs Chinchilla (which suggests ~14:1 for smaller models)
Reasonable for experimentation where you might continue training

Loss vs Compute (Scaling Law)

Kaplan et al. (2020) scaling law:

L(C) = (C_0 / C)^α
where:
  L = validation loss (or bpb)
  C = compute (FLOPs)
  C_0 = constant (depends on architecture)
  α ≈ 0.05-0.07 (exponent)

In practice:

Compute (FLOPs)	Model Size	Data	Expected bpb
1e18	30M	600M	1.75
1e19	83M	1.66B	1.45
1e20	230M	4.6B	1.20
1e21	630M	12.6B	1.00

NOTE

Key insight: To halve your loss, you need ~10× more compute.

Practical Insights from Evaluation

1. Early Stopping vs Overtraining

Observation: For fixed model size, loss continues improving with more data, but with diminishing returns.

# Training beyond Chinchilla ratio
target_param_data_ratio = 20  # Standard
# vs
target_param_data_ratio = 40  # 2× more data

Results (depth=20, 83M params):

Data Ratio	Tokens	val bpb	CORE	Training Time
10×	830M	1.52	0.35	30 min
20×	1.66B	1.45	0.38	60 min
40×	3.32B	1.42	0.39	120 min

Insight: 20× is a good default. Beyond that, returns diminish significantly.

2. bpb vs CORE Correlation

Generally:

Lower bpb → Higher CORE
But correlation is imperfect

Example anomaly:

Model A: bpb=1.45, CORE=0.38
Model B: bpb=1.44, CORE=0.36

Model B has slightly better compression but worse reasoning. This can happen when:

Model memorizes common phrases (lowers bpb)
But doesn't learn compositional reasoning (lowers CORE)

TIP

Lesson: Use both metrics! bpb for training dynamics, CORE for capabilities.

3. Validation Set Size

eval_tokens = 20 * 524288  # ~10M tokens

Why 10M tokens?

Large enough for stable estimates
Small enough to evaluate quickly (~30s)

Stability analysis:

eval_tokens	bpb mean	bpb std	Eval time
1M	1.447	0.023	3s
5M	1.451	0.008	15s
10M	1.450	0.004	30s
50M	1.450	0.002	150s

Conclusion: 10M tokens provides sufficient precision with reasonable compute.

4. Learning Rate Warmdown

nanochat uses warmdown (gradual LR decay) in the final 20% of training:

warmdown_ratio = 0.2
final_lr_frac = 0.0
 
def get_lr_multiplier(it):
    warmdown_iters = round(warmdown_ratio * num_iterations)
    if it <= num_iterations - warmdown_iters:
        return 1.0
    else:
        progress = (num_iterations - it) / warmdown_iters
        return progress * 1.0 + (1 - progress) * final_lr_frac

Effect on validation loss:

Steps 0-2500:    LR = 1.0 × base_lr,  val_bpb slowly decreasing
Steps 2500-3167: LR = 1.0 → 0.0,      val_bpb drops faster

Final val_bpb improvement: ~0.03 bpb (2% relative)

Why it works: Model fine-tunes to a sharper minimum in later training.

Advanced Topics

1. Bits Per Character vs Bits Per Byte

Some papers report "bits per character" (bpc):

# Bits per character
bpc = total_nats / (log(2) * num_characters)

Key difference:

bpb: UTF-8 bytes (e.g., "世界" = 6 bytes)
bpc: Unicode characters (e.g., "世界" = 2 characters)

For English text: bpc ≈ bpb (most characters = 1 byte)
For multilingual text: bpc < bpb (multi-byte characters)

nanochat uses bpb because:

More universal (bytes are always bytes)
Easier to compute (no character counting)
Standard in modern LLM papers

2. Perplexity

Some papers report perplexity instead of bpb:

perplexity = exp(cross_entropy_loss)

Relationship to bpb:

bpb = cross_entropy_loss / log(2)
perplexity = 2^bpb

Example:

bpb = 1.45
perplexity = 2^1.45 = 2.73

Interpretation: On average, the model is uncertain between ~2.7 likely next bytes.

3. CORE vs Other Benchmarks

Comparison:

Benchmark	Tasks	Coverage	Pros	Cons
CORE	11	Broad	Single metric, comprehensive	Less common than alternatives
HELM	42	Very broad	Industry standard	Slow, complex
Eleuther Eval Harness	200+	Extremely broad	Most comprehensive	Very slow
MMLU	57	Knowledge-focused	Standard, well-known	Narrow (only knowledge)

nanochat chooses CORE for:

✅ Fast evaluation (~15 min on 8 GPUs)
✅ Good coverage without redundancy
✅ From recent, well-regarded paper (DCLM)

4. Curriculum Learning

Question: Should we change the data distribution during training?

# Potential curriculum strategy
if step < warmup_steps:
    data_loader = easy_data_loader  # Simpler texts
else:
    data_loader = full_data_loader  # All data

nanochat's choice: No curriculum (uniform distribution throughout)

Rationale:

Simpler implementation
Data is already shuffled (good enough)
No clear evidence curriculum helps for LLMs at this scale

Debugging Training Runs

Common Issues

1. Loss diverges (NaN)

Step 100: loss=2.8
Step 101: loss=3.2
Step 102: loss=8.5
Step 103: loss=nan

WARNING

Likely causes:

Learning rate too high
Gradient clipping disabled/too high
Numerical instability in forward pass

Solution:

grad_clip = 1.0  # Enable gradient clipping
matrix_lr = 0.01  # Reduce Muon LR

2. Loss plateaus early

Step 0-500: bpb 4.5 → 2.1
Step 500-3000: bpb 2.1 → 2.09

Likely causes:

Learning rate too low
Model too small for dataset
Data quality issues

Solution:

Increase learning rate
Increase model depth
Inspect data distribution

3. Train/val loss diverge

Step 2000: train_loss=1.4, val_bpb=1.8

Likely causes:

Overfitting (model too large)
Train/val distribution mismatch
Bug in validation evaluation

Solution:

Check dataset splits
Verify eval implementation
Reduce model size if overfitting

Conclusion

nanochat's evaluation framework provides the tools needed to understand and optimize language model training:

Bits per byte (bpb):

✅ Tokenization-agnostic metric
✅ Enables fair comparison across models
✅ Fast to evaluate (~30 seconds)
✅ Stable training signal

CORE benchmark:

✅ Comprehensive task coverage
✅ Single-number metric for model quality
✅ Normalized for fair comparison
✅ Parallelized for efficiency

Scaling laws:

✅ Power-law relationship between compute and performance
✅ Chinchilla ratio (20:1 data:params) as default
✅ Smooth, predictable improvements

Together, these tools enable:

Principled hyperparameter selection
Efficient compute allocation
Early detection of training issues
Reliable model comparison

The evaluation infrastructure is not an afterthought - it's fundamental to understanding what's happening during training. By implementing proper metrics and comprehensive benchmarks, nanochat demonstrates that even small-scale projects can employ the same rigorous evaluation methods used by frontier labs.

Previous in series:

Post 1.5: Training Data Pipeline - Streaming Tokenization at Scale

Track 1 Complete! 🎉 All Technical Deep-Dive posts:

Coming next: Track 2 (Practical Guides) - Hands-on tutorials for building your own ChatGPT!

On This Page

Loss Landscape & Scaling Laws: Understanding Training Dynamics

Introduction

The Problem with Cross-Entropy Loss

Why Cross-Entropy Loss is Tokenizer-Dependent

Bits Per Byte (bpb): The Solution

The Core Idea

Implementation

Key Design Decisions

1. Per-Token Loss Calculation

2. Token Byte Lookup

3. Byte Weighting

4. Distributed Reduction

bpb Interpretation

bpb vs Cross-Entropy Comparison

The CORE Benchmark

What is CORE?

Why CORE?

Task Types

1. Multiple Choice

2. Schema (Cloze Completion)

3. Language Modeling

Implementation Deep-Dive

Key Techniques

CORE Metric Calculation

Training Loop Integration

Scaling Laws: Empirical Observations

Chinchilla Scaling Laws

Observing Scaling Laws

Loss Curves

Compute-Optimal Frontier

Loss vs Compute (Scaling Law)

Practical Insights from Evaluation

1. Early Stopping vs Overtraining

2. bpb vs CORE Correlation

3. Validation Set Size

4. Learning Rate Warmdown

Advanced Topics

1. Bits Per Character vs Bits Per Byte

2. Perplexity

3. CORE vs Other Benchmarks

4. Curriculum Learning

Debugging Training Runs

Common Issues

Conclusion

Related Posts

Further Reading

Related Articles

🤖→🚀Building Custom Evaluation Tasks

🤖→🔬Training Data Pipeline: Streaming Tokenization at Scale

🤖→🔬Modern Transformer Architecture: RoPE, QK Norm, and Design Choices