José David Baena

Building Custom Evaluation Tasks

Banner.jpeg
Published on
/19 mins read

Track 2: Practical Guides - Post 2.4 of 6

This post builds on Reinforcement Learning from Human Feedback. View all posts in this track →

Standard benchmarks don't measure what you care about

Building domain-specific evaluation tasks is where I've seen the most value from nanochat's framework. The abstraction is simple enough to implement quickly, but flexible enough to handle any domain.

MMLU measures general knowledge. GSM8K measures math. Neither measures your actual use case. That's why you need custom evaluation tasks.

TL;DR: One base class with two modes (categorical and generative). CORE normalizes scores 0–1 with baseline centering. Custom tasks take 50 lines. Sandbox execution makes code evaluation safe.

The benchmark that missed the bug: Consider a common pattern: a legal-tech team trains a contract analysis model and celebrates 78% accuracy on MMLU law questions. In production, it misses 40% of non-compete clause violations—the exact use case they built it for. MMLU tests general legal knowledge; their users need specific clause detection. After building a custom evaluation task with 500 real contract excerpts, they discover the model is guessing on unfamiliar clause structures. The custom benchmark exposes the gap. Targeted fine-tuning later, clause detection hits 91%. General benchmarks tell you your model is broadly capable. Custom benchmarks tell you it actually works.

You've trained a model, fine-tuned it for chat, and optimized it with RL. But how do you know if it's actually good at what you care about? Standard benchmarks like MMLU and GSM8K are useful, but they measure general capabilities—not your specific use case.

Custom evaluation tasks let you measure what matters for your application: medical diagnosis accuracy, legal document analysis, code generation for your codebase, or any domain-specific skill.

nanochat's evaluation framework makes this straightforward:

  • The three evaluation task types: multiple choice, schema, and language modeling
  • How the CORE benchmark framework works
  • Building custom tasks for any domain
  • Best practices for prompt engineering in evaluation
  • Code execution sandboxing for safe evaluation
  • Distributed evaluation for large benchmarks

One Task class handles any evaluation type

Base Task Class

All evaluation tasks in nanochat inherit from Task:

class Task:
    def __init__(self, start=0, stop=None, step=1):
        # Allows lightweight slicing over the dataset
        self.start = start
        self.stop = stop
        self.step = step
    
    @property
    def eval_type(self):
        # one of 'generative' | 'categorical'
        raise NotImplementedError
    
    def num_examples(self):
        raise NotImplementedError
    
    def get_example(self, index):
        # Returns a conversation dict
        raise NotImplementedError
    
    def evaluate(self, conversation, assistant_response):
        # Returns success (bool or float)
        raise NotImplementedError

Key design principles:

  1. Lightweight slicing: Create views over datasets without copying data
  2. Lazy loading: Examples fetched on-demand via get_example()
  3. Two evaluation modes: categorical (constrained choices) or generative (free-form)
  4. Conversation format: All tasks return standardized conversation dicts

The Two Evaluation Modes

Categorical Evaluation

When to use: Multiple choice, yes/no, classification tasks

How it works: Model assigns probabilities to predefined options, chooses highest probability

Example from MMLU:

@property
def eval_type(self):
    return 'categorical'
 
def get_example(self, index):
    row = self.ds[index]
    question = row["question"]
    choices = row["choices"]  # ["Choice A", "Choice B", "Choice C", "Choice D"]
    answer = row["answer"]    # 0, 1, 2, or 3
    
    user_message = render_mc(question, ['A', 'B', 'C', 'D'], choices)
    assistant_message = ['A', 'B', 'C', 'D'][answer]
    
    return {
        "messages": [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_message}
        ],
        "letters": ['A', 'B', 'C', 'D'],  # For evaluation
    }
 
def evaluate(self, conversation, assistant_response):
    correct_answer = conversation['messages'][-1]['content']
    return assistant_response == correct_answer

Advantages:

  • Fast (no sampling required, batch multiple questions)
  • Deterministic (no temperature variation)
  • Easy to score (exact match)

Generative Evaluation

When to use: Open-ended tasks (code generation, math, creative writing)

How it works: Model generates free-form text, evaluated against success criteria

Example from HumanEval:

@property
def eval_type(self):
    return 'generative'
 
def get_example(self, index):
    row = self.ds[index]
    prompt = row['prompt']           # Function signature
    solution = row['canonical_solution']
    test = row['test']               # Test cases
    
    return {
        "messages": [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": f"{prompt}\\n{solution}"},
        ],
        "entry_point": row['entry_point'],
        "test": test,
    }
 
def evaluate(self, conversation, completion):
    # Extract code from completion
    code = extract_program(completion)
    
    # Build executable program
    program = (
        imports +
        "\\n\\n" +
        code +
        "\\n\\n" +
        conversation['test'] +
        "\\n" +
        f"check({conversation['entry_point']})"
    )
    
    # Execute and check
    result = execute_code(program)
    return result.success

Advantages:

  • Flexible (measures complex capabilities)
  • Realistic (mimics actual use)
  • Rich feedback (can analyze failure modes)

For your custom tasks, this means: use categorical evaluation when you can (faster, deterministic), but don't force it. If your task is naturally open-ended—code generation, summarization, creative writing—generative evaluation captures what categorical misses.

For your compute budget, this means: categorical evaluation is 10-100× cheaper than generative. A single forward pass vs. potentially hundreds of tokens sampled. If you're evaluating on every training step, categorical saves real money.

CORE normalizes scores across diverse task types

What is CORE?

CORE (Compressed Open-Ended Requirements Evaluation) is an 11-task benchmark from the DCLM paper that evaluates base models across diverse capabilities with minimal compute.

From nanochat/core_eval.py:

Tasks:
1. ARC (easy/challenge) - Science reasoning
2. HellaSwag - Commonsense reasoning  
3. MMLU - Multitask knowledge
4. OpenBookQA - Elementary science
5. PIQA - Physical reasoning
6. SIQA - Social reasoning
7. WinoGrande - Coreference resolution
8. BoolQ - Yes/no questions
9. COPA - Causal reasoning
10. StoryCloze - Story completion
11. SQuAD - Reading comprehension

Why CORE matters:

  • Coverage: Tests diverse reasoning types
  • Efficiency: 11 tasks vs. dozens in full benchmarks
  • Correlation: CORE score correlates with broader evaluation suites
  • Centered metric: Accounts for random baselines (0=random, 1=perfect)

Three Task Types in CORE

CORE defines three fundamental task structures:

1. Multiple Choice Tasks

Structure: Same context, different continuations

Question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Evaluation method: Compare log probabilities of each continuation, choose lowest loss (highest probability)

From nanochat/core_eval.py:

if task_type == 'multiple_choice':
    prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
    # ...
    # Find option with lowest average loss
    mean_losses = [losses[i, si-1:ei-1].mean().item()
                   for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
    pred_idx = mean_losses.index(min(mean_losses))
    is_correct = pred_idx == item['gold']

Key insight: We evaluate log probabilities, not generated text. This is much faster and avoids issues with generation formatting.

2. Schema Tasks

Structure: Different contexts, same continuation

Context A: "The dog barked loudly."
Context B: "The dog slept quietly."
Context C: "The dog ran quickly."

Continuation: " It was happy."

Which context is most likely?

Use case: Sentence completion, coreference resolution

Evaluation method: Similar to multiple choice, but context varies instead of continuation

if task_type == 'schema':
    prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
    # Find context with lowest loss for the continuation
    mean_losses = [losses[i, si-1:ei-1].mean().item()
                   for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
    pred_idx = mean_losses.index(min(mean_losses))
    is_correct = pred_idx == item['gold']

3. Language Modeling Tasks

Structure: Context + continuation, evaluate continuation likelihood

Context: "The capital of France is"
Continuation: " Paris"

Check if model assigns high probability to continuation

Use case: Reading comprehension, factual knowledge

Evaluation method: Check if argmax predictions match actual tokens

if task_type == 'language_modeling':
    prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
    # Check if all predicted tokens match actual tokens
    si, ei = start_idxs[0], end_idxs[0]
    predicted_tokens = predictions[0, si-1:ei-1]
    actual_tokens = input_ids[0, si:ei]
    is_correct = torch.all(predicted_tokens == actual_tokens).item()

Centered Metrics

Raw accuracy can be misleading when random guessing achieves high scores:

# Example: 4-choice multiple choice
raw_accuracy = 0.30  # 30% correct
random_baseline = 0.25  # 25% by guessing
 
# Centered accuracy
centered = (raw_accuracy - random_baseline) / (1.0 - random_baseline)
# centered = (0.30 - 0.25) / (1.0 - 0.25) = 0.067 (6.7% above random)

Interpretation:

  • 0.0 = Random guessing
  • 1.0 = Perfect performance
  • Negative = Worse than random (model is broken)

For your evaluation pipeline, this means: always report centered accuracy, not raw accuracy. A "30% accuracy" sounds bad until you realize random is 25%—your model is actually learning something. Centered metrics make progress visible.

For your stakeholder reports, this means: centered metrics translate to business value. "7% above random" is concrete improvement. "32% accuracy" on a 4-choice task sounds terrible but means the same thing.

The CORE metric is the mean of centered accuracies across all 11 tasks.

Evaluation Task Builder

Create custom evaluation tasks for testing LLM capabilities

Tasks (1)

Add New Task

Scoring Methods

  • Exact Match: Response must match exactly
  • Contains: Response must contain pattern
  • Regex: Pattern is a regular expression
  • Semantic: Embedding similarity (simulated)

Centered Metric Calculator

Calculate human-centered evaluation metrics

Presets:
Raw Model Score
68.6%
Centered Accuracy
55.1%
(model - baseline) / (human - baseline)
Normalized Accuracy
80.7%
model / human
Gap to Human
16.4
points remaining

Per-Task Centered Accuracy

Task 1
50.0%
Task 2
55.6%
Task 3
58.8%
Task 4
57.1%
Task 5
54.1%

Why Centered Metrics?

  • Raw accuracy can be misleading - 75% on an easy task vs hard task means different things
  • Centered accuracy normalizes to 0% (random) to 100% (human-level)
  • A score of 50% means halfway between random baseline and human performance
  • Scores above 100% indicate superhuman performance (possible with some metrics)

Formula

Centered Accuracy = (model_score - baseline) / (human_ceiling - baseline)

Where:
  baseline = random/majority class performance
  human_ceiling = expert human performance
  
Range: [0, 1] typically (can exceed 1 for superhuman)

Multiple choice tasks: constrained evaluation in 40 lines

Example: Custom Medical Diagnosis Task

Building a medical diagnosis task from scratch:

from datasets import load_dataset
from tasks.common import Task, render_mc
 
class MedicalDiagnosis(Task):
    """Evaluate medical diagnosis from symptoms"""
    
    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
        assert split in ["train", "test"], "split must be train|test"
        # Assume you have a dataset with structure:
        # {"symptoms": str, "diagnosis": str, "options": List[str]}
        self.ds = load_dataset("your_org/medical_diagnosis", split=split)
        self.letters = ('A', 'B', 'C', 'D', 'E')  # Up to 5 options
    
    @property
    def eval_type(self):
        return 'categorical'
    
    def num_examples(self):
        return len(self.ds)
    
    def get_example(self, index):
        row = self.ds[index]
        symptoms = row['symptoms']
        diagnosis = row['diagnosis']
        options = row['options']  # List of possible diagnoses
        
        # Find which option is correct
        assert diagnosis in options, f"Correct diagnosis not in options!"
        answer_idx = options.index(diagnosis)
        
        # Render as multiple choice
        question = f"Given the following symptoms:\\n{symptoms}\\n\\nWhat is the most likely diagnosis?"
        user_message = render_mc(question, self.letters[:len(options)], options)
        assistant_message = self.letters[answer_idx]
        
        return {
            "messages": [
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": assistant_message}
            ],
            "letters": self.letters[:len(options)],
            "specialty": row.get('specialty', 'general'),  # For grouping results
        }
    
    def evaluate(self, conversation, assistant_response):
        assert assistant_response in conversation['letters']
        correct_answer = conversation['messages'][-1]['content']
        return assistant_response == correct_answer

Usage:

from tasks.medical_diagnosis import MedicalDiagnosis
 
# Create task
task = MedicalDiagnosis(split="test")
 
# Evaluate (from scripts/chat_eval.py pattern)
from scripts.chat_eval import run_categorical_eval
 
accuracy = run_categorical_eval(
    task_object=task,
    tokenizer=tokenizer,
    model=model,
    batch_size=8,
    max_problems=None  # Evaluate all
)
 
print(f"Medical Diagnosis Accuracy: {accuracy:.2%}")

Generative tasks: free-form evaluation with verifiable answers

Example: Custom Code Generation Task

For domain-specific code generation:

from tasks.common import Task
from nanochat.execution import execute_code
 
class CustomCodeGen(Task):
    """Evaluate code generation for your specific domain"""
    
    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
        # Load your custom code dataset
        # Structure: {"prompt": str, "solution": str, "tests": str, "imports": str}
        self.ds = load_dataset("your_org/custom_code", split=split)
    
    @property
    def eval_type(self):
        return 'generative'
    
    def num_examples(self):
        return len(self.ds)
    
    def get_example(self, index):
        row = self.ds[index]
        prompt = row['prompt']        # Function description or stub
        solution = row['solution']    # Reference solution
        tests = row['tests']          # Test cases
        imports = row['imports']      # Required imports
        
        return {
            "messages": [
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": solution},
            ],
            "tests": tests,
            "imports": imports,
        }
    
    def evaluate(self, conversation, completion):
        """Execute generated code and check if tests pass"""
        # Extract code from markdown blocks
        code = self.extract_code(completion)
        
        # Build executable program
        program = (
            conversation['imports'] +
            "\\n\\n" +
            code +
            "\\n\\n" +
            conversation['tests']
        )
        
        # Execute safely
        result = execute_code(
            program,
            timeout=5.0,
            maximum_memory_bytes=256 * 1024 * 1024  # 256MB
        )
        
        return result.success
    
    def extract_code(self, completion):
        """Extract code from LLM output"""
        import re
        # Try to find code blocks
        pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
        matches = re.findall(pattern, completion, re.DOTALL)
        if matches:
            return matches[0].strip()
        # Fall back to whole completion
        return completion.strip()

Advanced: Partial Credit

Instead of binary pass/fail, award partial credit:

def evaluate(self, conversation, completion):
    """Award partial credit based on number of passing tests"""
    code = self.extract_code(completion)
    test_cases = self.parse_tests(conversation['tests'])  # List of individual tests
    
    passing_tests = 0
    for test_case in test_cases:
        program = (
            conversation['imports'] +
            "\\n\\n" +
            code +
            "\\n\\n" +
            test_case
        )
        result = execute_code(program, timeout=2.0)
        if result.success:
            passing_tests += 1
    
    # Return fraction of tests passed
    return passing_tests / len(test_cases)

Sandbox execution makes code evaluation safe

The Sandbox

nanochat includes a sandboxed execution environment for running untrusted code from LLMs:

From nanochat/execution.py:

def execute_code(
    code: str,
    timeout: float = 5.0,
    maximum_memory_bytes: Optional[int] = 256 * 1024 * 1024,
) -> ExecutionResult:
    """
    Execute Python code in a sandboxed environment.
    
    Safety features:
    - Runs in separate process (can be killed)
    - Time limit (default 5 seconds)
    - Memory limit (default 256MB)
    - Temporary directory (auto-cleaned)
    - Disabled dangerous functions (os.system, subprocess, etc.)
    """

What's protected:

def reliability_guard(maximum_memory_bytes):
    # Disable exit/quit
    builtins.exit = None
    builtins.quit = None
    
    # Disable dangerous OS operations
    os.kill = None
    os.system = None
    os.remove = None
    os.fork = None
    
    # Disable subprocess
    subprocess.Popen = None
    
    # Disable filesystem manipulation
    shutil.rmtree = None
    
    # Set memory limits
    resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))

What's NOT protected:

  • Network access (sockets can be opened)
  • Python's dynamic features (ctypes, etc.)
  • No kernel-level isolation

Recommendation: Use this sandbox for evaluation, but for production systems serving untrusted code, use proper containerization (Docker with limited capabilities, gVisor, Firecracker, etc.).

Execution Result Handling

result = execute_code("print('hello'); 1/0")
 
# Check result
if result.success:
    print(f"Output: {result.stdout}")
else:
    print(f"Error: {result.error}")
    if result.timeout:
        print("Execution timed out")
    if result.memory_exceeded:
        print("Memory limit exceeded")

Example outputs:

# Success
ExecutionResult(success=True, stdout='hello world\\n', stderr='')
 
# Timeout
ExecutionResult(success=False, timeout=True, error='Execution timed out')
 
# Runtime error
ExecutionResult(success=False, error='ZeroDivisionError: division by zero')
 
# Memory exceeded
ExecutionResult(success=False, memory_exceeded=True, error='Memory limit exceeded: ...')

Prompt format affects evaluation accuracy more than you expect

Multiple Choice Rendering

The render_mc() function creates standardized multiple choice prompts:

From tasks/common.py:

def render_mc(question, letters, choices):
    """
    Important design decisions:
    1) Letter AFTER choice (better binding for small models)
    2) No whitespace before letter (tokenization consistency)
    """
    query = f"Multiple Choice question: {question}\\n"
    query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
    query += "\\nRespond only with the letter of the correct answer."
    return query

Example output:

Multiple Choice question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Why this format?

  1. Letter after choice: Smaller models bind better when the letter comes after
  2. No space before letter: Tokenizer treats "=A" consistently (not "= A" or " A")
  3. Explicit instruction: "Respond only with the letter" reduces verbosity

Few-Shot Prompting

CORE uses few-shot examples for consistency:

def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
    template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
 
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
    
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
    prompts = [template.render(choice=choice, item=item, fewshot_examples=fewshot_examples, continuation_delimiter=continuation_delimiter) 
               for choice in item['choices']]
    return prompts

Example with 2-shot:

Q: What is 2+2?
A: 4

Q: What is the capital of Germany?
A: Berlin

Q: What is the chemical symbol for gold?
A: Au

Best practices:

  • Use 0-5 shots (more doesn't always help)
  • Sample fewshot examples randomly but deterministically (seed-based)
  • Exclude current example from fewshot pool
  • Match format exactly (same template for fewshot and test)

Distributed evaluation scales to 10K+ examples

Multi-GPU Evaluation

From nanochat/core_eval.py:

def evaluate_task(model, tokenizer, data, device, task_meta):
    """Evaluate task across multiple GPUs"""
    rank = dist.get_rank() if dist.is_initialized() else 0
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    
    correct = torch.zeros(len(data), dtype=torch.float32, device=device)
    
    # Each rank processes different examples
    for idx in range(rank, len(data), world_size):
        is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
        correct[idx] = float(is_correct)
    
    # Synchronize results across ranks
    if world_size > 1:
        dist.barrier()
        dist.all_reduce(correct, op=dist.ReduceOp.SUM)
    
    # Compute mean accuracy
    mean_correct = correct.mean().item()
    return mean_correct

How it works:

  1. Stride distribution: Rank 0 gets examples [0, 8, 16, ...], Rank 1 gets [1, 9, 17, ...], etc.
  2. Local evaluation: Each rank evaluates its assigned examples
  3. Synchronization: all_reduce sums results across ranks
  4. Global mean: Final accuracy computed from aggregated results

Launch evaluation:

# Single GPU
python scripts/base_eval.py
 
# 8 GPUs (8x speedup)
torchrun --nproc_per_node=8 scripts/base_eval.py

Task mixtures combine multiple evaluations into one score

Combining Multiple Tasks

Use TaskMixture to create multi-task datasets:

from tasks.common import TaskMixture
from tasks.arc import ARC
from tasks.gsm8k import GSM8K
from tasks.medical_diagnosis import MedicalDiagnosis
 
train_ds = TaskMixture([
    ARC(subset="ARC-Easy", split="train"),      # 2.3K examples
    GSM8K(subset="main", split="train"),         # 8K examples
    MedicalDiagnosis(split="train"),             # Your custom task
])
 
# Access examples (automatically shuffled)
for i in range(len(train_ds)):
    conversation = train_ds[i]
    # Train on this conversation

TaskMixture features:

  1. Deterministic shuffling: Same seed = same ordering (reproducible)
  2. Uniform sampling: All tasks mixed throughout training (prevents forgetting)
  3. Simple oversampling: Include a task multiple times to oversample
# Oversample medical diagnosis 3x
train_ds = TaskMixture([
    ARC(subset="ARC-Easy", split="train"),
    GSM8K(subset="main", split="train"),
    MedicalDiagnosis(split="train"),
    MedicalDiagnosis(split="train"),  # 2x
    MedicalDiagnosis(split="train"),  # 3x
])

These practices make evaluations reliable and reproducible

1. Data Quality

Good evaluation data:

  • Representative of real use cases
  • Balanced difficulty (not all easy/hard)
  • Diverse examples (covers edge cases)
  • High-quality labels (verified correct)
  • Clear success criteria

Bad evaluation data:

  • Artificial or contrived examples
  • Ambiguous questions
  • Multiple valid answers (but only one labeled)
  • Label noise or errors

2. Prompt Design

Do:

  • Use clear, specific instructions
  • Provide examples (few-shot) when helpful
  • Match training format
  • Test prompts on small sample first

Don't:

  • Change format between train and eval
  • Use ambiguous wording
  • Assume model knows implicit conventions
  • Over-complicate prompts

3. Evaluation Metrics

For classification tasks:

  • Accuracy (overall)
  • Per-class precision/recall
  • Confusion matrix
  • Calibration (confidence vs. correctness)

For generation tasks:

  • Exact match
  • F1 score (token overlap)
  • BLEU/ROUGE (for summarization)
  • Pass@k (for code)
  • Human evaluation (gold standard)

4. Error Analysis

Always analyze failures:

# Collect failures
failures = []
for i in range(len(task)):
    conversation = task[i]
    response = generate_response(conversation)
    is_correct = task.evaluate(conversation, response)
    
    if not is_correct:
        failures.append({
            "index": i,
            "conversation": conversation,
            "response": response,
            "expected": conversation['messages'][-1]['content'],
        })
 
# Analyze patterns
print(f"Failure rate: {len(failures) / len(task):.2%}")
print("\\nSample failures:")
for failure in failures[:5]:
    print(f"\\nQuestion: {failure['conversation']['messages'][0]['content']}")
    print(f"Expected: {failure['expected']}")
    print(f"Got: {failure['response']}")

These mistakes invalidate your evaluation results

1. Data Leakage

Problem: Test examples appear in training data

Solution:

  • Use different splits (train/val/test)
  • Check for near-duplicates
  • Time-based splits for temporal data
  • Hold out test set completely

2. Prompt Sensitivity

Problem: Small prompt changes cause large metric changes

Solution:

  • Test multiple prompt variations
  • Report mean and standard deviation
  • Use few-shot examples
  • Standardize format

3. Metric Gaming

Problem: Model exploits metric without solving task

Example: Model learns to always output "A" on multiple choice if "A" is most common

Solution:

  • Balance datasets (equal distribution of answers)
  • Use multiple complementary metrics
  • Manual inspection of samples
  • Adversarial examples

4. Evaluation Bugs

Problem: Bug in evaluation code inflates/deflates scores

Solution:

  • Unit test evaluation logic
  • Verify with reference implementations
  • Check edge cases (empty strings, special characters)
  • Manual verification on small sample

Custom evaluation measures what actually matters

Generic benchmarks miss your domain. Build evaluations that test what you care about. The key principles:

  1. Choose the right eval type: Categorical for constrained choices, generative for open-ended tasks
  2. Design clear prompts: Unambiguous instructions, consistent format
  3. Define success criteria: Exact match, fuzzy match, execution-based, or multi-faceted scoring
  4. Analyze failures: Understand where and why the model fails
  5. Use safe execution: Sandbox untrusted code from LLMs
  6. Distribute evaluation: Speed up with multi-GPU evaluation

The nanochat evaluation framework provides all the building blocks—Task abstraction, CORE benchmark patterns, safe execution, and distributed evaluation. By extending these patterns to your domain, you can build benchmarks that drive model improvements where they matter most.

If you can't measure it, you can't improve it. Now you can measure anything.

Next up: tokenizer design choices. Vocabulary size, regex patterns, and special tokens affect model performance and training efficiency.


Before you build your custom evaluation:

  1. Define success criteria before writing code. Exact match? Fuzzy match? Execution-based? This determines your entire architecture.
  2. Start with 10 hand-verified examples. Run your model on these manually—you'll discover edge cases no automated test catches.
  3. Sandbox untrusted code execution. Never run LLM-generated code without resource limits—one infinite loop crashes your eval pipeline.
  4. Balance your evaluation dataset. Equal distribution of answer choices—if 60% of answers are "A", your model will learn to guess "A".
  5. Test on the actual model first. Run 5 examples end-to-end before launching 10,000—discover prompt format bugs early.

Sources

Institutional and Industry Research

Research Papers

Evaluation Methodology

Evaluation Frameworks

nanochat Implementation

  • nanochat Repository: karpathy/nanochat. Source code for evaluation task implementations.
  • Tasks Directory: tasks/. Task implementations for CORE benchmark.
  • CORE Eval: nanochat/core_eval.py. Distributed evaluation implementation.

Previous in series:

Next in series:


💡 nanochat Tip: Domain-specific evaluation is often more valuable than general benchmarks. Invest in building high-quality evaluation tasks for your use case—they'll guide model development and measure real-world impact.

MMLU measures what the internet knows. Your custom task measures what your users need. Build both.