José David Baena

Building Custom Evaluation Tasks

Banner.jpeg
Published on
/15 mins read

Track 2: Practical Guides - Post 2.4 of 6

This post builds on Reinforcement Learning from Human Feedback. View all posts in this track →

Introduction

You've trained a model, fine-tuned it for chat, and optimized it with RL. But how do you know if it's actually good at what you care about? Standard benchmarks like MMLU and GSM8K are useful, but they measure general capabilities—not your specific use case.

Custom evaluation tasks let you measure what matters for your application: medical diagnosis accuracy, legal document analysis, code generation for your codebase, or any domain-specific skill.

This post covers nanochat's evaluation framework and how to design, implement, and run custom benchmarks:

  • The three evaluation task types: multiple choice, schema, and language modeling
  • How the CORE benchmark framework works
  • Building custom tasks for any domain
  • Best practices for prompt engineering in evaluation
  • Code execution sandboxing for safe evaluation
  • Distributed evaluation for large benchmarks

The Task Abstraction

Base Task Class

All evaluation tasks in nanochat inherit from Task:

class Task:
    def __init__(self, start=0, stop=None, step=1):
        # Allows lightweight slicing over the dataset
        self.start = start
        self.stop = stop
        self.step = step
    
    @property
    def eval_type(self):
        # one of 'generative' | 'categorical'
        raise NotImplementedError
    
    def num_examples(self):
        raise NotImplementedError
    
    def get_example(self, index):
        # Returns a conversation dict
        raise NotImplementedError
    
    def evaluate(self, conversation, assistant_response):
        # Returns success (bool or float)
        raise NotImplementedError

Key design principles:

  1. Lightweight slicing: Create views over datasets without copying data
  2. Lazy loading: Examples fetched on-demand via get_example()
  3. Two evaluation modes: categorical (constrained choices) or generative (free-form)
  4. Conversation format: All tasks return standardized conversation dicts

The Two Evaluation Modes

Categorical Evaluation

When to use: Multiple choice, yes/no, classification tasks

How it works: Model assigns probabilities to predefined options, chooses highest probability

Example from MMLU:

@property
def eval_type(self):
    return 'categorical'
 
def get_example(self, index):
    row = self.ds[index]
    question = row["question"]
    choices = row["choices"]  # ["Choice A", "Choice B", "Choice C", "Choice D"]
    answer = row["answer"]    # 0, 1, 2, or 3
    
    user_message = render_mc(question, ['A', 'B', 'C', 'D'], choices)
    assistant_message = ['A', 'B', 'C', 'D'][answer]
    
    return {
        "messages": [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_message}
        ],
        "letters": ['A', 'B', 'C', 'D'],  # For evaluation
    }
 
def evaluate(self, conversation, assistant_response):
    correct_answer = conversation['messages'][-1]['content']
    return assistant_response == correct_answer

Advantages:

  • Fast (no sampling required, batch multiple questions)
  • Deterministic (no temperature variation)
  • Easy to score (exact match)

Generative Evaluation

When to use: Open-ended tasks (code generation, math, creative writing)

How it works: Model generates free-form text, evaluated against success criteria

Example from HumanEval:

@property
def eval_type(self):
    return 'generative'
 
def get_example(self, index):
    row = self.ds[index]
    prompt = row['prompt']           # Function signature
    solution = row['canonical_solution']
    test = row['test']               # Test cases
    
    return {
        "messages": [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": f"{prompt}\\n{solution}"},
        ],
        "entry_point": row['entry_point'],
        "test": test,
    }
 
def evaluate(self, conversation, completion):
    # Extract code from completion
    code = extract_program(completion)
    
    # Build executable program
    program = (
        imports +
        "\\n\\n" +
        code +
        "\\n\\n" +
        conversation['test'] +
        "\\n" +
        f"check({conversation['entry_point']})"
    )
    
    # Execute and check
    result = execute_code(program)
    return result.success

Advantages:

  • Flexible (measures complex capabilities)
  • Realistic (mimics actual use)
  • Rich feedback (can analyze failure modes)

The CORE Benchmark Framework

What is CORE?

CORE (Compressed Open-Ended Requirements Evaluation) is an 11-task benchmark from the DCLM paper that evaluates base models across diverse capabilities with minimal compute.

From nanochat/core_eval.py:

Tasks:
1. ARC (easy/challenge) - Science reasoning
2. HellaSwag - Commonsense reasoning  
3. MMLU - Multitask knowledge
4. OpenBookQA - Elementary science
5. PIQA - Physical reasoning
6. SIQA - Social reasoning
7. WinoGrande - Coreference resolution
8. BoolQ - Yes/no questions
9. COPA - Causal reasoning
10. StoryCloze - Story completion
11. SQuAD - Reading comprehension

Why CORE matters:

  • Coverage: Tests diverse reasoning types
  • Efficiency: 11 tasks vs. dozens in full benchmarks
  • Correlation: CORE score correlates with broader evaluation suites
  • Centered metric: Accounts for random baselines (0=random, 1=perfect)

Three Task Types in CORE

CORE defines three fundamental task structures:

1. Multiple Choice Tasks

Structure: Same context, different continuations

Question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Evaluation method: Compare log probabilities of each continuation, choose lowest loss (highest probability)

From nanochat/core_eval.py:

if task_type == 'multiple_choice':
    prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
    # ...
    # Find option with lowest average loss
    mean_losses = [losses[i, si-1:ei-1].mean().item()
                   for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
    pred_idx = mean_losses.index(min(mean_losses))
    is_correct = pred_idx == item['gold']

Key insight: We evaluate log probabilities, not generated text. This is much faster and avoids issues with generation formatting.

2. Schema Tasks

Structure: Different contexts, same continuation

Context A: "The dog barked loudly."
Context B: "The dog slept quietly."
Context C: "The dog ran quickly."

Continuation: " It was happy."

Which context is most likely?

Use case: Sentence completion, coreference resolution

Evaluation method: Similar to multiple choice, but context varies instead of continuation

if task_type == 'schema':
    prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
    # Find context with lowest loss for the continuation
    mean_losses = [losses[i, si-1:ei-1].mean().item()
                   for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
    pred_idx = mean_losses.index(min(mean_losses))
    is_correct = pred_idx == item['gold']

3. Language Modeling Tasks

Structure: Context + continuation, evaluate continuation likelihood

Context: "The capital of France is"
Continuation: " Paris"

Check if model assigns high probability to continuation

Use case: Reading comprehension, factual knowledge

Evaluation method: Check if argmax predictions match actual tokens

if task_type == 'language_modeling':
    prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
    # Check if all predicted tokens match actual tokens
    si, ei = start_idxs[0], end_idxs[0]
    predicted_tokens = predictions[0, si-1:ei-1]
    actual_tokens = input_ids[0, si:ei]
    is_correct = torch.all(predicted_tokens == actual_tokens).item()

Centered Metrics

Raw accuracy can be misleading when random guessing achieves high scores:

# Example: 4-choice multiple choice
raw_accuracy = 0.30  # 30% correct
random_baseline = 0.25  # 25% by guessing
 
# Centered accuracy
centered = (raw_accuracy - random_baseline) / (1.0 - random_baseline)
# centered = (0.30 - 0.25) / (1.0 - 0.25) = 0.067 (6.7% above random)

Interpretation:

  • 0.0 = Random guessing
  • 1.0 = Perfect performance
  • Negative = Worse than random (model is broken)

The CORE metric is the mean of centered accuracies across all 11 tasks.

Building a Custom Multiple Choice Task

Example: Custom Medical Diagnosis Task

Building a medical diagnosis task from scratch:

from datasets import load_dataset
from tasks.common import Task, render_mc
 
class MedicalDiagnosis(Task):
    """Evaluate medical diagnosis from symptoms"""
    
    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
        assert split in ["train", "test"], "split must be train|test"
        # Assume you have a dataset with structure:
        # {"symptoms": str, "diagnosis": str, "options": List[str]}
        self.ds = load_dataset("your_org/medical_diagnosis", split=split)
        self.letters = ('A', 'B', 'C', 'D', 'E')  # Up to 5 options
    
    @property
    def eval_type(self):
        return 'categorical'
    
    def num_examples(self):
        return len(self.ds)
    
    def get_example(self, index):
        row = self.ds[index]
        symptoms = row['symptoms']
        diagnosis = row['diagnosis']
        options = row['options']  # List of possible diagnoses
        
        # Find which option is correct
        assert diagnosis in options, f"Correct diagnosis not in options!"
        answer_idx = options.index(diagnosis)
        
        # Render as multiple choice
        question = f"Given the following symptoms:\\n{symptoms}\\n\\nWhat is the most likely diagnosis?"
        user_message = render_mc(question, self.letters[:len(options)], options)
        assistant_message = self.letters[answer_idx]
        
        return {
            "messages": [
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": assistant_message}
            ],
            "letters": self.letters[:len(options)],
            "specialty": row.get('specialty', 'general'),  # For grouping results
        }
    
    def evaluate(self, conversation, assistant_response):
        assert assistant_response in conversation['letters']
        correct_answer = conversation['messages'][-1]['content']
        return assistant_response == correct_answer

Usage:

from tasks.medical_diagnosis import MedicalDiagnosis
 
# Create task
task = MedicalDiagnosis(split="test")
 
# Evaluate (from scripts/chat_eval.py pattern)
from scripts.chat_eval import run_categorical_eval
 
accuracy = run_categorical_eval(
    task_object=task,
    tokenizer=tokenizer,
    model=model,
    batch_size=8,
    max_problems=None  # Evaluate all
)
 
print(f"Medical Diagnosis Accuracy: {accuracy:.2%}")

Building a Custom Generative Task

Example: Custom Code Generation Task

For domain-specific code generation:

from tasks.common import Task
from nanochat.execution import execute_code
 
class CustomCodeGen(Task):
    """Evaluate code generation for your specific domain"""
    
    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
        # Load your custom code dataset
        # Structure: {"prompt": str, "solution": str, "tests": str, "imports": str}
        self.ds = load_dataset("your_org/custom_code", split=split)
    
    @property
    def eval_type(self):
        return 'generative'
    
    def num_examples(self):
        return len(self.ds)
    
    def get_example(self, index):
        row = self.ds[index]
        prompt = row['prompt']        # Function description or stub
        solution = row['solution']    # Reference solution
        tests = row['tests']          # Test cases
        imports = row['imports']      # Required imports
        
        return {
            "messages": [
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": solution},
            ],
            "tests": tests,
            "imports": imports,
        }
    
    def evaluate(self, conversation, completion):
        """Execute generated code and check if tests pass"""
        # Extract code from markdown blocks
        code = self.extract_code(completion)
        
        # Build executable program
        program = (
            conversation['imports'] +
            "\\n\\n" +
            code +
            "\\n\\n" +
            conversation['tests']
        )
        
        # Execute safely
        result = execute_code(
            program,
            timeout=5.0,
            maximum_memory_bytes=256 * 1024 * 1024  # 256MB
        )
        
        return result.success
    
    def extract_code(self, completion):
        """Extract code from LLM output"""
        import re
        # Try to find code blocks
        pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
        matches = re.findall(pattern, completion, re.DOTALL)
        if matches:
            return matches[0].strip()
        # Fall back to whole completion
        return completion.strip()

Advanced: Partial Credit

Instead of binary pass/fail, award partial credit:

def evaluate(self, conversation, completion):
    """Award partial credit based on number of passing tests"""
    code = self.extract_code(completion)
    test_cases = self.parse_tests(conversation['tests'])  # List of individual tests
    
    passing_tests = 0
    for test_case in test_cases:
        program = (
            conversation['imports'] +
            "\\n\\n" +
            code +
            "\\n\\n" +
            test_case
        )
        result = execute_code(program, timeout=2.0)
        if result.success:
            passing_tests += 1
    
    # Return fraction of tests passed
    return passing_tests / len(test_cases)

Safe Code Execution

The Sandbox

nanochat includes a sandboxed execution environment for running untrusted code from LLMs:

From nanochat/execution.py:

def execute_code(
    code: str,
    timeout: float = 5.0,
    maximum_memory_bytes: Optional[int] = 256 * 1024 * 1024,
) -> ExecutionResult:
    """
    Execute Python code in a sandboxed environment.
    
    Safety features:
    - Runs in separate process (can be killed)
    - Time limit (default 5 seconds)
    - Memory limit (default 256MB)
    - Temporary directory (auto-cleaned)
    - Disabled dangerous functions (os.system, subprocess, etc.)
    """

What's protected:

def reliability_guard(maximum_memory_bytes):
    # Disable exit/quit
    builtins.exit = None
    builtins.quit = None
    
    # Disable dangerous OS operations
    os.kill = None
    os.system = None
    os.remove = None
    os.fork = None
    
    # Disable subprocess
    subprocess.Popen = None
    
    # Disable filesystem manipulation
    shutil.rmtree = None
    
    # Set memory limits
    resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))

What's NOT protected:

  • Network access (sockets can be opened)
  • Python's dynamic features (ctypes, etc.)
  • No kernel-level isolation

Recommendation: Use this sandbox for evaluation, but for production systems serving untrusted code, use proper containerization (Docker with limited capabilities, gVisor, Firecracker, etc.).

Execution Result Handling

result = execute_code("print('hello'); 1/0")
 
# Check result
if result.success:
    print(f"Output: {result.stdout}")
else:
    print(f"Error: {result.error}")
    if result.timeout:
        print("Execution timed out")
    if result.memory_exceeded:
        print("Memory limit exceeded")

Example outputs:

# Success
ExecutionResult(success=True, stdout='hello world\\n', stderr='')
 
# Timeout
ExecutionResult(success=False, timeout=True, error='Execution timed out')
 
# Runtime error
ExecutionResult(success=False, error='ZeroDivisionError: division by zero')
 
# Memory exceeded
ExecutionResult(success=False, memory_exceeded=True, error='Memory limit exceeded: ...')

Prompt Engineering for Evaluation

Multiple Choice Rendering

The render_mc() function creates standardized multiple choice prompts:

From tasks/common.py:

def render_mc(question, letters, choices):
    """
    Important design decisions:
    1) Letter AFTER choice (better binding for small models)
    2) No whitespace before letter (tokenization consistency)
    """
    query = f"Multiple Choice question: {question}\\n"
    query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
    query += "\\nRespond only with the letter of the correct answer."
    return query

Example output:

Multiple Choice question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Why this format?

  1. Letter after choice: Smaller models bind better when the letter comes after
  2. No space before letter: Tokenizer treats "=A" consistently (not "= A" or " A")
  3. Explicit instruction: "Respond only with the letter" reduces verbosity

Few-Shot Prompting

CORE uses few-shot examples for consistency:

def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
    template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
 
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
    
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
    prompts = [template.render(choice=choice, item=item, fewshot_examples=fewshot_examples, continuation_delimiter=continuation_delimiter) 
               for choice in item['choices']]
    return prompts

Example with 2-shot:

Q: What is 2+2?
A: 4

Q: What is the capital of Germany?
A: Berlin

Q: What is the chemical symbol for gold?
A: Au

Best practices:

  • Use 0-5 shots (more doesn't always help)
  • Sample fewshot examples randomly but deterministically (seed-based)
  • Exclude current example from fewshot pool
  • Match format exactly (same template for fewshot and test)

Distributed Evaluation

Multi-GPU Evaluation

From nanochat/core_eval.py:

def evaluate_task(model, tokenizer, data, device, task_meta):
    """Evaluate task across multiple GPUs"""
    rank = dist.get_rank() if dist.is_initialized() else 0
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    
    correct = torch.zeros(len(data), dtype=torch.float32, device=device)
    
    # Each rank processes different examples
    for idx in range(rank, len(data), world_size):
        is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
        correct[idx] = float(is_correct)
    
    # Synchronize results across ranks
    if world_size > 1:
        dist.barrier()
        dist.all_reduce(correct, op=dist.ReduceOp.SUM)
    
    # Compute mean accuracy
    mean_correct = correct.mean().item()
    return mean_correct

How it works:

  1. Stride distribution: Rank 0 gets examples [0, 8, 16, ...], Rank 1 gets [1, 9, 17, ...], etc.
  2. Local evaluation: Each rank evaluates its assigned examples
  3. Synchronization: all_reduce sums results across ranks
  4. Global mean: Final accuracy computed from aggregated results

Launch evaluation:

# Single GPU
python scripts/base_eval.py
 
# 8 GPUs (8x speedup)
torchrun --nproc_per_node=8 scripts/base_eval.py

Task Mixtures for Training

Combining Multiple Tasks

Use TaskMixture to create multi-task datasets:

from tasks.common import TaskMixture
from tasks.arc import ARC
from tasks.gsm8k import GSM8K
from tasks.medical_diagnosis import MedicalDiagnosis
 
train_ds = TaskMixture([
    ARC(subset="ARC-Easy", split="train"),      # 2.3K examples
    GSM8K(subset="main", split="train"),         # 8K examples
    MedicalDiagnosis(split="train"),             # Your custom task
])
 
# Access examples (automatically shuffled)
for i in range(len(train_ds)):
    conversation = train_ds[i]
    # Train on this conversation

TaskMixture features:

  1. Deterministic shuffling: Same seed = same ordering (reproducible)
  2. Uniform sampling: All tasks mixed throughout training (prevents forgetting)
  3. Simple oversampling: Include a task multiple times to oversample
# Oversample medical diagnosis 3x
train_ds = TaskMixture([
    ARC(subset="ARC-Easy", split="train"),
    GSM8K(subset="main", split="train"),
    MedicalDiagnosis(split="train"),
    MedicalDiagnosis(split="train"),  # 2x
    MedicalDiagnosis(split="train"),  # 3x
])

Best Practices

1. Data Quality

Good evaluation data:

  • Representative of real use cases
  • Balanced difficulty (not all easy/hard)
  • Diverse examples (covers edge cases)
  • High-quality labels (verified correct)
  • Clear success criteria

Bad evaluation data:

  • Artificial or contrived examples
  • Ambiguous questions
  • Multiple valid answers (but only one labeled)
  • Label noise or errors

2. Prompt Design

Do:

  • Use clear, specific instructions
  • Provide examples (few-shot) when helpful
  • Match training format
  • Test prompts on small sample first

Don't:

  • Change format between train and eval
  • Use ambiguous wording
  • Assume model knows implicit conventions
  • Over-complicate prompts

3. Evaluation Metrics

For classification tasks:

  • Accuracy (overall)
  • Per-class precision/recall
  • Confusion matrix
  • Calibration (confidence vs. correctness)

For generation tasks:

  • Exact match
  • F1 score (token overlap)
  • BLEU/ROUGE (for summarization)
  • Pass@k (for code)
  • Human evaluation (gold standard)

4. Error Analysis

Always analyze failures:

# Collect failures
failures = []
for i in range(len(task)):
    conversation = task[i]
    response = generate_response(conversation)
    is_correct = task.evaluate(conversation, response)
    
    if not is_correct:
        failures.append({
            "index": i,
            "conversation": conversation,
            "response": response,
            "expected": conversation['messages'][-1]['content'],
        })
 
# Analyze patterns
print(f"Failure rate: {len(failures) / len(task):.2%}")
print("\\nSample failures:")
for failure in failures[:5]:
    print(f"\\nQuestion: {failure['conversation']['messages'][0]['content']}")
    print(f"Expected: {failure['expected']}")
    print(f"Got: {failure['response']}")

Common Pitfalls

1. Data Leakage

Problem: Test examples appear in training data

Solution:

  • Use different splits (train/val/test)
  • Check for near-duplicates
  • Time-based splits for temporal data
  • Hold out test set completely

2. Prompt Sensitivity

Problem: Small prompt changes cause large metric changes

Solution:

  • Test multiple prompt variations
  • Report mean and standard deviation
  • Use few-shot examples
  • Standardize format

3. Metric Gaming

Problem: Model exploits metric without solving task

Example: Model learns to always output "A" on multiple choice if "A" is most common

Solution:

  • Balance datasets (equal distribution of answers)
  • Use multiple complementary metrics
  • Manual inspection of samples
  • Adversarial examples

4. Evaluation Bugs

Problem: Bug in evaluation code inflates/deflates scores

Solution:

  • Unit test evaluation logic
  • Verify with reference implementations
  • Check edge cases (empty strings, special characters)
  • Manual verification on small sample

Conclusion

Building custom evaluation tasks is essential for measuring what matters in your domain. The key principles:

  1. Choose the right eval type: Categorical for constrained choices, generative for open-ended tasks
  2. Design clear prompts: Unambiguous instructions, consistent format
  3. Define success criteria: Exact match, fuzzy match, execution-based, or multi-faceted scoring
  4. Analyze failures: Understand where and why the model fails
  5. Use safe execution: Sandbox untrusted code from LLMs
  6. Distribute evaluation: Speed up with multi-GPU evaluation

The nanochat evaluation framework provides all the building blocks—Task abstraction, CORE benchmark patterns, safe execution, and distributed evaluation. By extending these patterns to your domain, you can build comprehensive benchmarks that drive model improvements where they matter most.

The next post explores tokenizer design choices: vocabulary size, regex patterns, and special tokens affect model performance and training efficiency.

Previous in series:

Next in series:

Related posts:


Part of the nanochat Deep-Dive Series • Track 2: Practical Guides

GitHub: nanochat repository
Tasks: tasks/ CORE Eval: nanochat/core_eval.py

TIP

Domain-specific evaluation is often more valuable than general benchmarks. Invest in building high-quality evaluation tasks for your use case—they'll guide model development and measure real-world impact.