Building Custom Evaluation Tasks

- Published on
- /15 mins read
Track 2: Practical Guides - Post 2.4 of 6
This post builds on Reinforcement Learning from Human Feedback. View all posts in this track →
Introduction
You've trained a model, fine-tuned it for chat, and optimized it with RL. But how do you know if it's actually good at what you care about? Standard benchmarks like MMLU and GSM8K are useful, but they measure general capabilities—not your specific use case.
Custom evaluation tasks let you measure what matters for your application: medical diagnosis accuracy, legal document analysis, code generation for your codebase, or any domain-specific skill.
This post covers nanochat's evaluation framework and how to design, implement, and run custom benchmarks:
- The three evaluation task types: multiple choice, schema, and language modeling
- How the CORE benchmark framework works
- Building custom tasks for any domain
- Best practices for prompt engineering in evaluation
- Code execution sandboxing for safe evaluation
- Distributed evaluation for large benchmarks
The Task Abstraction
Base Task Class
All evaluation tasks in nanochat inherit from Task:
class Task:
def __init__(self, start=0, stop=None, step=1):
# Allows lightweight slicing over the dataset
self.start = start
self.stop = stop
self.step = step
@property
def eval_type(self):
# one of 'generative' | 'categorical'
raise NotImplementedError
def num_examples(self):
raise NotImplementedError
def get_example(self, index):
# Returns a conversation dict
raise NotImplementedError
def evaluate(self, conversation, assistant_response):
# Returns success (bool or float)
raise NotImplementedErrorKey design principles:
- Lightweight slicing: Create views over datasets without copying data
- Lazy loading: Examples fetched on-demand via
get_example() - Two evaluation modes:
categorical(constrained choices) orgenerative(free-form) - Conversation format: All tasks return standardized conversation dicts
The Two Evaluation Modes
Categorical Evaluation
When to use: Multiple choice, yes/no, classification tasks
How it works: Model assigns probabilities to predefined options, chooses highest probability
Example from MMLU:
@property
def eval_type(self):
return 'categorical'
def get_example(self, index):
row = self.ds[index]
question = row["question"]
choices = row["choices"] # ["Choice A", "Choice B", "Choice C", "Choice D"]
answer = row["answer"] # 0, 1, 2, or 3
user_message = render_mc(question, ['A', 'B', 'C', 'D'], choices)
assistant_message = ['A', 'B', 'C', 'D'][answer]
return {
"messages": [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message}
],
"letters": ['A', 'B', 'C', 'D'], # For evaluation
}
def evaluate(self, conversation, assistant_response):
correct_answer = conversation['messages'][-1]['content']
return assistant_response == correct_answerAdvantages:
- Fast (no sampling required, batch multiple questions)
- Deterministic (no temperature variation)
- Easy to score (exact match)
Generative Evaluation
When to use: Open-ended tasks (code generation, math, creative writing)
How it works: Model generates free-form text, evaluated against success criteria
Example from HumanEval:
@property
def eval_type(self):
return 'generative'
def get_example(self, index):
row = self.ds[index]
prompt = row['prompt'] # Function signature
solution = row['canonical_solution']
test = row['test'] # Test cases
return {
"messages": [
{"role": "user", "content": prompt},
{"role": "assistant", "content": f"{prompt}\\n{solution}"},
],
"entry_point": row['entry_point'],
"test": test,
}
def evaluate(self, conversation, completion):
# Extract code from completion
code = extract_program(completion)
# Build executable program
program = (
imports +
"\\n\\n" +
code +
"\\n\\n" +
conversation['test'] +
"\\n" +
f"check({conversation['entry_point']})"
)
# Execute and check
result = execute_code(program)
return result.successAdvantages:
- Flexible (measures complex capabilities)
- Realistic (mimics actual use)
- Rich feedback (can analyze failure modes)
The CORE Benchmark Framework
What is CORE?
CORE (Compressed Open-Ended Requirements Evaluation) is an 11-task benchmark from the DCLM paper that evaluates base models across diverse capabilities with minimal compute.
From nanochat/core_eval.py:
Tasks:
1. ARC (easy/challenge) - Science reasoning
2. HellaSwag - Commonsense reasoning
3. MMLU - Multitask knowledge
4. OpenBookQA - Elementary science
5. PIQA - Physical reasoning
6. SIQA - Social reasoning
7. WinoGrande - Coreference resolution
8. BoolQ - Yes/no questions
9. COPA - Causal reasoning
10. StoryCloze - Story completion
11. SQuAD - Reading comprehension
Why CORE matters:
- Coverage: Tests diverse reasoning types
- Efficiency: 11 tasks vs. dozens in full benchmarks
- Correlation: CORE score correlates with broader evaluation suites
- Centered metric: Accounts for random baselines (0=random, 1=perfect)
Three Task Types in CORE
CORE defines three fundamental task structures:
1. Multiple Choice Tasks
Structure: Same context, different continuations
Question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D
Respond only with the letter of the correct answer.
Evaluation method: Compare log probabilities of each continuation, choose lowest loss (highest probability)
From nanochat/core_eval.py:
if task_type == 'multiple_choice':
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
# ...
# Find option with lowest average loss
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
is_correct = pred_idx == item['gold']Key insight: We evaluate log probabilities, not generated text. This is much faster and avoids issues with generation formatting.
2. Schema Tasks
Structure: Different contexts, same continuation
Context A: "The dog barked loudly."
Context B: "The dog slept quietly."
Context C: "The dog ran quickly."
Continuation: " It was happy."
Which context is most likely?
Use case: Sentence completion, coreference resolution
Evaluation method: Similar to multiple choice, but context varies instead of continuation
if task_type == 'schema':
prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
# Find context with lowest loss for the continuation
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
is_correct = pred_idx == item['gold']3. Language Modeling Tasks
Structure: Context + continuation, evaluate continuation likelihood
Context: "The capital of France is"
Continuation: " Paris"
Check if model assigns high probability to continuation
Use case: Reading comprehension, factual knowledge
Evaluation method: Check if argmax predictions match actual tokens
if task_type == 'language_modeling':
prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
# Check if all predicted tokens match actual tokens
si, ei = start_idxs[0], end_idxs[0]
predicted_tokens = predictions[0, si-1:ei-1]
actual_tokens = input_ids[0, si:ei]
is_correct = torch.all(predicted_tokens == actual_tokens).item()Centered Metrics
Raw accuracy can be misleading when random guessing achieves high scores:
# Example: 4-choice multiple choice
raw_accuracy = 0.30 # 30% correct
random_baseline = 0.25 # 25% by guessing
# Centered accuracy
centered = (raw_accuracy - random_baseline) / (1.0 - random_baseline)
# centered = (0.30 - 0.25) / (1.0 - 0.25) = 0.067 (6.7% above random)Interpretation:
0.0= Random guessing1.0= Perfect performance- Negative = Worse than random (model is broken)
The CORE metric is the mean of centered accuracies across all 11 tasks.
Building a Custom Multiple Choice Task
Example: Custom Medical Diagnosis Task
Building a medical diagnosis task from scratch:
from datasets import load_dataset
from tasks.common import Task, render_mc
class MedicalDiagnosis(Task):
"""Evaluate medical diagnosis from symptoms"""
def __init__(self, split, **kwargs):
super().__init__(**kwargs)
assert split in ["train", "test"], "split must be train|test"
# Assume you have a dataset with structure:
# {"symptoms": str, "diagnosis": str, "options": List[str]}
self.ds = load_dataset("your_org/medical_diagnosis", split=split)
self.letters = ('A', 'B', 'C', 'D', 'E') # Up to 5 options
@property
def eval_type(self):
return 'categorical'
def num_examples(self):
return len(self.ds)
def get_example(self, index):
row = self.ds[index]
symptoms = row['symptoms']
diagnosis = row['diagnosis']
options = row['options'] # List of possible diagnoses
# Find which option is correct
assert diagnosis in options, f"Correct diagnosis not in options!"
answer_idx = options.index(diagnosis)
# Render as multiple choice
question = f"Given the following symptoms:\\n{symptoms}\\n\\nWhat is the most likely diagnosis?"
user_message = render_mc(question, self.letters[:len(options)], options)
assistant_message = self.letters[answer_idx]
return {
"messages": [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message}
],
"letters": self.letters[:len(options)],
"specialty": row.get('specialty', 'general'), # For grouping results
}
def evaluate(self, conversation, assistant_response):
assert assistant_response in conversation['letters']
correct_answer = conversation['messages'][-1]['content']
return assistant_response == correct_answerUsage:
from tasks.medical_diagnosis import MedicalDiagnosis
# Create task
task = MedicalDiagnosis(split="test")
# Evaluate (from scripts/chat_eval.py pattern)
from scripts.chat_eval import run_categorical_eval
accuracy = run_categorical_eval(
task_object=task,
tokenizer=tokenizer,
model=model,
batch_size=8,
max_problems=None # Evaluate all
)
print(f"Medical Diagnosis Accuracy: {accuracy:.2%}")Building a Custom Generative Task
Example: Custom Code Generation Task
For domain-specific code generation:
from tasks.common import Task
from nanochat.execution import execute_code
class CustomCodeGen(Task):
"""Evaluate code generation for your specific domain"""
def __init__(self, split, **kwargs):
super().__init__(**kwargs)
# Load your custom code dataset
# Structure: {"prompt": str, "solution": str, "tests": str, "imports": str}
self.ds = load_dataset("your_org/custom_code", split=split)
@property
def eval_type(self):
return 'generative'
def num_examples(self):
return len(self.ds)
def get_example(self, index):
row = self.ds[index]
prompt = row['prompt'] # Function description or stub
solution = row['solution'] # Reference solution
tests = row['tests'] # Test cases
imports = row['imports'] # Required imports
return {
"messages": [
{"role": "user", "content": prompt},
{"role": "assistant", "content": solution},
],
"tests": tests,
"imports": imports,
}
def evaluate(self, conversation, completion):
"""Execute generated code and check if tests pass"""
# Extract code from markdown blocks
code = self.extract_code(completion)
# Build executable program
program = (
conversation['imports'] +
"\\n\\n" +
code +
"\\n\\n" +
conversation['tests']
)
# Execute safely
result = execute_code(
program,
timeout=5.0,
maximum_memory_bytes=256 * 1024 * 1024 # 256MB
)
return result.success
def extract_code(self, completion):
"""Extract code from LLM output"""
import re
# Try to find code blocks
pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
matches = re.findall(pattern, completion, re.DOTALL)
if matches:
return matches[0].strip()
# Fall back to whole completion
return completion.strip()Advanced: Partial Credit
Instead of binary pass/fail, award partial credit:
def evaluate(self, conversation, completion):
"""Award partial credit based on number of passing tests"""
code = self.extract_code(completion)
test_cases = self.parse_tests(conversation['tests']) # List of individual tests
passing_tests = 0
for test_case in test_cases:
program = (
conversation['imports'] +
"\\n\\n" +
code +
"\\n\\n" +
test_case
)
result = execute_code(program, timeout=2.0)
if result.success:
passing_tests += 1
# Return fraction of tests passed
return passing_tests / len(test_cases)Safe Code Execution
The Sandbox
nanochat includes a sandboxed execution environment for running untrusted code from LLMs:
From nanochat/execution.py:
def execute_code(
code: str,
timeout: float = 5.0,
maximum_memory_bytes: Optional[int] = 256 * 1024 * 1024,
) -> ExecutionResult:
"""
Execute Python code in a sandboxed environment.
Safety features:
- Runs in separate process (can be killed)
- Time limit (default 5 seconds)
- Memory limit (default 256MB)
- Temporary directory (auto-cleaned)
- Disabled dangerous functions (os.system, subprocess, etc.)
"""What's protected:
def reliability_guard(maximum_memory_bytes):
# Disable exit/quit
builtins.exit = None
builtins.quit = None
# Disable dangerous OS operations
os.kill = None
os.system = None
os.remove = None
os.fork = None
# Disable subprocess
subprocess.Popen = None
# Disable filesystem manipulation
shutil.rmtree = None
# Set memory limits
resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))What's NOT protected:
- Network access (sockets can be opened)
- Python's dynamic features (ctypes, etc.)
- No kernel-level isolation
Recommendation: Use this sandbox for evaluation, but for production systems serving untrusted code, use proper containerization (Docker with limited capabilities, gVisor, Firecracker, etc.).
Execution Result Handling
result = execute_code("print('hello'); 1/0")
# Check result
if result.success:
print(f"Output: {result.stdout}")
else:
print(f"Error: {result.error}")
if result.timeout:
print("Execution timed out")
if result.memory_exceeded:
print("Memory limit exceeded")Example outputs:
# Success
ExecutionResult(success=True, stdout='hello world\\n', stderr='')
# Timeout
ExecutionResult(success=False, timeout=True, error='Execution timed out')
# Runtime error
ExecutionResult(success=False, error='ZeroDivisionError: division by zero')
# Memory exceeded
ExecutionResult(success=False, memory_exceeded=True, error='Memory limit exceeded: ...')Prompt Engineering for Evaluation
Multiple Choice Rendering
The render_mc() function creates standardized multiple choice prompts:
From tasks/common.py:
def render_mc(question, letters, choices):
"""
Important design decisions:
1) Letter AFTER choice (better binding for small models)
2) No whitespace before letter (tokenization consistency)
"""
query = f"Multiple Choice question: {question}\\n"
query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
query += "\\nRespond only with the letter of the correct answer."
return queryExample output:
Multiple Choice question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D
Respond only with the letter of the correct answer.
Why this format?
- Letter after choice: Smaller models bind better when the letter comes after
- No space before letter: Tokenizer treats "=A" consistently (not "= A" or " A")
- Explicit instruction: "Respond only with the letter" reduces verbosity
Few-Shot Prompting
CORE uses few-shot examples for consistency:
def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
template = Template(template_str)
fewshot_examples = fewshot_examples or []
prompts = [template.render(choice=choice, item=item, fewshot_examples=fewshot_examples, continuation_delimiter=continuation_delimiter)
for choice in item['choices']]
return promptsExample with 2-shot:
Q: What is 2+2?
A: 4
Q: What is the capital of Germany?
A: Berlin
Q: What is the chemical symbol for gold?
A: Au
Best practices:
- Use 0-5 shots (more doesn't always help)
- Sample fewshot examples randomly but deterministically (seed-based)
- Exclude current example from fewshot pool
- Match format exactly (same template for fewshot and test)
Distributed Evaluation
Multi-GPU Evaluation
From nanochat/core_eval.py:
def evaluate_task(model, tokenizer, data, device, task_meta):
"""Evaluate task across multiple GPUs"""
rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1
correct = torch.zeros(len(data), dtype=torch.float32, device=device)
# Each rank processes different examples
for idx in range(rank, len(data), world_size):
is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
correct[idx] = float(is_correct)
# Synchronize results across ranks
if world_size > 1:
dist.barrier()
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
# Compute mean accuracy
mean_correct = correct.mean().item()
return mean_correctHow it works:
- Stride distribution: Rank 0 gets examples [0, 8, 16, ...], Rank 1 gets [1, 9, 17, ...], etc.
- Local evaluation: Each rank evaluates its assigned examples
- Synchronization:
all_reducesums results across ranks - Global mean: Final accuracy computed from aggregated results
Launch evaluation:
# Single GPU
python scripts/base_eval.py
# 8 GPUs (8x speedup)
torchrun --nproc_per_node=8 scripts/base_eval.pyTask Mixtures for Training
Combining Multiple Tasks
Use TaskMixture to create multi-task datasets:
from tasks.common import TaskMixture
from tasks.arc import ARC
from tasks.gsm8k import GSM8K
from tasks.medical_diagnosis import MedicalDiagnosis
train_ds = TaskMixture([
ARC(subset="ARC-Easy", split="train"), # 2.3K examples
GSM8K(subset="main", split="train"), # 8K examples
MedicalDiagnosis(split="train"), # Your custom task
])
# Access examples (automatically shuffled)
for i in range(len(train_ds)):
conversation = train_ds[i]
# Train on this conversationTaskMixture features:
- Deterministic shuffling: Same seed = same ordering (reproducible)
- Uniform sampling: All tasks mixed throughout training (prevents forgetting)
- Simple oversampling: Include a task multiple times to oversample
# Oversample medical diagnosis 3x
train_ds = TaskMixture([
ARC(subset="ARC-Easy", split="train"),
GSM8K(subset="main", split="train"),
MedicalDiagnosis(split="train"),
MedicalDiagnosis(split="train"), # 2x
MedicalDiagnosis(split="train"), # 3x
])Best Practices
1. Data Quality
Good evaluation data:
- Representative of real use cases
- Balanced difficulty (not all easy/hard)
- Diverse examples (covers edge cases)
- High-quality labels (verified correct)
- Clear success criteria
Bad evaluation data:
- Artificial or contrived examples
- Ambiguous questions
- Multiple valid answers (but only one labeled)
- Label noise or errors
2. Prompt Design
Do:
- Use clear, specific instructions
- Provide examples (few-shot) when helpful
- Match training format
- Test prompts on small sample first
Don't:
- Change format between train and eval
- Use ambiguous wording
- Assume model knows implicit conventions
- Over-complicate prompts
3. Evaluation Metrics
For classification tasks:
- Accuracy (overall)
- Per-class precision/recall
- Confusion matrix
- Calibration (confidence vs. correctness)
For generation tasks:
- Exact match
- F1 score (token overlap)
- BLEU/ROUGE (for summarization)
- Pass@k (for code)
- Human evaluation (gold standard)
4. Error Analysis
Always analyze failures:
# Collect failures
failures = []
for i in range(len(task)):
conversation = task[i]
response = generate_response(conversation)
is_correct = task.evaluate(conversation, response)
if not is_correct:
failures.append({
"index": i,
"conversation": conversation,
"response": response,
"expected": conversation['messages'][-1]['content'],
})
# Analyze patterns
print(f"Failure rate: {len(failures) / len(task):.2%}")
print("\\nSample failures:")
for failure in failures[:5]:
print(f"\\nQuestion: {failure['conversation']['messages'][0]['content']}")
print(f"Expected: {failure['expected']}")
print(f"Got: {failure['response']}")Common Pitfalls
1. Data Leakage
Problem: Test examples appear in training data
Solution:
- Use different splits (train/val/test)
- Check for near-duplicates
- Time-based splits for temporal data
- Hold out test set completely
2. Prompt Sensitivity
Problem: Small prompt changes cause large metric changes
Solution:
- Test multiple prompt variations
- Report mean and standard deviation
- Use few-shot examples
- Standardize format
3. Metric Gaming
Problem: Model exploits metric without solving task
Example: Model learns to always output "A" on multiple choice if "A" is most common
Solution:
- Balance datasets (equal distribution of answers)
- Use multiple complementary metrics
- Manual inspection of samples
- Adversarial examples
4. Evaluation Bugs
Problem: Bug in evaluation code inflates/deflates scores
Solution:
- Unit test evaluation logic
- Verify with reference implementations
- Check edge cases (empty strings, special characters)
- Manual verification on small sample
Conclusion
Building custom evaluation tasks is essential for measuring what matters in your domain. The key principles:
- Choose the right eval type: Categorical for constrained choices, generative for open-ended tasks
- Design clear prompts: Unambiguous instructions, consistent format
- Define success criteria: Exact match, fuzzy match, execution-based, or multi-faceted scoring
- Analyze failures: Understand where and why the model fails
- Use safe execution: Sandbox untrusted code from LLMs
- Distribute evaluation: Speed up with multi-GPU evaluation
The nanochat evaluation framework provides all the building blocks—Task abstraction, CORE benchmark patterns, safe execution, and distributed evaluation. By extending these patterns to your domain, you can build comprehensive benchmarks that drive model improvements where they matter most.
The next post explores tokenizer design choices: vocabulary size, regex patterns, and special tokens affect model performance and training efficiency.
Related Posts
Previous in series:
- Reinforcement Learning from Human Feedback - Optimize models with RL
Next in series:
- Tokenizer Design Choices - Vocabulary, BPE, and special tokens
Related posts:
- Fine-tuning for Chat (SFT) - Training on evaluation tasks
- Loss Landscape & Scaling Laws - Evaluation metrics and frameworks
Part of the nanochat Deep-Dive Series • Track 2: Practical Guides
GitHub: nanochat repository
Tasks: tasks/ CORE Eval: nanochat/core_eval.py
TIP
Domain-specific evaluation is often more valuable than general benchmarks. Invest in building high-quality evaluation tasks for your use case—they'll guide model development and measure real-world impact.
On this page
- Introduction
- The Task Abstraction
- Base Task Class
- The Two Evaluation Modes
- Categorical Evaluation
- Generative Evaluation
- The CORE Benchmark Framework
- What is CORE?
- Three Task Types in CORE
- 1. Multiple Choice Tasks
- 2. Schema Tasks
- 3. Language Modeling Tasks
- Centered Metrics
- Building a Custom Multiple Choice Task
- Example: Custom Medical Diagnosis Task
- Building a Custom Generative Task
- Example: Custom Code Generation Task
- Safe Code Execution
- The Sandbox
- Execution Result Handling
- Prompt Engineering for Evaluation
- Multiple Choice Rendering
- Few-Shot Prompting
- Distributed Evaluation
- Multi-GPU Evaluation
- Task Mixtures for Training
- Combining Multiple Tasks
- Best Practices
- 1. Data Quality
- 2. Prompt Design
- 3. Evaluation Metrics
- 4. Error Analysis
- Common Pitfalls
- 1. Data Leakage
- 2. Prompt Sensitivity
- 3. Metric Gaming
- 4. Evaluation Bugs
- Conclusion
- Related Posts



