Tokenizer Design Choices: BPE, Vocabulary, and Implementation

- Published on
- /21 mins read
Track 2: Practical Guides - Post 2.5 of 6
This post explores tokenization fundamentals: BPE algorithm, vocabulary design, dual Rust/Python implementation, special tokens for chat, and tokenizer-agnostic evaluation. View all posts in this track →
Introduction
Tokenization is the foundational layer of any language model—it determines how text is converted into discrete tokens that the model can process. Poor tokenization choices can cripple model performance, waste compute on suboptimal token representations, and create artifacts in generation.
nanochat implements Byte Pair Encoding (BPE) tokenization in the style of GPT-4, with two complementary implementations:
- RustBPE + tiktoken: Fast training in Rust, ultra-fast inference with tiktoken
- HuggingFace Tokenizers: Pure Python fallback for compatibility
In this post:
- BPE algorithm fundamentals and why it works
- GPT-4 style design choices: regex patterns, byte fallback, special tokens
- Training pipeline: streaming data, vocabulary size selection
- Inference optimizations: tiktoken integration, parallel encoding
- Conversation formatting: special tokens for chat, supervision masking
- Evaluation metrics: bits-per-byte for tokenizer-agnostic comparison
- Implementation comparison: Rust vs Python trade-offs
Table of Contents
- BPE Algorithm Fundamentals
- GPT-4 Style Tokenization
- Training Pipeline
- The RustBPE Implementation
- Inference with tiktoken
- Special Tokens for Chat
- Conversation Rendering
- Tokenizer Evaluation
- Implementation Trade-offs
- Best Practices & Common Pitfalls
BPE Algorithm Fundamentals
What is Byte Pair Encoding?
BPE is a data compression algorithm adapted for tokenization. It builds a vocabulary by iteratively merging the most frequent pairs of tokens:
# Start with bytes (256 base tokens)
text = "hello hello world"
tokens = [104, 101, 108, 108, 111, 32, ...] # byte values
# Find most frequent pair
pairs = count_pairs(tokens) # (104, 101) appears most
# => merge (h, e) -> token 256
# Repeat for vocab_size - 256 merges
# Final vocabulary: [0..255, "he", "ll", "lo", "hello", ...]Key insight: BPE learns a vocabulary optimized for the training data distribution. Common words become single tokens, rare words are split into subwords.
Why BPE Works
- No OOV (out-of-vocabulary): Byte fallback ensures any text can be encoded
- Compression: Common patterns use fewer tokens (efficiency)
- Generalization: Rare words share subword units with common words
- Language-agnostic: Works across languages without language-specific rules
BPE Training Algorithm
def train_bpe(text_chunks, vocab_size):
"""Core BPE training loop"""
# 1. Initialize with byte-level tokens
words = [chunk.encode("utf-8") for chunk in text_chunks]
vocab = list(range(256)) # base vocabulary
merges = {} # (token_a, token_b) -> merged_token_id
# 2. Iteratively merge most frequent pairs
for i in range(vocab_size - 256):
# Count all adjacent pairs
pair_counts = count_pairs_in_corpus(words)
# Find most frequent pair
best_pair = max(pair_counts, key=pair_counts.get)
# Create new token for this pair
new_token_id = 256 + i
merges[best_pair] = new_token_id
vocab.append(best_pair)
# Apply merge to all words
words = apply_merge(words, best_pair, new_token_id)
return vocab, mergesThe core challenge: efficiency. Naive implementation is O(n²) or worse. nanochat uses advanced optimizations.
GPT-4 Style Tokenization
The Split Pattern
nanochat uses a regex pre-tokenization pattern inspired by GPT-4:
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""This pattern makes several design decisions:
Pattern Breakdown
| Component | Matches | Purpose |
|---|---|---|
'(?i:[sdmt]|ll|ve|re) | Contractions like 's, 'm, 'll | Keep common suffixes together |
[^\r\n\p{L}\p{N}]?+\p{L}+ | Words with optional leading punctuation | "hello" or ".hello" |
\p{N}{1,2} | Numbers (1-2 digits) | Deviation from GPT-4 |
?[^\s\p{L}\p{N}]++[\r\n]* | Punctuation with optional trailing newlines | Symbols like ... |
\s*[\r\n] | Newlines with optional preceding whitespace | Line breaks |
\s+(?!\S) | Trailing whitespace | Spaces at end |
\s+ | Other whitespace | Spaces, tabs |
Key Design Decision: \p{N}{1,2} vs GPT-4's \p{N}{1,3}
# nanochat uses \p{N}{1,2} (1-2 digits)
# GPT-4 uses \p{N}{1,3} (1-3 digits)Why the change?
- Vocabulary efficiency:
\p{N}{1,3}creates 1000 base patterns (0-999) - Small model consideration: With
vocab_size=65536, we don't want to "waste" tokens on numbers - Trade-off: Slightly worse compression on numeric data, but better vocabulary utilization
TODO note in code: This hasn't been validated experimentally! It's a hypothesis.
Byte Fallback
Critical feature: byte fallback ensures any Unicode text can be encoded:
tokenizer = HFTokenizer(BPE(
byte_fallback=True, # Enable byte-level fallback
unk_token=None, # No unknown token needed!
fuse_unk=False,
))How it works:
- Pre-tokenization splits text into chunks (using regex)
- Each chunk is UTF-8 encoded to bytes
- BPE merges are applied to byte sequences
- Unknown patterns fall back to individual bytes (256 base tokens)
Result: No text is ever "unknown"—worst case, it's encoded as raw bytes.
Training Pipeline
Streaming Iterator Design
nanochat trains on 10 billion characters from FineWeb-Edu. Loading this into memory is impractical. Solution: streaming iterator:
def text_iterator():
"""
Stream text from parquet files:
1) Flatten batches into single iterator
2) Crop documents to doc_cap characters
3) Break after max_chars total
"""
nchars = 0
for batch in parquets_iter_batched(split="train"):
for doc in batch:
# Crop long documents
doc_text = doc[:args.doc_cap] if len(doc) > args.doc_cap else doc
nchars += len(doc_text)
yield doc_text
if nchars > args.max_chars:
returnDesign decisions:
- Document capping (
doc_cap=10,000): Prevents single huge documents from dominating - Total character limit (
max_chars=10B): Controls training time vs coverage - Streaming: Never loads full dataset into RAM
Training Command
python scripts/tok_train.py \
--vocab_size 65536 \
--max_chars 10_000_000_000 \
--doc_cap 10000Vocabulary size choice: 65536 = 2^16
- Power of 2: Efficient for hardware (alignment, indexing)
- Embedding matrix:
vocab_size × d_modelis a major memory cost - Trade-off: Larger vocab = better compression but more parameters
Common choices:
- GPT-2: 50,257 (slightly odd number)
- GPT-4: ~100,000 (higher compression)
- nanochat: 65,536 (power of 2, balanced)
Token Bytes Cache
After training, nanochat computes token_bytes: how many UTF-8 bytes each token represents.
token_bytes = []
for token_id in range(vocab_size):
token_str = tokenizer.decode([token_id])
if token_str in special_tokens:
token_bytes.append(0) # Special tokens don't count
else:
token_bytes.append(len(token_str.encode("utf-8")))
# Save for bits-per-byte evaluation
torch.save(torch.tensor(token_bytes), "token_bytes.pt")Why? This enables bits-per-byte (bpb) evaluation—a tokenizer-agnostic metric:
# Traditional loss: depends on vocabulary size
loss = -log P(token_id) # varies with vocab_size
# Bits-per-byte: normalized by token byte count
bpb = loss / token_bytes[token_id] # comparable across tokenizersThis metric appears in Post 1.6 (Loss Landscape & Scaling Laws).
The RustBPE Implementation
Why Rust?
BPE training is compute-intensive. Python is too slow. nanochat implements the hot path in Rust:
Performance characteristics:
- 10B characters: ~5-10 minutes on modern CPU
- Python equivalent: Hours or days
- Speedup: 50-100x
Core Algorithm: Incremental BPE with Heap
nanochat's Rust implementation uses an incremental heap-based algorithm:
struct MergeJob {
pair: (u32, u32), // token pair
count: u64, // frequency
pos: AHashSet<usize>, // word indices where pair occurs
}
fn train_core_incremental(&mut self, words: Vec<Word>, counts: Vec<i32>, vocab_size: u32) {
// 1. Count all pairs in parallel
let (pair_counts, where_to_update) = count_pairs_parallel(&words, &counts);
// 2. Build max-heap of pairs by frequency
let mut heap = OctonaryHeap::with_capacity(pair_counts.len());
for (pair, pos) in where_to_update {
heap.push(MergeJob { pair, count: pair_counts[pair], pos });
}
// 3. Merge loop: process vocab_size - 256 pairs
for i in 0..(vocab_size - 256) {
// Pop most frequent pair
let top = heap.pop().unwrap();
// Lazy refresh: check if count is still valid
if top.count != pair_counts[top.pair] {
heap.push(MergeJob { count: pair_counts[top.pair], ..top });
continue;
}
// Record merge
let new_id = 256 + i;
self.merges.insert(top.pair, new_id);
// Apply merge to affected words
for &word_idx in &top.pos {
let deltas = words[word_idx].merge_pair(top.pair, new_id);
// Update pair counts based on deltas
for (pair, delta) in deltas {
pair_counts[pair] += delta * counts[word_idx];
if delta > 0 {
heap.push(MergeJob { pair, count: pair_counts[pair], .. });
}
}
}
}
}Optimizations
1. Octonary Heap
use dary_heap::OctonaryHeap; // 8-ary heap vs binaryWhy 8-ary? Better cache locality than binary heap—fewer cache misses during heap operations.
2. Lazy Evaluation
// Pop from heap
let top = heap.pop();
// Check if count is stale (other merges updated it)
if top.count != pair_counts[top.pair] {
// Re-push with updated count (lazy refresh)
heap.push(MergeJob { count: pair_counts[top.pair], ..top });
continue;
}Avoids eager heap updates—only refresh when item is popped.
3. Parallel Pair Counting
fn count_pairs_parallel(words: &[Word], counts: &[i32]) -> (HashMap<Pair, i32>, HashMap<Pair, HashSet<usize>>) {
words.par_iter() // Rayon parallel iterator
.enumerate()
.map(|(i, word)| {
// Count pairs in this word
let mut local_counts = HashMap::new();
for (a, b) in word.pairs() {
*local_counts.entry((a, b)).or_default() += counts[i];
}
local_counts
})
.reduce(/* merge local counts */)
}Uses Rayon for data parallelism—scales to all CPU cores.
4. Incremental Updates
When merging pair (a, b) -> new_id, only affected pairs change:
fn merge_pair(&mut self, pair: (a, b), new_id) -> Vec<(Pair, i32)> {
let mut deltas = Vec::new();
// For each occurrence of (a, b):
// - Remove pairs: (left, a), (a, b), (b, right)
// - Add pairs: (left, new_id), (new_id, right)
if let Some(left) = left_neighbor {
deltas.push(((left, a), -1)); // removed
deltas.push(((left, new_id), +1)); // added
}
deltas.push(((a, b), -1)); // removed
if let Some(right) = right_neighbor {
deltas.push(((b, right), -1)); // removed
deltas.push(((new_id, right), +1)); // added
}
deltas
}Efficiency: Only track ~3-5 pair changes per merge (not full recount).
Streaming Ingestion
The Rust code releases the GIL for parallel processing:
pub fn train_from_iterator(&mut self, py: Python, iterator: &PyAny, vocab_size: u32, buffer_size: usize) {
let mut buf: Vec<String> = Vec::with_capacity(buffer_size);
loop {
// 1. Refill buffer (under GIL)
let exhausted = refill(&mut buf, iterator)?;
// 2. Process buffer (release GIL, parallel)
let local_counts = py.allow_threads(|| {
buf.par_iter()
.map(|text| {
// Apply regex, count chunks
let mut counts = HashMap::new();
for chunk in pattern.find_iter(text) {
*counts.entry(chunk).or_default() += 1;
}
counts
})
.reduce(/* merge */)
});
// 3. Merge into global counts
for (chunk, count) in local_counts {
*global_counts.entry(chunk).or_default() += count;
}
if exhausted { break; }
}
}Pattern:
- Acquire GIL → read buffer_size strings from Python iterator
- Release GIL → process strings in parallel (Rayon)
- Acquire GIL → merge results, repeat
Maximizes throughput by minimizing GIL contention.
Inference with tiktoken
Training vs Inference Split
nanochat uses two libraries:
| Phase | Library | Why? |
|---|---|---|
| Training | RustBPE | Optimized incremental algorithm |
| Inference | tiktoken | OpenAI's ultra-fast encoder |
RustBPE training produces:
pattern: The regex split patternmergeable_ranks: Dictionary mapping token bytes → token ID
These are fed into tiktoken for inference:
# After training with RustBPE
pattern = tokenizer.get_pattern()
mergeable_ranks_list = tokenizer.get_mergeable_ranks()
mergeable_ranks = {bytes(k): v for k, v in mergeable_ranks_list}
# Add special tokens
special_tokens = {
"<|bos|>": 65536,
"<|user_start|>": 65537,
"<|user_end|>": 65538,
# ... 9 special tokens total
}
# Create tiktoken Encoding
enc = tiktoken.Encoding(
name="rustbpe",
pat_str=pattern,
mergeable_ranks=mergeable_ranks,
special_tokens=special_tokens,
)tiktoken Performance
tiktoken is extremely fast:
# Single string
tokens = enc.encode_ordinary("Hello, world!")
# Batch encoding (parallel)
texts = ["text1", "text2", ..., "text1000"]
tokens_batch = enc.encode_ordinary_batch(texts, num_threads=8)Why so fast?
- Rust implementation: Zero-copy string operations
- Parallel batch encoding: Scales to 8+ threads
- Optimized BPE merge: Uses precomputed merge priorities
Typical throughput: 10-50 MB/s per core (varies by text).
Special Token Handling
tiktoken distinguishes ordinary vs special tokens:
# Ordinary encoding: treats special tokens as text
tokens = enc.encode_ordinary("<|bos|>Hello")
# => [60, 124, 98, 111, 115, 124, 62, 9906] (raw bytes)
# Special encoding: recognizes special token
tokens = enc.encode("<|bos|>Hello", allowed_special="all")
# => [65536, 9906] (special token ID + "Hello")
# Single special token
bos_id = enc.encode_single_token("<|bos|>") # => 65536nanochat uses encode_ordinary for user text and explicitly injects special tokens.
Special Tokens for Chat
Token Inventory
nanochat defines 9 special tokens:
SPECIAL_TOKENS = [
"<|bos|>", # Beginning of sequence (every document)
"<|user_start|>", # User message delimiter
"<|user_end|>",
"<|assistant_start|>", # Assistant message delimiter
"<|assistant_end|>",
"<|python_start|>", # Tool use: Python REPL
"<|python_end|>",
"<|output_start|>", # Tool output
"<|output_end|>",
]Design Philosophy
- Explicit delimiters: Clear boundaries between messages
- Role-based: Different tokens for user vs assistant
- Tool use: Dedicated tokens for Python code and outputs
- No implicit behavior: All special tokens are explicit in the format
Why Not Use Text Markers?
Some systems use text-based markers:
User: Hello
Assistant: Hi there!
Problems:
- Ambiguous: What if user types "User:" in their message?
- Tokenization artifacts: "User:" might split across tokens
- No structural guarantee: Model must learn format from examples
Special tokens enforce structure at the tokenization level—impossible to generate malformed conversations.
Conversation Rendering
The Challenge
Training chat models requires converting conversations into token sequences with supervision masking:
conversation = {
"messages": [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."},
]
}
# Need to produce:
# tokens: [<bos>, <user_start>, "What", "is", "2", "+", "2", "?", <user_end>,
# <assistant_start>, "2", "+", "2", "equals", "4", ".", <assistant_end>]
# mask: [0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 1, 1, 1, 1, 1, 1, 1]Mask semantics:
0= Do not supervise (no gradient)1= Supervise (compute loss, backprop)
Why? We only want to train on assistant responses, not user messages.
Render Implementation
def render_conversation(self, conversation, max_tokens=2048):
"""
Tokenize a conversation and return (ids, mask).
- ids: list of token IDs
- mask: 1 for assistant tokens (supervised), 0 otherwise
"""
ids, mask = [], []
def add_tokens(token_ids, mask_val):
if isinstance(token_ids, int):
token_ids = [token_ids]
ids.extend(token_ids)
mask.extend([mask_val] * len(token_ids))
# Fetch special token IDs
bos = self.encode_special("<|bos|>")
user_start = self.encode_special("<|user_start|>")
user_end = self.encode_special("<|user_end|>")
assistant_start = self.encode_special("<|assistant_start|>")
assistant_end = self.encode_special("<|assistant_end|>")
# Add BOS (unsupervised)
add_tokens(bos, 0)
# Process messages
for i, message in enumerate(conversation["messages"]):
role = message["role"]
content = message["content"]
if role == "user":
add_tokens(user_start, 0)
add_tokens(self.encode(content), 0)
add_tokens(user_end, 0)
elif role == "assistant":
add_tokens(assistant_start, 0) # Start token not supervised
add_tokens(self.encode(content), 1) # Content IS supervised
add_tokens(assistant_end, 1) # End token IS supervised
# Truncate to max_tokens
ids = ids[:max_tokens]
mask = mask[:max_tokens]
return ids, maskSystem Message Handling
Some conversations start with a system message:
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
]
}nanochat merges system into the first user message:
if conversation["messages"][0]["role"] == "system":
conversation = copy.deepcopy(conversation)
messages = conversation["messages"]
assert messages[1]["role"] == "user"
# Prepend system content to user message
messages[1]["content"] = messages[0]["content"] + "\n\n" + messages[1]["content"]
messages = messages[1:] # Remove system messageWhy? Simplifies format—no need for separate <|system_start|> tokens.
Tool Use: Python REPL
Assistants can invoke Python code:
{
"role": "assistant",
"content": [
{"type": "text", "text": "Let me calculate that:"},
{"type": "python", "text": "print(2+2)"},
{"type": "python_output", "text": "4"},
{"type": "text", "text": "The answer is 4."},
]
}Rendering:
add_tokens(assistant_start, 0)
for part in content:
value_ids = self.encode(part["text"])
if part["type"] == "text":
add_tokens(value_ids, 1) # Supervised
elif part["type"] == "python":
add_tokens(python_start, 1)
add_tokens(value_ids, 1) # Supervised: learn to generate code
add_tokens(python_end, 1)
elif part["type"] == "python_output":
add_tokens(output_start, 0) # NOT supervised
add_tokens(value_ids, 0) # Output comes from Python, not model
add_tokens(output_end, 0)
add_tokens(assistant_end, 1)Key insight: Python outputs are not supervised—they're generated by the REPL, not the model.
Visualization Helper
Debugging tokenization is crucial. nanochat provides a visualizer:
def visualize_tokenization(self, ids, mask):
"""Color-code tokens by supervision mask"""
RED = '\033[91m' # Unsupervised
GREEN = '\033[92m' # Supervised
RESET = '\033[0m'
tokens = []
for token_id, mask_val in zip(ids, mask):
token_str = self.decode([token_id])
color = GREEN if mask_val == 1 else RED
tokens.append(f"{color}{token_str}{RESET}")
return '|'.join(tokens)Example output:
<|bos|>|<|user_start|>|What|is|2|+|2|?|<|user_end|>|<|assistant_start|>|2|+|2|equals|4|.|<|assistant_end|>
RED RED RED RED RED RED RED RED RED GREEN GREEN GREEN GREEN GREEN GREEN
Green = model is trained on these tokens, Red = ignored in loss.
Tokenizer Evaluation
Compression Ratio
The primary metric: bytes per token (higher = better compression):
text = "Hello, world!"
encoded = tokenizer.encode(text)
encoded_bytes = text.encode('utf-8')
compression_ratio = len(encoded_bytes) / len(encoded)
# Higher ratio = fewer tokens for same text = better compressionComparing to GPT-2 and GPT-4
scripts/tok_eval.py compares nanochat's tokenizer against baselines:
# Load tokenizers
gpt2_tok = RustBPETokenizer.from_pretrained("gpt2")
gpt4_tok = RustBPETokenizer.from_pretrained("cl100k_base")
ours_tok = get_tokenizer()
# Test on diverse data
test_data = [
("news", news_article),
("korean", korean_text),
("code", python_code),
("math", latex_math),
("science", technical_prose),
]
# Encode and compare
for name, text in test_data:
gpt2_tokens = gpt2_tok.encode(text)
gpt4_tokens = gpt4_tok.encode(text)
ours_tokens = ours_tok.encode(text)
gpt2_ratio = len(text.encode('utf-8')) / len(gpt2_tokens)
gpt4_ratio = len(text.encode('utf-8')) / len(gpt4_tokens)
ours_ratio = len(text.encode('utf-8')) / len(ours_tokens)
print(f"{name:10} GPT-2: {gpt2_ratio:.2f} GPT-4: {gpt4_ratio:.2f} Ours: {ours_ratio:.2f}")Example output:
Vocab sizes:
GPT-2: 50257
GPT-4: 100256
Ours: 65536
Comparison with GPT-2:
==========================================================================================
Text Type Bytes GPT-2 Ours Relative Better
Tokens Ratio Tokens Ratio Diff %
------------------------------------------------------------------------------------------
news 1087 295 3.69 275 3.95 +6.8% Ours
korean 385 180 2.14 145 2.66 +19.4% Ours
code 876 251 3.49 245 3.58 +2.4% Ours
math 1547 556 2.78 520 2.98 +6.5% Ours
science 715 194 3.69 185 3.86 +4.6% Ours
fwe-train 10247 2568 3.99 2450 4.18 +4.6% Ours
Insights:
- Korean text: Large improvement (+19%) due to better multilingual support
- English text: Modest improvement (+5-7%) due to similar training data
- Code/Math: Competitive, slightly better due to domain coverage
Bits-Per-Byte Metric
From Post 1.6, bits-per-byte (bpb) normalizes loss across tokenizers:
# Load token_bytes mapping
token_bytes = get_token_bytes(device="cuda") # [vocab_size]
# During evaluation
losses = F.cross_entropy(logits, targets, reduction='none') # [batch, seq_len]
# Weight each loss by token byte count
token_bytes_flat = token_bytes[targets.flatten()] # [batch*seq_len]
valid_mask = token_bytes_flat > 0 # Exclude special tokens
# Bits-per-byte
bpb = (losses.flatten()[valid_mask] / token_bytes_flat[valid_mask]).mean()Why bpb?
| Metric | Formula | Problem |
|---|---|---|
| Loss | -log P(token) | Depends on vocab_size |
| Perplexity | exp(loss) | Still vocab-dependent |
| bpb | loss / token_bytes | Vocab-agnostic |
A model with vocab_size=50k and one with vocab_size=100k can be fairly compared using bpb.
Implementation Trade-offs
HuggingFace vs RustBPE + tiktoken
nanochat provides two implementations:
HuggingFace Tokenizers
class HuggingFaceTokenizer:
"""Pure Python, uses HF tokenizers library"""
@classmethod
def train_from_iterator(cls, text_iterator, vocab_size):
tokenizer = HFTokenizer(BPE(byte_fallback=True))
# ... configure pre-tokenizer, decoder
trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=SPECIAL_TOKENS)
tokenizer.train_from_iterator(text_iterator, trainer)
return cls(tokenizer)Pros:
- ✅ Single library for training + inference
- ✅ No Rust compilation required
- ✅ Widely used, well-documented
Cons:
- ❌ Slower training (~10-20x vs RustBPE)
- ❌ Confusing API (many configuration options)
- ❌ Slower inference (~2-5x vs tiktoken)
RustBPE + tiktoken
class RustBPETokenizer:
"""Rust training, tiktoken inference"""
@classmethod
def train_from_iterator(cls, text_iterator, vocab_size):
# Train with Rust
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(text_iterator, vocab_size, pattern=SPLIT_PATTERN)
# Export to tiktoken
pattern = tokenizer.get_pattern()
mergeable_ranks = tokenizer.get_mergeable_ranks()
enc = tiktoken.Encoding(name="rustbpe", pat_str=pattern, mergeable_ranks=mergeable_ranks, ...)
return cls(enc)Pros:
- ✅ Fast training (~50-100x speedup)
- ✅ Ultra-fast inference (tiktoken is highly optimized)
- ✅ Parallel batch encoding
Cons:
- ❌ Requires Rust toolchain for compilation
- ❌ Two-library complexity (rustbpe + tiktoken)
- ❌ Custom code to maintain
When to Use Each
| Use Case | Recommendation |
|---|---|
| Production training | RustBPE + tiktoken (speed critical) |
| Research/experimentation | HuggingFace (easier iteration) |
| CPU-only environment | Both work, but RustBPE still faster |
| No Rust compiler | HuggingFace (pure Python) |
nanochat default: RustBPE + tiktoken for performance.
Best Practices & Common Pitfalls
Best Practices
1. Vocabulary Size Selection
# Good: Powers of 2 for efficiency
vocab_sizes = [32768, 65536, 131072]
# Avoid: Arbitrary sizes
vocab_sizes = [50000, 75000] # Worse for hardware alignmentTrade-offs:
- Smaller vocab (32k): Fewer parameters, lower memory, worse compression
- Larger vocab (128k): Better compression, more parameters, higher memory
2. Document Capping
# Good: Cap individual documents
def text_iterator():
for doc in dataset:
yield doc[:10000] # Prevent single huge doc from dominating
# Bad: No capping
def text_iterator():
for doc in dataset:
yield doc # 100MB document consumes 1% of training data alone3. Special Token Design
# Good: Explicit, unambiguous delimiters
SPECIAL_TOKENS = ["<|bos|>", "<|user_start|>", "<|user_end|>"]
# Bad: Ambiguous markers (could appear in text)
SPECIAL_TOKENS = ["[BOS]", "[USER]", "[/USER]"]4. Supervision Masking
# Good: Only supervise assistant tokens
if role == "assistant":
add_tokens(content_ids, mask=1) # Supervised
else:
add_tokens(content_ids, mask=0) # Not supervised
# Bad: Supervise everything (including user messages)
add_tokens(content_ids, mask=1) # Model learns to generate user messagesCommon Pitfalls
Pitfall 1: Forgetting Byte Fallback
# Bad: No byte fallback
tokenizer = BPE(byte_fallback=False, unk_token="<UNK>")
# Problem: Unknown characters → <UNK> token → information loss
# Good: Byte fallback enabled
tokenizer = BPE(byte_fallback=True, unk_token=None)
# Solution: Unknown patterns → individual bytes (no information loss)Pitfall 2: Inconsistent Regex Patterns
# Bad: Different patterns for training vs inference
train_pattern = r"\w+|\d+|[^\w\s]+"
inference_pattern = r"\w+|\s+|." # Oops, different!
# Good: Store pattern with tokenizer
tokenizer.pattern = SPLIT_PATTERN # Use same pattern alwaysPitfall 3: Special Token Injection
# Bad: Treating special tokens as ordinary text
text = "<|bos|>Hello"
tokens = tokenizer.encode(text) # Encodes "<|bos|>" as 5-6 tokens
# Good: Explicit special token injection
tokens = [tokenizer.encode_special("<|bos|>")] + tokenizer.encode("Hello")
# => [65536, 9906] (correct)Pitfall 4: Ignoring Token Bytes
# Bad: Using raw loss for evaluation
loss = F.cross_entropy(logits, targets).mean()
# Problem: Loss depends on vocab_size (not comparable)
# Good: Normalize by token bytes
token_bytes_flat = token_bytes[targets]
bpb = (loss / token_bytes_flat).mean()
# Solution: Vocabulary-agnostic metricPitfall 5: Truncation Without Padding
# Bad: Truncate but don't track actual lengths
ids = ids[:max_tokens] # Lost information about original length
# Good: Track lengths or use attention masks
ids = ids[:max_tokens]
actual_length = min(len(original_ids), max_tokens)
# Use actual_length for loss maskingDebugging Tokenization
When things go wrong:
# 1. Visualize tokenization
viz = tokenizer.visualize_tokenization(ids, mask)
print(viz) # Color-coded output
# 2. Decode individual tokens
for token_id in ids:
print(f"{token_id}: {tokenizer.decode([token_id])!r}")
# 3. Check special token IDs
special_ids = {name: tokenizer.encode_special(name) for name in SPECIAL_TOKENS}
print(special_ids)
# 4. Verify encode/decode round-trip
text = "Test 你好 123"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
assert decoded == text, f"Round-trip failed: {text!r} != {decoded!r}"Conclusion
Tokenization is the interface between raw text and neural networks—get it wrong, and no amount of architecture engineering will save you.
Key takeaways:
- BPE with byte fallback ensures no text is ever "unknown"
- GPT-4 style regex splitting creates linguistically meaningful chunks
- RustBPE + tiktoken offers best-in-class training and inference speed
- Special tokens enforce conversation structure at tokenization level
- Supervision masking ensures models learn assistant behavior, not user imitation
- Bits-per-byte enables fair comparison across tokenizers
- Vocabulary size is a critical hyperparameter balancing compression vs parameters
The tokenizer you train sets the foundation for everything downstream—training data efficiency, model capacity utilization, and generation quality. Spend time getting it right.
Related Posts
- Post 1.5: Training Data Pipeline - How tokenized data flows through dataloaders
- Post 1.6: Loss Landscape & Scaling Laws - Bits-per-byte in evaluation
- Post 2.2: Fine-tuning for Chat (SFT) - Using
render_conversationfor SFT - tiktoken documentation
- BPE paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)
Exercises
Experiment with vocabulary size: Train tokenizers with
vocab_size=32768andvocab_size=131072. Compare compression ratios and model performance.Analyze token distribution: Plot histogram of token frequencies from trained tokenizer. What percentage of tokens account for 50% of usage?
Multilingual compression: Evaluate compression ratio on 10+ languages. Which languages compress best/worst? Why?
Custom regex patterns: Modify
SPLIT_PATTERNto handle code better (e.g., keep->or==together). Measure impact on code compression.Special token ablation: Train a model with vs without special tokens (using text markers instead). Compare chat quality.
Next Post: Post 2.6: Memory Optimization Techniques - Gradient accumulation, mixed precision, activation checkpointing, batch size tuning
Part of the nanochat Deep-Dive Series • Code: nanochat on GitHub
On this page
- Introduction
- Table of Contents
- BPE Algorithm Fundamentals
- What is Byte Pair Encoding?
- Why BPE Works
- BPE Training Algorithm
- GPT-4 Style Tokenization
- The Split Pattern
- Pattern Breakdown
- Key Design Decision: \p{N}{1,2} vs GPT-4's \p{N}{1,3}
- Byte Fallback
- Training Pipeline
- Streaming Iterator Design
- Training Command
- Token Bytes Cache
- The RustBPE Implementation
- Why Rust?
- Core Algorithm: Incremental BPE with Heap
- Optimizations
- 1. Octonary Heap
- 2. Lazy Evaluation
- 3. Parallel Pair Counting
- 4. Incremental Updates
- Streaming Ingestion
- Inference with tiktoken
- Training vs Inference Split
- tiktoken Performance
- Special Token Handling
- Special Tokens for Chat
- Token Inventory
- Design Philosophy
- Why Not Use Text Markers?
- Conversation Rendering
- The Challenge
- Render Implementation
- System Message Handling
- Tool Use: Python REPL
- Visualization Helper
- Tokenizer Evaluation
- Compression Ratio
- Comparing to GPT-2 and GPT-4
- Bits-Per-Byte Metric
- Implementation Trade-offs
- HuggingFace vs RustBPE + tiktoken
- HuggingFace Tokenizers
- RustBPE + tiktoken
- When to Use Each
- Best Practices & Common Pitfalls
- Best Practices
- 1. Vocabulary Size Selection
- 2. Document Capping
- 3. Special Token Design
- 4. Supervision Masking
- Common Pitfalls
- Pitfall 1: Forgetting Byte Fallback
- Pitfall 2: Inconsistent Regex Patterns
- Pitfall 3: Special Token Injection
- Pitfall 4: Ignoring Token Bytes
- Pitfall 5: Truncation Without Padding
- Debugging Tokenization
- Conclusion
- Related Posts
- Exercises



