A lightweight, header-only Byte Pair Encoding (BPE) trainer implemented in modern C++17/20.
Train your own tokenizer vocabularies compatible with HuggingFace Transformers or use them with Modern Text Tokenizer for fast, production-ready tokenization in C++.
- Full BPE Algorithm: Train subword vocabularies from scratch
- Header-Only: Single file, zero external dependencies
- High Performance: Optimized C++ implementation
- HuggingFace Compatible: Outputs
vocab.txtandmerges.txtfiles - Multiple Formats: Supports plain text and JSONL input
- Configurable: Lowercase, punctuation splitting, normalization
- CLI Ready: Complete command-line interface
- UTF-8 Safe: Proper Unicode character handling
- C++17/20 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- No external dependencies - uses only standard library
#include "Tiny-BPE-Trainer.hpp"
using namespace MecanikDev;g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp
# or
clang++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp// Initialize trainer
TinyBPETrainer trainer;
trainer
.set_lowercase(true)
.set_split_punctuation(true)
.set_normalize_whitespace(true);
// Train from text file
if (trainer.train_from_file("corpus.txt", 16000, 2)) {
// Save HuggingFace-compatible files
trainer.save_vocab("vocab.txt");
trainer.save_merges("merges.txt");
// Show statistics
trainer.print_stats();
}// Test the trained tokenizer
auto tokens = trainer.tokenize_test("Hello, world!");
// Result: ["Hello", ",", "world", "!</w>"]# Quick demo
./Tiny-BPE-Trainer --demo
# Train from text file
./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o my_tokenizer
# Train from JSONL dataset
./Tiny-BPE-Trainer -i dataset.jsonl --jsonl -v 32000
# Test tokenization
./Tiny-BPE-Trainer --test "Hello, world! This is a test."Options:
-i, --input <file> Input text file or JSONL file
-o, --output <prefix> Output file prefix (default: "tokenizer")
-v, --vocab-size <num> Vocabulary size (default: 32000)
-m, --min-freq <num> Minimum frequency for merges (default: 2)
--jsonl Input is JSONL format
--text-field <field> JSONL text field name (default: "text")
--no-lowercase Don't convert to lowercase
--no-punct-split Don't split punctuation
--demo Run demo with sample data
--test <text> Test tokenization on given text./Tiny-BPE-Trainer -i small_corpus.txt -v 8000 -m 2 -o small_tokenizer
# Expected: ~30 seconds, 8K vocabulary./Tiny-BPE-Trainer -i medium_corpus.txt -v 32000 -m 5 -o medium_tokenizer
# Expected: ~10 minutes, 32K vocabulary./Tiny-BPE-Trainer -i large_corpus.txt -v 50000 -m 10 -o large_tokenizer
# Expected: ~1-2 hours, 50K vocabulary./Tiny-BPE-Trainer -i dataset.jsonl --jsonl --text-field content -v 32000The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Natural language processing enables computers to understand human language.
{"id": 1, "text": "The quick brown fox jumps over the lazy dog."}
{"id": 2, "text": "Machine learning is a subset of artificial intelligence."}
{"id": 3, "text": "Natural language processing enables computers."}Want to train on real world text like IMDB reviews, Wikipedia, or news articles?
You can use the Python script download_dataset.py to download datasets from HuggingFace Datasets Hub, and export them into plain .txt or .jsonl format that works directly with Tiny BPE Trainer.
Install the requirements first:
pip install datasets pandas pyarrowfrom datasets import load_dataset
# Load dataset (choose from "imdb", "ag_news", "wikitext", etc.)
dataset = load_dataset("imdb", split="train")
with open("corpus.txt", "w", encoding="utf-8") as f:
for example in dataset:
text = example.get("text") or example.get("content")
f.write(text.replace("\n", " ").strip() + "\n")import json
from datasets import load_dataset
# Load dataset (choose from "imdb", "ag_news", "wikitext", etc.)
dataset = load_dataset("imdb", split="train")
with open("corpus.jsonl", "w", encoding="utf-8") as f:
for i, example in enumerate(dataset):
f.write(json.dumps({"id": i, "text": example["text"]}) + "\n")# Using plain text
./Tiny-BPE-Trainer -i corpus.txt -v 16000 -m 2 -o imdb_tokenizer
# Using JSONL
./Tiny-BPE-Trainer -i corpus.jsonl --jsonl -v 16000 -o imdb_tokenizer<|endoftext|>
<|unk|>
<|pad|>
<|mask|>
!
"
#
...
the
of
and
ing</w>
er</w>
...
#version: 0.2
i n
t h
th e
e r
...
class TinyBPETrainer {
// Configuration
TinyBPETrainer& set_lowercase(bool enable);
TinyBPETrainer& set_split_punctuation(bool enable);
TinyBPETrainer& set_normalize_whitespace(bool enable);
TinyBPETrainer& set_special_tokens(eos, unk, pad, mask);
// Training
bool train_from_file(filepath, vocab_size=32000, min_freq=2);
bool train_from_jsonl(filepath, text_field="text", vocab_size=32000, min_freq=2);
// Output
bool save_vocab(vocab_path);
bool save_merges(merges_path);
void print_stats();
// Testing
std::vector<std::string> tokenize_test(text);
};TinyBPETrainer trainer;
trainer
.set_lowercase(true) // Convert to lowercase
.set_split_punctuation(true) // Split on punctuation
.set_normalize_whitespace(true) // Normalize whitespace
.set_special_tokens( // Custom special tokens
"<|endoftext|>",
"<|unk|>",
"<|pad|>",
"<|mask|>"
);#include "Modern-Text-Tokenizer.hpp" // Tokenizer
#include "Tiny-BPE-Trainer.hpp" // BPE trainer
using namespace MecanikDev;
// Train BPE vocabulary
TinyBPETrainer trainer;
trainer.train_from_file("corpus.txt", 16000);
trainer.save_vocab("my_vocab.txt");
trainer.save_merges("my_merges.txt");
// Use with tokenizer
TextTokenizer tokenizer;
tokenizer.load_vocab("my_vocab.txt");
auto token_ids = tokenizer.encode("Hello, world!");# Python - load in HuggingFace Tokenizers
from tokenizers import Tokenizer
from tokenizers.models import BPE
# Load our trained BPE
tokenizer = Tokenizer(BPE(
vocab="my_vocab.txt",
merges="my_merges.txt"
))
tokens = tokenizer.encode("Hello, world!")Starting BPE training...
Input: imdb.txt
Format: Plain text
Vocab size: 32000
Min frequency: 2
Output prefix: tokenizer
Reading corpus from: imdb.txt
Processed 33157823 characters, 6952632 words
Unique word forms: 106008
Initial vocabulary size: 240
Starting BPE training...
...
BPE training completed!
Final vocabulary size: 32000
Total merges: 31760
Training time: 1962 seconds
Saved vocabulary (32000 tokens) to: tokenizer_vocab.txt
Saved merges (31760 rules) to: tokenizer_merges.txt
Training completed successfully!
Total time: 1966 seconds
Training Statistics:
Characters processed: 33157823
Words processed: 6952632
Final vocab size: 32000
BPE merges: 31760
Compression ratio: 0.0010Benchmark on AMD Ryzen 9 5900X, compiled with -O3.
-
Preprocessing
- Normalize whitespace
- Convert to lowercase (optional)
- Split punctuation (optional)
-
Character Initialization
"hello" → ["h", "e", "l", "l", "o", "</w>"] -
Iterative Merging
Most frequent pair: "l" + "l" → "ll" "hello" → ["h", "e", "ll", "o", "</w>"] -
Vocabulary Building
- Characters:
h,e,l,o,</w> - Merges:
ll,he,ell,hello - Special tokens:
<|unk|>,<|pad|>, etc.
- Characters:
- Subword Units: Handles unknown words through decomposition
- Frequency-Based: Most common patterns get merged first
- Deterministic: Same corpus always produces same vocabulary
- Compression: Reduces vocabulary size vs. word-level tokenization
"Training failed" Error
# Check file exists and is readable
ls -la corpus.txt
file corpus.txt
# Try smaller vocabulary size
./Tiny-BPE-Trainer -i corpus.txt -v 8000 -m 1Slow Training
# Increase minimum frequency
./Tiny-BPE-Trainer -i corpus.txt -v 32000 -m 10
# Use smaller corpus for testing
head -n 10000 large_corpus.txt > small_test.txtMemory Issues
# Monitor memory usage
top -p $(pgrep Tiny-BPE-Trainer)
# Reduce vocabulary size
./Tiny-BPE-Trainer -i corpus.txt -v 16000- Start Small: Test with small corpus and vocabulary first
- Adjust min_frequency: Higher values = faster training, smaller vocab
- Preprocessing: Clean your corpus for better results
- Incremental: Train smaller models first, then scale up
- Parallel Training: Multi-threaded BPE training
- Streaming Mode: Process huge files without loading into memory
- Advanced Preprocessing: Custom regex patterns, language-specific rules
- Evaluation Metrics: Compression ratio, OOV handling statistics
- Visualization: Plot vocabulary growth, merge frequency distributions
- Export Formats: SentencePiece, custom binary formats
- Tokenizer Integration: Seamless loading of trained BPE models
- HuggingFace Plugin: Direct integration with transformers library
- TensorFlow/PyTorch: C++ ops for training integration
We welcome contributions! Areas of interest:
- Performance: SIMD optimizations, better algorithms
- Features: New preprocessing options, export formats
- Testing: More edge cases, different languages
- Documentation: Tutorials, examples, use cases
MIT License - see LICENSE file for details.
- Inspired by open-source libraries like SentencePiece and HuggingFace Tokenizers
- Format compatibility modeled after HuggingFace's
vocab.txtandmerges.txtoutputs - Based on the original Byte Pair Encoding paper by Sennrich
- UTF-8 safety and normalization techniques informed by modern C++ text processing resources
- BPE Paper - Original Byte Pair Encoding paper
- Neural Machine Translation of Rare Words with Subword Units
- SentencePiece - Google's implementation
- HuggingFace Tokenizers - Fast tokenization library
⭐ Star this repo if you find it useful!
Built with ❤️ for the C++ and NLP community