Tiny BPE Trainer – A Fast and Lightweight BPE Trainer in C++

A lightweight, header-only Byte Pair Encoding (BPE) trainer implemented in modern C++17/20.

Train your own tokenizer vocabularies compatible with HuggingFace Transformers or use them with Modern Text Tokenizer for fast, production-ready tokenization in C++.

Features

Full BPE Algorithm: Train subword vocabularies from scratch
Header-Only: Single file, zero external dependencies
High Performance: Optimized C++ implementation
HuggingFace Compatible: Outputs vocab.txt and merges.txt files
Multiple Formats: Supports plain text and JSONL input
Configurable: Lowercase, punctuation splitting, normalization
CLI Ready: Complete command-line interface
UTF-8 Safe: Proper Unicode character handling

Requirements

C++17/20 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
No external dependencies - uses only standard library

Quick Start

Include the Header

#include "Tiny-BPE-Trainer.hpp"
using namespace MecanikDev;

Build the CLI

g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp
# or
clang++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp

Basic Training

// Initialize trainer
TinyBPETrainer trainer;
trainer
    .set_lowercase(true)
    .set_split_punctuation(true)
    .set_normalize_whitespace(true);

// Train from text file
if (trainer.train_from_file("corpus.txt", 16000, 2)) {
    // Save HuggingFace-compatible files
    trainer.save_vocab("vocab.txt");
    trainer.save_merges("merges.txt");
    
    // Show statistics
    trainer.print_stats();
}

Test Tokenization

// Test the trained tokenizer
auto tokens = trainer.tokenize_test("Hello, world!");
// Result: ["Hello", ",", "world", "!</w>"]

Command Line Interface

Basic Usage

# Quick demo
./Tiny-BPE-Trainer --demo

# Train from text file
./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o my_tokenizer

# Train from JSONL dataset
./Tiny-BPE-Trainer -i dataset.jsonl --jsonl -v 32000

# Test tokenization
./Tiny-BPE-Trainer --test "Hello, world! This is a test."

All Options

Options:
  -i, --input <file>      Input text file or JSONL file
  -o, --output <prefix>   Output file prefix (default: "tokenizer")
  -v, --vocab-size <num>  Vocabulary size (default: 32000)  
  -m, --min-freq <num>    Minimum frequency for merges (default: 2)
  --jsonl                 Input is JSONL format
  --text-field <field>    JSONL text field name (default: "text")
  --no-lowercase          Don't convert to lowercase
  --no-punct-split        Don't split punctuation
  --demo                  Run demo with sample data
  --test <text>           Test tokenization on given text

Training Examples

Small Dataset (1MB)

./Tiny-BPE-Trainer -i small_corpus.txt -v 8000 -m 2 -o small_tokenizer
# Expected: ~30 seconds, 8K vocabulary

Medium Dataset (100MB)

./Tiny-BPE-Trainer -i medium_corpus.txt -v 32000 -m 5 -o medium_tokenizer  
# Expected: ~10 minutes, 32K vocabulary

Large Dataset (1GB+)

./Tiny-BPE-Trainer -i large_corpus.txt -v 50000 -m 10 -o large_tokenizer
# Expected: ~1-2 hours, 50K vocabulary

JSONL Dataset

./Tiny-BPE-Trainer -i dataset.jsonl --jsonl --text-field content -v 32000

Plain Text

The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Natural language processing enables computers to understand human language.

JSONL Format

{"id": 1, "text": "The quick brown fox jumps over the lazy dog."}
{"id": 2, "text": "Machine learning is a subset of artificial intelligence."}
{"id": 3, "text": "Natural language processing enables computers."}

Downloading Corpus with Python (HuggingFace Datasets)

Want to train on real world text like IMDB reviews, Wikipedia, or news articles?

You can use the Python script download_dataset.py to download datasets from HuggingFace Datasets Hub, and export them into plain .txt or .jsonl format that works directly with Tiny BPE Trainer.

Install the requirements first:

pip install datasets pandas pyarrow

Save as Plain Text (corpus.txt)

from datasets import load_dataset

# Load dataset (choose from "imdb", "ag_news", "wikitext", etc.)
dataset = load_dataset("imdb", split="train")

with open("corpus.txt", "w", encoding="utf-8") as f:
    for example in dataset:
        text = example.get("text") or example.get("content")
        f.write(text.replace("\n", " ").strip() + "\n")

Save as JSONL (corpus.jsonl)

import json
from datasets import load_dataset

# Load dataset (choose from "imdb", "ag_news", "wikitext", etc.)
dataset = load_dataset("imdb", split="train")

with open("corpus.jsonl", "w", encoding="utf-8") as f:
    for i, example in enumerate(dataset):
        f.write(json.dumps({"id": i, "text": example["text"]}) + "\n")

Train with Tiny BPE Trainer

# Using plain text
./Tiny-BPE-Trainer -i corpus.txt -v 16000 -m 2 -o imdb_tokenizer

# Using JSONL
./Tiny-BPE-Trainer -i corpus.jsonl --jsonl -v 16000 -o imdb_tokenizer

Output Files

vocab.txt (HuggingFace Compatible)

<|endoftext|>
<|unk|>
<|pad|>  
<|mask|>
!
"
#
...
the
of
and
ing</w>
er</w>
...

merges.txt (BPE Rules)

#version: 0.2
i n
t h
th e
e r
...

API Reference

Core Methods

class TinyBPETrainer {
    // Configuration
    TinyBPETrainer& set_lowercase(bool enable);
    TinyBPETrainer& set_split_punctuation(bool enable);  
    TinyBPETrainer& set_normalize_whitespace(bool enable);
    TinyBPETrainer& set_special_tokens(eos, unk, pad, mask);
    
    // Training
    bool train_from_file(filepath, vocab_size=32000, min_freq=2);
    bool train_from_jsonl(filepath, text_field="text", vocab_size=32000, min_freq=2);
    
    // Output
    bool save_vocab(vocab_path);
    bool save_merges(merges_path);
    void print_stats();
    
    // Testing  
    std::vector<std::string> tokenize_test(text);
};

Configuration Options

TinyBPETrainer trainer;

trainer
    .set_lowercase(true)              // Convert to lowercase
    .set_split_punctuation(true)      // Split on punctuation  
    .set_normalize_whitespace(true)   // Normalize whitespace
    .set_special_tokens(              // Custom special tokens
        "<|endoftext|>", 
        "<|unk|>", 
        "<|pad|>", 
        "<|mask|>"
    );

Integration with Tokenizers

Use with Modern Text Tokenizer

#include "Modern-Text-Tokenizer.hpp" // Tokenizer
#include "Tiny-BPE-Trainer.hpp"    // BPE trainer

using namespace MecanikDev;

// Train BPE vocabulary
TinyBPETrainer trainer;
trainer.train_from_file("corpus.txt", 16000);
trainer.save_vocab("my_vocab.txt");
trainer.save_merges("my_merges.txt");

// Use with tokenizer 
TextTokenizer tokenizer;
tokenizer.load_vocab("my_vocab.txt");
auto token_ids = tokenizer.encode("Hello, world!");

Use with HuggingFace

# Python - load in HuggingFace Tokenizers
from tokenizers import Tokenizer
from tokenizers.models import BPE

# Load our trained BPE
tokenizer = Tokenizer(BPE(
    vocab="my_vocab.txt", 
    merges="my_merges.txt"
))

tokens = tokenizer.encode("Hello, world!")

Performance

Starting BPE training...
   Input: imdb.txt
   Format: Plain text
   Vocab size: 32000
   Min frequency: 2
   Output prefix: tokenizer
Reading corpus from: imdb.txt
Processed 33157823 characters, 6952632 words
Unique word forms: 106008
Initial vocabulary size: 240
Starting BPE training...
    ...
BPE training completed!
   Final vocabulary size: 32000
   Total merges: 31760
   Training time: 1962 seconds
Saved vocabulary (32000 tokens) to: tokenizer_vocab.txt
Saved merges (31760 rules) to: tokenizer_merges.txt

Training completed successfully!
   Total time: 1966 seconds

Training Statistics:
   Characters processed: 33157823
   Words processed: 6952632
   Final vocab size: 32000
   BPE merges: 31760
   Compression ratio: 0.0010

Benchmark on AMD Ryzen 9 5900X, compiled with -O3.

Algorithm Details

BPE Training Process

Preprocessing
- Normalize whitespace
- Convert to lowercase (optional)
- Split punctuation (optional)

Character Initialization

"hello" → ["h", "e", "l", "l", "o", "</w>"]

Iterative Merging

Most frequent pair: "l" + "l" → "ll"
"hello" → ["h", "e", "ll", "o", "</w>"]

Vocabulary Building
- Characters: h, e, l, o, </w>
- Merges: ll, he, ell, hello
- Special tokens: <|unk|>, <|pad|>, etc.

Key Features

Subword Units: Handles unknown words through decomposition
Frequency-Based: Most common patterns get merged first
Deterministic: Same corpus always produces same vocabulary
Compression: Reduces vocabulary size vs. word-level tokenization

Troubleshooting

Common Issues

"Training failed" Error

# Check file exists and is readable
ls -la corpus.txt
file corpus.txt

# Try smaller vocabulary size
./Tiny-BPE-Trainer -i corpus.txt -v 8000 -m 1

Slow Training

# Increase minimum frequency
./Tiny-BPE-Trainer -i corpus.txt -v 32000 -m 10

# Use smaller corpus for testing
head -n 10000 large_corpus.txt > small_test.txt

Memory Issues

# Monitor memory usage
top -p $(pgrep Tiny-BPE-Trainer)

# Reduce vocabulary size
./Tiny-BPE-Trainer -i corpus.txt -v 16000

Performance Tips

Start Small: Test with small corpus and vocabulary first
Adjust min_frequency: Higher values = faster training, smaller vocab
Preprocessing: Clean your corpus for better results
Incremental: Train smaller models first, then scale up

Roadmap

Planned Features

Parallel Training: Multi-threaded BPE training
Streaming Mode: Process huge files without loading into memory
Advanced Preprocessing: Custom regex patterns, language-specific rules
Evaluation Metrics: Compression ratio, OOV handling statistics
Visualization: Plot vocabulary growth, merge frequency distributions
Export Formats: SentencePiece, custom binary formats

Future Considerations

Tokenizer Integration: Seamless loading of trained BPE models
HuggingFace Plugin: Direct integration with transformers library
TensorFlow/PyTorch: C++ ops for training integration

Contributing

We welcome contributions! Areas of interest:

Performance: SIMD optimizations, better algorithms
Features: New preprocessing options, export formats
Testing: More edge cases, different languages
Documentation: Tutorials, examples, use cases

License

MIT License - see LICENSE file for details.

Acknowledgments

Inspired by open-source libraries like SentencePiece and HuggingFace Tokenizers
Format compatibility modeled after HuggingFace's vocab.txt and merges.txt outputs
Based on the original Byte Pair Encoding paper by Sennrich
UTF-8 safety and normalization techniques informed by modern C++ text processing resources

Learn More

BPE Paper - Original Byte Pair Encoding paper
Neural Machine Translation of Rare Words with Subword Units
SentencePiece - Google's implementation
HuggingFace Tokenizers - Fast tokenization library

⭐ Star this repo if you find it useful!

Built with ❤️ for the C++ and NLP community

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
LICENSE		LICENSE
README.md		README.md
Tiny-BPE-Trainer.cpp		Tiny-BPE-Trainer.cpp
Tiny-BPE-Trainer.hpp		Tiny-BPE-Trainer.hpp
download_dataset.py		download_dataset.py

Uh oh!

License

Mecanik/Tiny-BPE-Trainer

Folders and files

Latest commit

History

Repository files navigation

Tiny BPE Trainer – A Fast and Lightweight BPE Trainer in C++

Features

Requirements

Quick Start

Include the Header

Build the CLI

Basic Training

Test Tokenization

Command Line Interface

Basic Usage

All Options

Training Examples

Small Dataset (1MB)

Medium Dataset (100MB)

Large Dataset (1GB+)

JSONL Dataset

Plain Text

JSONL Format

Downloading Corpus with Python (HuggingFace Datasets)

Save as Plain Text (corpus.txt)

Save as JSONL (corpus.jsonl)

Train with Tiny BPE Trainer

Output Files

vocab.txt (HuggingFace Compatible)

merges.txt (BPE Rules)

API Reference

Core Methods

Configuration Options

Integration with Tokenizers

Use with Modern Text Tokenizer

Use with HuggingFace

Performance

Algorithm Details

BPE Training Process

Key Features

Troubleshooting

Common Issues

Performance Tips

Roadmap

Planned Features

Future Considerations

Contributing

License

Acknowledgments

Learn More

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Languages

Packages