Sovereign Tokenizer V2.1: 99.86% Accuracy at 2.5x Compression
We set out to build a tokenizer that could compress Axiom-format text more efficiently than standard BPE. The result exceeded expectations.
Results
| Metric | Value |
|---|---|
| Accuracy | 99.86% |
| False Positive Rate | 0.09% |
| Compression | 2.5x (12 slots vs 30 BPE tokens) |
| Attention Savings | 6.2x |
| Total Training Cost | $8 (GH200) |
Architecture
The key insight: compress before you quantise, not after.
Our pipeline: BPE (8k vocab) encodes the text, then an encoder maps to continuous space, a slot compressor reduces to 12 named slots, vector quantisation maps to 4k discrete codes, and an MLP decoder reconstructs.
The Peter Principle
The training used curriculum learning — learn ABCs before spelling. Stage 1 trains reconstruction on clean data. Stage 2 adds noise. Stage 3 handles edge cases. Each stage starts from the best checkpoint of the previous one.
This biological learning order is why an $8 training run outperformed brute-force approaches that cost 100x more.
What’s Next
The tokenizer will be integrated into RAH (our custom SSM architecture) and deployed on Jetson Orin for edge inference.