Sovereign Tokenizer V2.1: 99.86% Accuracy at 2.5x Compression

| Peter Bradshaw researchtokenizeraxiom

We set out to build a tokenizer that could compress Axiom-format text more efficiently than standard BPE. The result exceeded expectations.

Results

Metric Value
Accuracy 99.86%
False Positive Rate 0.09%
Compression 2.5x (12 slots vs 30 BPE tokens)
Attention Savings 6.2x
Total Training Cost $8 (GH200)

Architecture

The key insight: compress before you quantise, not after.

Our pipeline: BPE (8k vocab) encodes the text, then an encoder maps to continuous space, a slot compressor reduces to 12 named slots, vector quantisation maps to 4k discrete codes, and an MLP decoder reconstructs.

The Peter Principle

The training used curriculum learning — learn ABCs before spelling. Stage 1 trains reconstruction on clean data. Stage 2 adds noise. Stage 3 handles edge cases. Each stage starts from the best checkpoint of the previous one.

This biological learning order is why an $8 training run outperformed brute-force approaches that cost 100x more.

What’s Next

The tokenizer will be integrated into RAH (our custom SSM architecture) and deployed on Jetson Orin for edge inference.