Attention Is All You Need — What It Actually Changed (And What It Didn't)

Originally posted as a thread on X/Twitter by @SovereignLabsAU, 27 March 2026. Corrected below — see the note.

Correction (2026-06-14): An earlier version of this post claimed the Transformer “flips the paradigm from prediction to understanding” and that these models do “real intelligence,” not prediction. That was overclaiming, and one line (“it sees the full composition before it plays a single bar”) flatly contradicted how the decoder actually generates text — one token at a time, left to right. We build on measured claims, not hype, so we’re fixing our own. The architecture facts below are unchanged; the framing now matches what the paper actually shows.


I took the 2017 paper that started the whole LLM game — Attention Is All You Need — and tried to explain it in plain English: what attention really changed, and, just as important, what it didn’t.

No PhD gatekeeping. No scary robot talk. Just what’s actually true.


What came before

The old guard of sequence models leaned on recurrent loops (RNNs, LSTMs) or convolutions. Even when they bolted attention on top, the backbone was still sequential — every step waited on the one before it. That meant no parallel training and memory that blew up on long sequences.

What the Transformer did

It threw out recurrence and convolutions entirely and built the network from attention alone. That’s not spin — it’s the paper’s own abstract.

The payoff is real and concrete:

  • Parallel, not sequential. The encoder can attend over the whole input at once instead of crawling token by token.
  • Constant path length between any two tokens. Any position can reach any other in one hop, regardless of distance — the paper’s actual argument (their Table 1), and why long-range dependencies got easier to learn.
  • Faster training, higher quality. It hit state-of-the-art translation (WMT'14 EN-DE and EN-FR) using far less compute than the recurrent models it beat.

Under the hood it’s an encoder-decoder of identical stacked layers, each built from just two pieces: multi-head attention and a point-wise feed-forward net, wrapped in residual connections and normalisation. Positional encodings tell it the order. An attention function itself is simple: a query meets a set of key-value pairs and returns a weighted sum (scaled dot-product), with multiple heads so different subspaces can learn different relationships. That’s genuinely most of it.

The part the hype gets wrong

Here’s where the original thread oversold it, so let’s be straight.

Attention changed the architecturehow the model computes its answer. It did not change the objective. A Transformer language model is still trained on next-token prediction: minimise the error on “what word comes next.” Mathematically it’s still estimating p(next token | context). The richer machinery makes deeper internal structure possible; it doesn’t replace prediction with something else.

And the generation step is still left-to-right. The decoder — which is exactly what GPT-style LLMs are — is masked: at each step it can only see what came before, and it emits one token at a time, autoregressively. So “it sees the whole composition before it plays a bar” is the wrong picture for how an LLM actually writes a sentence. The encoder is bidirectional; the generator is not.

So what’s the honest takeaway?

Attention didn’t end the story of prediction — it gave prediction much better machinery to run on: global, parallel, distance-agnostic relationships instead of a sequential crawl. That machinery is what makes richer internal representations possible.

Whether those representations amount to anything like understanding is the genuinely open question — and the field’s reflex of “just add more parameters and data and pray” is not an answer to it. Measuring what’s actually in there, rather than assuming it, is the real work. That’s the part we’re interested in.

AI isn’t magic and it isn’t a fraud. It’s a prediction engine with remarkable machinery — and the honest, interesting question is how far that machinery can be pushed, and what it’s really doing when it works.


Sovereign Labs AU — Melbourne. We correct our own record.