How Large Language Models Actually Work — No PhD Required

Every time you ask ChatGPT, Claude, or Gemini a question, something remarkable happens inside a machine. Billions of numbers shift and interact, and out comes a sentence that feels almost human. But how does it actually work?

It all starts with tokens

Before a language model reads your words, it breaks them into tokens — fragments of text that might be a word, part of a word, or even a single character. "Unbelievable" might become ["Un","belie","vable"]. Each token gets converted into a long list of numbers called a vector — its mathematical identity in the model's world.

The transformer architecture

The secret sauce of every modern LLM is the transformer, introduced by Google researchers in 2017. Its core innovation is self-attention: the ability for every token in a sequence to "look at" every other token and decide how much it should care about it. When processing the word "bank," the model checks whether nearby tokens suggest a riverbank or a financial institution.

Layers upon layers

A transformer isn't one attention mechanism — it's dozens or hundreds stacked on top of each other. GPT-4 reportedly has 96 layers. Each layer refines the representation of every token, building up increasingly abstract understanding: early layers recognize grammar, middle layers understand meaning, and deep layers reason about context and intent.

Training: reading the internet

LLMs are trained by predicting the next token in billions of text snippets scraped from the web, books, and code. The model starts with random numbers, makes predictions, measures how wrong it is, and nudges its billions of parameters in the right direction — millions of times per second, for months. This process is called gradient descent, and it's how all that number-shuffling eventually produces something that can write poetry or debug code.

Why do they hallucinate?

LLMs don't look facts up in a database. They compress patterns from training data into weights, and generate text that statistically "fits." When they don't have a strong pattern to follow — like a niche historical event — they invent one that sounds plausible. It's not lying; it's the mathematical equivalent of a confident guess.

What's next?

Researchers are actively working on reasoning models (like o1/o3) that think step-by-step before answering, multimodal models that see images and hear audio, and agentic systems that take real-world actions. The transformer that Google invented in 2017 is still at the core of all of it — which is either remarkable or slightly alarming, depending on your perspective.

The role of RLHF in making models helpful

Raw pretraining produces a model that can complete text, but not one that follows instructions or avoids harmful outputs. This is where Reinforcement Learning from Human Feedback (RLHF) enters. Human trainers rate thousands of model responses, and those preferences are used to fine-tune the model further — teaching it to be helpful, harmless, and honest rather than just statistically fluent.

This process is why Claude says "I'm not sure" rather than inventing an answer, and why ChatGPT refuses to write malware. The model has learned, through human signal, that these refusals score better than the alternatives.

Context windows: the model's working memory

One of the most practical limitations of LLMs is the context window — the maximum amount of text a model can "see" at once. Early GPT models had windows of around 2,000 tokens. Today's frontier models have extended this to 128,000 tokens or more, allowing them to process entire books in a single prompt.

The context window determines how much conversation history, document text, and instruction the model can use simultaneously. When you paste a long document and ask questions about it, the entire document lives within this window. When the window fills up, earlier content is effectively forgotten — the model cannot reference it.

Emergent capabilities: the mystery no one fully understands

One of the strangest phenomena in LLM research is emergence — the sudden appearance of capabilities that weren't present at smaller scale. A model trained on 10 billion parameters might be unable to do multi-step arithmetic. Train a model on 100 billion parameters on the same data, and suddenly it can do it reliably — even though the training process didn't change.

Researchers have documented dozens of these emergent skills: multi-language translation without multilingual training data, basic programming, analogical reasoning, and even theory of mind. Why scale produces emergence is still not fully understood, and it's one of the most active areas of AI safety and interpretability research.

What you should actually know as a user

Understanding how LLMs work changes how you use them. Knowing that these models work by predicting plausible text — rather than looking up facts — helps you understand why they hallucinate, why specificity in prompts matters, and why asking them to "think step by step" genuinely improves their accuracy (it does, because it forces the model to generate intermediate tokens that constrain the final answer).

The transformer revolution of 2017 gave us the backbone. RLHF gave us safety and helpfulness. Scaling gave us emergence. And the race to understand what's actually happening inside these billions of parameters is still very much underway.