Generative AI Masterclass
Build with LLMs end to end - prompting, RAG, fine-tuning and generation projects.
View course →Inside the models behind modern AI
Tokens, embeddings, attention and transformers - explained from the inside, then built. This is the page that turns "AI feels like magic" into "I know exactly what is happening, and I can make it do useful work."
The whole model in three moves
A language model never sees words the way you do. It reads a sentence as a list of tokens, lets each token look at the others to decide what matters, and then guesses the single most likely token to come next. Everything else is detail on top of these three moves.
To fill the blank, the model leans on can and write more than on The. Those weighted links are attention.
The model assigns a probability to every token it knows, then samples one. Pick code, append it, and run the whole loop again to produce the next token. A long answer is just this loop, repeated, one token at a time.
The pipeline, in order
Here is the full path a sentence takes through a model, broken into the parts you can actually reason about and, later in our classes, build.
A model cannot read letters; it reads tokens, which are numbered fragments of text. A tokeniser slices your input into common chunks - whole words like the, pieces like token + ise, and even single characters for rare strings. Each chunk maps to an integer in a fixed vocabulary. This is why models count tokens, not words, why unusual names cost more tokens, and why your context limit is measured in tokens. Get this wrong and your prompts silently get truncated. Understanding it is the first practical lever you gain.
Each token integer is looked up in a giant table and replaced with a vector - a long list of numbers that the model learned during training. Vectors are how the model stores meaning: tokens used in similar ways end up near each other in this space, so king and queen sit closer than king and bicycle. Position information is added too, so the model knows token order. From here on, the model is working entirely with numbers, and arithmetic on those numbers is what produces understanding.
This is the core idea, and it is simpler than the name suggests. For every token, the model asks: which other tokens should I pay attention to right now? It scores each pair of tokens, turns those scores into weights that sum to one, and blends the other tokens' information in proportion to those weights. In "the trophy did not fit in the case because it was too big", attention is how the model links it to trophy rather than case. Attention is the mechanism that lets context change meaning.
One attention step is useful; stacking dozens of them is powerful. A transformer interleaves attention layers with small feed-forward networks, and repeats the block many times. Early layers capture surface patterns like grammar; deeper layers capture relationships, intent and topic. Many attention "heads" run in parallel, each free to track a different kind of relationship. The transformer is the architecture that made today's models possible, because it processes a whole sequence at once and scales cleanly to enormous size.
After all those layers, the model outputs one score per token in its vocabulary, converts the scores to probabilities, and that is the answer to a single question: what comes next? A language model is, at heart, a very good guesser of the next token. Sampling settings like temperature decide whether it always takes the top guess or occasionally takes a lower one for variety. Knowing this demystifies a lot: the model is not retrieving a stored answer, it is composing one token by token from learned probability.
Pretraining shows the model oceans of text and asks it, over and over, to predict the next token, adjusting billions of weights each time it is wrong. That alone produces fluency but not helpfulness. Instruction tuning and RLHF then teach it to follow requests and prefer answers humans rate as good - in plain terms, people compare responses, and the model is nudged toward the preferred style. Finally, fine-tuning adapts a base model to your own data and tone, which you will do hands-on in our classes.
Know the failure modes
Using LLMs well means respecting their limits. These are not bugs to wait out; they follow directly from how the machine works, and you design around them.
The model optimises for the most plausible next token, not the most true one. When it has no grounding, it will still produce fluent, confident text - a citation, a date, an API that does not exist. The fix is not to scold the model; it is to ground it with retrieval, ask for sources you can check, and verify anything that matters.
A model can only attend to a fixed number of tokens at once. Past that window, earlier text is simply gone. This is why long chats lose the thread and why pasting an entire book fails. Designing what to keep, summarise, or retrieve into the window is a real engineering skill, and one we drill directly.
A base model knows only what was in its training data, frozen at a cutoff. It does not browse, remember you between sessions, or check the clock unless you give it tools that do. Treat it as a brilliant reasoner with no calendar and no internet until you wire those in - which, again, is exactly what you will build.
Theory you can ship
Every technique below sits on top of the mechanics above. We teach the why first so your builds hold up, then you build them live with an instructor. This is the practical core of our real coding classes.
Once you know the model is reading tokens and predicting the next one, prompting stops being guesswork. You learn to set role and constraints, give worked examples, ask for structured output, and break hard tasks into steps the model can follow. You also learn why some "tricks" work and others are folklore, so you can write prompts that behave the same way every time.
RAG is how you give a model facts it was never trained on. You turn your documents into embeddings, store them, retrieve the most relevant chunks for a question, and feed them into the context window so the answer is grounded in real sources. Because you understand embeddings and the context window, you can debug a RAG system instead of just hoping it works.
When prompting and retrieval are not enough, you fine-tune. You will prepare a dataset, train a smaller model to adopt a specific tone or task, and measure whether it actually improved. Knowing what pretraining and RLHF already did tells you when fine-tuning is the right tool and when it is an expensive detour.
An agent is a model given tools and a loop: it decides what to do, calls a tool, reads the result, and continues until the job is done. You build one that can search, run code, or call an API - and because you understand the model's limits, you design guardrails so it fails safely instead of looping forever or making things up.
Attached learning paths
Live, small-batch, instructor-led tracks that take you from the ideas on this page to working AI projects.
Build with LLMs end to end - prompting, RAG, fine-tuning and generation projects.
View course →
The full machine learning foundation underneath modern language models.
View course →
A broad, rigorous tour of AI - from search and learning to deep networks.
View course →
Ship real software with AI coding agents - for adults and professionals.
View course →
The language every AI project is written in, from first line to advanced.
View course →Keep reading
Standalone deep-dives that sit alongside this one. Each explains a part of modern AI from the inside.
Common questions
No. You need curiosity and a willingness to think carefully. We start from text and tokens and build up the ideas one layer at a time. If you can follow a recipe and reason about steps, you can follow how a language model turns words into numbers, weighs context, and predicts the next token. Our small live batches mean you can ask the instructor to slow down or re-explain any idea until it clicks.
Some, but far less than people fear. To understand and build with LLMs you mainly need comfort with vectors and a rough sense of probability - both of which we teach in context as they come up. You do not need to derive backpropagation by hand to use embeddings, attention and fine-tuning well. If you want the deeper mathematics, our companion page on the maths behind machine learning takes you there at your own pace.
You build. Every concept is paired with code you run yourself. You will tokenise text, inspect embeddings, write prompts that behave reliably, connect a retrieval-augmented generation pipeline to your own documents, fine-tune a small model on a dataset, and assemble a simple agent that uses tools. The theory exists to make your builds robust, not to replace them.
This path is aimed at serious teenagers, college students and working professionals who want real understanding rather than a quick demo. Modern Age Coders teaches ages 6 to 65 across more than 70 live courses, and we place each learner in a track that matches their background, whether that is a curious 15-year-old or an engineer adding AI to their work.
Yes. Every class is live, online and instructor-led in small batches, not a pre-recorded video library. You write code with your instructor watching, ask questions in real time, and get feedback on your own projects. Start with a free demo to meet the teacher and see the format before you commit.
Because understanding tokens, attention and the model's limits is what separates a fragile demo from a real product. When you know that the model reads tokens, attends to context within a fixed window, and predicts the next token by probability, you can prompt well, choose when to fine-tune, design retrieval-augmented generation that grounds answers in facts, recognise and defuse hallucination traps, and control cost. People who only call the API get surprised; people who understand the machine ship.
Start with a conversation
A free, live demo with a real instructor. Bring your questions about tokens, attention or anything on this page.