Courses Course Atlas Contact Book a free demo
How machines learn to see

How computer vision actually works.

Pixels, convolutions, feature maps and detection - explained deeply, then built. This is the page that takes you from a grid of numbers to a labelled bounding box, and shows you exactly how a machine gets from one to the other.

10,000+
Students taught
15+
Countries
70+
Live courses
4.9/5
247+ reviews
What the model sees

From pixels to a labelled box - the whole pipeline, one frame at a time.

A machine never sees a "cat". It sees a grid of numbers. Each stage below transforms that grid: first into edges, then textures, then parts, until a final head decides what is present and where. This is the shape of nearly every modern vision model.

Input pixels
tensor 224x224x3
Edges
early filters
Textures
mid feature maps
Parts
deep feature maps
Object + box
class + coordinates

Left to right: a raw pixel grid becomes edges, then textures, then object parts, then a single confident decision - cat, 0.97, located inside a green box. The picture below explains each step in real depth.

From pixels to understanding

The six ideas that make a machine see.

Each idea builds on the one before it. Read them in order and the whole field stops being a black box - you will be able to point at any stage of a model and say what it is doing and why.

1 - What an image really is

A grid of numbers, stored as a tensor

To a computer there is no picture. A colour image is a three-dimensional array: height by width by three channels - red, green and blue - where every entry is an intensity, usually from 0 to 255. A 224 by 224 photo is just 150,528 numbers arranged in a cube. Once you see an image as a tensor, everything that follows is arithmetic on that tensor: scaling values into a small range so training is stable, stacking many images into a batch, and reading off the shape to know what flows into the next layer.

image → tensor of shape (height, width, channels) = (224, 224, 3)
2 - Convolutions

A small filter slides across the image and reports what it finds

The core operation in vision is the convolution. You take a tiny grid of weights - a kernel, often 3 by 3 - and slide it across the image. At each position you multiply the kernel against the pixels underneath it and add the results into a single number. Slide it everywhere and you get a new grid: a feature map showing where that kernel's pattern appeared.

What a kernel detects depends entirely on its weights. One set fires on vertical edges, another on horizontal edges, another on a patch of a particular colour. The breakthrough of deep learning is that we do not design these kernels by hand - the network learns the weights from data, discovering for itself which patterns are worth looking for.

kernel (3x3) over input → one feature map per filter
3 - Feature maps and hierarchy

Edges become textures, textures become parts, parts become objects

One convolution layer finds simple things - edges and blobs. Stack the output through another convolution layer and it combines those edges into textures and corners. Stack again and it combines textures into parts: an eye, a wheel, a window frame. Stack deeper still and those parts combine into whole objects. This rising ladder of meaning is the single most important idea in the field. Early feature maps are local and generic; deep feature maps are abstract and specific. The same architecture that learns to spot a cat's ear learns to spot a road sign, because the hierarchy of edges-to-parts-to-objects is general.

4 - Convolutional neural networks

Why pooling and depth make the hierarchy actually work

A convolutional neural network, or CNN, is a stack of convolution layers with two helpers. Between layers, a non-linearity - usually a ReLU that simply zeroes out negatives - lets the network represent patterns that are not just straight-line combinations. And pooling layers shrink the feature maps by keeping only the strongest response in each small region, which makes the network cheaper and lets it recognise a feature whether it sits slightly left or slightly right.

Depth is what gives a CNN its power: every layer sees a wider slice of the original image than the one before it, so deep layers reason about large structures while early layers handle fine detail. The whole stack is trained end to end with gradient descent - the same loss-minimising step behind every neural network - so the kernels, the combinations and the final decision are all learned together.

5 - Vision transformers

Treat the image as patches and let attention decide what matters

CNNs build understanding from small neighbourhoods outward. Vision transformers take a different route. They cut the image into a grid of small patches - say 16 by 16 pixels each - turn every patch into a vector, and then let an attention mechanism compare every patch with every other patch. Attention, in plain terms, is the model asking "to understand this patch, which other patches should I look at?" and weighting them accordingly. A patch of fur can pay attention to a patch of ear far across the image, even though they are not neighbours.

That global view is the strength of transformers: they can relate distant parts of an image directly, where a CNN needs many layers to connect them. With enough data they match or beat CNNs, and many modern systems combine both - convolutions for efficient local features, attention for long-range reasoning.

6 - The tasks

Classification, detection and segmentation - and how each is set up

Classification answers "what is in this image?" with one label. The feature maps are flattened and passed to a final layer that outputs a probability for each class - the highest wins. Object detection answers "what is in this image, and where?" It adds heads that predict bounding-box coordinates and a class for each box, plus a confidence score, so a single image can return many labelled boxes. Segmentation goes further and labels every pixel - this pixel is road, this one is car - producing a mask rather than a box. Same backbone of feature maps underneath; what changes is the head on top and the loss you train it with. Understanding that the task is decided by the head, not the eyes, is what lets you reshape a model for a new problem.

Understanding vs running a demo

Why understanding beats running a pretrained model.

Loading a pretrained model and getting a result is the easy 20 percent. The hard, valuable 80 percent is everything that happens when that result meets your real data. That is exactly what understanding unlocks.

Choosing the right model

When you know how backbones, heads and input sizes work, you can pick a model that fits your problem and your hardware instead of grabbing whatever the tutorial used. You can tell a classifier from a detector at a glance and know which one your task needs.

Fine-tuning on your own data

Pretrained weights know generic features, not your factory, your crops or your medical scans. Understanding the hierarchy tells you which layers to freeze and which to retrain, so you adapt a powerful backbone to your classes with a fraction of the data and time.

Debugging when it goes wrong

A model that scores well in a demo can fail badly on your images. Knowing how feature maps, losses and confidence scores behave lets you read a confusion matrix, spot a labelling error, fix a preprocessing mismatch and tell a real model limit from a setup bug.

Trusting it on edge cases

Odd lighting, blur, an unusual angle, a class the model never saw - these are where products break. Understanding how vision models generalise, and where they do not, lets you test honestly, set sensible confidence thresholds and decide when a human needs to stay in the loop.

This is the philosophy of every track we teach. We build the depth first, because depth is what holds up when the demo meets reality. It is the same reason behind our real coding classes - learning to build and reason, not just to copy and run.

What you learn to build

Real systems, with the understanding sitting underneath them.

You do not finish with notes about computer vision. You finish with working systems you wrote, ran and can explain - and you understand every layer they stand on.

project 01

An image classifier

You build a CNN that takes an image and returns a label with a probability. You train it on a real dataset, watch the loss fall and the accuracy rise, and learn to read whether it is genuinely learning or just memorising - then fix it when it is the latter.

project 02

An object detector

You move from "what is in the image" to "what and where". You build a detector that draws bounding boxes with class labels and confidence scores on images it has never seen, and you learn how the box and class heads are trained together.

skill

Working with real image data

Real datasets are messy. You learn to load, resize and normalise images, to augment them so the model generalises, to handle class imbalance, and to split data honestly so your accuracy numbers actually mean something.

skill

Transfer learning, understood

You take a backbone pretrained on millions of images and fine-tune it on your own small dataset. Because you understand the feature hierarchy, you know exactly which layers to freeze and which to retrain - the fastest honest route to a strong model.

Questions, answered

Before you book a demo.

Do I need to be advanced to learn computer vision?

No. You need to be comfortable writing basic Python and willing to think carefully. We start from what an image really is - a grid of numbers - and build up to convolutions, CNNs and detection one honest step at a time. If your Python is shaky, our Python Masterclass gets you ready, and many learners take it first. You do not need any prior machine learning experience to begin.

Do I need maths for this?

You need school-level algebra and a willingness to learn the rest as we use it. Convolution is multiplication and addition over a small window. Training is gradient descent on a loss. We teach the small amount of linear algebra, calculus and probability each idea needs, exactly when it is needed, with pictures and code rather than abstract proofs. You will not be left guessing why a model behaves the way it does.

Will I build real projects?

Yes. You build an image classifier and train it on a real dataset, then build an object detector that draws bounding boxes with confidence scores on images it has never seen. You work with real image data - loading, augmenting and cleaning it - and you use transfer learning to fine-tune a pretrained backbone on your own classes. Every project ends in code you wrote, can run and can explain.

Who is this for and what ages?

This depth-first track suits serious teenagers, college students and working professionals who want to genuinely understand computer vision and build with it. We teach ages 6 to 65 across our courses, but the from-pixels-to-detection path here is aimed at older, motivated learners. If you are a younger beginner, a free demo helps us place you on the right starting course.

Are classes live?

Yes. Every class is live, online and instructor-led in a small batch. You write code during the session, watch feature maps and detections appear on screen, and get your model and your reasoning reviewed in real time. It is not a recorded video you watch alone - you can ask why something failed the moment it fails.

Why learn how computer vision works when pretrained models exist?

Because understanding is what lets you choose, fine-tune, debug and trust a vision model rather than just run a demo you cannot fix. Pretrained models are powerful, but they fail on your own data, your edge cases, your lighting, your classes. The people who know how convolutions, feature maps and detection heads work can read a confusion matrix, fix a labelling problem, fine-tune the right layers and ship a real product. People who only call an API stall the moment the demo stops matching reality.

See how it sees - in a live class.

Sit in one free demo, watch a model turn pixels into a labelled box on a real image, and decide for yourself whether this is the depth you have been looking for.