Neural Networks & Deep Learning
How layers of simple math add up to something that can recognize faces, translate languages, and write poetry.
The photo that broke the internet — and launched a revolution
In 2012, a neural network called AlexNet entered an image recognition competition. The task: look at a photo and identify what's in it. A cat? A car? A coffee mug? Previous systems topped out around ~74% top-5 accuracy. AlexNet achieved 84.7% top-5 accuracy (Krizhevsky et al., NeurIPS, 2012) — a roughly 11-point leap that the field had not seen before — meaning the correct answer appeared in the model's top 5 guesses approximately 84.7% of the time. The gap was so absurd that researchers thought it was a bug.
It wasn't. AlexNet had done something different: instead of humans writing rules like "a cat has pointy ears and whiskers," it stacked layers of simple math operations and let the network figure out its own rules from millions of photos.
That moment kicked off the deep learning revolution. Every AI breakthrough since — voice assistants, self-driving cars, ChatGPT — traces back to that same core idea: stack enough layers of simple operations, feed in enough data, and the system learns patterns no human could program by hand.
A single neuron: the simplest decision-maker
Before we build a network, let's understand one tiny piece: a single artificial neuron. It does three things:
- Takes in numbers (inputs)
- Multiplies each input by a weight (how important is this input?)
- Adds them up and decides (output)
Think of it as a volume knob on a mixing board. You're a music producer with three tracks: vocals, guitar, drums. Each track has a volume knob (the weight). You turn up the vocals, turn down the drums, leave guitar in the middle. The final mix (output) depends on which knobs you turned and how far.
| Mixing board | Neuron |
|---|---|
| Audio tracks (vocals, guitar, drums) | Inputs (pixel brightness, word frequency, etc.) |
| Volume knobs | Weights (how much each input matters) |
| The final mixed sound | Output (the neuron's decision) |
| Adjusting knobs until the song sounds right | Training (adjusting weights until predictions are accurate) |
Here's the math — and it's simpler than you think:
Output = (Input1 x Weight1) + (Input2 x Weight2) + (Input3 x Weight3)
That's it. Multiply and add. A 10-year-old could do it. The magic isn't in any single neuron — it's in what happens when you connect thousands of them together.
There Are No Dumb Questions
"If a neuron is just multiply-and-add, how can it do anything smart?"
A single neuron can't. It can draw one straight line to separate two groups — like "if (pixels x weights) > some threshold, it's a cat; otherwise it's not." That's incredibly limited. The power comes from stacking neurons into layers and layers into networks. Each layer builds on the previous one to capture increasingly complex patterns.
"Are artificial neurons anything like real brain neurons?"
The name is inspired by biology, but the similarity is loose. Real neurons communicate with electrical and chemical signals, can grow new connections, and work in ways we still don't fully understand. Artificial neurons are just a math equation. Think of it as "inspired by the brain" the way an airplane is "inspired by birds" — the engineering is completely different.
Be the Neuron
25 XPLayers: the relay race
One neuron draws one line. Boring. But what if you connected neurons in layers — like a relay race where each runner hands off to the next?
Here's what each layer does in an image recognition network:
| Layer | What it detects | Analogy |
|---|---|---|
| Input layer | Raw pixel values (brightness, color) | Looking at individual dots on a pointillist painting |
| Hidden layer 1 | Edges and lines | Stepping back — you see lines and boundaries |
| Hidden layer 2 | Shapes and textures | Stepping further back — you see circles, fur, feathers |
| Hidden layer 3 | Parts of objects | Even further — you see ears, eyes, beaks |
| Output layer | Final classification (cat, dog, bird) | "Oh! It's a cat." |
This is the relay race. Each layer takes the output of the previous layer and builds something more complex from it. Raw pixels become edges. Edges become shapes. Shapes become object parts. Object parts become a classification. No single layer does the whole job — each runner carries the baton a little further.
The layers between input and output are called hidden layers — not because they're secret, but because you don't directly see their inputs or outputs. They're the internal thinking steps.
There Are No Dumb Questions
"Who decides what each hidden layer detects?"
Nobody! That's the whole point. During training, the network adjusts all its weights (the volume knobs) to minimise prediction errors. The layers emerge with useful representations on their own. Researchers were amazed when they discovered that early layers naturally learned edge detection — nobody programmed that. The data and the training loop figured it out.
"How many hidden layers do you need?"
It depends on the complexity of the task. A simple task (is this number even or odd?) might need one hidden layer. Image recognition typically uses dozens or hundreds. Language models like GPT-4 use over a hundred. More layers = can learn more complex patterns, but also needs more data and more computing power.
How networks learn: adjusting all the knobs at once
Prediction question: A neural network with one million weights just made a wrong prediction. How do you think the network figures out which of those million weights to adjust? Write down your intuition before reading on.
You know the training loop from the previous module: predict, check, adjust, repeat. Neural networks do the same thing — but instead of adjusting a few simple settings, they adjust millions of weights simultaneously.
Here's the process, called backpropagation (don't let the fancy name scare you):
Think of it like tuning an orchestra. The conductor (backpropagation) listens to the performance (forward pass), identifies which instruments are off-key (error calculation), and tells each musician exactly how much to adjust their tuning (weight update). After thousands of rehearsals, the orchestra sounds incredible — even though no single musician understands the entire symphony.
Hands-On: Train Your Own Network (on Paper)
50 XPDeep learning: when "deep" just means "many layers"
Deep learning = neural networks with many hidden layers. That's it. The word "deep" refers to the depth of the network — the number of layers stacked between input and output.
| Term | What it means | Typical layer count |
|---|---|---|
| Shallow network | 1-2 hidden layers | Good for simple patterns |
| Deep network | 3+ hidden layers | Can learn complex, abstract patterns |
| Very deep network (modern AI) | 50-100+ layers | Image recognition, language models |
Why does depth matter? Because each layer builds on the previous one. More layers = more levels of abstraction = more complex patterns.
Think of it like giving directions:
- 1 layer deep: "Turn left at the light." (Simple instruction.)
- 3 layers deep: "Go to the shopping district, find the Italian restaurant row, look for the one with the red awning." (More context, more precision.)
- 50 layers deep: You can describe a specific person's face out of 8 billion people on Earth. (Absurdly complex pattern, but composed of many simple steps.)
Deeper networks = higher accuracy (until you hit diminishing returns)
The "deep learning revolution" of the 2010s happened because three things came together at once:
- More data — the internet produced billions of labelled images, text, and videos
- More compute — GPUs made training deep networks feasible
- Better techniques — researchers figured out how to train very deep networks without the signal getting lost between layers
Why GPUs matter: 1,000 students doing math simultaneously
Training a neural network means doing the same simple math operation (multiply and add) on millions of data points. A CPU does this one at a time — like one student solving math problems at a desk.
A GPU (Graphics Processing Unit) was originally designed for video games — it needs to calculate the color of millions of pixels simultaneously, 60 times per second. That's the same "do simple math on tons of data points at once" pattern that neural networks need.
| CPU | GPU | |
|---|---|---|
| Analogy | One brilliant math student | 1,000 average math students |
| Strength | Complex, sequential tasks | Simple tasks done in parallel |
| Neural network training | Slow (days/weeks) | Fast (hours/days) |
| Cost | Cheaper per unit | Expensive but worth it at scale |
Think of it this way: you need to add up 1,000 grocery receipts. A CPU is one accountant who's really fast — she processes one receipt at a time and finishes in an hour. A GPU is 1,000 interns who are slower individually but each handles one receipt simultaneously — done in 4 seconds.
This is why NVIDIA, the dominant supplier of AI training chips, became one of the most valuable companies in the world by 2024 on the strength of AI chip demand. Every AI company needs GPUs. The demand is so high that GPU shipments are sometimes tracked like oil futures.
There Are No Dumb Questions
"Do I need a GPU to USE a neural network, or just to TRAIN one?"
Training is where GPUs are essential — that's the expensive, compute-heavy phase. Using a trained model (called "inference") is much cheaper and can often run on CPUs or smaller GPUs. When you chat with ChatGPT, inference runs on powerful servers — but the compute per response is a tiny fraction of what training cost.
"What about TPUs? I keep hearing about those."
TPUs (Tensor Processing Units) are Google's custom chips designed specifically for neural network math. They're even faster than GPUs for certain workloads. Amazon has Trainium, and other companies are building their own chips too. The key point: all of these are specialised hardware for doing simple math on lots of data simultaneously.
Match the Hardware to the Task
25 XP2. Running a simple spam filter that classifies 10 emails per minute →
Types of neural networks: different architectures for different jobs
Not all neural networks look the same. Different problems need different architectures — like how you'd use a screwdriver for screws and a hammer for nails.
| Architecture | Best for | How it works | Real-world example |
|---|---|---|---|
| Feedforward (basic) | Simple classification | Data flows one direction: input → hidden → output | Spam detection, simple prediction |
| Convolutional (CNN) | Images and visual data | Slides a small window across the image, detecting patterns at each position | Face recognition, medical imaging, self-driving cars |
| Recurrent (RNN) | Sequential data (text, time series) | Has a "memory" that carries information from previous steps | (Mostly replaced by transformers now) |
| Transformer | Text, code, and increasingly everything | Uses "attention" to look at all parts of input simultaneously | ChatGPT, Claude, DALL-E, translation |
The transformer architecture — which we'll cover in depth in the next module — is the foundation of modern AI. It replaced RNNs for language tasks because it can process all words in a sentence simultaneously (like reading a whole page at once) instead of one word at a time (like reading letter by letter).
Pick the Architecture
25 XPBack to AlexNet
In 2012, nobody could explain exactly why AlexNet worked so well. The researchers had stacked layers of neurons, fed in millions of photos, and let the math sort itself out. The network learned to detect edges in layer 1, curves in layer 2, shapes in layer 3, and object parts in layer 4 — but nobody programmed those steps. The network discovered them.
That gap between "we know how to build it" and "we know why it works" is still one of the most fascinating things about deep learning. You now understand the mechanism: multiply, add, activate, stack, backpropagate. What the layers decide to learn — that's still largely up to the data. That's both what makes deep learning powerful and what makes it hard to fully trust.
The 2012 moment didn't just win a competition. It proved that this approach scales. Every AI system you use today is built on that same idea.
Key takeaways
- A neuron is just multiply-and-add. Each input gets multiplied by a weight (volume knob), the results get summed, and the neuron fires an output. Simple math, remarkable results at scale.
- Layers build abstraction. Like a relay race, each layer takes the previous layer's output and builds something more complex — raw pixels become edges become shapes become "that's a cat."
- Deep learning = many layers. The word "deep" just means the network has many hidden layers, allowing it to learn complex, abstract patterns.
- Training = adjusting millions of knobs. Backpropagation traces errors backward through the network and adjusts every weight to reduce mistakes. Repeat millions of times.
- GPUs made deep learning possible. They do simple math on thousands of data points simultaneously — exactly what neural network training needs.
- Different architectures for different jobs. CNNs for images, transformers for language, basic feedforward for simple classification. Pick the right tool for the job.
Knowledge Check
1.In a neural network, what does a 'weight' represent, and what happens to weights during training?
2.Why did GPUs become essential for training deep neural networks?
3.What does 'deep' mean in deep learning, and why does depth matter?
4.A neural network trained on photos achieves 99% accuracy on training images but 65% on new test images. Its first hidden layer has learned edge detection, and its last hidden layer has learned features specific to individual training photos. What is happening?