Neural Networks & Deep Learning — Understanding AI

The photo that broke the internet — and launched a revolution

In 2012, a neural network called AlexNet entered an image recognition competition. The task: look at a photo and identify what's in it. A cat? A car? A coffee mug? Previous systems topped out around ~74% top-5 accuracy. AlexNet achieved 84.7% top-5 accuracy (Krizhevsky et al., NeurIPS, 2012) — a roughly 11-point leap that the field had not seen before — meaning the correct answer appeared in the model's top 5 guesses approximately 84.7% of the time. The gap was so absurd that researchers thought it was a bug.

It wasn't. AlexNet had done something different: instead of humans writing rules like "a cat has pointy ears and whiskers," it stacked layers of simple math operations and let the network figure out its own rules from millions of photos.

That moment kicked off the deep learning revolution. Every AI breakthrough since — voice assistants, self-driving cars, ChatGPT — traces back to that same core idea: stack enough layers of simple operations, feed in enough data, and the system learns patterns no human could program by hand.

74%Best top-5 accuracy before 2012

85%AlexNet top-5 accuracy ~85% (2012) — an ~11-point leap over the previous best

99%State of the art by 2024

A single neuron: the simplest decision-maker

Before we build a network, let's understand one tiny piece: a single artificial neuron. It does three things:

Takes in numbers (inputs)
Multiplies each input by a weight (how important is this input?)
Adds them up and decides (output)

Think of it as a volume knob on a mixing board. You're a music producer with three tracks: vocals, guitar, drums. Each track has a volume knob (the weight). You turn up the vocals, turn down the drums, leave guitar in the middle. The final mix (output) depends on which knobs you turned and how far.

Mixing board	Neuron
Audio tracks (vocals, guitar, drums)	Inputs (pixel brightness, word frequency, etc.)
Volume knobs	Weights (how much each input matters)
The final mixed sound	Output (the neuron's decision)
Adjusting knobs until the song sounds right	Training (adjusting weights until predictions are accurate)

Here's the math — and it's simpler than you think:

Output = (Input1 x Weight1) + (Input2 x Weight2) + (Input3 x Weight3)

That's it. Multiply and add. A 10-year-old could do it. The magic isn't in any single neuron — it's in what happens when you connect thousands of them together.

There Are No Dumb Questions

"If a neuron is just multiply-and-add, how can it do anything smart?"

A single neuron can't. It can draw one straight line to separate two groups — like "if (pixels x weights) > some threshold, it's a cat; otherwise it's not." That's incredibly limited. The power comes from stacking neurons into layers and layers into networks. Each layer builds on the previous one to capture increasingly complex patterns.

"Are artificial neurons anything like real brain neurons?"

The name is inspired by biology, but the similarity is loose. Real neurons communicate with electrical and chemical signals, can grow new connections, and work in ways we still don't fully understand. Artificial neurons are just a math equation. Think of it as "inspired by the brain" the way an airplane is "inspired by birds" — the engineering is completely different.

⚡

Be the Neuron

25 XP

You are a single neuron deciding whether to approve a loan. You have three inputs: | Input | Value | Weight | |-------|-------|--------| | Credit score (scaled 0-1) | 0.8 | 0.5 | | Income level (scaled 0-1) | 0.6 | 0.3 | | Debt ratio (scaled 0-1) | 0.9 | -0.4 | **Your tasks:** 1. Calculate the weighted sum: (0.8 x 0.5) + (0.6 x 0.3) + (0.9 x -0.4) = ___ 2. If your threshold is 0.2 (approve if weighted sum > 0.2), do you approve this loan? 3. The debt ratio weight is negative (-0.4). Why does that make sense for a loan decision? 4. If you wanted to make the neuron care MORE about credit score, what would you do to its weight? _Hint: For #1, remember that multiplying by a negative weight means high debt DECREASES the output — which is exactly what you'd want for loan approval. For #4, think about what "turning up the volume knob" means._

Layers: the relay race

One neuron draws one line. Boring. But what if you connected neurons in layers — like a relay race where each runner hands off to the next?

Here's what each layer does in an image recognition network:

Layer	What it detects	Analogy
Input layer	Raw pixel values (brightness, color)	Looking at individual dots on a pointillist painting
Hidden layer 1	Edges and lines	Stepping back — you see lines and boundaries
Hidden layer 2	Shapes and textures	Stepping further back — you see circles, fur, feathers
Hidden layer 3	Parts of objects	Even further — you see ears, eyes, beaks
Output layer	Final classification (cat, dog, bird)	"Oh! It's a cat."

This is the relay race. Each layer takes the output of the previous layer and builds something more complex from it. Raw pixels become edges. Edges become shapes. Shapes become object parts. Object parts become a classification. No single layer does the whole job — each runner carries the baton a little further.

The layers between input and output are called hidden layers — not because they're secret, but because you don't directly see their inputs or outputs. They're the internal thinking steps.

There Are No Dumb Questions

"Who decides what each hidden layer detects?"

Nobody! That's the whole point. During training, the network adjusts all its weights (the volume knobs) to minimise prediction errors. The layers emerge with useful representations on their own. Researchers were amazed when they discovered that early layers naturally learned edge detection — nobody programmed that. The data and the training loop figured it out.

"How many hidden layers do you need?"

It depends on the complexity of the task. A simple task (is this number even or odd?) might need one hidden layer. Image recognition typically uses dozens or hundreds. Language models like GPT-4 use over a hundred. More layers = can learn more complex patterns, but also needs more data and more computing power.

How networks learn: adjusting all the knobs at once

Prediction question: A neural network with one million weights just made a wrong prediction. How do you think the network figures out which of those million weights to adjust? Write down your intuition before reading on.

You know the training loop from the previous module: predict, check, adjust, repeat. Neural networks do the same thing — but instead of adjusting a few simple settings, they adjust millions of weights simultaneously.

Here's the process, called backpropagation (don't let the fancy name scare you):

Think of it like tuning an orchestra. The conductor (backpropagation) listens to the performance (forward pass), identifies which instruments are off-key (error calculation), and tells each musician exactly how much to adjust their tuning (weight update). After thousands of rehearsals, the orchestra sounds incredible — even though no single musician understands the entire symphony.

🔑Backpropagation in one sentence

After each prediction, the network asks "which weights were most responsible for this mistake?" — then nudges those weights slightly in the direction that would have produced the right answer. Repeat this a million times across a million examples and the weights converge to something useful.

⚡

Hands-On: Train Your Own Network (on Paper)

50 XP

You're training a tiny neural network to classify shapes as "circle" or "square." Your network has one hidden layer with two neurons. **Round 1:** - Input: A circle - Network predicts: Square (60%), Circle (40%) - Actual: Circle - Error: Wrong! - Action: Increase weights that respond to curved edges. Decrease weights that respond to straight edges. **Round 2:** - Input: A square - Network predicts: Square (55%), Circle (45%) — better but still shaky - Actual: Square - Error: Right answer, but not confident enough - Action: Further increase weights for straight edges and right angles. **Now you continue the training. For each round, predict what happens:** **Round 3 — Input: A circle.** After the adjustments from rounds 1-2, will the network be: - (a) More confident it's a circle than in Round 1? - (b) Less confident? - (c) About the same? Explain why: ___ **Round 4 — Input: An oval.** The network has never seen an oval. Will it classify it as: - (a) Circle (because it has curved edges) - (b) Square (because it's wider than a circle) - (c) 50/50 toss-up What does this tell you about how neural networks generalise? ___ **Round 5 — You've now trained on 10,000 shapes.** Training accuracy: 99%. Test accuracy: 72%. What's happening? _Hint: Round 3 — the weight adjustments from Round 1 increased sensitivity to curves, so yes, it should be more confident about circles. Round 5 — remember overfitting from the previous module?_

Deep learning: when "deep" just means "many layers"

Deep learning = neural networks with many hidden layers. That's it. The word "deep" refers to the depth of the network — the number of layers stacked between input and output.

Term	What it means	Typical layer count
Shallow network	1-2 hidden layers	Good for simple patterns
Deep network	3+ hidden layers	Can learn complex, abstract patterns
Very deep network (modern AI)	50-100+ layers	Image recognition, language models

Why does depth matter? Because each layer builds on the previous one. More layers = more levels of abstraction = more complex patterns.

Think of it like giving directions:

1 layer deep: "Turn left at the light." (Simple instruction.)
3 layers deep: "Go to the shopping district, find the Italian restaurant row, look for the one with the red awning." (More context, more precision.)
50 layers deep: You can describe a specific person's face out of 8 billion people on Earth. (Absurdly complex pattern, but composed of many simple steps.)

Deeper networks = higher accuracy (until you hit diminishing returns)

The "deep learning revolution" of the 2010s happened because three things came together at once:

More data — the internet produced billions of labelled images, text, and videos
More compute — GPUs made training deep networks feasible
Better techniques — researchers figured out how to train very deep networks without the signal getting lost between layers

Why GPUs matter: 1,000 students doing math simultaneously

Training a neural network means doing the same simple math operation (multiply and add) on millions of data points. A CPU does this one at a time — like one student solving math problems at a desk.

A GPU (Graphics Processing Unit) was originally designed for video games — it needs to calculate the color of millions of pixels simultaneously, 60 times per second. That's the same "do simple math on tons of data points at once" pattern that neural networks need.

	CPU	GPU
Analogy	One brilliant math student	1,000 average math students
Strength	Complex, sequential tasks	Simple tasks done in parallel
Neural network training	Slow (days/weeks)	Fast (hours/days)
Cost	Cheaper per unit	Expensive but worth it at scale

Think of it this way: you need to add up 1,000 grocery receipts. A CPU is one accountant who's really fast — she processes one receipt at a time and finishes in an hour. A GPU is 1,000 interns who are slower individually but each handles one receipt simultaneously — done in 4 seconds.

This is why NVIDIA, the dominant supplier of AI training chips, became one of the most valuable companies in the world by 2024 on the strength of AI chip demand. Every AI company needs GPUs. The demand is so high that GPU shipments are sometimes tracked like oil futures.

There Are No Dumb Questions

"Do I need a GPU to USE a neural network, or just to TRAIN one?"

Training is where GPUs are essential — that's the expensive, compute-heavy phase. Using a trained model (called "inference") is much cheaper and can often run on CPUs or smaller GPUs. When you chat with ChatGPT, inference runs on powerful servers — but the compute per response is a tiny fraction of what training cost.

"What about TPUs? I keep hearing about those."

TPUs (Tensor Processing Units) are Google's custom chips designed specifically for neural network math. They're even faster than GPUs for certain workloads. Amazon has Trainium, and other companies are building their own chips too. The key point: all of these are specialised hardware for doing simple math on lots of data simultaneously.

⚡

Match the Hardware to the Task

25 XP

CPUGPU

Training a 70-billion-parameter language model on 2 trillion tokens of text

Running a simple spam filter that classifies 10 emails per minute

A video game rendering millions of pixels at 60 frames per second

A self-driving car processing camera feeds in real-time to detect pedestrians

2. Running a simple spam filter that classifies 10 emails per minute →

0/4 answered

Types of neural networks: different architectures for different jobs

Not all neural networks look the same. Different problems need different architectures — like how you'd use a screwdriver for screws and a hammer for nails.

Architecture	Best for	How it works	Real-world example
Feedforward (basic)	Simple classification	Data flows one direction: input → hidden → output	Spam detection, simple prediction
Convolutional (CNN)	Images and visual data	Slides a small window across the image, detecting patterns at each position	Face recognition, medical imaging, self-driving cars
Recurrent (RNN)	Sequential data (text, time series)	Has a "memory" that carries information from previous steps	(Mostly replaced by transformers now)
Transformer	Text, code, and increasingly everything	Uses "attention" to look at all parts of input simultaneously	ChatGPT, Claude, DALL-E, translation

The transformer architecture — which we'll cover in depth in the next module — is the foundation of modern AI. It replaced RNNs for language tasks because it can process all words in a sentence simultaneously (like reading a whole page at once) instead of one word at a time (like reading letter by letter).

⚡

Pick the Architecture

25 XP

Match each application to the most appropriate neural network architecture: 1. Identifying tumours in MRI scans → ___ 2. Translating English to Japanese → ___ 3. Predicting whether a bank transaction is fraudulent based on 10 numerical features → ___ 4. Generating a poem in the style of Shakespeare → ___ _Hint: Visual data → CNN. Language generation and translation → Transformer. Simple numerical classification → basic feedforward. If it involves generating text or understanding language context, it's almost certainly a transformer in 2024+._

Back to AlexNet

In 2012, nobody could explain exactly why AlexNet worked so well. The researchers had stacked layers of neurons, fed in millions of photos, and let the math sort itself out. The network learned to detect edges in layer 1, curves in layer 2, shapes in layer 3, and object parts in layer 4 — but nobody programmed those steps. The network discovered them.

That gap between "we know how to build it" and "we know why it works" is still one of the most fascinating things about deep learning. You now understand the mechanism: multiply, add, activate, stack, backpropagate. What the layers decide to learn — that's still largely up to the data. That's both what makes deep learning powerful and what makes it hard to fully trust.

The 2012 moment didn't just win a competition. It proved that this approach scales. Every AI system you use today is built on that same idea.

Key takeaways

A neuron is just multiply-and-add. Each input gets multiplied by a weight (volume knob), the results get summed, and the neuron fires an output. Simple math, remarkable results at scale.
Layers build abstraction. Like a relay race, each layer takes the previous layer's output and builds something more complex — raw pixels become edges become shapes become "that's a cat."
Deep learning = many layers. The word "deep" just means the network has many hidden layers, allowing it to learn complex, abstract patterns.
Training = adjusting millions of knobs. Backpropagation traces errors backward through the network and adjusts every weight to reduce mistakes. Repeat millions of times.
GPUs made deep learning possible. They do simple math on thousands of data points simultaneously — exactly what neural network training needs.
Different architectures for different jobs. CNNs for images, transformers for language, basic feedforward for simple classification. Pick the right tool for the job.

Knowledge Check

1.In a neural network, what does a 'weight' represent, and what happens to weights during training?

2.Why did GPUs become essential for training deep neural networks?

3.What does 'deep' mean in deep learning, and why does depth matter?

4.A neural network trained on photos achieves 99% accuracy on training images but 65% on new test images. Its first hidden layer has learned edge detection, and its last hidden layer has learned features specific to individual training photos. What is happening?

The photo that broke the internet — and launched a revolution

74%Best top-5 accuracy before 2012

85%AlexNet top-5 accuracy ~85% (2012) — an ~11-point leap over the previous best

99%State of the art by 2024

A single neuron: the simplest decision-maker

Before we build a network, let's understand one tiny piece: a single artificial neuron. It does three things:

Takes in numbers (inputs)
Multiplies each input by a weight (how important is this input?)
Adds them up and decides (output)

Mixing board	Neuron
Audio tracks (vocals, guitar, drums)	Inputs (pixel brightness, word frequency, etc.)
Volume knobs	Weights (how much each input matters)
The final mixed sound	Output (the neuron's decision)
Adjusting knobs until the song sounds right	Training (adjusting weights until predictions are accurate)

Here's the math — and it's simpler than you think:

Output = (Input1 x Weight1) + (Input2 x Weight2) + (Input3 x Weight3)

That's it. Multiply and add. A 10-year-old could do it. The magic isn't in any single neuron — it's in what happens when you connect thousands of them together.

There Are No Dumb Questions

"If a neuron is just multiply-and-add, how can it do anything smart?"

A single neuron can't. It can draw one straight line to separate two groups — like "if (pixels x weights) > some threshold, it's a cat; otherwise it's not." That's incredibly limited. The power comes from stacking neurons into layers and layers into networks. Each layer builds on the previous one to capture increasingly complex patterns.

"Are artificial neurons anything like real brain neurons?"

The name is inspired by biology, but the similarity is loose. Real neurons communicate with electrical and chemical signals, can grow new connections, and work in ways we still don't fully understand. Artificial neurons are just a math equation. Think of it as "inspired by the brain" the way an airplane is "inspired by birds" — the engineering is completely different.

⚡

Be the Neuron

25 XP

Layers: the relay race

One neuron draws one line. Boring. But what if you connected neurons in layers — like a relay race where each runner hands off to the next?

Here's what each layer does in an image recognition network:

Layer	What it detects	Analogy
Input layer	Raw pixel values (brightness, color)	Looking at individual dots on a pointillist painting
Hidden layer 1	Edges and lines	Stepping back — you see lines and boundaries
Hidden layer 2	Shapes and textures	Stepping further back — you see circles, fur, feathers
Hidden layer 3	Parts of objects	Even further — you see ears, eyes, beaks
Output layer	Final classification (cat, dog, bird)	"Oh! It's a cat."

The layers between input and output are called hidden layers — not because they're secret, but because you don't directly see their inputs or outputs. They're the internal thinking steps.

There Are No Dumb Questions

"Who decides what each hidden layer detects?"

Nobody! That's the whole point. During training, the network adjusts all its weights (the volume knobs) to minimise prediction errors. The layers emerge with useful representations on their own. Researchers were amazed when they discovered that early layers naturally learned edge detection — nobody programmed that. The data and the training loop figured it out.

"How many hidden layers do you need?"

It depends on the complexity of the task. A simple task (is this number even or odd?) might need one hidden layer. Image recognition typically uses dozens or hundreds. Language models like GPT-4 use over a hundred. More layers = can learn more complex patterns, but also needs more data and more computing power.

How networks learn: adjusting all the knobs at once

Here's the process, called backpropagation (don't let the fancy name scare you):

🔑Backpropagation in one sentence

⚡

Hands-On: Train Your Own Network (on Paper)

50 XP

Deep learning: when "deep" just means "many layers"

Deep learning = neural networks with many hidden layers. That's it. The word "deep" refers to the depth of the network — the number of layers stacked between input and output.

Term	What it means	Typical layer count
Shallow network	1-2 hidden layers	Good for simple patterns
Deep network	3+ hidden layers	Can learn complex, abstract patterns
Very deep network (modern AI)	50-100+ layers	Image recognition, language models

Why does depth matter? Because each layer builds on the previous one. More layers = more levels of abstraction = more complex patterns.

Think of it like giving directions:

1 layer deep: "Turn left at the light." (Simple instruction.)
3 layers deep: "Go to the shopping district, find the Italian restaurant row, look for the one with the red awning." (More context, more precision.)
50 layers deep: You can describe a specific person's face out of 8 billion people on Earth. (Absurdly complex pattern, but composed of many simple steps.)

Deeper networks = higher accuracy (until you hit diminishing returns)

The "deep learning revolution" of the 2010s happened because three things came together at once:

More data — the internet produced billions of labelled images, text, and videos
More compute — GPUs made training deep networks feasible
Better techniques — researchers figured out how to train very deep networks without the signal getting lost between layers

Why GPUs matter: 1,000 students doing math simultaneously

Training a neural network means doing the same simple math operation (multiply and add) on millions of data points. A CPU does this one at a time — like one student solving math problems at a desk.

	CPU	GPU
Analogy	One brilliant math student	1,000 average math students
Strength	Complex, sequential tasks	Simple tasks done in parallel
Neural network training	Slow (days/weeks)	Fast (hours/days)
Cost	Cheaper per unit	Expensive but worth it at scale

There Are No Dumb Questions

"Do I need a GPU to USE a neural network, or just to TRAIN one?"

Training is where GPUs are essential — that's the expensive, compute-heavy phase. Using a trained model (called "inference") is much cheaper and can often run on CPUs or smaller GPUs. When you chat with ChatGPT, inference runs on powerful servers — but the compute per response is a tiny fraction of what training cost.

"What about TPUs? I keep hearing about those."

TPUs (Tensor Processing Units) are Google's custom chips designed specifically for neural network math. They're even faster than GPUs for certain workloads. Amazon has Trainium, and other companies are building their own chips too. The key point: all of these are specialised hardware for doing simple math on lots of data simultaneously.

⚡

Match the Hardware to the Task

25 XP

CPUGPU

Training a 70-billion-parameter language model on 2 trillion tokens of text

Running a simple spam filter that classifies 10 emails per minute

A video game rendering millions of pixels at 60 frames per second

A self-driving car processing camera feeds in real-time to detect pedestrians

2. Running a simple spam filter that classifies 10 emails per minute →

0/4 answered

Types of neural networks: different architectures for different jobs

Not all neural networks look the same. Different problems need different architectures — like how you'd use a screwdriver for screws and a hammer for nails.

Architecture	Best for	How it works	Real-world example
Feedforward (basic)	Simple classification	Data flows one direction: input → hidden → output	Spam detection, simple prediction
Convolutional (CNN)	Images and visual data	Slides a small window across the image, detecting patterns at each position	Face recognition, medical imaging, self-driving cars
Recurrent (RNN)	Sequential data (text, time series)	Has a "memory" that carries information from previous steps	(Mostly replaced by transformers now)
Transformer	Text, code, and increasingly everything	Uses "attention" to look at all parts of input simultaneously	ChatGPT, Claude, DALL-E, translation

⚡

Pick the Architecture

25 XP

Back to AlexNet

The 2012 moment didn't just win a competition. It proved that this approach scales. Every AI system you use today is built on that same idea.

Key takeaways

A neuron is just multiply-and-add. Each input gets multiplied by a weight (volume knob), the results get summed, and the neuron fires an output. Simple math, remarkable results at scale.
Layers build abstraction. Like a relay race, each layer takes the previous layer's output and builds something more complex — raw pixels become edges become shapes become "that's a cat."
Deep learning = many layers. The word "deep" just means the network has many hidden layers, allowing it to learn complex, abstract patterns.
Training = adjusting millions of knobs. Backpropagation traces errors backward through the network and adjusts every weight to reduce mistakes. Repeat millions of times.
GPUs made deep learning possible. They do simple math on thousands of data points simultaneously — exactly what neural network training needs.
Different architectures for different jobs. CNNs for images, transformers for language, basic feedforward for simple classification. Pick the right tool for the job.

Knowledge Check

1.In a neural network, what does a 'weight' represent, and what happens to weights during training?

2.Why did GPUs become essential for training deep neural networks?

3.What does 'deep' mean in deep learning, and why does depth matter?