Data: The Fuel for AI — Understanding AI

The hiring algorithm that taught itself to reject women

Amazon scrapped an internal AI recruiting tool after discovering it had taught itself to penalise resumes containing the word "women's" — as in "women's chess club captain" or "women's soccer team." It also downgraded graduates of two all-women's colleges. The bias was identified internally around 2015; the story was publicly reported by Reuters in October 2018.

Nobody programmed that rule. Nobody told the system to discriminate.

The data did. The model was trained on 10 years of Amazon's own hiring decisions — a decade in which the company had overwhelmingly hired men for technical roles. The AI looked at that data and concluded: "historically, successful candidates don't have 'women's' on their resumes." It didn't understand gender. It found a statistical pattern and optimised for it.

This is the single most important lesson about AI: an AI model is only as good as the data you feed it. Feed it biased data, and you get a biased model. Feed it incomplete data, and you get an incomplete model. Feed it garbage, and you get garbage — with a very confident tone of voice.

Structured vs. unstructured data

Not all data is created equal. The first distinction you need to understand is between structured and unstructured data.

Structured data lives in rows and columns — like a spreadsheet. Every piece of data has a label and a specific place to go.

Unstructured data is everything else — emails, photos, videos, social media posts, PDFs, voice recordings. It's messy, inconsistent, and doesn't fit neatly into a table.

	Structured data	Unstructured data
What it looks like	Rows and columns (spreadsheets, databases)	Free-form (text, images, audio, video)
Examples	Customer names, order dates, prices, ZIP codes	Emails, photos, tweets, phone call recordings
Easy to search?	Yes — just query the column	No — you need AI to search it
Easy to analyse?	Yes — sort, filter, calculate	Hard — requires processing first
What % of all data?	~20%	~80%
AI relevance	Great for traditional ML (predictions, classification)	Where modern AI (LLMs, computer vision) shines

Before you read the next sentence: What percentage of all the world's data do you think is unstructured — text, photos, audio, video — vs. the clean rows-and-columns kind? Write down your guess.

Here's the stat that surprises everyone: approximately 80% of the world's data is unstructured — a widely-cited industry estimate — often attributed to IDC — though the exact figure varies by definition and methodology. Emails. Slack messages. Photos. Call transcripts. PDFs buried in shared drives. That mountain of unstructured data was mostly unusable for analysis until recently. LLMs and modern AI changed that — they can read, summarise, and extract information from unstructured data at scale.

✗ Without AI

✗Rows and columns (spreadsheets, databases)
✗Easy to query and aggregate
✗Every field has a defined type
✗~20% of enterprise data

✓ With AI

✓Text, images, audio, video
✓Requires AI to extract meaning
✓No predefined schema
✓~80% of enterprise data — the untapped majority

There Are No Dumb Questions

"Is a PDF structured or unstructured?"

Unstructured — even if the PDF contains a table! The data inside a PDF isn't in labelled rows and columns that a computer can easily query. A computer sees a PDF as a blob of text and images. Extracting structured data from PDFs is actually one of the most common (and frustrating) tasks in the AI industry.

"What about a CSV file?"

That's structured. A CSV (Comma-Separated Values) file is essentially a spreadsheet in text form. Each line is a row, and commas separate the columns. Computers can read and process CSVs very easily.

⚡

Structured or Unstructured?

25 XP

structuredunstructured

A spreadsheet of employee names, departments, and salaries

A folder of customer support email threads

A database table of product prices and inventory counts

A collection of YouTube videos

A JSON file with user profiles (name, age, location)

A stack of scanned handwritten medical forms

2. A folder of customer support email threads →

0/6 answered

What is a dataset?

A dataset is simply a collection of data organised for a specific purpose. Think of it as a textbook for AI.

When you go to school, you learn from textbooks. Each textbook covers a specific subject and contains carefully selected examples. A dataset is the same thing — it's the textbook that an AI model studies.

Textbook concept	Dataset concept
The textbook itself	The dataset
One chapter	A subset of the data (e.g., training set vs. test set)
One page with an example	One data point (one row, one image, one text sample)
The subject (math, history)	The domain (medical, financial, language)
The answer key	The labels (what the correct answer is for each example)

Some famous datasets you might hear about:

Dataset	What's in it	What it's used for
ImageNet	14 million labelled images	Training image recognition models
Common Crawl	Billions of web pages	Training language models
MNIST	70,000 handwritten digit images	Teaching beginners machine learning
SQuAD	100,000+ question-answer pairs	Training reading comprehension models

Key insight: the quality and size of the dataset directly determines how good the AI model will be. A model trained on 100 examples won't be as good as one trained on 10 million. And a model trained on wrong examples will learn wrong things.

More data = better models (logarithmic scaling)

Data quality: garbage in, garbage out

There's an old saying in computing: "Garbage in, garbage out" (GIGO). It means: if you put bad data into a system, you'll get bad results out of it — no matter how sophisticated the system is.

For AI, this isn't just a catchy phrase. It's a law of nature.

Here's what "bad data" looks like:

Data quality problem	What it means	Real-world example
Missing values	Empty cells where data should be	30% of customer records have no email address
Duplicates	Same data recorded multiple times	Same patient entered twice with slightly different names
Inconsistent formatting	Same thing written different ways	"New York," "NY," "new york," "N.Y.C."
Outdated data	Information that's no longer accurate	Training on 2019 product prices for 2025 predictions
Incorrect labels	Data labelled with the wrong answer	A photo of a dog labelled as "cat"
Sampling bias	Data that doesn't represent the real world	Training a facial recognition system on only light-skinned faces

The algorithm is the same in both cases. The only difference is the data. That's why data scientists have been found to spend roughly 80% of their time cleaning and preparing data and only 20% building models (per practitioner surveys including CrowdFlower/Figure Eight, 2017; figures vary by role and context — a pattern that remains widely cited in more recent surveys). The unsexy work is the important work.

⚠️Garbage in, garbage out

A model trained on biased data learns biased patterns — and then applies them at scale. Amazon built a hiring model trained on 10 years of resumés; since most past hires were male, it learned to penalise CVs that included the word "women's" (e.g. "women's chess club"). They scrapped it. The data was the problem, not the model.

There Are No Dumb Questions

"Can't AI just 'figure out' that the data is messy and correct for it?"

Some techniques can handle minor messiness — like filling in a few missing values. But AI can't fix what it can't see. If your data systematically excludes an entire group of people, the model has no way to know those people exist. It will learn the world as your data presents it — biases and all.

"How do you know if your data is 'good enough'?"

There's no universal threshold, but experienced practitioners check for: completeness (how many values are missing?), consistency (are similar things formatted the same way?), accuracy (do the labels match reality?), and representativeness (does the data reflect the real world?). Tools called "data profilers" can automate some of these checks.

⚡

Find the Data Problems

50 XP

You're reviewing a dataset used to train an AI model that predicts whether a loan applicant will repay their loan. Here's a sample: | Name | Age | Income | ZIP Code | Loan Amount | Repaid? | |------|-----|--------|----------|-------------|---------| | Jane Smith | 34 | $85,000 | 10001 | $20,000 | Yes | | john doe | 29 | | 90210 | $15,000 | Yes | | Jane Smith | 34 | $85,000 | 10001 | $20,000 | Yes | | Maria Garcia | -5 | $45,000 | 99999 | $10,000 | No | | Wei Chen | 41 | $120,000 | ten-thousand-one | $30,000 | | | Bob Johnson | 55 | $200K | 60601 | $50,000 | No | Find at least **5 data quality problems** in this dataset. For each one, explain what type of problem it is (missing value, duplicate, inconsistent formatting, incorrect data, etc.) and how it could affect the model. *Hint: Go column by column — which cells are empty? Go row by row — does any row look identical to another? Check whether all values in a column follow the same format (numbers, text, currency). Check whether all values are logically possible. And don't forget to look at the "Repaid?" column too.*

Bias in data: the crooked ruler

Imagine you're measuring the height of everyone in a room, but your ruler is bent — it adds an extra inch to every measurement. You'll get numbers. You'll get consistent numbers. You'll even get numbers that look perfectly reasonable. But every single measurement will be wrong in the same direction.

Biased data is a crooked ruler. It measures the world inaccurately, consistently, and invisibly.

Here's how bias sneaks into data:

Type of bias	How it happens	Example
Historical bias	The data reflects real-world inequities	Hiring data shows mostly male engineers because the industry historically excluded women. The AI learns "good engineer = male."
Selection bias	The data doesn't represent everyone	A medical study includes only patients from wealthy hospitals. The AI learns treatment patterns that don't work for underserved communities.
Measurement bias	The way data is collected introduces errors	Customer satisfaction surveys only go to English speakers. Non-English-speaking customers' experiences are invisible.
Label bias	Human labellers bring their own biases	Humans labelling resume quality unconsciously rate "foreign-sounding" names lower.

The Amazon hiring algorithm story from the beginning of this module is a textbook case of historical bias. The training data reflected a decade of gender-skewed hiring decisions — the tech industry had historically hired far more men than women. The algorithm faithfully learned those patterns and reproduced them — at scale, automatically, and with a veneer of objectivity that made the bias harder to question.

Before you look at the diagram: If a biased hiring model rejects certain candidates, and those rejections become part of next year's "who we didn't hire" record, what do you think happens when that new data is used to retrain the model?

That feedback loop at the bottom is the scariest part. If a biased model's predictions influence the real world (e.g., who gets policed, who gets hired, who gets a loan), the real world becomes more biased, which produces more biased data, which trains more biased models. The bias amplifies itself.

⚡

Spot the Bias

25 XP

For each scenario, identify what type of bias is present and explain how it could lead to harm: 1. A facial recognition system trained mostly on photos of light-skinned faces struggles to identify people with darker skin tones → Type: ___ Harm: ___ 2. A resume screening AI trained on a company's past hiring decisions (which favoured graduates from elite universities) → Type: ___ Harm: ___ 3. A sentiment analysis model trained only on English-language tweets is used to analyse customer feedback in a multilingual market → Type: ___ Harm: ___ *Hint: For each scenario, ask two questions: who was left out of the training data, and why? Use the bias types from the table above to classify what you find. Then think through what happens when the model is applied to the groups that were missing.*

Labels and annotations: teaching by example

How do you teach a child the difference between a cat and a dog? You show them pictures. "This is a cat. This is a dog. This is a cat. This is a dog." After enough examples, they figure it out on their own.

AI learns the same way — through labelled data. A label is the answer attached to each example. It's the flashcard answer.

Flashcard concept	AI concept
The front of the card (the question)	The data (image, text, etc.)
The back of the card (the answer)	The label (what the correct answer is)
A stack of flashcards	A labelled dataset
Studying flashcards	Training the model

The process of adding labels to data is called annotation. It's usually done by humans — sometimes thousands of them. And it's more important (and harder) than you might think.

Task	Data	Label
Email spam detection	The text of an email	"spam" or "not spam"
Image classification	A photo	"cat," "dog," "bird"
Sentiment analysis	A customer review	"positive," "negative," "neutral"
Medical diagnosis	An X-ray image	"fracture" or "no fracture"

Here's the catch: if the labels are wrong, the model learns wrong things. And labelling is done by humans, who make mistakes, get tired, disagree with each other, and bring their own biases. One study found that major image datasets had error rates between 3-10% in their labels (Northcutt et al., 2021). That means up to 1 in 10 "answers" in the model's textbook are wrong.

⚡

Label It Yourself

25 XP

You're training an AI to classify customer support tickets. For each ticket below, assign a label from this list: **billing, technical, shipping, account, feedback**. 1. "I was charged twice for my subscription last month" → ___ 2. "The app crashes every time I try to upload a photo" → ___ 3. "My package says delivered but I never received it" → ___ 4. "I love the new dashboard design — great work!" → ___ 5. "I can't reset my password, the link doesn't work" → ___ Now consider: **What happens if 10% of the training data has wrong labels?** (E.g., billing tickets labelled as "technical") How would this affect the model? *Hint: Wrong labels = wrong patterns learned. If the model sees billing complaints labelled as "technical," it will start routing real billing complaints to the wrong team. The model will be confidently wrong — which is worse than being obviously wrong.*

Putting it all together

Here's the complete picture of how data flows into an AI model:

Notice something? The model is only one box in this entire flow. Most of the work — and most of the potential for things to go wrong — is in the data boxes.

Back to Amazon's hiring algorithm

The story that opened this module now has a fuller explanation. Amazon's recruiting AI didn't learn to discriminate because someone coded that rule. It learned because the training data — 10 years of Amazon's own hiring records — reflected a world where men were overwhelmingly hired for technical roles. The algorithm found that pattern, treated it as signal, and optimised for it.

That's historical bias in action: the data faithfully recorded real-world inequity, and the model faithfully learned from it. Amazon's engineers weren't careless. They were rigorous. But no amount of rigour fixes a fundamentally flawed dataset. They eventually scrapped the tool entirely.

What you now have — an understanding of data quality, bias types, labels, and annotation — is what gives you the ability to ask the right questions before a tool like that gets deployed.

Key takeaways

AI inherits the biases, gaps, and errors in its training data. Biased data produces biased models. There's no algorithmic fix for fundamentally flawed data.
You can distinguish structured from unstructured data — and roughly 80% of real-world data is unstructured (text, images, audio), which is where modern AI provides the most value.
"Garbage in, garbage out" is the iron law of AI. Missing values, duplicates, inconsistent formatting, and wrong labels all directly degrade model quality.
Labels are the flashcard answers that teach AI models what's correct. Human labelling errors (3-10% in major datasets) are a significant source of model mistakes.
You can identify common types of data bias — historical, selection, measurement, and label bias — and understand how they create harmful feedback loops when deployed in real-world systems.

Knowledge Check

1.An AI hiring tool trained on a company's past 10 years of hiring decisions consistently ranks male candidates higher than equally qualified female candidates. What is the most likely cause?

2.Which of the following is an example of unstructured data?

3.A labelled dataset for training a spam detector contains 10,000 emails. 500 of those emails are incorrectly labelled (spam emails marked as 'not spam' and vice versa). What is the most likely impact?

4.Why do data scientists typically spend about 80% of their time on data cleaning and preparation rather than building models?

The hiring algorithm that taught itself to reject women

Nobody programmed that rule. Nobody told the system to discriminate.

Structured vs. unstructured data

Not all data is created equal. The first distinction you need to understand is between structured and unstructured data.

Structured data lives in rows and columns — like a spreadsheet. Every piece of data has a label and a specific place to go.

Unstructured data is everything else — emails, photos, videos, social media posts, PDFs, voice recordings. It's messy, inconsistent, and doesn't fit neatly into a table.

	Structured data	Unstructured data
What it looks like	Rows and columns (spreadsheets, databases)	Free-form (text, images, audio, video)
Examples	Customer names, order dates, prices, ZIP codes	Emails, photos, tweets, phone call recordings
Easy to search?	Yes — just query the column	No — you need AI to search it
Easy to analyse?	Yes — sort, filter, calculate	Hard — requires processing first
What % of all data?	~20%	~80%
AI relevance	Great for traditional ML (predictions, classification)	Where modern AI (LLMs, computer vision) shines

✗ Without AI

✗Rows and columns (spreadsheets, databases)
✗Easy to query and aggregate
✗Every field has a defined type
✗~20% of enterprise data

✓ With AI

✓Text, images, audio, video
✓Requires AI to extract meaning
✓No predefined schema
✓~80% of enterprise data — the untapped majority

There Are No Dumb Questions

"Is a PDF structured or unstructured?"

Unstructured — even if the PDF contains a table! The data inside a PDF isn't in labelled rows and columns that a computer can easily query. A computer sees a PDF as a blob of text and images. Extracting structured data from PDFs is actually one of the most common (and frustrating) tasks in the AI industry.

"What about a CSV file?"

That's structured. A CSV (Comma-Separated Values) file is essentially a spreadsheet in text form. Each line is a row, and commas separate the columns. Computers can read and process CSVs very easily.

⚡

Structured or Unstructured?

25 XP

structuredunstructured

A spreadsheet of employee names, departments, and salaries

A folder of customer support email threads

A database table of product prices and inventory counts

A collection of YouTube videos

A JSON file with user profiles (name, age, location)

A stack of scanned handwritten medical forms

2. A folder of customer support email threads →

0/6 answered

What is a dataset?

A dataset is simply a collection of data organised for a specific purpose. Think of it as a textbook for AI.

Textbook concept	Dataset concept
The textbook itself	The dataset
One chapter	A subset of the data (e.g., training set vs. test set)
One page with an example	One data point (one row, one image, one text sample)
The subject (math, history)	The domain (medical, financial, language)
The answer key	The labels (what the correct answer is for each example)

Some famous datasets you might hear about:

Dataset	What's in it	What it's used for
ImageNet	14 million labelled images	Training image recognition models
Common Crawl	Billions of web pages	Training language models
MNIST	70,000 handwritten digit images	Teaching beginners machine learning
SQuAD	100,000+ question-answer pairs	Training reading comprehension models

More data = better models (logarithmic scaling)

Data quality: garbage in, garbage out

There's an old saying in computing: "Garbage in, garbage out" (GIGO). It means: if you put bad data into a system, you'll get bad results out of it — no matter how sophisticated the system is.

For AI, this isn't just a catchy phrase. It's a law of nature.

Here's what "bad data" looks like:

Data quality problem	What it means	Real-world example
Missing values	Empty cells where data should be	30% of customer records have no email address
Duplicates	Same data recorded multiple times	Same patient entered twice with slightly different names
Inconsistent formatting	Same thing written different ways	"New York," "NY," "new york," "N.Y.C."
Outdated data	Information that's no longer accurate	Training on 2019 product prices for 2025 predictions
Incorrect labels	Data labelled with the wrong answer	A photo of a dog labelled as "cat"
Sampling bias	Data that doesn't represent the real world	Training a facial recognition system on only light-skinned faces

⚠️Garbage in, garbage out

There Are No Dumb Questions

"Can't AI just 'figure out' that the data is messy and correct for it?"

Some techniques can handle minor messiness — like filling in a few missing values. But AI can't fix what it can't see. If your data systematically excludes an entire group of people, the model has no way to know those people exist. It will learn the world as your data presents it — biases and all.

"How do you know if your data is 'good enough'?"

There's no universal threshold, but experienced practitioners check for: completeness (how many values are missing?), consistency (are similar things formatted the same way?), accuracy (do the labels match reality?), and representativeness (does the data reflect the real world?). Tools called "data profilers" can automate some of these checks.

⚡

Find the Data Problems

50 XP

Bias in data: the crooked ruler

Biased data is a crooked ruler. It measures the world inaccurately, consistently, and invisibly.

Here's how bias sneaks into data:

Type of bias	How it happens	Example
Historical bias	The data reflects real-world inequities	Hiring data shows mostly male engineers because the industry historically excluded women. The AI learns "good engineer = male."
Selection bias	The data doesn't represent everyone	A medical study includes only patients from wealthy hospitals. The AI learns treatment patterns that don't work for underserved communities.
Measurement bias	The way data is collected introduces errors	Customer satisfaction surveys only go to English speakers. Non-English-speaking customers' experiences are invisible.
Label bias	Human labellers bring their own biases	Humans labelling resume quality unconsciously rate "foreign-sounding" names lower.

⚡

Spot the Bias

25 XP

Labels and annotations: teaching by example

AI learns the same way — through labelled data. A label is the answer attached to each example. It's the flashcard answer.

Flashcard concept	AI concept
The front of the card (the question)	The data (image, text, etc.)
The back of the card (the answer)	The label (what the correct answer is)
A stack of flashcards	A labelled dataset
Studying flashcards	Training the model

The process of adding labels to data is called annotation. It's usually done by humans — sometimes thousands of them. And it's more important (and harder) than you might think.

Task	Data	Label
Email spam detection	The text of an email	"spam" or "not spam"
Image classification	A photo	"cat," "dog," "bird"
Sentiment analysis	A customer review	"positive," "negative," "neutral"
Medical diagnosis	An X-ray image	"fracture" or "no fracture"

⚡

Label It Yourself

25 XP

Putting it all together

Here's the complete picture of how data flows into an AI model:

Notice something? The model is only one box in this entire flow. Most of the work — and most of the potential for things to go wrong — is in the data boxes.

Back to Amazon's hiring algorithm

What you now have — an understanding of data quality, bias types, labels, and annotation — is what gives you the ability to ask the right questions before a tool like that gets deployed.

Key takeaways

AI inherits the biases, gaps, and errors in its training data. Biased data produces biased models. There's no algorithmic fix for fundamentally flawed data.
You can distinguish structured from unstructured data — and roughly 80% of real-world data is unstructured (text, images, audio), which is where modern AI provides the most value.
"Garbage in, garbage out" is the iron law of AI. Missing values, duplicates, inconsistent formatting, and wrong labels all directly degrade model quality.
Labels are the flashcard answers that teach AI models what's correct. Human labelling errors (3-10% in major datasets) are a significant source of model mistakes.
You can identify common types of data bias — historical, selection, measurement, and label bias — and understand how they create harmful feedback loops when deployed in real-world systems.

Knowledge Check

1.An AI hiring tool trained on a company's past 10 years of hiring decisions consistently ranks male candidates higher than equally qualified female candidates. What is the most likely cause?

2.Which of the following is an example of unstructured data?

4.Why do data scientists typically spend about 80% of their time on data cleaning and preparation rather than building models?