Data: The Fuel for AI
AI is only as good as the data it learns from — and most data is messier than you think.
The hiring algorithm that taught itself to reject women
Amazon scrapped an internal AI recruiting tool after discovering it had taught itself to penalise resumes containing the word "women's" — as in "women's chess club captain" or "women's soccer team." It also downgraded graduates of two all-women's colleges. The bias was identified internally around 2015; the story was publicly reported by Reuters in October 2018.
Nobody programmed that rule. Nobody told the system to discriminate.
The data did. The model was trained on 10 years of Amazon's own hiring decisions — a decade in which the company had overwhelmingly hired men for technical roles. The AI looked at that data and concluded: "historically, successful candidates don't have 'women's' on their resumes." It didn't understand gender. It found a statistical pattern and optimised for it.
This is the single most important lesson about AI: an AI model is only as good as the data you feed it. Feed it biased data, and you get a biased model. Feed it incomplete data, and you get an incomplete model. Feed it garbage, and you get garbage — with a very confident tone of voice.
Structured vs. unstructured data
Not all data is created equal. The first distinction you need to understand is between structured and unstructured data.
Structured data lives in rows and columns — like a spreadsheet. Every piece of data has a label and a specific place to go.
Unstructured data is everything else — emails, photos, videos, social media posts, PDFs, voice recordings. It's messy, inconsistent, and doesn't fit neatly into a table.
| Structured data | Unstructured data | |
|---|---|---|
| What it looks like | Rows and columns (spreadsheets, databases) | Free-form (text, images, audio, video) |
| Examples | Customer names, order dates, prices, ZIP codes | Emails, photos, tweets, phone call recordings |
| Easy to search? | Yes — just query the column | No — you need AI to search it |
| Easy to analyse? | Yes — sort, filter, calculate | Hard — requires processing first |
| What % of all data? | ~20% | ~80% |
| AI relevance | Great for traditional ML (predictions, classification) | Where modern AI (LLMs, computer vision) shines |
Before you read the next sentence: What percentage of all the world's data do you think is unstructured — text, photos, audio, video — vs. the clean rows-and-columns kind? Write down your guess.
Here's the stat that surprises everyone: approximately 80% of the world's data is unstructured — a widely-cited industry estimate — often attributed to IDC — though the exact figure varies by definition and methodology. Emails. Slack messages. Photos. Call transcripts. PDFs buried in shared drives. That mountain of unstructured data was mostly unusable for analysis until recently. LLMs and modern AI changed that — they can read, summarise, and extract information from unstructured data at scale.
✗ Without AI
- ✗Rows and columns (spreadsheets, databases)
- ✗Easy to query and aggregate
- ✗Every field has a defined type
- ✗~20% of enterprise data
✓ With AI
- ✓Text, images, audio, video
- ✓Requires AI to extract meaning
- ✓No predefined schema
- ✓~80% of enterprise data — the untapped majority
There Are No Dumb Questions
"Is a PDF structured or unstructured?"
Unstructured — even if the PDF contains a table! The data inside a PDF isn't in labelled rows and columns that a computer can easily query. A computer sees a PDF as a blob of text and images. Extracting structured data from PDFs is actually one of the most common (and frustrating) tasks in the AI industry.
"What about a CSV file?"
That's structured. A CSV (Comma-Separated Values) file is essentially a spreadsheet in text form. Each line is a row, and commas separate the columns. Computers can read and process CSVs very easily.
Structured or Unstructured?
25 XP2. A folder of customer support email threads →
What is a dataset?
A dataset is simply a collection of data organised for a specific purpose. Think of it as a textbook for AI.
When you go to school, you learn from textbooks. Each textbook covers a specific subject and contains carefully selected examples. A dataset is the same thing — it's the textbook that an AI model studies.
| Textbook concept | Dataset concept |
|---|---|
| The textbook itself | The dataset |
| One chapter | A subset of the data (e.g., training set vs. test set) |
| One page with an example | One data point (one row, one image, one text sample) |
| The subject (math, history) | The domain (medical, financial, language) |
| The answer key | The labels (what the correct answer is for each example) |
Some famous datasets you might hear about:
| Dataset | What's in it | What it's used for |
|---|---|---|
| ImageNet | 14 million labelled images | Training image recognition models |
| Common Crawl | Billions of web pages | Training language models |
| MNIST | 70,000 handwritten digit images | Teaching beginners machine learning |
| SQuAD | 100,000+ question-answer pairs | Training reading comprehension models |
Key insight: the quality and size of the dataset directly determines how good the AI model will be. A model trained on 100 examples won't be as good as one trained on 10 million. And a model trained on wrong examples will learn wrong things.
More data = better models (logarithmic scaling)
Data quality: garbage in, garbage out
There's an old saying in computing: "Garbage in, garbage out" (GIGO). It means: if you put bad data into a system, you'll get bad results out of it — no matter how sophisticated the system is.
For AI, this isn't just a catchy phrase. It's a law of nature.
Here's what "bad data" looks like:
| Data quality problem | What it means | Real-world example |
|---|---|---|
| Missing values | Empty cells where data should be | 30% of customer records have no email address |
| Duplicates | Same data recorded multiple times | Same patient entered twice with slightly different names |
| Inconsistent formatting | Same thing written different ways | "New York," "NY," "new york," "N.Y.C." |
| Outdated data | Information that's no longer accurate | Training on 2019 product prices for 2025 predictions |
| Incorrect labels | Data labelled with the wrong answer | A photo of a dog labelled as "cat" |
| Sampling bias | Data that doesn't represent the real world | Training a facial recognition system on only light-skinned faces |
The algorithm is the same in both cases. The only difference is the data. That's why data scientists have been found to spend roughly 80% of their time cleaning and preparing data and only 20% building models (per practitioner surveys including CrowdFlower/Figure Eight, 2017; figures vary by role and context — a pattern that remains widely cited in more recent surveys). The unsexy work is the important work.
There Are No Dumb Questions
"Can't AI just 'figure out' that the data is messy and correct for it?"
Some techniques can handle minor messiness — like filling in a few missing values. But AI can't fix what it can't see. If your data systematically excludes an entire group of people, the model has no way to know those people exist. It will learn the world as your data presents it — biases and all.
"How do you know if your data is 'good enough'?"
There's no universal threshold, but experienced practitioners check for: completeness (how many values are missing?), consistency (are similar things formatted the same way?), accuracy (do the labels match reality?), and representativeness (does the data reflect the real world?). Tools called "data profilers" can automate some of these checks.
Find the Data Problems
50 XPBias in data: the crooked ruler
Imagine you're measuring the height of everyone in a room, but your ruler is bent — it adds an extra inch to every measurement. You'll get numbers. You'll get consistent numbers. You'll even get numbers that look perfectly reasonable. But every single measurement will be wrong in the same direction.
Biased data is a crooked ruler. It measures the world inaccurately, consistently, and invisibly.
Here's how bias sneaks into data:
| Type of bias | How it happens | Example |
|---|---|---|
| Historical bias | The data reflects real-world inequities | Hiring data shows mostly male engineers because the industry historically excluded women. The AI learns "good engineer = male." |
| Selection bias | The data doesn't represent everyone | A medical study includes only patients from wealthy hospitals. The AI learns treatment patterns that don't work for underserved communities. |
| Measurement bias | The way data is collected introduces errors | Customer satisfaction surveys only go to English speakers. Non-English-speaking customers' experiences are invisible. |
| Label bias | Human labellers bring their own biases | Humans labelling resume quality unconsciously rate "foreign-sounding" names lower. |
The Amazon hiring algorithm story from the beginning of this module is a textbook case of historical bias. The training data reflected a decade of gender-skewed hiring decisions — the tech industry had historically hired far more men than women. The algorithm faithfully learned those patterns and reproduced them — at scale, automatically, and with a veneer of objectivity that made the bias harder to question.
Before you look at the diagram: If a biased hiring model rejects certain candidates, and those rejections become part of next year's "who we didn't hire" record, what do you think happens when that new data is used to retrain the model?
That feedback loop at the bottom is the scariest part. If a biased model's predictions influence the real world (e.g., who gets policed, who gets hired, who gets a loan), the real world becomes more biased, which produces more biased data, which trains more biased models. The bias amplifies itself.
Spot the Bias
25 XPLabels and annotations: teaching by example
How do you teach a child the difference between a cat and a dog? You show them pictures. "This is a cat. This is a dog. This is a cat. This is a dog." After enough examples, they figure it out on their own.
AI learns the same way — through labelled data. A label is the answer attached to each example. It's the flashcard answer.
| Flashcard concept | AI concept |
|---|---|
| The front of the card (the question) | The data (image, text, etc.) |
| The back of the card (the answer) | The label (what the correct answer is) |
| A stack of flashcards | A labelled dataset |
| Studying flashcards | Training the model |
The process of adding labels to data is called annotation. It's usually done by humans — sometimes thousands of them. And it's more important (and harder) than you might think.
| Task | Data | Label |
|---|---|---|
| Email spam detection | The text of an email | "spam" or "not spam" |
| Image classification | A photo | "cat," "dog," "bird" |
| Sentiment analysis | A customer review | "positive," "negative," "neutral" |
| Medical diagnosis | An X-ray image | "fracture" or "no fracture" |
Here's the catch: if the labels are wrong, the model learns wrong things. And labelling is done by humans, who make mistakes, get tired, disagree with each other, and bring their own biases. One study found that major image datasets had error rates between 3-10% in their labels (Northcutt et al., 2021). That means up to 1 in 10 "answers" in the model's textbook are wrong.
Label It Yourself
25 XPPutting it all together
Here's the complete picture of how data flows into an AI model:
Notice something? The model is only one box in this entire flow. Most of the work — and most of the potential for things to go wrong — is in the data boxes.
Back to Amazon's hiring algorithm
The story that opened this module now has a fuller explanation. Amazon's recruiting AI didn't learn to discriminate because someone coded that rule. It learned because the training data — 10 years of Amazon's own hiring records — reflected a world where men were overwhelmingly hired for technical roles. The algorithm found that pattern, treated it as signal, and optimised for it.
That's historical bias in action: the data faithfully recorded real-world inequity, and the model faithfully learned from it. Amazon's engineers weren't careless. They were rigorous. But no amount of rigour fixes a fundamentally flawed dataset. They eventually scrapped the tool entirely.
What you now have — an understanding of data quality, bias types, labels, and annotation — is what gives you the ability to ask the right questions before a tool like that gets deployed.
Key takeaways
- AI inherits the biases, gaps, and errors in its training data. Biased data produces biased models. There's no algorithmic fix for fundamentally flawed data.
- You can distinguish structured from unstructured data — and roughly 80% of real-world data is unstructured (text, images, audio), which is where modern AI provides the most value.
- "Garbage in, garbage out" is the iron law of AI. Missing values, duplicates, inconsistent formatting, and wrong labels all directly degrade model quality.
- Labels are the flashcard answers that teach AI models what's correct. Human labelling errors (3-10% in major datasets) are a significant source of model mistakes.
- You can identify common types of data bias — historical, selection, measurement, and label bias — and understand how they create harmful feedback loops when deployed in real-world systems.
Knowledge Check
1.An AI hiring tool trained on a company's past 10 years of hiring decisions consistently ranks male candidates higher than equally qualified female candidates. What is the most likely cause?
2.Which of the following is an example of unstructured data?
3.A labelled dataset for training a spam detector contains 10,000 emails. 500 of those emails are incorrectly labelled (spam emails marked as 'not spam' and vice versa). What is the most likely impact?
4.Why do data scientists typically spend about 80% of their time on data cleaning and preparation rather than building models?