Data Cleaning & Preparation
Real data is messy — missing values, duplicates, typos, and formats that make no sense. Here's how to wrestle it into shape before you analyze a single thing.
The $24 million typo
In 2012, Knight Capital Group — one of the largest market makers on Wall Street — deployed a software update that contained a bug. The system started executing trades at wrong prices, buying high and selling low at machine speed. In 45 minutes, Knight Capital lost $440 million. The company nearly went bankrupt overnight.
The root cause? A configuration error — old test code that should have been removed was triggered by new data that didn't match the expected format. Dirty data met unvalidated inputs and the result was financial catastrophe.
But you don't need Wall Street to see the damage of bad data. In 2020, Public Health England lost nearly 16,000 COVID-19 test results because they were stored in an Excel file that hit the row limit of the old .xls format (65,536 rows). The data simply vanished. Contact tracing was delayed for thousands of people.
Bad data doesn't just produce wrong charts. It produces wrong decisions, lost money, and in healthcare, lost lives. Data cleaning isn't glamorous, but it's the most important step in any analysis.
The five types of dirty data
Every messy dataset suffers from some combination of these problems. Learn to spot them and you'll know what to fix.
| Type | What it looks like | Why it matters |
|---|---|---|
| Missing values | Blank cells, "N/A", "null", "-" | Averages are wrong, counts are off, models break |
| Duplicates | Same record appears 2+ times | Totals are inflated, counts are wrong |
| Inconsistent formatting | "USA", "US", "United States", "us" | GROUP BY produces 4 groups instead of 1 |
| Outliers | One order for $999,999 when the average is $85 | Skews averages and totals; may be real or an error |
| Wrong data types | Numbers stored as text, dates as strings | Formulas fail, sorts produce wrong order, joins break |
Handling missing values
Missing data is the most common problem you'll face. The question isn't whether your data has missing values — it's what to do about them.
The decision tree for missing values
The four strategies
Delete the row. Simple but dangerous. If 20% of rows have missing values, you lose 20% of your data. Only use this when the missing percentage is tiny (under 5%) and the rows are randomly distributed.
Fill with a statistic. Replace missing values with the mean (for normally distributed data), median (for skewed data), or mode (for categories). This preserves your row count but can reduce variance.
Forward-fill or backward-fill. For time series data, carry the last known value forward. Stock prices, sensor readings, and daily metrics often use this approach.
Flag and keep. Add a new column: is_revenue_missing = TRUE/FALSE. This lets you analyze the complete cases separately while keeping all your data. Often the best approach for analytical work.
There Are No Dumb Questions
"Can I just replace missing values with zero?"
Almost never. Zero is a real value. If a customer's revenue field is blank, replacing it with zero means "this customer spent $0" — which is a statement about their behavior. A missing value means "we don't know what they spent." Those are very different things. Replacing blanks with zero will deflate your averages and distort your analysis.
"What if the missing data isn't random?"
That's called "systematic missingness" and it's a red flag. Example: if all customers over age 60 have blank email fields, the data isn't missing randomly — older customers may not have provided emails. Deleting those rows would remove an entire demographic from your analysis. You need to investigate WHY the data is missing before deciding how to handle it.
Missing Value Strategy
25 XPHandling duplicates
Duplicates inflate your numbers. If a $500 order appears twice, your revenue report is $500 too high. If a customer appears three times, your user count is inflated by two.
How duplicates happen
| Cause | Example |
|---|---|
| System errors | A form submission that fires twice on a slow connection |
| Data imports | Merging two CSV files that overlap in date range |
| Manual entry | An employee entering the same record in two systems |
| Lack of unique IDs | No primary key, so nothing prevents duplication |
How to find and fix them
In SQL:
-- Find duplicates
SELECT email, COUNT(*) as occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- Keep only the first occurrence
DELETE FROM customers
WHERE id NOT IN (
SELECT MIN(id) FROM customers GROUP BY email
);
In Excel/Sheets:
- Select the data range
- Data > Remove Duplicates (Excel) or Data > Data cleanup > Remove duplicates (Sheets)
- Choose which columns define a "duplicate" — usually a unique identifier like email or order ID
Handling outliers
An outlier is a data point that's far from the others. The question: is it an error, or is it real?
✗ Without AI
- ✗A customer age of 203
- ✗A negative revenue value in a field that should be positive
- ✗An order date of January 1, 1900
- ✗A temperature reading of 500 degrees F from a room sensor
✓ With AI
- ✓A single $2M order from an enterprise client
- ✓A viral tweet getting 100x normal engagement
- ✓A website outage day showing zero traffic
- ✓Jeff Bezos's net worth in a wealth dataset
The outlier handling process
- Detect: Use box plots, z-scores, or IQR (interquartile range) method. Any point beyond 1.5x IQR from Q1 or Q3 is a statistical outlier.
- Investigate: Look at the record. Does it make sense? Check the source.
- Decide: Remove if it's an error. Cap (winsorize) if it's extreme but real. Keep and flag if it's important.
There Are No Dumb Questions
"Should I always remove outliers?"
No. Removing real outliers is data manipulation. If your dataset includes Elon Musk's net worth, removing it because it "skews the average" is hiding reality. Instead, report the median (which is robust to outliers) alongside the mean. Only remove outliers when they're clearly errors — a typo, a sensor malfunction, a test record that wasn't deleted.
"What's the difference between winsorizing and trimming?"
Trimming removes the extreme values entirely. Winsorizing replaces them with the next-closest non-extreme value. If the 99th percentile of order value is $5,000 and you have orders at $50,000 and $80,000, winsorizing sets both to $5,000. Trimming deletes them. Winsorizing is usually preferred because it preserves sample size.
Standardizing formats
The same information written three different ways creates three different groups in your analysis. This is the #1 cause of wrong GROUP BY results.
| Field | Messy | Clean |
|---|---|---|
| Country | "USA", "US", "United States", "us", "U.S.A." | "United States" |
| Date | "3/15/25", "15-Mar-2025", "2025-03-15" | "2025-03-15" (ISO 8601) |
| Phone | "(555) 123-4567", "555.123.4567", "5551234567" | "555-123-4567" |
| Currency | "$1,234", "1234.00", "$1234" | 1234.00 (number) |
| Boolean | "Yes", "Y", "TRUE", "1", "yes" | TRUE |
The rule: pick one standard and enforce it. For dates, use ISO 8601 (YYYY-MM-DD). For countries, use a reference table. For booleans, use TRUE/FALSE. Document the standard so everyone on the team follows it.
Clean This Real-World Dataset
50 XPThe data cleaning checklist
Use this checklist every time you receive a new dataset. It takes 15 minutes and saves hours.
1. Shape check. How many rows and columns? Does it match what you expected?
2. Column types. Are numbers stored as numbers? Dates as dates? Look for text columns hiding in numeric fields.
3. Missing values. How many blanks per column? Is the missingness random or systematic?
4. Duplicates. Check for exact duplicates on key identifiers (email, ID, order number).
5. Value ranges. Are all values within expected bounds? No negative ages, no future dates, no percentages above 100.
6. Categorical consistency. Do categories have standardized names? No "USA" vs "US" vs "United States."
7. Outlier scan. Sort each numeric column ascending and descending. Do the extremes make sense?
8. Referential integrity. Do IDs in one table match IDs in related tables? Orphan records break JOINs.
Key takeaways
- Bad data costs real money. Trillions annually across industries. Every wrong decision built on dirty data is a preventable loss.
- Five types of dirty data: missing values, duplicates, inconsistent formatting, outliers, and wrong data types. Learn to spot all five.
- Missing values need strategy, not deletion. Choose between removing, filling, forward-filling, or flagging based on how much data is missing and whether the missingness is random.
- Not all outliers are errors. Investigate before removing. Use median instead of mean when outliers are real but extreme.
- Standardize formats ruthlessly. One country name format. One date format (ISO 8601). One boolean format. Document the standard.
- Use the 8-step checklist on every new dataset: shape, types, missing values, duplicates, ranges, categories, outliers, referential integrity.
Knowledge Check
1.A customer database has blank revenue fields for 35% of records. All missing values belong to customers who signed up through a free trial. What is the best approach?
2.Your GROUP BY query on a 'country' column returns 7 rows for the United States: 'US', 'USA', 'United States', 'us', 'U.S.', 'united states', and 'US '. What happened and how do you fix it?
3.An e-commerce dataset shows an order for $847,000 when the average order value is $85. Should you remove it?
4.What is the most important reason to document every data cleaning step you perform?