Data Cleaning & Preparation — Data Skills Essentials

The $24 million typo

In 2012, Knight Capital Group — one of the largest market makers on Wall Street — deployed a software update that contained a bug. The system started executing trades at wrong prices, buying high and selling low at machine speed. In 45 minutes, Knight Capital lost $440 million. The company nearly went bankrupt overnight.

The root cause? A configuration error — old test code that should have been removed was triggered by new data that didn't match the expected format. Dirty data met unvalidated inputs and the result was financial catastrophe.

But you don't need Wall Street to see the damage of bad data. In 2020, Public Health England lost nearly 16,000 COVID-19 test results because they were stored in an Excel file that hit the row limit of the old .xls format (65,536 rows). The data simply vanished. Contact tracing was delayed for thousands of people.

3.1Testimated annual cost of bad data to US economy (IBM/Gartner estimates; directional)

60%of analyst time spent cleaning data (industry surveys; varies by role)

27%of revenue lost due to inaccurate data (Gartner estimate for average enterprise)

Bad data doesn't just produce wrong charts. It produces wrong decisions, lost money, and in healthcare, lost lives. Data cleaning isn't glamorous, but it's the most important step in any analysis.

The five types of dirty data

Every messy dataset suffers from some combination of these problems. Learn to spot them and you'll know what to fix.

Type	What it looks like	Why it matters
Missing values	Blank cells, "N/A", "null", "-"	Averages are wrong, counts are off, models break
Duplicates	Same record appears 2+ times	Totals are inflated, counts are wrong
Inconsistent formatting	"USA", "US", "United States", "us"	GROUP BY produces 4 groups instead of 1
Outliers	One order for $999,999 when the average is $85	Skews averages and totals; may be real or an error
Wrong data types	Numbers stored as text, dates as strings	Formulas fail, sorts produce wrong order, joins break

⚠️The invisible problem

The worst dirty data looks clean. A dataset with no blanks, no obvious errors, and perfect formatting can still be wrong — if the underlying values were entered incorrectly. A customer age of "203" won't trigger a missing-value check, but it's clearly wrong. Always validate against expected ranges.

Handling missing values

Missing data is the most common problem you'll face. The question isn't whether your data has missing values — it's what to do about them.

The decision tree for missing values

Missing value found

Is it random or systematic?

Systematic: Investigate why

How much is missing?

< 5%: Safe to handle

> 30%: Consider dropping column

Fill: mean, median, mode, or forward-fill

Flag: add is_missing column

The four strategies

Delete the row. Simple but dangerous. If 20% of rows have missing values, you lose 20% of your data. Only use this when the missing percentage is tiny (under 5%) and the rows are randomly distributed.

Fill with a statistic. Replace missing values with the mean (for normally distributed data), median (for skewed data), or mode (for categories). This preserves your row count but can reduce variance.

Forward-fill or backward-fill. For time series data, carry the last known value forward. Stock prices, sensor readings, and daily metrics often use this approach.

Flag and keep. Add a new column: is_revenue_missing = TRUE/FALSE. This lets you analyze the complete cases separately while keeping all your data. Often the best approach for analytical work.

There Are No Dumb Questions

"Can I just replace missing values with zero?"

Almost never. Zero is a real value. If a customer's revenue field is blank, replacing it with zero means "this customer spent $0" — which is a statement about their behavior. A missing value means "we don't know what they spent." Those are very different things. Replacing blanks with zero will deflate your averages and distort your analysis.

"What if the missing data isn't random?"

That's called "systematic missingness" and it's a red flag. Example: if all customers over age 60 have blank email fields, the data isn't missing randomly — older customers may not have provided emails. Deleting those rows would remove an entire demographic from your analysis. You need to investigate WHY the data is missing before deciding how to handle it.

⚡

Missing Value Strategy

25 XP

For each scenario, choose the best strategy: **delete, fill with mean/median, forward-fill, or flag**. 1. A survey has 3% of respondents with blank income fields, randomly distributed → ___ 2. A daily stock price dataset has gaps on weekends and holidays → ___ 3. 40% of a "preferred language" column is blank → ___ 4. A temperature sensor sends readings every minute but occasionally skips a reading → ___ 5. Customer satisfaction scores are missing for all orders placed through the mobile app → ___ *Hint: Consider how much data is missing, whether the missingness is random, and what the data represents.*

Handling duplicates

Duplicates inflate your numbers. If a $500 order appears twice, your revenue report is $500 too high. If a customer appears three times, your user count is inflated by two.

How duplicates happen

Cause	Example
System errors	A form submission that fires twice on a slow connection
Data imports	Merging two CSV files that overlap in date range
Manual entry	An employee entering the same record in two systems
Lack of unique IDs	No primary key, so nothing prevents duplication

How to find and fix them

In SQL:

-- Find duplicates
SELECT email, COUNT(*) as occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

-- Keep only the first occurrence
DELETE FROM customers
WHERE id NOT IN (
  SELECT MIN(id) FROM customers GROUP BY email
);

In Excel/Sheets:

Select the data range
Data > Remove Duplicates (Excel) or Data > Data cleanup > Remove duplicates (Sheets)
Choose which columns define a "duplicate" — usually a unique identifier like email or order ID

🔑Exact vs. fuzzy duplicates

Exact duplicates have identical values in every column — easy to catch. Fuzzy duplicates are harder: "John Smith" and "Jon Smith" and "J. Smith" might all be the same person. For fuzzy matching, you'll need tools beyond basic spreadsheets — Python's `fuzzywuzzy` library or dedicated data quality tools.

Handling outliers

An outlier is a data point that's far from the others. The question: is it an error, or is it real?

✗ Without AI

✗A customer age of 203
✗A negative revenue value in a field that should be positive
✗An order date of January 1, 1900
✗A temperature reading of 500 degrees F from a room sensor

✓ With AI

✓A single $2M order from an enterprise client
✓A viral tweet getting 100x normal engagement
✓A website outage day showing zero traffic
✓Jeff Bezos's net worth in a wealth dataset

The outlier handling process

Detect: Use box plots, z-scores, or IQR (interquartile range) method. Any point beyond 1.5x IQR from Q1 or Q3 is a statistical outlier.
Investigate: Look at the record. Does it make sense? Check the source.
Decide: Remove if it's an error. Cap (winsorize) if it's extreme but real. Keep and flag if it's important.

There Are No Dumb Questions

"Should I always remove outliers?"

No. Removing real outliers is data manipulation. If your dataset includes Elon Musk's net worth, removing it because it "skews the average" is hiding reality. Instead, report the median (which is robust to outliers) alongside the mean. Only remove outliers when they're clearly errors — a typo, a sensor malfunction, a test record that wasn't deleted.

"What's the difference between winsorizing and trimming?"

Trimming removes the extreme values entirely. Winsorizing replaces them with the next-closest non-extreme value. If the 99th percentile of order value is $5,000 and you have orders at $50,000 and $80,000, winsorizing sets both to $5,000. Trimming deletes them. Winsorizing is usually preferred because it preserves sample size.

Standardizing formats

The same information written three different ways creates three different groups in your analysis. This is the #1 cause of wrong GROUP BY results.

Field	Messy	Clean
Country	"USA", "US", "United States", "us", "U.S.A."	"United States"
Date	"3/15/25", "15-Mar-2025", "2025-03-15"	"2025-03-15" (ISO 8601)
Phone	"(555) 123-4567", "555.123.4567", "5551234567"	"555-123-4567"
Currency	"$1,234", "1234.00", "$1234"	1234.00 (number)
Boolean	"Yes", "Y", "TRUE", "1", "yes"	TRUE

The rule: pick one standard and enforce it. For dates, use ISO 8601 (YYYY-MM-DD). For countries, use a reference table. For booleans, use TRUE/FALSE. Document the standard so everyone on the team follows it.

⚡

Clean This Real-World Dataset

50 XP

You receive a CSV with 8,000 rows of customer data for a quarterly review. Here's a sample of the problems: | Row | Name | Email | State | Revenue | Signup_Date | |-----|------|-------|-------|---------|-------------| | 1 | john doe | [email protected] | California | $1,234 | 03/15/2024 | | 2 | JOHN DOE | [email protected] | CA | 1234 | 2024-03-15 | | 3 | Jane Smith | [email protected] | new york | $987.00 | 15-Mar-2024 | | 4 | Bob Wilson | [email protected] | TX | | 2024/01/20 | | 5 | Alice Brown | [email protected] | Texas | $-50 | 2024-02-29 | Identify and describe how to fix every problem you can spot. There are at least 8 issues. *Hint: Check for duplicates, inconsistent formats, missing values, data type issues, impossible values, and naming inconsistencies.*

The data cleaning checklist

Use this checklist every time you receive a new dataset. It takes 15 minutes and saves hours.

1. Shape check. How many rows and columns? Does it match what you expected?

2. Column types. Are numbers stored as numbers? Dates as dates? Look for text columns hiding in numeric fields.

3. Missing values. How many blanks per column? Is the missingness random or systematic?

4. Duplicates. Check for exact duplicates on key identifiers (email, ID, order number).

5. Value ranges. Are all values within expected bounds? No negative ages, no future dates, no percentages above 100.

6. Categorical consistency. Do categories have standardized names? No "USA" vs "US" vs "United States."

7. Outlier scan. Sort each numeric column ascending and descending. Do the extremes make sense?

8. Referential integrity. Do IDs in one table match IDs in related tables? Orphan records break JOINs.

🔑Document everything

Keep a cleaning log. Write down every transformation you make: "Removed 47 duplicate rows based on email. Filled 12 missing revenue values with column median ($234). Standardized state names to two-letter abbreviations." Future you — and your teammates — will thank you.

Key takeaways

Bad data costs real money. Trillions annually across industries. Every wrong decision built on dirty data is a preventable loss.
Five types of dirty data: missing values, duplicates, inconsistent formatting, outliers, and wrong data types. Learn to spot all five.
Missing values need strategy, not deletion. Choose between removing, filling, forward-filling, or flagging based on how much data is missing and whether the missingness is random.
Not all outliers are errors. Investigate before removing. Use median instead of mean when outliers are real but extreme.
Standardize formats ruthlessly. One country name format. One date format (ISO 8601). One boolean format. Document the standard.
Use the 8-step checklist on every new dataset: shape, types, missing values, duplicates, ranges, categories, outliers, referential integrity.

Knowledge Check

1.A customer database has blank revenue fields for 35% of records. All missing values belong to customers who signed up through a free trial. What is the best approach?

2.Your GROUP BY query on a 'country' column returns 7 rows for the United States: 'US', 'USA', 'United States', 'us', 'U.S.', 'united states', and 'US '. What happened and how do you fix it?

3.An e-commerce dataset shows an order for $847,000 when the average order value is $85. Should you remove it?

4.What is the most important reason to document every data cleaning step you perform?