Data Analysis with Spreadsheets
Sorting, filtering, conditional formatting, and data validation — the tools that turn a wall of numbers into answers you can act on.
In 2020, England lost nearly 16,000 COVID test results because of an Excel row limit
Public Health England was tracking COVID-19 test results in an Excel spreadsheet. They were using the older .xls format, which maxes out at 65,536 rows. When the dataset grew past that limit, new rows simply vanished. Nearly 16,000 positive test results were silently lost — meaning thousands of infected people were never contacted for contact tracing during a critical period of the pandemic.
The problem was not that someone wrote a bad formula. The problem was that no one had set up the data for proper analysis — no structured table, no validation, no overflow checks. Raw data was dumped into a spreadsheet with no guardrails.
This module teaches you how to analyze data in spreadsheets without losing it, breaking it, or misreading it.
Sorting: the simplest analysis you can do
Sorting rearranges your data by one or more columns. It sounds trivial, but sorting is how you answer questions like "Who are our top 10 customers?" or "Which product sold the least last quarter?"
Single-column sort: Click any cell in the column, then Data → Sort A-Z (ascending) or Z-A (descending).
Multi-level sort: Data → Sort → Add Level. Example: Sort by Region first, then by Sales (largest to smallest) within each region.
Custom sort orders: Need to sort months chronologically instead of alphabetically (which puts April before January)? Custom sort lets you define your own order: Jan, Feb, Mar, Apr... instead of Apr, Aug, Dec, Feb.
Filtering: show only what matters
Filters hide rows that do not match your criteria — without deleting them. The data is still there; you just cannot see the irrelevant parts.
Step 1: Click any cell in your data range
Step 2: Go to Data → Filter (or Ctrl+Shift+L)
Step 3: Click the dropdown arrow on any column header
Step 4: Uncheck values you want to hide, or use conditions like "Greater than $500"
Step 5: Multiple columns can be filtered simultaneously — filters stack
Filter types:
| Filter | What it does | Example |
|---|---|---|
| Value filter | Show/hide specific values | Show only "California" and "Texas" |
| Number filter | Conditions on numbers | Greater than $1,000 |
| Date filter | Date-specific conditions | Last 30 days, This quarter |
| Text filter | Pattern matching | Contains "Corp", Begins with "A" |
| Color filter | Filter by cell/font color | Show only red-highlighted cells |
There Are No Dumb Questions
"If I filter data and then use SUM, does it sum only the visible rows?"
No —
SUMadds everything, including hidden rows. Use=SUBTOTAL(109, D2:D100)instead. The109tells Excel to sum only visible cells. This is one of the most common filtering mistakes. Alternatively, use=AGGREGATE(9, 5, D2:D100)for even more control."Can I filter and then copy-paste just the visible rows?"
Yes, but you need to select the visible cells first. After filtering, select your range, then press
Alt+;(Windows) orCmd+Shift+Z(Mac) to select only visible cells. Now copy-paste works as expected.
Filter Challenge
25 XPConditional formatting — make data tell you where to look
Conditional formatting changes a cell's appearance (color, font, icon) based on its value. Instead of scanning 1,000 rows of numbers, your eyes are drawn immediately to what matters.
| Format type | What it looks like | Best for |
|---|---|---|
| Color scales | Gradient from red to green | Spotting high/low values across a range |
| Data bars | Mini bar charts inside cells | Comparing magnitudes at a glance |
| Icon sets | Arrows, traffic lights, flags | Status indicators (up/down/stable) |
| Highlight rules | Cells turn red if < 0 | Flagging exceptions, errors, overdue items |
| Top/Bottom rules | Highlight top 10 values | Finding best/worst performers |
Practical examples:
- Highlight all cells in the "Profit" column that are negative → red fill
- Color scale on monthly revenue → see seasonal patterns instantly
- Icon set on delivery status → green check for delivered, yellow clock for in transit, red X for delayed
✗ Without AI
- ✗Wall of identical-looking numbers
- ✗Must scan every cell manually
- ✗Errors hide in plain sight
- ✗Patterns invisible until you chart the data
✓ With AI
- ✓Negative values pop in red immediately
- ✓Eyes drawn to exceptions and outliers
- ✓Errors flagged with warning icons
- ✓Color gradients reveal trends without charts
Data validation — preventing bad data at the source
Data validation restricts what can be entered into a cell. Instead of cleaning up bad data after the fact, you prevent it from being entered at all.
| Validation type | What it does | Example |
|---|---|---|
| List | Dropdown of allowed values | Department: Sales, Marketing, Engineering, HR |
| Whole number | Only integers in a range | Quantity: between 1 and 1000 |
| Decimal | Numbers with decimal constraints | Discount rate: between 0 and 0.5 |
| Date | Only dates in a range | Start date: after January 1, 2024 |
| Text length | Limits character count | Product code: exactly 8 characters |
| Custom | Any formula-based rule | Must be unique (no duplicates) |
How to set it up: Select cells → Data → Data Validation → Choose criteria → Add an input message (tells users what to enter) → Add an error alert (tells users what went wrong).
The dropdown trick: Create a list validation for "Status" with values: Not Started, In Progress, Complete, On Hold — now users select from a dropdown instead of typing (and mistyping)
The date guard: Set validation to reject dates before today — prevents accidentally entering past dates for future deadlines
The duplicate blocker: Use custom validation with =COUNTIF(A:A, A2)=1 — rejects any value already in the column
There Are No Dumb Questions
"Can people bypass data validation by pasting data?"
Yes — and this is an important limitation. If someone copies data from another source and pastes it into a validated cell, the validation can be bypassed. This is why data validation is a guide, not a lock. For critical data, combine validation with periodic checks (like COUNTIF to find duplicates).
"What happens to existing data when I add validation to a column that already has data?"
Nothing happens to existing data. Validation only applies to new entries. To find existing values that violate your rules, go to Data → Data Validation → Circle Invalid Data. Excel will circle any cells that do not meet the criteria.
Removing duplicates
Duplicate data inflates your counts, skews your averages, and makes reports unreliable. Excel makes removal simple:
- Select your data range
- Data → Remove Duplicates
- Choose which columns to check (check all columns for exact row duplicates, or specific columns for partial matches)
- Excel tells you how many duplicates were found and removed
Clean This Dataset
50 XPStructured tables — the feature that changes everything
A structured table (Insert → Table or Ctrl+T) converts a plain range into an intelligent data object. This is the single most underused feature in Excel.
| Plain range | Structured table |
|---|---|
Formulas reference cells like =SUM(B2:B100) | Formulas reference columns like =SUM(Table1[Revenue]) |
| New rows are not included in formulas | New rows are automatically included |
| No automatic formatting | Alternating row colors, filter buttons |
| Sorting can disconnect columns | Columns always stay connected |
| Charts do not expand with new data | Charts automatically include new rows |
Why tables matter: When you add a new row of data to a structured table, every formula, chart, and pivot table that references the table automatically includes the new row. With plain ranges, you have to manually update every reference. This alone prevents an enormous category of spreadsheet errors.
Key takeaways
- Sorting answers "what are the top/bottom values?" — always sort the full data range, never a single column in isolation
- Filtering hides irrelevant rows without deleting them — use
SUBTOTAL(notSUM) to calculate on visible rows only - Conditional formatting directs your eyes to exceptions and patterns — use it sparingly on what matters most
- Data validation prevents bad data at entry — dropdown lists for categories, number ranges for quantities, custom formulas for uniqueness
- Remove Duplicates cleans data but is destructive — always identify duplicates visually first, then remove
- Structured tables (
Ctrl+T) automatically expand formulas, charts, and pivot tables when new data is added — use them for every dataset
Knowledge Check
1.You filter a column to show only rows where Region = 'West', then use =SUM(D2:D500) to total the Amount column. What happens?
2.What is the primary advantage of using a structured table (Ctrl+T) over a plain data range?
3.In 2020, Public Health England lost nearly 16,000 COVID test results. What was the root cause?
4.You want to prevent users from entering anything other than 'High', 'Medium', or 'Low' in a Priority column. Which feature should you use?