Data Analysis with Spreadsheets
Sorting, filtering, conditional formatting, and data validation — the tools that turn a wall of numbers into answers you can act on.
In 2020, England lost nearly 16,000 COVID test results because of an Excel row limit
Public Health England was tracking COVID-19 test results in an Excel spreadsheet. They were using the older .xls format, which maxes out at 65,536 rows. When the dataset grew past that limit, new rows simply vanished. Nearly 16,000 positive test results were silently lost — meaning thousands of infected people were never contacted for contact tracing during a critical period of the pandemic.
The problem was not that someone wrote a bad formula. The problem was that no one had set up the data for proper analysis — no structured table, no validation, no overflow checks. Raw data was dumped into a spreadsheet with no guardrails.
This module teaches you how to analyze data in spreadsheets without losing it, breaking it, or misreading it.
Sorting: the simplest analysis you can do
Sorting rearranges your data by one or more columns. It sounds trivial, but sorting is how you answer questions like "Who are our top 10 customers?" or "Which product sold the least last quarter?"
Single-column sort: Click any cell in the column, then Data → Sort A-Z (ascending) or Z-A (descending).
Multi-level sort: Data → Sort → Add Level. Example: Sort by Region first, then by Sales (largest to smallest) within each region.
Custom sort orders: Need to sort months chronologically instead of alphabetically (which puts April before January)? Custom sort lets you define your own order: Jan, Feb, Mar, Apr... instead of Apr, Aug, Dec, Feb.
Filtering: show only what matters
Filters hide rows that do not match your criteria — without deleting them. The data is still there; you just cannot see the irrelevant parts.
Step 1: Click any cell in your data range
Step 2: Go to Data → Filter (or Ctrl+Shift+L)
Step 3: Click the dropdown arrow on any column header
Step 4: Uncheck values you want to hide, or use conditions like "Greater than $500"
Step 5: Multiple columns can be filtered simultaneously — filters stack
Filter types:
| Filter | What it does | Example |
|---|---|---|
| Value filter | Show/hide specific values | Show only "California" and "Texas" |
| Number filter | Conditions on numbers | Greater than $1,000 |
| Date filter | Date-specific conditions | Last 30 days, This quarter |
| Text filter | Pattern matching | Contains "Corp", Begins with "A" |
| Color filter | Filter by cell/font color | Show only red-highlighted cells |
There Are No Dumb Questions
"If I filter data and then use SUM, does it sum only the visible rows?"
No —
SUMadds everything, including hidden rows. Use=SUBTOTAL(109, D2:D100)instead. The109tells Excel to sum only visible cells. This is one of the most common filtering mistakes. Alternatively, use=AGGREGATE(9, 5, D2:D100)for even more control."Can I filter and then copy-paste just the visible rows?"
Yes, but you need to select the visible cells first. After filtering, select your range, then press
Alt+;(Windows) orCmd+Shift+Z(Mac) to select only visible cells. Now copy-paste works as expected.
Filter Challenge
25 XPYou have a sales dataset with columns: Date, Sales Rep, Region, Product, Amount. Your manager asks these questions. What filter(s) would you apply? 1. "Show me all sales from Q1 2024 in the West region" → ___ 2. "Who closed deals over $10,000?" → ___ 3. "Show only laptop and tablet sales, sorted by amount (highest first)" → ___ _Hint: For #1, use a date filter on the Date column (between Jan 1 and Mar 31, 2024) AND a value filter on Region for "West". Stack multiple filters._
Sign in to earn XPConditional formatting — make data tell you where to look
Conditional formatting changes a cell's appearance (color, font, icon) based on its value. Instead of scanning 1,000 rows of numbers, your eyes are drawn immediately to what matters.
| Format type | What it looks like | Best for |
|---|---|---|
| Color scales | Gradient from red to green | Spotting high/low values across a range |
| Data bars | Mini bar charts inside cells | Comparing magnitudes at a glance |
| Icon sets | Arrows, traffic lights, flags | Status indicators (up/down/stable) |
| Highlight rules | Cells turn red if < 0 | Flagging exceptions, errors, overdue items |
| Top/Bottom rules | Highlight top 10 values | Finding best/worst performers |
Practical examples:
- Highlight all cells in the "Profit" column that are negative → red fill
- Color scale on monthly revenue → see seasonal patterns instantly
- Icon set on delivery status → green check for delivered, yellow clock for in transit, red X for delayed
✗ Without Conditional Formatting
- ✗Wall of identical-looking numbers
- ✗Must scan every cell manually
- ✗Errors hide in plain sight
- ✗Patterns invisible until you chart the data
✓ With Conditional Formatting
- ✓Negative values pop in red immediately
- ✓Eyes drawn to exceptions and outliers
- ✓Errors flagged with warning icons
- ✓Color gradients reveal trends without charts
Data validation — preventing bad data at the source
Data validation restricts what can be entered into a cell. Instead of cleaning up bad data after the fact, you prevent it from being entered at all.
| Validation type | What it does | Example |
|---|---|---|
| List | Dropdown of allowed values | Department: Sales, Marketing, Engineering, HR |
| Whole number | Only integers in a range | Quantity: between 1 and 1000 |
| Decimal | Numbers with decimal constraints | Discount rate: between 0 and 0.5 |
| Date | Only dates in a range | Start date: after January 1, 2024 |
| Text length | Limits character count | Product code: exactly 8 characters |
| Custom | Any formula-based rule | Must be unique (no duplicates) |
How to set it up: Select cells → Data → Data Validation → Choose criteria → Add an input message (tells users what to enter) → Add an error alert (tells users what went wrong).
The dropdown trick: Create a list validation for "Status" with values: Not Started, In Progress, Complete, On Hold — now users select from a dropdown instead of typing (and mistyping)
The date guard: Set validation to reject dates before today — prevents accidentally entering past dates for future deadlines
The duplicate blocker: Use custom validation with =COUNTIF(A:A, A2)=1 — rejects any value already in the column
There Are No Dumb Questions
"Can people bypass data validation by pasting data?"
Yes — and this is an important limitation. If someone copies data from another source and pastes it into a validated cell, the validation can be bypassed. This is why data validation is a guide, not a lock. For critical data, combine validation with periodic checks (like COUNTIF to find duplicates).
"What happens to existing data when I add validation to a column that already has data?"
Nothing happens to existing data. Validation only applies to new entries. To find existing values that violate your rules, go to Data → Data Validation → Circle Invalid Data. Excel will circle any cells that do not meet the criteria.
Removing duplicates
Duplicate data inflates your counts, skews your averages, and makes reports unreliable. Excel makes removal simple:
- Select your data range
- Data → Remove Duplicates
- Choose which columns to check (check all columns for exact row duplicates, or specific columns for partial matches)
- Excel tells you how many duplicates were found and removed
Clean This Dataset
50 XPYou receive a spreadsheet with 5,000 customer records. Before any analysis, you need to clean it up. Describe the steps you would take for each issue: 1. The "State" column has entries like "CA", "Ca", "california", "CALIFORNIA", and "Calif." — how do you standardize this? → ___ 2. You suspect there are duplicate customer records (same email in multiple rows) — how do you find and handle them? → ___ 3. The "Order Amount" column has some cells with text like "$1,234" (stored as text) instead of the number 1234 — how do you fix this? → ___ 4. Future entries in the "State" column should only allow valid two-letter state abbreviations — how do you enforce this? → ___ _Hint: For #1, consider a helper column with `=UPPER()` or `=PROPER()`. For #2, use conditional formatting to highlight duplicates first, then Review. For #3, try Text to Columns or Find & Replace to remove $ and commas. For #4, data validation with a list of state codes._
Sign in to earn XPStructured tables — the feature that changes everything
A structured table (Insert → Table or Ctrl+T) converts a plain range into an intelligent data object. This is the single most underused feature in Excel.
| Plain range | Structured table |
|---|---|
Formulas reference cells like =SUM(B2:B100) | Formulas reference columns like =SUM(Table1[Revenue]) |
| New rows are not included in formulas | New rows are automatically included |
| No automatic formatting | Alternating row colors, filter buttons |
| Sorting can disconnect columns | Columns always stay connected |
| Charts do not expand with new data | Charts automatically include new rows |
Why tables matter: When you add a new row of data to a structured table, every formula, chart, and pivot table that references the table automatically includes the new row. With plain ranges, you have to manually update every reference. This alone prevents an enormous category of spreadsheet errors.
<classifychallenge xp="25" title="Which Analysis Tool?" items={["Find your top 10 customers by revenue","Show only Q4 transactions from the West region","Highlight all overdue invoices in red automatically","Restrict the Status column to only allow Open, Closed, or Pending","Find and remove rows where the same email appears twice","Make formulas and charts auto-include new rows of data"]} options={["Sorting","Filtering","Conditional formatting","Data validation","Remove Duplicates","Structured tables"]} hint="Each task maps to exactly one tool. Sorting rearranges data to find top/bottom. Filtering hides irrelevant rows. Conditional formatting changes appearance based on values. Data validation restricts input. Remove Duplicates deletes repeated rows. Structured tables auto-expand.">
Back to the COVID test results
Remember the 16,000 lost COVID test results? Public Health England stored critical pandemic data in a plain .xls range with no structured tables, no validation, and no overflow checks. When the data outgrew the 65,536-row limit, rows silently vanished. Nobody was alerted because there was no validation rule checking row counts. Nobody caught the problem because there was no conditional formatting highlighting gaps.
Every tool in this module — sorting, filtering, conditional formatting, data validation, structured tables — exists to prevent exactly that kind of silent data loss. The Formulas & Functions you learned in the previous module calculate answers; the analysis tools in this module ensure those answers are based on complete, clean, trustworthy data.
Next up: Sorting and filtering show you slices of data. But what if you need to summarize thousands of rows into an instant, interactive report — total revenue by region, broken down by product, with quarterly comparisons? In the next module, you will learn pivot tables, the single most powerful feature in Excel.
Key takeaways
- Sorting answers "what are the top/bottom values?" — always sort the full data range, never a single column in isolation
- Filtering hides irrelevant rows without deleting them — use
SUBTOTAL(notSUM) to calculate on visible rows only - Conditional formatting directs your eyes to exceptions and patterns — use it sparingly on what matters most
- Data validation prevents bad data at entry — dropdown lists for categories, number ranges for quantities, custom formulas for uniqueness
- Remove Duplicates cleans data but is destructive — always identify duplicates visually first, then remove
- Structured tables (
Ctrl+T) automatically expand formulas, charts, and pivot tables when new data is added — use them for every dataset
Knowledge Check
1.You filter a column to show only rows where Region = 'West', then use =SUM(D2:D500) to total the Amount column. What happens?
2.What is the primary advantage of using a structured table (Ctrl+T) over a plain data range?
3.In 2020, Public Health England lost nearly 16,000 COVID test results. What was the root cause?
4.You want to prevent users from entering anything other than 'High', 'Medium', or 'Low' in a Priority column. Which feature should you use?