Data Analysis with Spreadsheets — Excel & Spreadsheets

In 2020, England lost nearly 16,000 COVID test results because of an Excel row limit

Public Health England was tracking COVID-19 test results in an Excel spreadsheet. They were using the older .xls format, which maxes out at 65,536 rows. When the dataset grew past that limit, new rows simply vanished. Nearly 16,000 positive test results were silently lost — meaning thousands of infected people were never contacted for contact tracing during a critical period of the pandemic.

The problem was not that someone wrote a bad formula. The problem was that no one had set up the data for proper analysis — no structured table, no validation, no overflow checks. Raw data was dumped into a spreadsheet with no guardrails.

This module teaches you how to analyze data in spreadsheets without losing it, breaking it, or misreading it.

16000COVID test results lost

65536row limit in old .xls format

1048576row limit in modern .xlsx

Sorting: the simplest analysis you can do

Sorting rearranges your data by one or more columns. It sounds trivial, but sorting is how you answer questions like "Who are our top 10 customers?" or "Which product sold the least last quarter?"

Single-column sort: Click any cell in the column, then Data → Sort A-Z (ascending) or Z-A (descending).

Multi-level sort: Data → Sort → Add Level. Example: Sort by Region first, then by Sales (largest to smallest) within each region.

⚠️The partial-sort disaster

If you select only column B and sort, Excel may sort *just that column* — disconnecting it from the rest of your data. Suddenly, customer names are matched with the wrong order totals. Always make sure your entire data range is selected, or better yet, use structured tables (covered below) that prevent this entirely.

Custom sort orders: Need to sort months chronologically instead of alphabetically (which puts April before January)? Custom sort lets you define your own order: Jan, Feb, Mar, Apr... instead of Apr, Aug, Dec, Feb.

Filtering: show only what matters

Filters hide rows that do not match your criteria — without deleting them. The data is still there; you just cannot see the irrelevant parts.

Step 1: Click any cell in your data range

Step 2: Go to Data → Filter (or Ctrl+Shift+L)

Step 3: Click the dropdown arrow on any column header

Step 4: Uncheck values you want to hide, or use conditions like "Greater than $500"

Step 5: Multiple columns can be filtered simultaneously — filters stack

Filter types:

Filter	What it does	Example
Value filter	Show/hide specific values	Show only "California" and "Texas"
Number filter	Conditions on numbers	Greater than $1,000
Date filter	Date-specific conditions	Last 30 days, This quarter
Text filter	Pattern matching	Contains "Corp", Begins with "A"
Color filter	Filter by cell/font color	Show only red-highlighted cells

There Are No Dumb Questions

"If I filter data and then use SUM, does it sum only the visible rows?"

No — SUM adds everything, including hidden rows. Use =SUBTOTAL(109, D2:D100) instead. The 109 tells Excel to sum only visible cells. This is one of the most common filtering mistakes. Alternatively, use =AGGREGATE(9, 5, D2:D100) for even more control.

"Can I filter and then copy-paste just the visible rows?"

Yes, but you need to select the visible cells first. After filtering, select your range, then press Alt+; (Windows) or Cmd+Shift+Z (Mac) to select only visible cells. Now copy-paste works as expected.

⚡

Filter Challenge

25 XP

You have a sales dataset with columns: Date, Sales Rep, Region, Product, Amount. Your manager asks these questions. What filter(s) would you apply? 1. "Show me all sales from Q1 2024 in the West region" → ___ 2. "Who closed deals over $10,000?" → ___ 3. "Show only laptop and tablet sales, sorted by amount (highest first)" → ___ _Hint: For #1, use a date filter on the Date column (between Jan 1 and Mar 31, 2024) AND a value filter on Region for "West". Stack multiple filters._

Conditional formatting — make data tell you where to look

Conditional formatting changes a cell's appearance (color, font, icon) based on its value. Instead of scanning 1,000 rows of numbers, your eyes are drawn immediately to what matters.

Format type	What it looks like	Best for
Color scales	Gradient from red to green	Spotting high/low values across a range
Data bars	Mini bar charts inside cells	Comparing magnitudes at a glance
Icon sets	Arrows, traffic lights, flags	Status indicators (up/down/stable)
Highlight rules	Cells turn red if < 0	Flagging exceptions, errors, overdue items
Top/Bottom rules	Highlight top 10 values	Finding best/worst performers

Practical examples:

Highlight all cells in the "Profit" column that are negative → red fill
Color scale on monthly revenue → see seasonal patterns instantly
Icon set on delivery status → green check for delivered, yellow clock for in transit, red X for delayed

✗ Without AI

✗Wall of identical-looking numbers
✗Must scan every cell manually
✗Errors hide in plain sight
✗Patterns invisible until you chart the data

✓ With AI

✓Negative values pop in red immediately
✓Eyes drawn to exceptions and outliers
✓Errors flagged with warning icons
✓Color gradients reveal trends without charts

🔑Do not over-format

A spreadsheet where every cell is a different color is worse than no formatting at all. Use conditional formatting to highlight *exceptions* — the things that need attention. If everything is highlighted, nothing is highlighted.

Data validation — preventing bad data at the source

Data validation restricts what can be entered into a cell. Instead of cleaning up bad data after the fact, you prevent it from being entered at all.

Validation type	What it does	Example
List	Dropdown of allowed values	Department: Sales, Marketing, Engineering, HR
Whole number	Only integers in a range	Quantity: between 1 and 1000
Decimal	Numbers with decimal constraints	Discount rate: between 0 and 0.5
Date	Only dates in a range	Start date: after January 1, 2024
Text length	Limits character count	Product code: exactly 8 characters
Custom	Any formula-based rule	Must be unique (no duplicates)

How to set it up: Select cells → Data → Data Validation → Choose criteria → Add an input message (tells users what to enter) → Add an error alert (tells users what went wrong).

The dropdown trick: Create a list validation for "Status" with values: Not Started, In Progress, Complete, On Hold — now users select from a dropdown instead of typing (and mistyping)

The date guard: Set validation to reject dates before today — prevents accidentally entering past dates for future deadlines

The duplicate blocker: Use custom validation with =COUNTIF(A:A, A2)=1 — rejects any value already in the column

There Are No Dumb Questions

"Can people bypass data validation by pasting data?"

Yes — and this is an important limitation. If someone copies data from another source and pastes it into a validated cell, the validation can be bypassed. This is why data validation is a guide, not a lock. For critical data, combine validation with periodic checks (like COUNTIF to find duplicates).

"What happens to existing data when I add validation to a column that already has data?"

Nothing happens to existing data. Validation only applies to new entries. To find existing values that violate your rules, go to Data → Data Validation → Circle Invalid Data. Excel will circle any cells that do not meet the criteria.

Removing duplicates

Duplicate data inflates your counts, skews your averages, and makes reports unreliable. Excel makes removal simple:

Select your data range
Data → Remove Duplicates
Choose which columns to check (check all columns for exact row duplicates, or specific columns for partial matches)
Excel tells you how many duplicates were found and removed

⚠️Always make a copy first

Remove Duplicates permanently deletes rows. There is no "show me the duplicates first" option. Before removing, copy your data to another sheet. Or use conditional formatting (Highlight Cell Rules → Duplicate Values) to *see* duplicates first, then decide which to remove manually.

⚡

Clean This Dataset

50 XP

You receive a spreadsheet with 5,000 customer records. Before any analysis, you need to clean it up. Describe the steps you would take for each issue: 1. The "State" column has entries like "CA", "Ca", "california", "CALIFORNIA", and "Calif." — how do you standardize this? → ___ 2. You suspect there are duplicate customer records (same email in multiple rows) — how do you find and handle them? → ___ 3. The "Order Amount" column has some cells with text like "$1,234" (stored as text) instead of the number 1234 — how do you fix this? → ___ 4. Future entries in the "State" column should only allow valid two-letter state abbreviations — how do you enforce this? → ___ _Hint: For #1, consider a helper column with `=UPPER()` or `=PROPER()`. For #2, use conditional formatting to highlight duplicates first, then Review. For #3, try Text to Columns or Find & Replace to remove $ and commas. For #4, data validation with a list of state codes._

Structured tables — the feature that changes everything

A structured table (Insert → Table or Ctrl+T) converts a plain range into an intelligent data object. This is the single most underused feature in Excel.

Plain range	Structured table
Formulas reference cells like `=SUM(B2:B100)`	Formulas reference columns like `=SUM(Table1[Revenue])`
New rows are not included in formulas	New rows are automatically included
No automatic formatting	Alternating row colors, filter buttons
Sorting can disconnect columns	Columns always stay connected
Charts do not expand with new data	Charts automatically include new rows

Why tables matter: When you add a new row of data to a structured table, every formula, chart, and pivot table that references the table automatically includes the new row. With plain ranges, you have to manually update every reference. This alone prevents an enormous category of spreadsheet errors.

🔑Name your tables

When you create a table, give it a meaningful name (click the table, go to Table Design, change "Table1" to something like "Sales2024"). Now your formulas read `=SUM(Sales2024[Revenue])` instead of `=SUM(Table1[Revenue])`. Anyone reading your spreadsheet instantly knows what the formula references.

Key takeaways

Sorting answers "what are the top/bottom values?" — always sort the full data range, never a single column in isolation
Filtering hides irrelevant rows without deleting them — use SUBTOTAL (not SUM) to calculate on visible rows only
Conditional formatting directs your eyes to exceptions and patterns — use it sparingly on what matters most
Data validation prevents bad data at entry — dropdown lists for categories, number ranges for quantities, custom formulas for uniqueness
Remove Duplicates cleans data but is destructive — always identify duplicates visually first, then remove
Structured tables (Ctrl+T) automatically expand formulas, charts, and pivot tables when new data is added — use them for every dataset

Knowledge Check

1.You filter a column to show only rows where Region = 'West', then use =SUM(D2:D500) to total the Amount column. What happens?

2.What is the primary advantage of using a structured table (Ctrl+T) over a plain data range?

3.In 2020, Public Health England lost nearly 16,000 COVID test results. What was the root cause?

4.You want to prevent users from entering anything other than 'High', 'Medium', or 'Low' in a Priority column. Which feature should you use?