O
Octo
O
Octo
CoursesPricingDashboardPrivacyTerms

© 2026 Octo

Excel & Spreadsheets
1Spreadsheet Fundamentals2Formulas & Functions3Data Analysis with Spreadsheets4Pivot Tables5Charts & Visualization6Automation & Macros7Advanced Formulas8Real-World Projects
Module 3

Data Analysis with Spreadsheets

Sorting, filtering, conditional formatting, and data validation — the tools that turn a wall of numbers into answers you can act on.

In 2020, England lost nearly 16,000 COVID test results because of an Excel row limit

Public Health England was tracking COVID-19 test results in an Excel spreadsheet. They were using the older .xls format, which maxes out at 65,536 rows. When the dataset grew past that limit, new rows simply vanished. Nearly 16,000 positive test results were silently lost — meaning thousands of infected people were never contacted for contact tracing during a critical period of the pandemic.

The problem was not that someone wrote a bad formula. The problem was that no one had set up the data for proper analysis — no structured table, no validation, no overflow checks. Raw data was dumped into a spreadsheet with no guardrails.

This module teaches you how to analyze data in spreadsheets without losing it, breaking it, or misreading it.

16000COVID test results lost

65536row limit in old .xls format

1048576row limit in modern .xlsx

Sorting: the simplest analysis you can do

Sorting rearranges your data by one or more columns. It sounds trivial, but sorting is how you answer questions like "Who are our top 10 customers?" or "Which product sold the least last quarter?"

Single-column sort: Click any cell in the column, then Data → Sort A-Z (ascending) or Z-A (descending).

Multi-level sort: Data → Sort → Add Level. Example: Sort by Region first, then by Sales (largest to smallest) within each region.

⚠️The partial-sort disaster
If you select only column B and sort, Excel may sort *just that column* — disconnecting it from the rest of your data. Suddenly, customer names are matched with the wrong order totals. Always make sure your entire data range is selected, or better yet, use structured tables (covered below) that prevent this entirely.

Custom sort orders: Need to sort months chronologically instead of alphabetically (which puts April before January)? Custom sort lets you define your own order: Jan, Feb, Mar, Apr... instead of Apr, Aug, Dec, Feb.

Filtering: show only what matters

Filters hide rows that do not match your criteria — without deleting them. The data is still there; you just cannot see the irrelevant parts.

Step 1: Click any cell in your data range

Step 2: Go to Data → Filter (or Ctrl+Shift+L)

Step 3: Click the dropdown arrow on any column header

Step 4: Uncheck values you want to hide, or use conditions like "Greater than $500"

Step 5: Multiple columns can be filtered simultaneously — filters stack

Filter types:

FilterWhat it doesExample
Value filterShow/hide specific valuesShow only "California" and "Texas"
Number filterConditions on numbersGreater than $1,000
Date filterDate-specific conditionsLast 30 days, This quarter
Text filterPattern matchingContains "Corp", Begins with "A"
Color filterFilter by cell/font colorShow only red-highlighted cells

There Are No Dumb Questions

"If I filter data and then use SUM, does it sum only the visible rows?"

No — SUM adds everything, including hidden rows. Use =SUBTOTAL(109, D2:D100) instead. The 109 tells Excel to sum only visible cells. This is one of the most common filtering mistakes. Alternatively, use =AGGREGATE(9, 5, D2:D100) for even more control.

"Can I filter and then copy-paste just the visible rows?"

Yes, but you need to select the visible cells first. After filtering, select your range, then press Alt+; (Windows) or Cmd+Shift+Z (Mac) to select only visible cells. Now copy-paste works as expected.

⚡

Filter Challenge

25 XP
You have a sales dataset with columns: Date, Sales Rep, Region, Product, Amount. Your manager asks these questions. What filter(s) would you apply? 1. "Show me all sales from Q1 2024 in the West region" → ___ 2. "Who closed deals over $10,000?" → ___ 3. "Show only laptop and tablet sales, sorted by amount (highest first)" → ___ _Hint: For #1, use a date filter on the Date column (between Jan 1 and Mar 31, 2024) AND a value filter on Region for "West". Stack multiple filters._

Conditional formatting — make data tell you where to look

Conditional formatting changes a cell's appearance (color, font, icon) based on its value. Instead of scanning 1,000 rows of numbers, your eyes are drawn immediately to what matters.

Format typeWhat it looks likeBest for
Color scalesGradient from red to greenSpotting high/low values across a range
Data barsMini bar charts inside cellsComparing magnitudes at a glance
Icon setsArrows, traffic lights, flagsStatus indicators (up/down/stable)
Highlight rulesCells turn red if < 0Flagging exceptions, errors, overdue items
Top/Bottom rulesHighlight top 10 valuesFinding best/worst performers

Practical examples:

  • Highlight all cells in the "Profit" column that are negative → red fill
  • Color scale on monthly revenue → see seasonal patterns instantly
  • Icon set on delivery status → green check for delivered, yellow clock for in transit, red X for delayed

✗ Without AI

  • ✗Wall of identical-looking numbers
  • ✗Must scan every cell manually
  • ✗Errors hide in plain sight
  • ✗Patterns invisible until you chart the data

✓ With AI

  • ✓Negative values pop in red immediately
  • ✓Eyes drawn to exceptions and outliers
  • ✓Errors flagged with warning icons
  • ✓Color gradients reveal trends without charts

🔑Do not over-format
A spreadsheet where every cell is a different color is worse than no formatting at all. Use conditional formatting to highlight *exceptions* — the things that need attention. If everything is highlighted, nothing is highlighted.

Data validation — preventing bad data at the source

Data validation restricts what can be entered into a cell. Instead of cleaning up bad data after the fact, you prevent it from being entered at all.

Validation typeWhat it doesExample
ListDropdown of allowed valuesDepartment: Sales, Marketing, Engineering, HR
Whole numberOnly integers in a rangeQuantity: between 1 and 1000
DecimalNumbers with decimal constraintsDiscount rate: between 0 and 0.5
DateOnly dates in a rangeStart date: after January 1, 2024
Text lengthLimits character countProduct code: exactly 8 characters
CustomAny formula-based ruleMust be unique (no duplicates)

How to set it up: Select cells → Data → Data Validation → Choose criteria → Add an input message (tells users what to enter) → Add an error alert (tells users what went wrong).

The dropdown trick: Create a list validation for "Status" with values: Not Started, In Progress, Complete, On Hold — now users select from a dropdown instead of typing (and mistyping)

The date guard: Set validation to reject dates before today — prevents accidentally entering past dates for future deadlines

The duplicate blocker: Use custom validation with =COUNTIF(A:A, A2)=1 — rejects any value already in the column

There Are No Dumb Questions

"Can people bypass data validation by pasting data?"

Yes — and this is an important limitation. If someone copies data from another source and pastes it into a validated cell, the validation can be bypassed. This is why data validation is a guide, not a lock. For critical data, combine validation with periodic checks (like COUNTIF to find duplicates).

"What happens to existing data when I add validation to a column that already has data?"

Nothing happens to existing data. Validation only applies to new entries. To find existing values that violate your rules, go to Data → Data Validation → Circle Invalid Data. Excel will circle any cells that do not meet the criteria.

Removing duplicates

Duplicate data inflates your counts, skews your averages, and makes reports unreliable. Excel makes removal simple:

  1. Select your data range
  2. Data → Remove Duplicates
  3. Choose which columns to check (check all columns for exact row duplicates, or specific columns for partial matches)
  4. Excel tells you how many duplicates were found and removed
⚠️Always make a copy first
Remove Duplicates permanently deletes rows. There is no "show me the duplicates first" option. Before removing, copy your data to another sheet. Or use conditional formatting (Highlight Cell Rules → Duplicate Values) to *see* duplicates first, then decide which to remove manually.

⚡

Clean This Dataset

50 XP
You receive a spreadsheet with 5,000 customer records. Before any analysis, you need to clean it up. Describe the steps you would take for each issue: 1. The "State" column has entries like "CA", "Ca", "california", "CALIFORNIA", and "Calif." — how do you standardize this? → ___ 2. You suspect there are duplicate customer records (same email in multiple rows) — how do you find and handle them? → ___ 3. The "Order Amount" column has some cells with text like "$1,234" (stored as text) instead of the number 1234 — how do you fix this? → ___ 4. Future entries in the "State" column should only allow valid two-letter state abbreviations — how do you enforce this? → ___ _Hint: For #1, consider a helper column with `=UPPER()` or `=PROPER()`. For #2, use conditional formatting to highlight duplicates first, then Review. For #3, try Text to Columns or Find & Replace to remove $ and commas. For #4, data validation with a list of state codes._

Structured tables — the feature that changes everything

A structured table (Insert → Table or Ctrl+T) converts a plain range into an intelligent data object. This is the single most underused feature in Excel.

Plain rangeStructured table
Formulas reference cells like =SUM(B2:B100)Formulas reference columns like =SUM(Table1[Revenue])
New rows are not included in formulasNew rows are automatically included
No automatic formattingAlternating row colors, filter buttons
Sorting can disconnect columnsColumns always stay connected
Charts do not expand with new dataCharts automatically include new rows

Why tables matter: When you add a new row of data to a structured table, every formula, chart, and pivot table that references the table automatically includes the new row. With plain ranges, you have to manually update every reference. This alone prevents an enormous category of spreadsheet errors.

🔑Name your tables
When you create a table, give it a meaningful name (click the table, go to Table Design, change "Table1" to something like "Sales2024"). Now your formulas read `=SUM(Sales2024[Revenue])` instead of `=SUM(Table1[Revenue])`. Anyone reading your spreadsheet instantly knows what the formula references.

Key takeaways

  • Sorting answers "what are the top/bottom values?" — always sort the full data range, never a single column in isolation
  • Filtering hides irrelevant rows without deleting them — use SUBTOTAL (not SUM) to calculate on visible rows only
  • Conditional formatting directs your eyes to exceptions and patterns — use it sparingly on what matters most
  • Data validation prevents bad data at entry — dropdown lists for categories, number ranges for quantities, custom formulas for uniqueness
  • Remove Duplicates cleans data but is destructive — always identify duplicates visually first, then remove
  • Structured tables (Ctrl+T) automatically expand formulas, charts, and pivot tables when new data is added — use them for every dataset

?

Knowledge Check

1.You filter a column to show only rows where Region = 'West', then use =SUM(D2:D500) to total the Amount column. What happens?

2.What is the primary advantage of using a structured table (Ctrl+T) over a plain data range?

3.In 2020, Public Health England lost nearly 16,000 COVID test results. What was the root cause?

4.You want to prevent users from entering anything other than 'High', 'Medium', or 'Low' in a Priority column. Which feature should you use?

Previous

Formulas & Functions

Next

Pivot Tables