Usability Testing
You are not testing the user — you are testing the design. Here's how to run usability tests, write scripts, analyze findings, and turn user confusion into product improvements.
Google's 41 shades of blue
In 2009, Google's design team could not agree on which shade of blue to use for links in Gmail. The lead designer wanted one shade. The product team wanted another. Rather than debating, they tested 41 different shades of blue with real users to measure which one got the most clicks.
The winning shade generated $200 million in additional annual ad revenue.
The designer who led the project, Douglas Bowman, quit shortly after. He wrote a famous blog post saying that a company should not need data to decide between two shades of blue — that is what designers are for.
Both sides had a point. But the story reveals something important: testing with real users produces answers that no amount of internal debate can. Whether you test 41 shades of blue or the placement of a checkout button, putting your design in front of real humans is the fastest way to separate what works from what you think works.
What usability testing actually is
Usability testing is watching real people try to use your product while you observe their behavior, listen to their thoughts, and identify where the design confuses, frustrates, or fails them.
It is not:
- Asking users if they like the design (that is an opinion survey)
- Checking if the code works (that is QA testing)
- A/B testing (that measures behavior at scale; usability testing reveals why)
- Focus groups (that is group discussion; usability testing is one-on-one task observation)
| What usability testing measures | How |
|---|---|
| Task completion rate | Can users actually finish the task? (Yes/no, with/without help) |
| Time on task | How long does it take? Longer = harder |
| Error rate | How many wrong clicks, dead ends, or mistakes? |
| Satisfaction | How does the user feel about the experience? (Post-task rating) |
| Qualitative insights | Why did they struggle? What were they thinking? (Think-aloud protocol) |
✗ Without AI
- ✗Team argues about which design is better
- ✗Ship a feature that 40% of users cannot figure out
- ✗Discover problems from angry customer support tickets
- ✗Fix issues after they have cost the company money
✓ With AI
- ✓Real users show which design works
- ✓Catch the 40% failure rate before shipping
- ✓Discover problems from watching 5 users in a room
- ✓Fix issues before they reach a single customer
Moderated vs. unmoderated testing
There are two fundamentally different ways to run usability tests:
Moderated testing
A facilitator sits with the participant (in person or via video call), gives tasks, observes behavior, and asks follow-up questions in real time.
Strengths: Deep insights. You can follow up on unexpected behavior ("I noticed you hesitated there — what were you thinking?"). You catch nuance that recordings miss.
Weaknesses: Time-intensive (30-60 minutes per session). Scheduling is hard. The facilitator can accidentally influence the participant.
Unmoderated testing
Participants complete tasks on their own using a tool (Maze, UserTesting, Lookback) that records their screen and voice. No facilitator present.
Strengths: Fast. You can test 20 users in a single day. No scheduling — participants complete tasks on their own time. Good for quantitative metrics (completion rates, time on task).
Weaknesses: No follow-up questions. You cannot probe deeper when something interesting happens. Participants may not think aloud without prompting.
| Dimension | Moderated | Unmoderated |
|---|---|---|
| Participants per study | 5-8 | 15-30 |
| Time per session | 30-60 minutes | 10-20 minutes |
| Depth of insight | Very deep | Moderate |
| Speed of data collection | Days to weeks | Hours to days |
| Best for | Early design exploration, complex flows | Validating specific tasks, competitive benchmarking |
| Tools | Zoom, Lookback, in-person | Maze, UserTesting, UsabilityHub |
There Are No Dumb Questions
"How do I recruit participants?"
For moderated testing: use your existing user base (email a segment), recruit through UserTesting or Respondent.io, or ask friends-of-friends (not your actual friends — they will be too nice). For unmoderated testing: Maze and UserTesting have participant panels you can filter by demographics. Budget: expect to pay $50-100 per participant for a 30-minute session.
"What if I cannot afford to pay participants?"
Hallway testing. Literally grab someone in the hallway (or a coffee shop) and ask: "Can I have 5 minutes of your time? I want to see if this design makes sense." You will not get the deepest insights, but any testing is better than no testing.
Writing a usability test script
A test script is your facilitator guide. It keeps every session consistent so you can compare results across participants.
Script structure
Introduction (2 minutes). Welcome the participant. Explain that you are testing the design, not them. "There are no wrong answers. If something is confusing, that is the design's fault, not yours." Ask permission to record.
Warm-up questions (3 minutes). Build rapport and gather context. "Tell me about the last time you booked a flight online. What tool did you use? What was easy? What was frustrating?"
Task scenarios (20-30 minutes). Give 4-6 realistic tasks. Frame them as scenarios, not instructions. Say: "You are planning a trip to Barcelona next month. Find a round-trip flight for 2 adults." Do NOT say: "Click Flights, then enter Barcelona, then select dates."
Post-task questions (5 minutes). After each task, ask: "On a scale of 1-7, how easy was that?" and "What was going through your mind?" At the end, ask: "What was the most frustrating part? What was the easiest part?"
Wrap-up (2 minutes). Thank the participant. Ask if there is anything else they want to share. End the recording.
Writing great task scenarios
| Bad task | Why it fails | Better task |
|---|---|---|
| "Find the search function" | Too direct — tells them what to do, not why | "You want to find a recipe for chocolate chip cookies. How would you do that?" |
| "Test the checkout flow" | Not a realistic user goal | "You have found a pair of shoes you like. Complete the purchase." |
| "Click on Settings, then Account, then Change Password" | Step-by-step instructions — you are testing memory, not usability | "You want to change your password because you think someone else may have access. Show me how you would do that." |
Write a Test Script
25 XPThe think-aloud protocol
The think-aloud protocol is the most important technique in usability testing. You ask participants to verbalize their thoughts as they complete tasks.
What it sounds like: "OK, I see a big blue button that says 'Get Started,' so I assume that is where I sign up... I am clicking it... now it is asking for my email... I am wondering if this is going to spam me... I am going to use my secondary email just in case..."
This narration reveals what no analytics tool can: the user's internal decision-making process. You hear their expectations, their hesitations, their confusion, and their assumptions.
How to facilitate think-aloud
- At the start: "As you work through these tasks, please think out loud. Tell me what you are looking at, what you are thinking, and what you are trying to do. There are no wrong answers."
- When they go silent: "What are you thinking right now?" or "What are you looking for?"
- When they struggle: Do NOT help. Say: "Take your time. What would you do if I were not here?"
- When they ask you a question: "What would you expect to happen?" Turn their question back into data.
There Are No Dumb Questions
"What if the participant cannot complete the task at all?"
After 3-5 minutes of being stuck, give them a gentle prompt: "Where would you expect to find that?" If they are still stuck after another minute, offer a hint: "Try looking in the top navigation." If they still cannot complete it, move to the next task. A task that nobody can complete is an extremely clear finding — the design has fundamentally failed for that flow.
Analyzing findings — from observations to action
After testing 5 users, you will have hours of recordings and pages of notes. Here is how to turn that into actionable insights:
1. Debrief immediately. Right after each session, write down the top 3 observations while they are fresh. What surprised you? Where did the user struggle most?
2. Create an observation grid. Rows = tasks. Columns = participants. Each cell = what happened. This lets you see patterns: if 4 out of 5 users fail the same task, that is a pattern, not an outlier.
3. Categorize issues by severity. Critical (user cannot complete the task), Major (user completes with significant difficulty), Minor (user notices but works around it), Cosmetic (user does not notice, but it violates best practice).
4. Prioritize fixes. Critical issues first, always. A "nice-to-have" animation improvement can wait. A checkout flow that 60% of users cannot complete cannot.
5. Present findings with evidence. Do not say "I think the nav is confusing." Say "4 out of 5 users could not find Account Settings. User 3 said: 'I have been looking for this for 2 minutes and I am about to give up.'" Video clips are the most persuasive evidence.
The severity scale
| Severity | Definition | Example | Action |
|---|---|---|---|
| Critical | User cannot complete the task | 0 out of 5 users found the checkout button | Fix before launch — this is a blocker |
| Major | User completes with significant difficulty or frustration | Users took 4 minutes instead of 30 seconds to find settings | Fix in current sprint |
| Minor | User notices but works around it | Error message is unclear but users eventually figure it out | Fix in next sprint |
| Cosmetic | Violates best practice but users do not notice | Inconsistent button padding | Add to backlog |
Analyze These Findings
50 XPIterating on feedback — closing the loop
Testing is not the end. It is the middle. The cycle is:
Design → Test → Learn → Redesign → Test again
After fixing the issues found in Round 1, run Round 2 with 5 new users. Never test with the same participants — they have already learned your interface and will not struggle with the same things.
Expect to need 2-3 rounds of testing for major features. Each round catches fewer issues. Round 1 finds the critical and major problems. Round 2 validates the fixes and catches minor issues. Round 3 polishes.
Key takeaways
- Usability testing watches real users attempt real tasks — it reveals problems that no amount of internal debate can surface
- Moderated testing gives deep qualitative insights; unmoderated testing gives faster quantitative data — use both at different stages
- Write task scenarios, not instructions — frame tasks as realistic situations that give users a goal without telling them how to achieve it
- The think-aloud protocol reveals the user's internal decision process — their hesitations, expectations, and confusion become your best data
- Severity classification (Critical, Major, Minor, Cosmetic) ensures you fix the most impactful issues first
- Test in rounds — 5 users per round, 2-3 rounds per major feature, new participants each time
Knowledge Check
1.What is the main advantage of moderated usability testing over unmoderated testing?
2.Why should usability test tasks be framed as scenarios rather than instructions?
3.During a usability test, a participant is visibly stuck and asks you 'Where is the settings button?' What should you do?
4.If 4 out of 5 test participants cannot complete a task, what severity would you assign?