Usability Testing — UX/UI Design

Google's 41 shades of blue

In 2009, Google's design team could not agree on which shade of blue to use for links in Gmail. The lead designer wanted one shade. The product team wanted another. Rather than debating, they tested 41 different shades of blue with real users to measure which one got the most clicks.

The winning shade generated $200 million in additional annual ad revenue.

The designer who led the project, Douglas Bowman, quit shortly after. He wrote a famous blog post saying that a company should not need data to decide between two shades of blue — that is what designers are for.

Both sides had a point. But the story reveals something important: testing with real users produces answers that no amount of internal debate can. Whether you test 41 shades of blue or the placement of a checkout button, putting your design in front of real humans is the fastest way to separate what works from what you think works.

200Mannual revenue from the right shade of blue

41shades tested

85%of usability issues found with just 5 users

What usability testing actually is

Usability testing is watching real people try to use your product while you observe their behavior, listen to their thoughts, and identify where the design confuses, frustrates, or fails them.

It is not:

Asking users if they like the design (that is an opinion survey)
Checking if the code works (that is QA testing)
A/B testing (that measures behavior at scale; usability testing reveals why)
Focus groups (that is group discussion; usability testing is one-on-one task observation)

What usability testing measures	How
Task completion rate	Can users actually finish the task? (Yes/no, with/without help)
Time on task	How long does it take? Longer = harder
Error rate	How many wrong clicks, dead ends, or mistakes?
Satisfaction	How does the user feel about the experience? (Post-task rating)
Qualitative insights	Why did they struggle? What were they thinking? (Think-aloud protocol)

✗ Without AI

✗Team argues about which design is better
✗Ship a feature that 40% of users cannot figure out
✗Discover problems from angry customer support tickets
✗Fix issues after they have cost the company money

✓ With AI

✓Real users show which design works
✓Catch the 40% failure rate before shipping
✓Discover problems from watching 5 users in a room
✓Fix issues before they reach a single customer

Moderated vs. unmoderated testing

There are two fundamentally different ways to run usability tests:

Moderated testing

A facilitator sits with the participant (in person or via video call), gives tasks, observes behavior, and asks follow-up questions in real time.

Strengths: Deep insights. You can follow up on unexpected behavior ("I noticed you hesitated there — what were you thinking?"). You catch nuance that recordings miss.

Weaknesses: Time-intensive (30-60 minutes per session). Scheduling is hard. The facilitator can accidentally influence the participant.

Unmoderated testing

Participants complete tasks on their own using a tool (Maze, UserTesting, Lookback) that records their screen and voice. No facilitator present.

Strengths: Fast. You can test 20 users in a single day. No scheduling — participants complete tasks on their own time. Good for quantitative metrics (completion rates, time on task).

Weaknesses: No follow-up questions. You cannot probe deeper when something interesting happens. Participants may not think aloud without prompting.

Dimension	Moderated	Unmoderated
Participants per study	5-8	15-30
Time per session	30-60 minutes	10-20 minutes
Depth of insight	Very deep	Moderate
Speed of data collection	Days to weeks	Hours to days
Best for	Early design exploration, complex flows	Validating specific tasks, competitive benchmarking
Tools	Zoom, Lookback, in-person	Maze, UserTesting, UsabilityHub

🔑Use both — at different times

Use moderated testing when you are still exploring and need to understand *why* users behave a certain way. Use unmoderated testing when you have specific tasks to validate and need larger sample sizes fast. Most teams alternate between the two throughout a project.

There Are No Dumb Questions

"How do I recruit participants?"

For moderated testing: use your existing user base (email a segment), recruit through UserTesting or Respondent.io, or ask friends-of-friends (not your actual friends — they will be too nice). For unmoderated testing: Maze and UserTesting have participant panels you can filter by demographics. Budget: expect to pay $50-100 per participant for a 30-minute session.

"What if I cannot afford to pay participants?"

Hallway testing. Literally grab someone in the hallway (or a coffee shop) and ask: "Can I have 5 minutes of your time? I want to see if this design makes sense." You will not get the deepest insights, but any testing is better than no testing.

Writing a usability test script

A test script is your facilitator guide. It keeps every session consistent so you can compare results across participants.

Script structure

Introduction (2 minutes). Welcome the participant. Explain that you are testing the design, not them. "There are no wrong answers. If something is confusing, that is the design's fault, not yours." Ask permission to record.

Warm-up questions (3 minutes). Build rapport and gather context. "Tell me about the last time you booked a flight online. What tool did you use? What was easy? What was frustrating?"

Task scenarios (20-30 minutes). Give 4-6 realistic tasks. Frame them as scenarios, not instructions. Say: "You are planning a trip to Barcelona next month. Find a round-trip flight for 2 adults." Do NOT say: "Click Flights, then enter Barcelona, then select dates."

Post-task questions (5 minutes). After each task, ask: "On a scale of 1-7, how easy was that?" and "What was going through your mind?" At the end, ask: "What was the most frustrating part? What was the easiest part?"

Wrap-up (2 minutes). Thank the participant. Ask if there is anything else they want to share. End the recording.

Writing great task scenarios

Bad task	Why it fails	Better task
"Find the search function"	Too direct — tells them what to do, not why	"You want to find a recipe for chocolate chip cookies. How would you do that?"
"Test the checkout flow"	Not a realistic user goal	"You have found a pair of shoes you like. Complete the purchase."
"Click on Settings, then Account, then Change Password"	Step-by-step instructions — you are testing memory, not usability	"You want to change your password because you think someone else may have access. Show me how you would do that."

⚡

Write a Test Script

25 XP

You are testing a new banking app. Write task scenarios for these 4 goals: 1. The user wants to check their account balance 2. The user wants to send $50 to a friend 3. The user wants to set up a monthly savings transfer 4. The user wants to dispute a suspicious charge on their card For each task, write a realistic scenario (not an instruction). Make it feel like a situation the user would actually be in. _Hint: Good scenarios include motivation. "You just got paid and want to see how much you have" is better than "Check your balance." The scenario should make the user think about their goal, not about the interface._

The think-aloud protocol

The think-aloud protocol is the most important technique in usability testing. You ask participants to verbalize their thoughts as they complete tasks.

What it sounds like: "OK, I see a big blue button that says 'Get Started,' so I assume that is where I sign up... I am clicking it... now it is asking for my email... I am wondering if this is going to spam me... I am going to use my secondary email just in case..."

This narration reveals what no analytics tool can: the user's internal decision-making process. You hear their expectations, their hesitations, their confusion, and their assumptions.

How to facilitate think-aloud

At the start: "As you work through these tasks, please think out loud. Tell me what you are looking at, what you are thinking, and what you are trying to do. There are no wrong answers."
When they go silent: "What are you thinking right now?" or "What are you looking for?"
When they struggle: Do NOT help. Say: "Take your time. What would you do if I were not here?"
When they ask you a question: "What would you expect to happen?" Turn their question back into data.

⚠️The hardest skill: staying silent

The natural human instinct when someone is struggling is to help them. In usability testing, their struggle is your most valuable data. When a user stares at a screen for 30 seconds, confused — that is a finding. When they click the wrong button — that is a finding. If you rescue them, you lose the insight. Practice sitting on your hands and saying nothing.

There Are No Dumb Questions

"What if the participant cannot complete the task at all?"

After 3-5 minutes of being stuck, give them a gentle prompt: "Where would you expect to find that?" If they are still stuck after another minute, offer a hint: "Try looking in the top navigation." If they still cannot complete it, move to the next task. A task that nobody can complete is an extremely clear finding — the design has fundamentally failed for that flow.

Analyzing findings — from observations to action

After testing 5 users, you will have hours of recordings and pages of notes. Here is how to turn that into actionable insights:

1. Debrief immediately. Right after each session, write down the top 3 observations while they are fresh. What surprised you? Where did the user struggle most?

2. Create an observation grid. Rows = tasks. Columns = participants. Each cell = what happened. This lets you see patterns: if 4 out of 5 users fail the same task, that is a pattern, not an outlier.

3. Categorize issues by severity. Critical (user cannot complete the task), Major (user completes with significant difficulty), Minor (user notices but works around it), Cosmetic (user does not notice, but it violates best practice).

4. Prioritize fixes. Critical issues first, always. A "nice-to-have" animation improvement can wait. A checkout flow that 60% of users cannot complete cannot.

5. Present findings with evidence. Do not say "I think the nav is confusing." Say "4 out of 5 users could not find Account Settings. User 3 said: 'I have been looking for this for 2 minutes and I am about to give up.'" Video clips are the most persuasive evidence.

The severity scale

Severity	Definition	Example	Action
Critical	User cannot complete the task	0 out of 5 users found the checkout button	Fix before launch — this is a blocker
Major	User completes with significant difficulty or frustration	Users took 4 minutes instead of 30 seconds to find settings	Fix in current sprint
Minor	User notices but works around it	Error message is unclear but users eventually figure it out	Fix in next sprint
Cosmetic	Violates best practice but users do not notice	Inconsistent button padding	Add to backlog

⚡

Analyze These Findings

50 XP

You tested a new project management app with 5 users. Here are the raw results for 4 tasks: **Task 1: Create a new project** - User 1: Completed in 15 seconds. No issues. - User 2: Completed in 20 seconds. Clicked "Create" button immediately. - User 3: Completed in 12 seconds. - User 4: Completed in 18 seconds. - User 5: Completed in 22 seconds. "That was easy." **Task 2: Invite a team member to a project** - User 1: Looked in Settings first, then found it under the team tab. 90 seconds. - User 2: Could not find it. Gave up after 3 minutes. - User 3: Found it, but said "That was hidden." 2 minutes. - User 4: Looked in the project menu, could not find it. Eventually found it under People. 2.5 minutes. - User 5: Found it on the first try. 30 seconds. **Task 3: Set a due date on a task** - User 1: Completed in 10 seconds. - User 2: Completed in 8 seconds. - User 3: Completed in 15 seconds. - User 4: Completed in 12 seconds. - User 5: Completed in 20 seconds. Tried to type the date instead of using the picker. **Task 4: Export a project report** - User 1: Could not find export. Looked in Reports, then Dashboard. Never found it. - User 2: Could not find export. "Where would this be? I give up." - User 3: Found it under the three-dot menu. "That was completely hidden." 3 minutes. - User 4: Could not find it. - User 5: Could not find it. For each task, determine: severity (Critical/Major/Minor/Cosmetic), the pattern, and your recommended fix. _Hint: Count how many users succeeded. 5/5 = no issue. 4/5 = minor. 2-3/5 = major. 0-1/5 = critical. Then look at the qualitative data for the "why."_

Iterating on feedback — closing the loop

Testing is not the end. It is the middle. The cycle is:

Design → Test → Learn → Redesign → Test again

After fixing the issues found in Round 1, run Round 2 with 5 new users. Never test with the same participants — they have already learned your interface and will not struggle with the same things.

Expect to need 2-3 rounds of testing for major features. Each round catches fewer issues. Round 1 finds the critical and major problems. Round 2 validates the fixes and catches minor issues. Round 3 polishes.

Design

Test with 5 users

Analyze findings

Fix issues

🔑Steve Krug's 'Rocket surgery made easy'

Steve Krug, author of the classic UX book "Don't Make Me Think," advocates for monthly usability testing with just 3 users. He calls it "lost morning" testing — spend one morning per month watching 3 people use your product. Fix the worst problem. Repeat. This simple habit catches more issues than any annual redesign.

Key takeaways

Usability testing watches real users attempt real tasks — it reveals problems that no amount of internal debate can surface
Moderated testing gives deep qualitative insights; unmoderated testing gives faster quantitative data — use both at different stages
Write task scenarios, not instructions — frame tasks as realistic situations that give users a goal without telling them how to achieve it
The think-aloud protocol reveals the user's internal decision process — their hesitations, expectations, and confusion become your best data
Severity classification (Critical, Major, Minor, Cosmetic) ensures you fix the most impactful issues first
Test in rounds — 5 users per round, 2-3 rounds per major feature, new participants each time

Knowledge Check

1.What is the main advantage of moderated usability testing over unmoderated testing?

2.Why should usability test tasks be framed as scenarios rather than instructions?

3.During a usability test, a participant is visibly stuck and asks you 'Where is the settings button?' What should you do?

4.If 4 out of 5 test participants cannot complete a task, what severity would you assign?