Skip to content

Usability Tests

Usability testing puts your design in front of real people and watches what happens. It’s the most direct way to find out whether your interface actually works — before you ship it to everyone.

Designers and developers are too close to their own work. We know where every button is, what every icon means, and how every flow is supposed to work. Users don’t. Usability testing reveals the gap between intention and reality.

No amount of design review, heuristic evaluation, or internal QA catches what real users catch. They click on things that aren’t clickable, miss things that seem obvious, and interpret labels in ways you never anticipated.


TypeParticipantsModeratorDurationBest For
Moderated Remote5-8Yes (video call)45-60 minExploratory research, complex flows
Unmoderated Remote15-30+No15-30 minQuick validation, specific questions
In-Person Lab5-8Yes60-90 minHigh-fidelity prototypes, accessibility
Guerrilla/Hallway3-5Yes (informal)5-15 minEarly concepts, quick feedback
A/B Testing1000+NoVariesStatistical comparison of variants
Study GoalMinimumRecommendedMaximum ROINotes
Problem discovery55-885% of issues found at 5Nielsen/Landauer formula
Multiple segments3-5 per segment5 per segmentPer-segment problem discoveryEffectively separate studies
Quantitative metrics2040Tight confidence intervals95% confidence level
Card sorting1530-50Statistical clusteringPer user group
Eye-tracking heatmaps2039Stable heatmapsFixed viewport required
Test PhaseDurationNotes
Planning3-5 daysTasks, script, recruitment criteria
Recruitment3-10 daysLonger for specialized audiences
Pilot test1 sessionAlways pilot before full study
Individual session45-90 minIncluding intro and debrief
Analysis per session1-2 hoursManual; less with AI tools
Report synthesis1-3 daysFindings, priorities, recommendations

usability_test_validation:
rules:
- id: sample-size-qualitative
severity: warning
check: "Qualitative studies have 5-8 participants"
rationale: "85% of usability problems found with 5 users"
- id: sample-size-quantitative
severity: error
check: "Quantitative studies have minimum 20 participants"
rationale: "Statistical significance requires larger samples"
- id: task-scenario-format
severity: error
check: "Tasks use scenario format, not step-by-step instructions"
bad: "Click Add to Cart, then go to checkout"
good: "You want to buy this shirt as a gift. Complete the purchase."
- id: pilot-test-required
severity: warning
check: "At least one pilot session before main study"
rationale: "Catches script issues, technical problems"
- id: no-leading-questions
severity: error
check: "Questions don't lead or suggest answers"
bad: "Did you see the big blue button?"
good: "How would you proceed from here?"
- id: real-user-recruitment
severity: error
check: "Participants match target user profile"
rationale: "Convenience samples miss critical issues"
- id: neutral-facilitation
severity: error
check: "Moderator doesn't help, guide, or react to participant actions"
rationale: "Intervention invalidates natural behavior"
- id: unmoderated-instructions-clear
severity: error
applies_to: "unmoderated"
check: "All instructions are self-contained and unambiguous"
rationale: "No moderator to clarify confusion"
- id: accessibility-testing-included
severity: warning
check: "Study includes participants using assistive technologies"
rationale: "Screen reader users reveal keyboard and ARIA issues"

Run with 5-8 participants. You’re looking for why people struggle, not precise measurements. Use think-aloud protocol to hear their reasoning. Best during design iteration.

When to use:

  • Early-stage design exploration
  • Understanding mental models
  • Identifying usability problems
  • Informing design decisions

Output: Prioritized list of usability issues with severity ratings

Run with 20+ participants (40 recommended for tight confidence intervals). You’re measuring success rates, time-on-task, and error counts. Best for benchmarking or comparing designs.

When to use:

  • Comparing design variants
  • Establishing baselines
  • Validating improvements
  • Competitive benchmarking

Output: Statistical metrics with confidence intervals

For most teams, qualitative testing gives the best return on investment.


Very much like an in-person study, except facilitator and participant aren’t in the same physical location. Uses video conferencing with screen sharing.

Advantages:

  • Access to broader demographics
  • No travel time or costs
  • Participants in natural environment
  • Easier to schedule

Challenges:

  • Can’t see full body language
  • Technical issues (connectivity, audio)
  • Less rapport building
  • Harder to test physical products

Participants complete tasks and answer questions at their own pace, on their own time. No moderator present.

Advantages:

  • Lower cost (no moderator time)
  • Faster results (often within 24-48 hours)
  • Larger sample sizes feasible
  • No observer effect (Hawthorne Effect)
  • Global reach

Challenges:

  • No follow-up questions
  • Can’t clarify confusion
  • Instructions must be bulletproof
  • Less insight into reasoning
  • Potential for distraction

Best for: High-fidelity prototypes in final design stages, specific questions, quantitative validation.

Traditional approach with participant and moderator in same room, often with observation room for stakeholders.

Advantages:

  • Full body language observation
  • Better rapport and trust
  • Immediate follow-up possible
  • Test physical products
  • Control over environment

Challenges:

  • Geographic limitations
  • Higher cost
  • Scheduling complexity
  • May feel artificial

The most common technique for understanding user reasoning during usability tests.

Participants verbalize thoughts while performing tasks.

Advantages:

  • Real-time insight into reasoning
  • Captures immediate reactions
  • Most complete verbalization

Disadvantages:

  • May alter behavior and performance
  • Can slow task completion
  • Feels unnatural to some participants
  • Negative effect on complex tasks

Participants review video recording of their session and explain their reasoning afterward.

Advantages:

  • No interference with task performance
  • More accurate time measurements
  • Better for complex tasks
  • More cognitive/interpretive comments

Disadvantages:

  • Requires twice the session time
  • Memory decay affects accuracy
  • Participants act as observers, not performers

Two participants work together, naturally conversing as they complete tasks.

Advantages:

  • More natural conversation
  • Less awkwardness than solo think-aloud
  • Reveals different perspectives
  • Good for collaborative features

Disadvantages:

  • Dominant personalities can skew results
  • May not represent solo use
  • Harder to analyze

What do you want to learn? Specific questions yield actionable findings.

Good research questions:

  • “Can users find the checkout button?”
  • “Do users understand what happens when they click ‘Save Draft’?”
  • “Where do users expect to find account settings?”

Vague research questions:

  • “Is the design good?”
  • “Do users like it?”
  • “Is it intuitive?“

Give participants scenarios, not instructions.

Good task (scenario-based):

“You want to buy this shirt as a gift for a friend. Their birthday is next week, so you need it delivered within 5 days. Complete the purchase.”

Bad task (step-by-step):

“Click Add to Cart, then go to checkout, then enter shipping information.”

Task writing tips:

  • Include realistic motivation
  • Don’t use UI terminology
  • Make success/failure measurable
  • Include enough context
  • Don’t reveal the answer

Test with users who match your actual audience. 5-8 participants typically reveal ~85% of usability problems.

Recruitment criteria:

  • Demographics matching target audience
  • Relevant experience level
  • Mix of tech proficiency
  • Exclude employees and close friends
  • Include assistive technology users

Over-recruit by 20-30% — remote studies have higher no-show rates.

Introduction (5 minutes):

  • Explain the process
  • Emphasize: “We’re testing the design, not you”
  • Get consent for recording
  • Ask them to think aloud

Tasks (30-45 minutes):

  • Present one task at a time
  • Stay neutral — don’t help, guide, or react
  • Note what they do, say, and where they struggle
  • Ask clarifying questions (not leading)

Debrief (5-10 minutes):

  • Overall impressions
  • Comparison to expectations
  • What was easy/difficult
  • Suggestions for improvement

Look for patterns across participants. Prioritize issues by severity (how bad?) and frequency (how common?).

Severity rating scale:

RatingDescriptionImpact
4 - CriticalUser cannot complete taskMust fix before launch
3 - SeriousMajor difficulty or frustrationShould fix before launch
2 - MinorCauses hesitation or confusionFix if time permits
1 - CosmeticNoticed but doesn’t affect successLow priority

Prioritization matrix:

High Frequency
Fix First │ Monitor
(Critical) │ (Annoying)
High ─────────────┼─────────────── Low
Severity │ Severity
Fix Soon │ Backlog
(Painful) │ (Minor)
Low Frequency

FUNCTION selectTestType(constraints, goals):
// Determine moderation approach
IF goals.require_follow_up OR goals.exploratory:
moderation = "moderated"
ELSE IF goals.quantitative AND constraints.budget_limited:
moderation = "unmoderated"
ELSE IF prototype.complexity == "high" OR prototype.requires_explanation:
moderation = "moderated"
ELSE:
moderation = "unmoderated"
// Determine location
IF constraints.geographic_diversity_needed:
location = "remote"
ELSE IF testing.physical_product OR testing.accessibility_focus:
location = "in-person"
ELSE IF constraints.budget_limited OR constraints.time_limited:
location = "remote"
ELSE:
location = "remote" // Default to remote
// Determine sample size
IF goals.quantitative:
IF goals.statistical_significance == "high":
sample_size = 40
ELSE:
sample_size = 20
ELSE: // Qualitative
IF user_segments.count > 1:
sample_size = 5 * user_segments.count
ELSE:
sample_size = 5-8
// Determine think-aloud approach
IF tasks.complexity == "high":
think_aloud = "retrospective"
ELSE IF goals.understand_reasoning:
think_aloud = "concurrent"
ELSE IF goals.measure_performance:
think_aloud = "retrospective"
RETURN TestPlan(moderation, location, sample_size, think_aloud)

According to the 2024 State of User Research report, 56% of UX researchers now use AI to support their work — a 36% increase from 2023.

CapabilityMaturityTime SavingsTools
TranscriptionHigh80-90%Looppanel, Dovetail, CoNote
Keyword taggingHigh60-70%Dovetail, Condens
Theme generationMedium40-50%Maze, Dovetail
Sentiment analysisMedium50-60%Maze, UXtweak
Summary generationMedium30-40%Looppanel, BuildBetter
Video clip creationHigh70-80%Looppanel, CoNote
Pattern detectionLow-Medium20-30%Dovetail
Insight generationLowLimitedVarious

Current AI tools:

  • Cannot watch usability tests or process video effectively
  • Cannot factor researcher context into analysis
  • Struggle with visual cues and body language
  • Struggle with mixed-method research data
  • May produce vague or biased recommendations
  • Sometimes mis-cite sources

Recommendation: Treat AI as an accelerant, not a replacement. Rely on human synthesis for nuanced, actionable insights.

┌─────────────────┐
│ Test Session │
│ (Recording) │
└────────┬────────┘
┌─────────────────┐
│ AI Transcribe │◄── 80-90% time savings
│ + Timestamp │
└────────┬────────┘
┌─────────────────┐
│ AI Tagging │◄── Surface themes
│ + Clustering │
└────────┬────────┘
┌─────────────────┐
│ Human Review │◄── Verify, contextualize
│ + Synthesis │
└────────┬────────┘
┌─────────────────┐
│ Final Report │
│ + Recommendations
└─────────────────┘

  • Testing too late — No time to act on findings
  • Vague research questions — Can’t measure success
  • Wrong participants — Convenience samples miss issues
  • No pilot test — Script problems discovered too late
  • Leading questions — “Did you see the big blue button?”
  • Helping participants — Jumping in when they struggle
  • Reacting to errors — Sighing, nodding, or facial expressions
  • Too many tasks — Session fatigue affects results
  • Focusing on preferences — “I like blue” vs. behavioral data
  • Ignoring severity — Treating all issues equally
  • Cherry-picking — Selecting quotes that confirm assumptions
  • No prioritization — Long list without action guidance

PlatformKey FeaturesPrice Range
LookbackReal-time collaboration, notes, highlights$99-299/mo
UserTestingParticipant panel, video analysisEnterprise
Zoom/MeetSimple setup, familiar to participantsFree-$20/mo
dscoutMobile diary studies, video captureEnterprise
PlatformKey FeaturesPrice Range
MazeRapid testing, 300 responses in 48h, AI analysis$99-300/mo
UserTestingPanel access, video recordingsEnterprise
LyssnaClick tests, preference tests, prototype testing$75-175/mo
UXtweakTask analysis, session recordings$50-150/mo
PlatformKey FeaturesPrice Range
DovetailCentralized insights, AI tagging, pattern detection$29-79/user/mo
LooppanelAuto-transcription, video clips, AI analysis$30-80/mo
CoNoteInterview analysis, themes, deliverables$0-195/mo
CondensAuto-tagging, collaborative synthesis$10-25/user/mo

## Introduction (5 min)
"Thanks for joining us today. My name is [name] and I'll be walking
you through this session.
Before we begin, I want to emphasize: we're testing the design, not you.
There are no wrong answers. If something is confusing, that's valuable
feedback for us.
I'm going to ask you to share your screen and think aloud as you work —
tell me what you're looking at, what you're thinking, what you expect
to happen.
Do I have your permission to record this session? The recording is only
for our team to review. [Get consent]
Do you have any questions before we start?"
## Warm-up Questions (3 min)
- "Can you tell me a bit about your role?"
- "Have you used [product category] before?"
- "Walk me through a typical day when you might need [task area]"
## Tasks (30-45 min)
[Present one task at a time. Don't read step-by-step instructions.]
"Imagine this scenario: [context]. Starting from this screen,
please [goal]. Remember to think aloud."
[After each task:]
- "How did that go?"
- "Was that what you expected?"
- "On a scale of 1-5, how difficult was that?"
## Debrief (5 min)
- "Overall, how would you describe this experience?"
- "What stood out as easy or difficult?"
- "Is there anything you expected to find but didn't?"
- "Any final thoughts or suggestions?"
"Thank you so much for your time today. Your feedback is incredibly
valuable."

MetricWhat It MeasuresHow to Calculate
Success rateTask completion(Successes / Attempts) × 100
Time on taskEfficiencyStart to completion (seconds)
Error countError frequencyMistakes per task
LostnessNavigation efficiency(Unique pages / Optimal) - 1
AssistsIndependenceModerator interventions
MetricWhat It MeasuresIndustry Benchmark
SUS (System Usability Scale)Overall usability68 = average
SEQ (Single Ease Question)Task difficulty5.5/7 = acceptable
Net Promoter ScoreLikelihood to recommend0 = neutral
SUPR-QWebsite UX qualityPercentile ranks

For quantitative metrics with 40 participants at 95% confidence:

  • Task success rate 80%: 95% CI = 65% to 90%
  • Mean time on task 45s: 95% CI depends on variance

At 90% confidence level (acceptable for UX research):

  • Margin of error 15% requires 28 users
  • Margin of error 20% requires 15 users

Nielsen Norman Group’s 2024 research on AI in UX found that current AI tools excel at transcription and tagging but cannot yet process video effectively or factor researcher context into analysis. Recommendations lean toward using AI as an accelerant while relying on human synthesis for nuanced insights.

Nielsen surveyed 217 UX practitioners and found the average usability study uses 11 participants — more than twice the recommended 5 for qualitative testing. This suggests either conservative risk management or conflation of qualitative and quantitative goals.

Post-pandemic, remote moderated testing has become the default approach. Video conferencing tools have matured, and AI-powered analysis tools have streamlined transcription and synthesis. The 2024 State of User Research report shows 56% of researchers using AI tools.

Research continues to show trade-offs between concurrent and retrospective think-aloud. Van Den Haak et al. found concurrent think-aloud negatively affects task performance on complex tasks, while retrospective yields more cognitive/interpretive comments but suffers from memory decay.


Foundational Work:

Sample Size Research:

Remote Testing:

AI Tools:

Think-Aloud Protocol:

Official Resources: