Skip to content

Metrics & Benchmarks

What gets measured gets managed — but only if you’re measuring the right things. Good metrics connect user behavior to real outcomes. Bad metrics create perverse incentives and waste engineering effort.

This page provides industry benchmarks, measurement frameworks, and validation rules for UX metrics that both humans and AI systems can use to assess interface quality.


MetricAverageGoodExcellentSource
Task Success Rate78%80-90%>90%MeasuringU
SUS Score6870-80>80Usability.gov
SEQ Score5.5/76.0/76.5+/7MeasuringU
NPS (SaaS)+36+40-50+50+CustomerGauge
NPS (B2B)+38+40-60+60+Retently
NPS (eCommerce)+55+55-70+70+CustomerGauge
LCP≤2.5s≤1.5sGoogle
INP≤200ms≤100msGoogle
CLS≤0.1≤0.05Google
CategoryMeasuresExamples
BehavioralWhat users doTask success, time on task, error rate
AttitudinalHow users feelSUS, NPS, SEQ, CSAT
PerformanceSystem qualityLCP, INP, CLS, Time to Interactive
BusinessOutcomesConversion, retention, revenue per user

ux_metrics_validation:
rules:
- id: task-success-threshold
severity: warning
check: "Task success rate >= 78% for critical tasks"
threshold: 78
action: "Investigate usability issues if below threshold"
- id: sus-score-minimum
severity: warning
check: "SUS score >= 68 (industry average)"
threshold: 68
action: "Below average indicates usability problems"
- id: seq-score-minimum
severity: warning
check: "SEQ score >= 5.5/7 for all tasks"
threshold: 5.5
action: "Low SEQ indicates task difficulty"
- id: lcp-good
severity: error
check: "LCP <= 2.5 seconds at 75th percentile"
threshold_ms: 2500
action: "Optimize loading performance"
- id: inp-good
severity: error
check: "INP <= 200ms at 75th percentile"
threshold_ms: 200
action: "Optimize interaction responsiveness"
- id: cls-good
severity: error
check: "CLS <= 0.1 at 75th percentile"
threshold: 0.1
action: "Fix layout shifts"
- id: baseline-before-changes
severity: error
check: "Baseline measurement exists before design changes"
rationale: "Cannot measure improvement without baseline"
- id: sample-size-sufficient
severity: warning
check: "Quantitative metrics have n >= 20 responses"
rationale: "Statistical significance requires adequate sample"
- id: guardrail-monitoring
severity: warning
check: "Guardrail metrics monitored alongside north star"
rationale: "Prevent gaming at users' expense"

The one metric that best captures the value you deliver to users. Everything else ladders up to it.

CompanyNorth StarWhy It Works
SpotifyTime spent listeningReflects value delivered
AirbnbNights bookedCaptures marketplace success
SlackMessages sent in channelsIndicates team engagement
DuolingoDaily active learnersLearning progress
NotionWeekly active editorsCollaboration value

Criteria for a good north star:

  • Meaningful — Reflects real user value
  • Measurable — Quantifiable and trackable
  • Movable — Your team can influence it
  • Leading — Predicts business outcomes

Leading indicators you can influence directly. They predict changes in your north star.

MetricPredictsExample Target
Activation rateRetention40% in first week
Feature adoptionEngagement30% try new feature
Session frequencyLifetime value3+ sessions/week
Onboarding completionActivation70% complete flow
Time to first valueConversion<5 minutes

Constraints that ensure you’re not gaming your north star at users’ expense.

GuardrailProtects AgainstAlert Threshold
Error ratesBroken experiences>1% increase
Page load timePerformance degradation>500ms increase
Support volumeConfusion, frustration>10% increase
Accessibility complianceExclusionAny regression
Bounce ratePoor content match>5% increase
Unsubscribe rateOver-communication>0.5% per send

Rule: If your guardrails decline while your north star rises, something’s wrong.


Google’s UX research team (Kerry Rodden, Hilary Hutchinson, Xin Fu) created the HEART framework to measure user satisfaction and product success at scale.

DimensionDefinitionExample SignalsExample Metrics
HappinessUser attitudes, satisfactionSurvey responsesNPS, SUS, CSAT
EngagementLevel of involvementSession depthSessions/week, features used
AdoptionNew users gainedSign-ups, upgrades% new users, feature adoption
RetentionUsers returningReturn visitsChurn rate, DAU/MAU ratio
Task SuccessEfficiency, effectivenessTask completionSuccess rate, time on task
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GOALS │ → │ SIGNALS │ → │ METRICS │
│ What success│ │ Indicators │ │ How to │
│ looks like │ │ of progress │ │ measure │
└─────────────┘ └─────────────┘ └─────────────┘
Example:
Goal: Users find what they need quickly
Signal: Low search refinement rate
Metric: % searches with 0-1 refinements
ScenarioPrimary Dimensions
New feature launchAdoption, Task Success
Retention problemRetention, Engagement
Conversion optimizationTask Success, Happiness
Growth initiativeAdoption, Retention
Quality assessmentHappiness, Task Success

The most fundamental UX metric: what percentage of users complete a given task?

Industry benchmark: 78% average (MeasuringU analysis of 1,100+ tasks)

PerformanceTask Success Rate
Excellent>90%
Good80-90%
Acceptable70-80%
Poor<70%

Calculation:

Task Success Rate = (Successful Completions / Attempts) × 100

Types of success:

  • Binary — Completed or not
  • Levels — Complete, partial, failed
  • Assisted — Track moderator interventions

How long does it take? Context determines whether shorter is better.

Task TypeShorter is BetterLonger May Be Good
Checkout
Error recovery
Content consumption
Learning features
Data entry

Calculation:

Time on Task = Task End Time - Task Start Time
Report: Mean, Median, 90th percentile

How often do users make mistakes? Distinguish between types.

Error TypeDefinitionExample
SlipWrong action, right intentionClicking wrong button
MistakeRight action, wrong understandingMisinterpreting a label
FailureCannot completeGiving up

Calculation:

Error Rate = (Tasks with Errors / Total Tasks) × 100

Measures navigation efficiency — how directly users find what they need.

Calculation:

Lostness = sqrt((N/S - 1)² + (R/N - 1)²)
Where:
N = Number of different pages visited
S = Minimum number of pages needed
R = Total pages visited (including revisits)
Score interpretation:
0.0 = Perfect navigation
0.4 = Some confusion
0.5+ = Lost

A 10-item questionnaire producing a score from 0-100. Quick, reliable, and widely benchmarked.

Scoring interpretation:

ScoreGradePercentileAdjective
90+A+Top 10%Best imaginable
80-89ATop 20%Excellent
68-79BAverageGood
50-67CBelow averageOK
<50FBottom 25%Poor

Key benchmark: 68 is the average SUS score across all products.

One question asked immediately after a task: “How easy or difficult was this task?”

  • Scale: 1 (Very Difficult) to 7 (Very Easy)
  • Average: 5.5/7
  • Threshold: Below 5.0 indicates difficulty

Benefits:

  • Minimal burden on participants
  • Task-specific (unlike SUS)
  • Correlates with task success

“How likely are you to recommend this to a friend?” (0-10 scale)

Calculation:

NPS = % Promoters (9-10) - % Detractors (0-6)
Range: -100 to +100

Industry benchmarks (2024):

IndustryAverage NPSGoodExcellent
SaaS+36+40+50+
B2B Software+41+45+55+
eCommerce+55-62+60+70+
B2B Overall+38+45+60+
B2C Overall+49+55+70+

Caution: NPS measures sentiment, not usability. A product can have high NPS but poor task success.

An 8-item questionnaire measuring four factors of website experience:

FactorWhat It Measures
UsabilityEase of use
TrustCredibility, security
AppearanceVisual design quality
LoyaltyIntent to return/recommend

Advantage: Provides percentile ranks against database of 100+ websites.

A short, 2-item usability survey:

  1. “[Product]‘s capabilities meet my requirements”
  2. “[Product] is easy to use”

Scale: 7-point Likert Conversion: Can be converted to SUS-equivalent score


Google’s metrics for user experience quality, used in search ranking since 2021.

MetricMeasuresGoodNeeds ImprovementPoor
LCPLoading≤2.5s2.5-4.0s>4.0s
INPInteractivity≤200ms200-500ms>500ms
CLSVisual stability≤0.10.1-0.25>0.25

Note: INP (Interaction to Next Paint) replaced FID (First Input Delay) in March 2024.

Measures loading performance — when the largest content element becomes visible.

Common causes of poor LCP:

  • Slow server response (TTFB)
  • Render-blocking JavaScript/CSS
  • Large unoptimized images
  • Client-side rendering delays

Measures responsiveness — latency of all interactions throughout the page lifecycle.

Common causes of poor INP:

  • Long JavaScript tasks blocking main thread
  • Large DOM size
  • Excessive event handlers
  • Third-party scripts

Measures visual stability — unexpected movement of visible elements.

Common causes of poor CLS:

  • Images without dimensions
  • Ads/embeds without reserved space
  • Dynamically injected content
  • Web fonts causing FOIT/FOUT

To pass assessment, 75% of page visits must meet “Good” thresholds.

Passing = (Pages with "Good" Score / Total Pages) >= 0.75

Measure where you are before making changes. Without a baseline, you can’t prove improvement.

FUNCTION establishBaseline(metric):
measurements = collectData(metric, days=14-30)
baseline = {
mean: average(measurements),
median: median(measurements),
p75: percentile(measurements, 75),
p95: percentile(measurements, 95),
sample_size: count(measurements),
date_range: [start, end]
}
RETURN baseline

Compare against public benchmarks:

SourceBenchmarks Available
MeasuringUSUS, task success, time on task
Google CrUXCore Web Vitals by industry
WebAIM MillionAccessibility errors
CustomerGaugeNPS by industry

Compare against your own history:

Improvement = ((New Score - Baseline) / Baseline) × 100

Track:

  • Week-over-week changes
  • Release-over-release changes
  • Seasonal variations

Set meaningful targets, not arbitrary numbers.

Good targets:

  • “Improve task success from 72% to 80%”
  • “Reduce LCP from 3.2s to 2.5s”
  • “Achieve SUS score of 75 (from 65)”

Bad targets:

  • “Get 100% task success” (unrealistic)
  • “Be the best in industry” (vague)
  • “Improve all metrics” (unfocused)

FUNCTION selectMetrics(product_stage, goals):
metrics = []
// Always include task success for usability
metrics.push("task_success_rate")
// Select attitudinal metric
IF goals.includes("overall_usability"):
metrics.push("SUS")
IF goals.includes("task_specific_feedback"):
metrics.push("SEQ")
IF goals.includes("loyalty_measurement"):
metrics.push("NPS")
// Select behavioral metrics based on stage
IF product_stage == "new_feature":
metrics.push("adoption_rate")
metrics.push("feature_usage_frequency")
IF product_stage == "retention_focus":
metrics.push("DAU_MAU_ratio")
metrics.push("churn_rate")
IF product_stage == "growth":
metrics.push("activation_rate")
metrics.push("time_to_first_value")
// Always include performance guardrails
metrics.push("LCP", "INP", "CLS")
// Always include error guardrails
metrics.push("error_rate")
RETURN metrics

  • Consistent events across platforms (web, mobile, API)
  • Standardized naming conventions
  • User identity resolution for cross-session tracking
  • Timestamp precision for time-based metrics
User Action → Event Collection → Processing → Storage → Dashboard
│ │ │ │ │
│ (Real-time) (Batch) (Analytics) (BI)
│ │
└──────────────────── Alert if threshold breached ─────┘
CadenceActivityMetrics Focus
DailyMonitor dashboardsAnomalies, errors
WeeklyTeam reviewTrends, progress
SprintRetrospectiveFeature impact
MonthlyStakeholder reportBusiness outcomes
QuarterlyBenchmarking studyIndustry comparison

When metrics change unexpectedly:

  1. Verify data — Is the change real or a tracking bug?
  2. Identify scope — Which segments affected?
  3. Correlate events — What changed (deploy, external)?
  4. Root cause — Why did behavior change?
  5. Action — Fix, revert, or accept?

Big numbers that don’t connect to outcomes.

Vanity MetricBetter Alternative
Page viewsEngaged sessions
DownloadsActivated users
Sign-ups7-day retained users
Time on siteTask completion

“When a measure becomes a target, it ceases to be a good measure.”

Example: Targeting “sessions per user” leads to artificial engagement features that annoy users and hurt long-term retention.

Prevention: Use guardrail metrics alongside targets.

Optimizing for immediate conversion at the cost of retention.

Example: Dark patterns increase sign-ups but increase churn and damage trust.

Prevention: Track 90-day retention alongside conversion.

Tracking everything, acting on nothing.

Prevention: Limit to 3-5 key metrics per initiative. More data ≠ better decisions.


CustomerGauge 2024-2025 data shows median NPS remaining at 42, but 10 of 11 industries saw NPS decline. Only Manufacturing improved. Wholesale, Retail & eCommerce, and Digital Marketplaces took double-digit hits.

Google replaced First Input Delay (FID) with Interaction to Next Paint (INP) as a Core Web Vital in March 2024. INP measures all interactions throughout the page lifecycle, not just the first one.

Tools like Mixpanel, Amplitude, and Heap are incorporating AI for anomaly detection, funnel optimization suggestions, and natural language querying of analytics data. However, human interpretation remains essential for understanding causation.

The WebAIM Million study continues annual tracking, showing 95.9% of home pages have detectable WCAG 2 failures in 2024. Accessibility compliance is increasingly treated as a guardrail metric alongside performance.


Frameworks:

Survey Metrics:

Performance:

Benchmarking:

Practical Guides: