top of page
Search

The Shape of User Experience

  • Writer: Bahareh Jozranjbar
    Bahareh Jozranjbar
  • Dec 8
  • 12 min read

A Practical Guide to Probability Distributions in UX Research


Most UX research teams now live in a world of metrics.

Conversion rates, task success, time on task, churn, NPS, feature adoption, “rage clicks”, scroll depth. We A/B test them, segment them, and present them in slide decks every week.

But under the surface, most of us still treat all of these metrics as if they came from the same simple shape: a nice symmetric bell curve.

In the textbook world, variables are continuous, symmetric and unbounded. In real products, time on task cannot be negative, counts of errors are whole numbers, success is a yes or no outcome, and many of our “scores” are really 1 to 5 buttons on a screen.

When we force bell curve tools onto non-bell-curve data we are not just being a bit imprecise. We can:

  • Underestimate how often extreme frustration happens

  • Miss real improvements because we looked only at means

  • Dramatically misjudge how many users we need in a study

This post is a practical guide to the shapes that actually show up in UX and HCI. We will not try to turn you into a statistician. Instead, the goal is simple:

When you look at a metric, you should have an instinctive sense of what kind of distribution it probably follows and what that implies for analysis.

Once you know the generative story behind each type, model choice becomes much less mysterious.


1. Start With The Generative Story, Not The Curve

A distribution is not just a curve that happens to fit your histogram. It is a description of a process.

In UX, the underlying processes often fall into a few common families:

  • Binary choices: A user converts or does not convert. Login succeeds or fails. A task is completed or abandoned.

  • Counts: Number of errors, number of help clicks, number of tickets, number of sessions. These are discrete, non-negative and often very noisy because people are different.

  • Latencies and durations: Time on task, reaction time, session length, fixation time. These are continuous, strictly positive and usually skewed to the right.

  • Bounded proportions and percentages: Completion rates, conversion rates, scroll depth, attention ratios, video completion percent. These live between 0 and 1 (or 0 to 100 percent).

  • Ordinal ratings: Likert items, NPS responses, satisfaction or perceived difficulty. These are ordered categories, not real numbers with equal spacing.

When you ask “what distribution should I use”, what you are really asking is “what is the most reasonable story for how this metric is generated”.

For example, when you model time on task you are not just fitting a curve. You are describing a process with a hard lower bound (people cannot respond faster than nerve signals travel) and no real upper bound (people can get distracted, confused, or interrupted). That process tends to create long right tails and makes the “average time” a dangerous summary if you do not respect the shape.


2. Binary Outcomes: Bernoulli And Binomial

The smallest unit of quantitative UX is a yes or no event:

Did the user complete checkout

Did they find the setting

Did they click the call to action

Each single trial is a Bernoulli event. It has one parameter p, the probability of success. The variance is p(1 - p), which already teaches you something important: binary metrics are most variable around 50 percent and much more stable when they are near zero or one. That is why estimating a 0.5 click-through rate precisely is much harder than estimating a 0.99 login success rate with the same n.

When you test n users and count how many succeed, that total follows a Binomial distribution. This is the shape behind questions like “out of 10 users, how many finished the flow”.


2.1 The “Magic Number 5” And Why It Fails For Edge Cases

The famous “test with 5 users, find 85 percent of problems” rule is really a Binomial statement.

If a problem affects proportion p of all users, the probability you see it at least once in a study of size n is:

P(detect) = 1 − (1 − p)^n

For p around 0.3 and n = 5, you get about 84 percent. That is where the heuristic comes from.

But plug in smaller p and the story changes:

  • p = 0.10, n = 5 → about 41 percent chance to see the problem

  • p = 0.01, n = 5 → about 5 percent chance

So “5 users is enough” only holds for common issues. For subtle bugs that affect 1 percent of users, a small study almost certainly sees nothing. The math says so.

This is why it helps to separate:

  • Formative testing:

    Small samples. Goal: find big, common issues.

  • Summative or reliability testing

    Larger samples. Goal: estimate rates, catch rare events.

If an edge case never appears in your 12-person lab study, that might be completely consistent with it hurting thousands of people in production.


2.2 Completion Rates And Confidence Intervals

When you say “80 percent of users completed the task” you are estimating the population Bernoulli probability p from a Binomial sample.

The usual “Wald” interval, which does p̂ ± z * sqrt(p̂(1 - p̂) / n), assumes a normal approximation that breaks badly when n is small or p is near zero or one. Examples: 0 out of 5 successes, 10 out of 10 successes. You get confidence intervals that look far too confident.

Adjusted methods like the Agresti–Coull interval effectively “add two successes and two failures” before computing the interval. For the small n typical in UX, these adjusted intervals behave much better and give more honest uncertainty bands around task success rates.


3. Count Data: Poisson, Negative Binomial And Excess Zeros

Now leave yes/no land and ask “how many times":

How many errors did the user make

How many times did they click “Help”

How many sessions did they start this month

These are counts. They live in 0, 1, 2, 3 and so on.


3.1 Poisson As A Starting Point

The Poisson distribution is the basic starting point for counts. It assumes a constant underlying rate λ and independent events. Its key property is equidispersion:

Mean = Variance = λ

That is a strong assumption. It is sometimes reasonable for rare, independent events, like safety-critical failures or random error events in a fairly homogeneous group.

In UX, you might use Poisson as a first pass to model:

  • Number of form errors per submission

  • Number of API errors per minute

  • Number of rage clicks on a single element

But in real products users are not identical.


3.2 Overdispersion And The Negative Binomial

In practice, you will often compute the mean and variance of your count data and find:

Variance >> Mean

This is overdispersion. It usually arises because one or both of these are true:

  • Skill heterogeneity

    Beginners make many more errors than experts.

  • Motivation heterogeneity

    Power users hammer a feature, casual users barely touch it.

If you force a Poisson model onto overdispersed data, your standard errors will be too small. You will think your differences are “significant” when they are just noise.

The Negative Binomial distribution fixes this by introducing a dispersion parameter in addition to the mean. A nice way to think about it is a “Poisson with a random user-level rate”: each user has their own Poisson rate, and those rates themselves follow a distribution.

As the dispersion tends to zero, the Negative Binomial collapses back to Poisson. As dispersion grows, the tails get heavier and the model can handle noisier, more uneven behavior.

A very simple rule of thumb:

  • If mean and variance are about equal, Poisson might be fine

  • If variance is much bigger than the mean, move to Negative Binomial

Ignoring this check is one of the main sources of false positives in count based A/B tests.


3.3 When Zero Is Ambiguous: Zero Inflated And Hurdle Models

Digital products are full of zeros. Most users do not click a specific banner, do not use an advanced feature, do not file a ticket this week.

But “0” can mean at least two different things:

  • Structural zero: this user never had a chance or is not the type

    For example, free plan users cannot use a premium feature.

  • Sampling zero: this user could have done it, but just did not this time

    For example, a sports fan did not open the app on Monday.

Zero inflated models (ZIP, ZINB) combine:

  • A binary model for “is this user in the at risk group or not”

  • A count model (often Poisson or Negative Binomial) for users who are at risk

Hurdle models are similar but assume zeros come only from the binary part. Once someone crosses the hurdle (count greater than zero), the count model takes over on positive values.

This is a powerful way to separate adoption from intensity. For example with a “Save to collection” feature:

  • The binary part tells you: how many users even start using it

  • The count part tells you: how heavily they use it once they do

A simple average of “saves per user” blends those stories into a single blurry number.


4. Time And Latency: Lognormal, Ex-Gaussian, Gamma

Time is one of the most important UX currencies. The problem is that time data almost never behave like textbook normals.

They are strictly positive and usually skewed. Many people are “pretty fast”, a few people are very slow.

4.1 Lognormal: The Default For Task Times

A variable is lognormal when its logarithm is normal. If T is task time, then log(T) is often fairly bell shaped.

A useful story here is that task time is a product of many small multiplicative factors:

Perceive element

Understand it

Decide what to do

Move hand or cursor

Recover from any confusion

Small multiplicative delays accumulate and create a long tail.

For practitioners, this suggests a simple workflow:

  1. Take the log of time on task

  2. Analyze log times with t tests, ANOVA or regression

  3. Transform summaries back to the original scale

This effectively works with geometric means instead of arithmetic means. The geometric mean is a much more stable “typical” time when a few users occasionally take ten times longer than everyone else.


4.2 Ex-Gaussian: Separating Motor Speed From Cognitive Slowness

In some UX and HCI work, especially reaction time studies or micro-interactions, the shape of the latency distribution is not just a nuisance, it is the signal.

Ex-Gaussian distributions combine:

  • A normal part (with mean mu and standard deviation sigma)

  • An exponential tail (with parameter tau)

This lets you distinguish:

  • Baseline sensory-motor speed (mu, sigma)

  • Occasional long pauses, lapses or heavy cognitive load (tau)

You might find that one design has a slightly slower base speed but fewer extreme delays, while another is usually fast but sometimes induces very long stalls. A simple mean comparison will say “no difference”. Ex-Gaussian parameters will say “it is faster on average, but only by burning attention and increasing rare but severe slowdowns”.


4.3 Gamma: Waiting Times And Evidence Accumulation

Gamma distributions are another flexible family for positive only data. They often show up when you think of a decision as “waiting until enough evidence accumulates”.

Examples:

  • How long until a user feels they have enough information to buy

  • How long until someone decides to abandon a search

Gamma and lognormal can look very similar. Tools like AIC can help pick between them. Conceptually, Gamma is more natural when you think in terms of “accumulation to a threshold”, lognormal when you think in terms of multiplicative stages.


5. Proportions And Percentages: Beta And Beta Regression

Many core UX metrics are ratios:

  • Task completion rate

  • Conversion rate

  • Scroll depth ratio

  • Share of time spent on key content

They live between 0 and 1. Linear models on percentages have two big problems:

  • They can happily predict impossible values like 120 percent

  • They assume constant variance, while real variance is highest in the middle and squeezed near 0 and 1

The natural home for these metrics is the Beta distribution, which lives on (0, 1) and can take on many shapes: bell shaped, U shaped, J shaped, heavily skewed and so on.

Beta regression lets you model a proportion as a function of predictors while respecting the bounds and the changing variance.

In an A/B test on attention ratio (time on content / time on page), you might use Beta regression with layout, device and content type as predictors. You can then choose link functions that match the shape:

  • Logit link for more symmetric data

  • Log log or complementary log log for highly skewed data

If your data include many exact zeros and ones, you either nudge them slightly inward with a simple transformation or use zero one inflated Beta models that explicitly treat 0 and 1 as special masses, similar to zero inflated count models.


6. Ordinal Ratings: Likert, NPS And SUS

Survey data is everywhere in UX, and most of it is ordinal.

“Strongly disagree” to “strongly agree” are ordered categories. The distance between 1 and 2 is not guaranteed to be the same as between 4 and 5.

Two practical levels of truth:

  • For single items, treating 1 to 5 as a real number is risky. A mean of 3.5 could mean a cluster around the middle or a polarized split between 1 and 5.

  • For composite scores that sum several items, like SUS, the total often behaves close enough to normal for standard methods when sample size is not tiny.

The more principled way to analyze single item ratings is ordinal logistic regression, often called cumulative link models.

These models assume there is an underlying continuous satisfaction variable and a set of thresholds that cut it into categories. They let you estimate how a design shift moves the entire latent distribution without pretending that the step from 2 to 3 is identical to the step from 3 to 4.

For SUS, it also helps to remember that the distribution is typically skewed high and that the average in industry is around 68, not 50. Many teams now map scores to percentiles or letter grades (“C”, “A minus”) or adjectives (“OK”, “Good”, “Excellent”) to give stakeholders an intuitive feel for where a product sits in the broader landscape.


7. Heavy Tails: Power Users, Communities And “Whales”

Engagement data in communities, creator platforms, or games often follows heavy tailed distributions.

The classic 90–9–1 pattern:

  • 90 percent of users mostly lurk

  • 9 percent contribute occasionally

  • 1 percent generate most of the content or revenue

In such cases, the “average user” is a misleading concept. A single person can produce as many posts as hundreds of casual users combined. In extreme cases, the variance or even the mean can be unstable in the mathematical sense.

Here, “outliers” are not noise to be trimmed. They are the business.

Metrics that work better in this world include:

  • Medians

  • 90th and 99th percentiles

  • Inequality measures like the Gini coefficient

These help you reason about participation health and concentration without pretending your distribution is nicely centered and symmetric.

In practice, you often see truncated power laws or lognormal tails rather than perfect power laws, simply because there are physical limits to how much a person can do in a day. That truncation itself can help you detect suspicious activity. If a small group of accounts exceeds the plausible human limit, you might be looking at bots rather than super-fans.


8. Retention And Churn: Survival Analysis And Weibull

In subscription products, the key question is often not “how many users do we have” but “how long do they stay”.

Standard regression on “months retained” has a big problem: censoring. At the moment of analysis, many users are still active. You know they have stayed at least this long, but you do not yet know when they will leave.

If you drop all active users from your analysis, you bias toward churners. If you pretend their current tenure is their final tenure, you underestimate true retention.

Survival analysis is built to handle this. It treats “still here at time” as real information rather than a missing value.

  • Kaplan–Meier curves give you a nonparametric estimate of the survival function over time and let you compare cohorts visually.

  • Parametric models like Weibull let you estimate hazard functions, which tell you how churn risk changes over time.

Weibull is flexible enough to capture different churn stories:

  • Decreasing hazard: the longer someone stays, the safer they are, which often happens with complex tools that have a learning curve.

  • Constant hazard: churn risk is roughly constant over time.

  • Increasing hazard: churn risk grows as novelty fades or as life circumstances change, which is common for products with a natural life cycle.

These insights feed directly into lifetime value estimates, onboarding design and re-engagement strategy.


9. Choosing And Checking A Distribution

You do not have to guess the right distribution in the dark.

The usual workflow looks something like this:

  1. Start from the generative story

  2. Fit a reasonable candidate distribution or model

  3. Check diagnostics: residual plots, Q–Q plots, rootograms for counts, worm plots for more complex cases

  4. Adjust if you consistently see skew or heavy tails that the model is not capturing

For example, a hanging rootogram compares observed count frequencies to what your Poisson or Negative Binomial model would expect. If the bar at zero is way off, you might need a zero inflated model. If mid-range counts are systematically under or over predicted, you might need a different mean structure or distribution.

The goal is not to chase a perfect fit, but to avoid obviously wrong ones.


10. A Simple Decision Tree For UX Metrics

You can summarize much of this into a quick mental checklist when you see a new metric.

  1. Is the data discrete or continuous

    • Discrete, small integers

      Are they yes or no outcomes→ Think Bernoulli or Binomial

      Are they counts→ Compare mean and variance

      Roughly equal → Poisson could workVariance much bigger → Negative Binomial, possibly with zero inflation

    • Continuous

      Is it time or latency→ Lognormal as a default, Ex-Gaussian or Gamma if you care about components

      Is it bounded 0 to 1→ Beta or Beta regression, maybe zero one inflation

      Is it time until churn or conversion→ Survival analysis, often Weibull or related models

  2. Is the outcome an ordered rating

    • Single item → treat as ordinal, cumulative link models

    • Composite scores → sometimes acceptable to treat as approximately normal

  3. Does the distribution look heavy tailed with a few extreme users

    • Avoid relying on means alone

    • Use quantiles and inequality measures

Once you align the shape of your model with the shape of your data, you move from “throwing stats at a dashboard” to actually modeling user behavior.

The reward is not just nicer plots. You get:

  • More honest uncertainty around your metrics

  • Fewer false positives in A/B tests

  • Better understanding of where and how designs fail

  • Clearer stories to tell stakeholders about what is really happening in the product

The normal distribution still has its place. It is just not the default shape of user experience.


At PUX Lab, this distribution-first approach is how we actually run our studies in practice, and it’s a workflow we also apply directly for partner teams. We begin every project by identifying the generative process behind each UX metric before selecting any statistical model, whether the data are binary, count-based, bounded, ordinal, time-based, or heavy-tailed. From there, we design both the study and the analysis accordingly, using methods such as Negative Binomial and zero-inflated models for heterogeneous engagement data, log-time or Ex-Gaussian models for latency metrics, and survival analysis for churn and retention. If your team wants to move beyond default averages and one-size-fits-all testing, we can run these analyses for you and translate the results into clear, actionable insights that are statistically rigorous and defensible.

 
 
 

1 Comment


segode1646
5 days ago

Informative summary, albeit a bit too text intensive (illustrations of those distributions would have been nice).


What is your opinion on Aligned Rank Transform often used in HCI to transform non-normal data for ANOVAs?

Edited
Like
  • LinkedIn

©2020 by Mohsen Rafiei.

bottom of page