Marketing Cluster

A/B Testing Calculator Guide: Significance, Sample Size, Bayes

Published April 11, 2026 · 11 min read

Most A/B tests do not fail because the variant was wrong. They fail because the test was stopped too early, the sample size was never calculated, or someone peeked at the numbers on day three and called a winner. The math of A/B testing is not new — statistical inference has been around for a century — but the specific pitfalls of product experimentation get rediscovered every quarter by teams running their first tests.

This guide walks through what an A/B test calculator is actually computing, how to size a test before running it, how to interpret results without fooling yourself, and the frequentist versus Bayesian choice that determines what your "winner" sign even means. By the end you should be able to plan a test, run it, and read the output with calibrated confidence.

What an A/B test is actually measuring

An A/B test randomly splits users into groups and shows each group a different version. You measure a conversion rate (or any other metric) per group and check whether the difference is larger than you would expect from random noise. That is literally it. Everything else — multi-armed bandits, stratification, variance reduction, holdout groups — is an elaboration.

The "difference larger than noise" question is a statistical inference problem. Given the sample sizes, the conversion rates, and an assumption about the underlying distribution, you compute a probability that the observed difference could have happened by chance alone. Small probability → unlikely to be chance → evidence for a real difference.

Sample size planning

The most common A/B testing mistake is running a test without knowing how big a sample it needs. Test too few people and you cannot detect the effect even if it exists. Test too many and you waste time and traffic. The calculation is standard and every team should run it before touching a test tool.

Four inputs go into a sample size calculation:

Baseline conversion rate. What is the control currently converting at?
Minimum detectable effect (MDE). How small a lift do you want to be able to detect reliably? Detecting a 0.1% lift needs vastly more traffic than detecting a 5% lift.
Statistical power. The probability you will detect a real effect. 0.8 is standard (80% chance of catching a true difference of the MDE size).
Alpha. Your false positive tolerance. 0.05 is standard (5% chance of calling a winner when there is no real effect).

Feed these into Sample Size Calculator and you get the required sample per variant. For a 5% baseline with 10% relative lift, 80% power, and 5% alpha, you typically need around 15,000 users per variant — numbers that often shock teams who were planning to run tests on 500 people.

The economic implications are worth sitting with. If your baseline is 2% and you want to detect a 1% absolute lift, you need around 3,000 per variant. If you want to detect a 0.2% absolute lift, you need around 70,000 per variant. Smaller effects cost traffic quadratically. This is why "let's test everything" as a strategy fails: low-volume products cannot statistically detect small improvements, and small improvements are most of what you find.

Statistical significance, in plain language

Statistical significance is shorthand for "the observed difference is unlikely to have happened by chance alone if the variants were actually equivalent." The p-value is a number between 0 and 1 representing that probability. If p < 0.05, conventional practice calls the result "significant."

What significance does not mean:

It does not mean the variant is better by the observed amount. The true effect could be smaller or larger than what you measured; the observed lift is a point estimate with uncertainty around it.
It does not mean the effect is practically meaningful. A significant 0.01% lift is not worth shipping.
It does not mean the test should have stopped the moment p dropped below 0.05. Peeking changes the math, which is the next section.
It does not mean p > 0.05 proves the variants are equivalent. It means you did not have enough evidence to distinguish them.

Compute significance for a standard two-proportion test with A/B Test Calculator, which takes visitor and conversion counts per variant and returns a p-value, confidence interval, and lift estimate. For the broader lift calculation ("variant converted at 5.2% vs control at 4.8% = 8.3% relative lift"), Conversion Rate Calculator handles the conversions and the percentage math. Evan Miller's classic posts on how not to run an A/B test are required reading on the statistical pitfalls.

The peeking problem

Here is the counterintuitive one. If you peek at a frequentist A/B test repeatedly and stop as soon as you see "significant," your false positive rate is much higher than the nominal 5%. The reason: random noise wanders. Over many check-ins, the test will randomly cross the significance threshold at some point even if there is no real effect, and if you stop there, you have fooled yourself.

The fix is one of the following:

Fixed-horizon testing. Decide the sample size in advance, run to that size, then read the result exactly once. This is the classical approach and it gives you the nominal false positive rate you planned for.
Sequential testing. Use methods designed for peeking (SPRT, Always Valid Inference, Bayesian methods). These adjust for the fact that you will check multiple times.
Bayesian approach. Bayesian methods are immune to peeking in a specific sense — posterior probabilities remain calibrated even under continuous monitoring. More on this next.

The Adobe Target documentation and VWO's research publications both cover peeking in practical terms for commercial tools. The short version: if your testing platform lets you peek, assume the platform is either Bayesian under the hood or is misleading you, and check which one.

Bayesian A/B testing

Bayesian A/B testing asks a different question. Instead of "what is the probability of seeing data this extreme if the variants were equal?" (frequentist), it asks "what is the probability that variant B is better than variant A, given the data?" (Bayesian). The second question is what business stakeholders actually want, which is why Bayesian methods have become popular in commercial A/B testing tools.

Bayesian methods require a prior — your belief about the effect before collecting data. In practice, most tools use a "weak prior" that expresses "I don't know anything, let the data drive," which gives results similar to frequentist methods for large samples. For small samples or prior knowledge about the domain, Bayesian methods let you incorporate what you know and get tighter answers.

The headline feature of Bayesian A/B testing is that peeking is allowed. You can monitor a test continuously and stop when the posterior probability of B beating A crosses some threshold (say, 95%). This is not a free lunch — you are still trading off time for confidence — but the math handles continuous monitoring gracefully where frequentist testing does not.

Pitfalls that shipped to production

The novelty effect

Users react to change, then habituate. A variant can look like a winner for the first three days because users are exploring the new thing, then converge with the control. Run tests for at least a full weekly cycle, ideally two.

Multiple metrics, no correction

If you test 20 metrics at alpha = 0.05, you expect 1 false positive by pure chance. Pick one primary metric, declare it in advance, and treat everything else as exploratory.

Selection bias in who sees the test

If you run the test only on logged-in users and extrapolate to all users, your result does not generalize. Be explicit about the population and randomization unit.

Segment slicing after the fact

"The test was not significant overall, but it was significant on mobile." Slicing by segment after the fact is a flavor of p-hacking. If you expected the effect to differ by segment, pre-register the segments.

Ignoring implementation bugs

Before trusting a test, verify that assignment is actually 50/50, that the variants are being shown correctly, and that the metric is being tracked consistently for both. "The test ran for two weeks before we noticed event tracking was broken in the variant" is a real post-mortem that happens more than it should.

Effect size vs significance

A statistically significant 0.1% lift on a feature that cost two engineer-weeks is a net loss. Always evaluate significance and effect size and opportunity cost together.

Adjacent tools worth bookmarking

Related statistics and marketing calculators: Statistics Calculator for descriptive stats on raw conversion data, Z-Score Calculator for standardizing individual observations, Confidence Interval Calculator for reporting effect size uncertainty instead of p-values alone, Percentage Calculator for communicating lifts in non-technical terms, Chi-Square Calculator when your test involves three or more variants with categorical outcomes, and UTM Builder for consistently tagging variant landing pages.

Related pillar guide

This cluster post is part of the SEO and marketing track. For the broader foundation on auditing and improving your site's performance, see SEO Audit Toolkit: A Masterclass.

FAQ

How long should an A/B test run?

Until you have reached the sample size you pre-computed, and for at least one full business cycle (typically one to two weeks). Stop on sample size, not on time, unless calendar effects would bias the result.

Can I run multiple A/B tests at the same time?

Yes, usually. If the tests are in different parts of the product and do not interact, they can run concurrently without inflating each other's error rates. Be cautious when two tests touch the same conversion funnel.

What if my sample is too small to reach significance?

Options: run longer, test a bigger change (larger expected effect = smaller sample needed), use Bayesian methods with an informed prior, or accept that you do not have the traffic to detect the effect you want and skip the test.

What is a minimum detectable effect?

The smallest true effect size your test has adequate power to detect. It is not a guarantee; it is the threshold below which a real effect is likely to appear non-significant due to insufficient sample.

Should I use one-tailed or two-tailed tests?

Two-tailed, almost always. One-tailed tests implicitly assume you could not possibly lose — which is rarely true, and being surprised by a losing variant is valuable information.

Closing thought

A/B testing is mostly discipline and a little math. The discipline is pre-registering metrics and sample sizes, running to completion, and resisting the urge to slice after the fact. The math is the twenty minutes with a calculator that saves you from the two weeks of running a test that was never going to have enough power.