A/B Test Sample Size and Duration Guide

A practical guide to estimating A/B test sample size and duration using baseline rate, traffic, MDE, and sound planning assumptions.

Planning an experiment without estimating sample size and duration usually leads to the same problems: tests stop too early, run too long, or produce results nobody trusts. This guide gives you a practical framework for estimating what an A/B test needs before launch, using repeatable inputs such as baseline conversion rate, minimum detectable effect, traffic, and desired confidence. It is designed as a reference you can return to whenever your traffic mix, conversion rate, or business stakes change.

Overview

A/B testing works best when the decision rules are set before the first visitor enters the experiment. The two questions most teams ask are simple: how many users do we need, and how long will the test take? The answers are connected. Sample size tells you how much data each variant should collect. Duration translates that target into calendar time based on your traffic and allocation.

For most website owners, marketers, and CRO teams, the purpose of an ab test sample size estimate is not mathematical perfection. It is planning discipline. A good estimate helps you decide whether a test is realistic, whether the expected lift is large enough to matter, and whether the experiment should focus on a high-volume micro-conversion or a lower-volume macro-conversion.

At a practical level, you can think of A/B test planning as a trade-off among five variables:

Baseline conversion rate: your current performance.
Minimum detectable effect: the smallest lift or decline worth detecting.
Confidence level: how much false-positive risk you accept.
Statistical power: how much false-negative risk you accept.
Traffic volume: the number of eligible users or sessions reaching the test.

If one of these changes, your ab test duration changes too. That is why this topic is worth revisiting regularly. A redesign, new traffic campaign, consent changes, seasonality, or tracking updates can all affect the estimate.

One more point matters before any calculator enters the picture: your tracking has to be dependable. If purchases, leads, or add-to-cart events are misfiring, the sample size math will look precise while the underlying data remains unreliable. If you need to validate the setup first, see How to Debug Broken Conversion Tracking Across GA4, GTM, and Ad Platforms and GTM Container Audit Checklist: Tags, Triggers, Variables, and Governance.

How to estimate

Use this section as your repeatable decision framework. The goal is not to memorize formulas. The goal is to define the inputs in a consistent way so a calculator, spreadsheet, or internal planning template produces a useful answer.

Step 1: Choose one primary metric

Every experiment should have one main success metric. That might be purchase rate, lead form completion rate, checkout completion rate, or subscription rate. Secondary metrics are useful for diagnosis, but sample size should be based on the primary outcome only.

If you switch your main metric mid-test, your original estimate no longer describes the experiment you are running.

Step 2: Find a stable baseline

Your baseline conversion rate should come from recent, comparable traffic. Avoid using a one-week spike or a blended annual average if the test applies to a specific page type, device group, or traffic source. If you are testing a product detail page, use the conversion rate for that page group or funnel step, not the sitewide average.

GA4 reporting can help here, especially if you segment by landing page, device, channel, and returning versus new users. For reporting hygiene, GA4 for SEO Reporting: Metrics, Segments, and Dashboards Worth Watching is a useful companion for structuring segmented views.

Step 3: Define the minimum detectable effect

The minimum detectable effect, often shortened to MDE, is the smallest relative or absolute change you care enough to act on. This is one of the most important experiment planning choices.

If the business would only ship a variation for a 10% relative lift, use that as the target. If even a 2% lift would justify implementation on a very high-revenue page, set a smaller MDE. Smaller effects require larger samples. Larger effects require fewer users but may overlook worthwhile gains.

A simple rule: do not choose an MDE because it makes the test shorter. Choose it because it reflects a meaningful business outcome.

Step 4: Set confidence and power

Most teams use conventional defaults for a statistical significance test, such as a 95% confidence level and 80% power. You do not need to overcomplicate this unless you run a mature experimentation program with strict risk controls. The key is consistency. If every team uses different thresholds, test results become hard to compare.

Higher confidence and higher power both increase required sample size. In other words, greater certainty costs more traffic and more time.

Step 5: Estimate required users per variant

At this stage, a CRO calculator or spreadsheet will usually ask for:

Baseline conversion rate
Expected uplift or MDE
Confidence level
Power
Number of variants

The output is commonly shown as required sample size per variant. If you are testing A versus B, and the calculator returns 20,000 users per variant, the test needs about 40,000 eligible users in total.

Step 6: Convert sample size into duration

Once you know the sample target, duration is a traffic problem. Use this planning logic:

Estimated test duration = required total sample / daily eligible traffic entering the experiment

Then adjust for traffic allocation. If only 50% of eligible traffic is being sent into the test, your duration roughly doubles compared with full allocation.

Also account for business cycles. Most experiments should run through full weekly patterns. If your site behaves differently on weekends, paydays, or campaign launch days, the calendar estimate should be long enough to capture that rhythm.

Step 7: Pressure-test feasibility before launch

Before approving the test, ask three questions:

Can we realistically collect the sample in an acceptable timeframe?
Is the MDE large enough to matter but small enough to be plausible?
Is tracking stable enough that we trust the conversion data?

If the answer to any of these is no, change the test design. Common fixes include testing a higher-volume page, using a stronger variation, simplifying segmentation, or choosing an earlier funnel metric as the primary measure.

Inputs and assumptions

This is where most A/B test planning goes wrong. The math can be fine while the assumptions are weak. Treat these inputs as editorial choices, not just calculator fields.

Baseline conversion rate

Baseline should reflect the exact audience eligible for the experiment. If mobile traffic converts materially differently from desktop traffic, and the test is mobile-only, use the mobile baseline. Blended rates can understate or overstate the needed sample.

Minimum detectable effect

MDE is a business decision disguised as a statistical input. Smaller MDEs mean more precision and longer run times. If your site has limited traffic, picking a very small MDE can create tests that never finish. On the other hand, an unrealistically large MDE can make weak ideas seem testable.

A practical approach is to ask: what is the smallest improvement that would change a roadmap decision?

User versus session counting

Some teams estimate using users, others sessions. The important thing is consistency between your baseline data, your calculator, and your experiment platform. If your conversion metric is user-based but the estimate is session-based, duration can be misread.

Traffic eligibility

Not all site traffic enters a test. Exclude pages, geographies, logged-in states, consent states, app traffic, or campaign traffic if they are outside scope. Duration estimates built on total site sessions are often too optimistic.

Variant allocation

A 50/50 split is common in a two-variant test. But many teams hold out some traffic for risk control, limit traffic during launch, or run uneven allocations. Every allocation decision affects duration.

Conversion lag

If your users often convert days after first exposure, do not read early results as final. Lead generation, B2B funnels, and subscription products often have lag between visit and conversion. Your test may reach the planned sample before the primary conversions are fully observed.

Seasonality and campaign distortion

If one variant receives more traffic during a promotion, holiday period, or channel spike, the result may reflect traffic mix instead of page performance. This is why a calendar-aware experiment planning process matters as much as sample size estimation itself.

Tracking quality

Even a well-designed test can fail if events are duplicated, blocked, or mapped inconsistently across platforms. If your experiment ties back to ad platform optimization, also review related implementation guides such as Google Ads Conversion Tracking Checklist for Websites and Lead Forms and Meta Pixel Setup and Event Match Quality Audit Guide.

Privacy-aware measurement can reduce observable conversion volume depending on your setup and consent behavior. That does not make testing impossible, but it does mean historical baselines and current traffic may not be directly comparable. If your site recently changed consent behavior or moved to server-side collection, recalculate before launching a new experiment. For implementation context, see Server-Side Tracking Setup Guide: When It Helps, What It Breaks, and How to Validate It.

Worked examples

The numbers below are illustrative. They are not universal benchmarks. The value is in the planning logic.

Example 1: Moderate-traffic lead generation page

Suppose a lead form landing page converts at 8%. You want to test a simpler form layout and would only implement it if it improved conversion by at least 10% relative. That means your target lift is from 8% to 8.8%.

You enter these assumptions into your cro calculator:

Baseline conversion rate: 8%
Minimum detectable effect: 10% relative lift
Confidence: 95%
Power: 80%
Variants: 2

The calculator returns a required sample per variant. To estimate duration, divide the total required sample by the number of eligible daily visitors reaching the page. If the page gets 2,000 eligible visitors per day and traffic is split evenly, you can estimate the run length in days. Then pressure-test that estimate against weekly cycles and conversion lag.

If the resulting duration feels too long, do not immediately lower the confidence threshold. First consider whether the expected effect is too small for this page’s traffic, or whether the test idea should be stronger.

Example 2: Ecommerce checkout step with low conversion volume

An ecommerce team wants to test a checkout trust message. The baseline purchase completion rate from checkout entry is 25%, but only a few hundred users enter checkout per day. Because checkout volume is limited, the team may find that detecting a small improvement would take too long.

This is a common decision point. Instead of forcing a long test, the team can ask:

Should we test a bolder variation?
Should we use a higher-volume upstream metric such as add-to-cart as a diagnostic secondary metric?
Should we wait until traffic rises during a known peak period?

If the purchase metric remains the primary success measure, the sample estimate still rules the decision. Secondary metrics can help explain behavior, but they should not replace the primary outcome after the fact. If you are validating ecommerce tracking before testing, GA4 Ecommerce Tracking Checklist: Product Views, Add to Cart, Checkout, and Purchase is a useful pre-launch reference.

Example 3: Content publisher testing subscription prompts

A publisher tests two article-end subscription prompts. The baseline subscription rate is low, and traffic varies sharply by topic and day of week. In this case, duration should not be estimated from average daily pageviews alone. Eligible traffic should reflect article pages that actually show the prompt and should be smoothed over full publishing cycles.

Publishers often benefit from choosing a hierarchy of metrics:

Primary: subscription start rate
Secondary: click-through to subscription form
Guardrail: engagement depth or ad revenue per session

This structure helps teams avoid chasing a lift in prompt clicks that harms the business outcome that matters. For related measurement context, see Content Engagement Metrics Guide: What Publishers Should Track Beyond Pageviews.

Example 4: Why short tests often mislead

Imagine a page receives strong weekday traffic and weak weekend traffic. A test starts on Monday and by Thursday one variant appears to be ahead. If the planned sample size has not been reached, and the experiment has not covered the site’s weekly cycle, stopping early can turn normal variance into a false win.

That is one reason an ab test duration calculator should be treated as a floor, not a finish line. You still need enough calendar coverage to represent normal behavior.

When to recalculate

You should revisit sample size and duration estimates whenever the underlying inputs move in a meaningful way. This is the section to bookmark and use operationally.

Recalculate before launch if any of these changed

Your baseline conversion rate shifted after a redesign or campaign change.
Your traffic mix changed by device, channel, geography, or landing page type.
You changed the test audience or eligibility rules.
You switched the primary metric.
You adopted new consent logic, server-side tracking, or platform measurement changes.
You now require a smaller or larger business impact to ship the winner.

Recalculate during the planning stage if the test looks unrealistic

If the estimate says the experiment will take too long, work through these options in order:

Check the baseline. Make sure it matches the actual eligible audience.
Review the MDE. Is the target effect meaningful and plausible?
Simplify the design. Fewer variants mean faster learning.
Increase traffic eligibility. Include more relevant pages or a broader audience if that still fits the hypothesis.
Strengthen the treatment. A larger expected effect reduces sample requirements.
Choose a different test. Some ideas are not practical at current traffic levels.

Recalculate after notable business shifts

This topic is evergreen because the inputs do not stay fixed. Return to your estimate when benchmarks or rates move, especially after:

Seasonal peaks or off-seasons
Major channel mix changes
Pricing or offer changes
Site migrations or template changes
Tracking audits and event definition updates

A simple planning template to keep

For each experiment, capture the following in one document or spreadsheet:

Primary metric
Baseline conversion rate and date range used
MDE and business rationale
Confidence and power assumptions
Estimated sample per variant
Eligible daily traffic
Estimated duration in days and weeks
Known risks: seasonality, conversion lag, tracking issues
Decision rule for stopping and shipping

This small discipline makes experiment reviews far cleaner. It also helps align analytics, product, and marketing teams before a test goes live.

Finally, remember that sample size estimation is a planning tool, not a promise. Real-world experiments are influenced by traffic quality, implementation quality, and business context. Good planning will not remove uncertainty, but it will keep you from making avoidable mistakes. If your team treats every test as a repeatable measurement exercise rather than a one-off gamble, your experimentation program will become more trustworthy over time.

For adjacent planning work, it can also help to standardize campaign inputs and attribution interpretation. Related references include UTM Naming Convention Guide: A Maintainable Framework for Teams and Marketing Attribution Models Explained: When to Use Each and What to Watch For.