# Experimentation & Statistics

Run experiments you can trust. Definitions and calculators for the statistics that decide whether a result is real — statistical significance, sample size, confidence intervals, control groups, and false positives — plus incrementality and A/B-test methodology for marketing teams that refuse to ship on noise.

**Tagline:** The statistics behind trustworthy tests — significance, sample size, power, and lift.

## Overview

Most "winning" tests are not. The fastest way to ship false winners is to read results before a test has gathered enough data, to call any difference a result, or to confuse a metric that moved in-platform with revenue that actually grew. Trustworthy experimentation is the discipline of separating signal from noise — and it rests on three dials you set before the test starts, not after.

The first dial is statistical significance, usually expressed as 95% confidence (p < 0.05). It answers a narrow question: if there were truly no difference between A and B, how often would random chance alone produce a gap this large? It does not tell you the variant is better, by how much, or whether the difference matters to the business. The second dial is statistical power, conventionally 80% — the chance your test detects a real effect that exists. Underpowered tests are the silent killer: they quietly miss real wins and leave teams concluding "no difference" when there was one.

The third dial is sample size, and it is driven by your baseline conversion rate and the minimum detectable effect (MDE) you care about. Lower baselines and smaller effects both demand dramatically more traffic. As concrete anchors at 95% confidence and 80% power, a 5% baseline rate needs roughly 31,000 visitors per variant to detect a +10% relative lift, or about 8,100 to detect a +20% lift; a 1% baseline needs roughly 163,000 per variant for a +10% lift. Fix the sample size in advance, then resist peeking — checking repeatedly and stopping the moment you see significance inflates the false-positive rate far above the 5% you think you set.

Finally, a significant in-platform lift is not the same as incremental revenue. Platform attribution credits conversions that often would have happened anyway; the only way to know what your marketing truly caused is a holdout or geo experiment that compares an exposed group to an unexposed control. Use the calculators below to plan sample size and read significance, the glossary to ground the concepts, and incrementality testing to confirm that a statistically real lift is also a real business one.

## Curated resources

### Glossary terms

- [ab-testing](https://www.adsights.ai/resources/glossary/creative/ab-testing)
- [statistical-significance](https://www.adsights.ai/resources/glossary/metrics/statistical-significance)
- [sample-size](https://www.adsights.ai/resources/glossary/metrics/sample-size)
- [confidence-interval](https://www.adsights.ai/resources/glossary/metrics/confidence-interval)
- [margin-of-error](https://www.adsights.ai/resources/glossary/metrics/margin-of-error)
- [control-group](https://www.adsights.ai/resources/glossary/metrics/control-group)
- [false-positive](https://www.adsights.ai/resources/glossary/metrics/false-positive)
- [false-negative](https://www.adsights.ai/resources/glossary/metrics/false-negative)
- [incrementality](https://www.adsights.ai/resources/glossary/general/incrementality)

### Tools

- [ab-test-statistical-significance-calculator](https://www.adsights.ai/resources/tools/calculators/ab-test-statistical-significance-calculator)
- [creative-testing-calculator](https://www.adsights.ai/resources/tools/calculators/creative-testing-calculator)
- [marketing-incrementality-calculator](https://www.adsights.ai/resources/tools/calculators/marketing-incrementality-calculator)

### Guides

- [creative-testing-framework-guide](https://www.adsights.ai/resources/guides/creative-testing-framework-guide)

### Templates

- [ab-test-tracker-template](https://www.adsights.ai/resources/templates/ab-test-tracker-template)

### Featured blog posts

- [statistical-noise-unmasking-the-illusion-of-insights-in-modern-marketing](https://www.adsights.ai/blog/topics/data-science/statistical-noise-unmasking-the-illusion-of-insights-in-modern-marketing)

## Related topics

- [creative-testing](https://www.adsights.ai/resources/topics/creative-testing)
- [attribution-measurement](https://www.adsights.ai/resources/topics/attribution-measurement)
- [data-visualization](https://www.adsights.ai/resources/topics/data-visualization)

## Frequently asked questions

### What does statistical significance actually tell me?

At 95% confidence (p < 0.05), it tells you that if there were genuinely no difference between your variants, you would see a gap this large by random chance only about 5% of the time. That is all. It does not tell you the variant is better, how large the effect is, or whether it matters commercially — a tiny, business-irrelevant difference can be "statistically significant" with enough traffic. Always pair significance with the effect size and a confidence interval.

### How much traffic do I need for an A/B test?

It depends on your baseline conversion rate and the minimum detectable effect (MDE). At 95% confidence and 80% power: a 5% baseline needs roughly 31,000 visitors per variant to detect a +10% relative lift, or about 8,100 for a +20% lift; a 1% baseline needs roughly 163,000 per variant for +10%. Lower baselines and smaller target effects both require far more traffic. If the required sample exceeds what you can gather in 2–4 weeks, test a bigger change or a higher-converting funnel step — don't compensate by stopping early.

### What is statistical power, and why does 80% matter?

Power is the probability your test detects a real effect when one truly exists. The 80% convention means you accept a 20% chance of missing a real winner (a false negative). Underpowered tests — too little traffic for the effect you care about — are the most common and least visible testing mistake: they return "no significant difference" on changes that actually worked, so teams stop iterating on ideas that were quietly winning.

### Can I stop a test early once it hits significance?

No — "peeking" and stopping the moment you see p < 0.05 is the fastest way to ship false positives. Significance fluctuates as data accumulates, so if you check repeatedly and stop at the first significant reading, your true false-positive rate climbs well above 5%. Decide the sample size and run length before launching, and read the result once at the end. If you need to monitor continuously, use a sequential testing method designed for it rather than naive peeking.

### What is the difference between statistical significance and incrementality?

Significance tells you a measured difference is unlikely to be chance. Incrementality tells you the conversions were actually caused by your marketing rather than ones that would have happened anyway. A campaign can show a significant in-platform lift that is largely non-incremental — retargeting users who were already going to buy. The only way to measure true incrementality is a controlled holdout or geo test comparing an exposed group to an unexposed control; a typical "good" incremental lift for performance campaigns is roughly 10–30%.

### What is a minimum detectable effect (MDE)?

The MDE is the smallest improvement you want your test to be able to catch — for example, a +10% relative lift on a 5% conversion rate (i.e., moving it to 5.5%, not 15%). It is an input you choose, and it has an enormous effect on required sample size: halving the MDE roughly quadruples the traffic you need. Set it to the smallest change that would actually be worth shipping, not the smallest change imaginable.

Landing page: https://www.adsights.ai/resources/topics/experimentation