# Statistical Significance

**Category:** metrics  
**Short Description:** Measure of whether results are likely due to chance or a real difference.  
**Last Updated:** 2026-05-27T00:00:00Z

## Definition

Statistical significance indicates whether an observed difference between variants in an experiment is likely to be due to random chance or represents a genuine effect. In advertising, it helps determine if differences in key metrics like CTR, conversion rate, or ROAS between ad variants or campaigns represent real performance differences rather than random fluctuations. This is crucial for making data-driven optimization decisions and avoiding false conclusions based on temporary variations.

## Formula

**Formula:** `Significant when p-value < α (typically 0.05)`
**Result Unit:** ratio

Result is statistically significant when the probability of seeing it by chance is below 5% (α = 0.05) at the standard 95% confidence level.

## Calculation

**Formula:** `p-value < Significance Level`

**Explanation:** Calculated using statistical tests like t-tests or chi-square tests, comparing observed differences against the null hypothesis. For advertising, this often involves comparing conversion rates, CTRs, or other KPIs between variants while accounting for sample size and variance.

### Components

- **Sample Size**: The number of observations in each group
- **Test Statistic**: The calculated test value (t-value, z-score, etc.)
- **Significance Level**: The threshold for statistical significance (typically 0.05)

## Industry Benchmarks

| Segment | Typical Range | Median | Notes |
| --- | --- | --- | --- |
| Industry-standard alpha (α) | α = 0.05 (95% confidence) | 0.05 | The default in virtually every paid-social A/B testing playbook. α = 0.10 acceptable for early-stage directional reads; α = 0.01 for finance-impact decisions. |
| Conversion-rate A/B test, MDE 10% lift | ~50,000 – 100,000 visitors per variant | ~75,000 | Assuming baseline CVR=3%, α=0.05, power=0.80, MDE=10% relative lift. Most paid-social conversion tests are under-powered to detect <15% lifts at typical DTC spend. |
| Conversion-rate A/B test, MDE 20% lift | ~12,000 – 25,000 visitors per variant | ~18,000 | Assuming baseline CVR=3–5%, α=0.05, power=0.80. The realistic detectability floor for a single-creative test in a 2-week window at typical DTC ad spend. |
| CTR A/B test (Meta), MDE 15% lift | ~8,000 – 20,000 impressions per variant | ~12,000 | Assuming baseline CTR=1–2%, α=0.05, power=0.80. CTR tests need fewer impressions than CVR tests because event volume is higher and variance is lower. |
| ROAS test (DTC, 30+ purchases per variant) | ~30 – 50 purchases per variant minimum | ~40 | Assuming purchase-value coefficient-of-variation ~1.0 (typical DTC), α=0.05, power=0.80, MDE=25% lift. Below 30 purchases per variant, ROAS variance is too high to detect a 25% lift. Most accounts can't reach significance on weekly cadence. |
| Common alpha levels by decision type | 0.10 (directional) – 0.05 (standard) – 0.01 (high-stakes) | 0.05 | Pick the alpha before the test starts, not after. Adjusting alpha post-hoc is the most common abuse of significance testing. |
| Sequential testing (peeking-safe methods) | +30% to +50% sample size vs fixed-horizon test | +40% | If you're going to peek at results before the planned end, switch to a sequential method (mSPRT or Bayesian) — fixed-horizon p-values are invalid under peeking. |

**Sources:** Agresti, 'Statistical Methods for the Social Sciences' (5th ed., Pearson 2017), Kohavi, Tang & Xu, 'Trustworthy Online Controlled Experiments' (Cambridge 2020), Derived from standard two-proportion z-test sample-size formula (n ≈ 16·p·(1−p)/Δ²) at α=0.05, power=0.80, baseline CVR=3%, MDE=10% relative lift, see Kohavi et al. 2020, ch. 17, Two-proportion z-test sample-size formula at α=0.05, power=0.80, baseline CVR=3–5%, MDE=20% relative lift, Kohavi et al. 2020, ch. 17, Two-proportion z-test formula at α=0.05, power=0.80, baseline CTR=1–2%, MDE=15% relative lift, Meta A/B Test Help Center methodology notes, Common Thread Collective DTC Index 2024, sample-size guidance via Welch's t-test for ratios (Casella & Berger, 'Statistical Inference' 2nd ed.), FBSC / Princeton GEO methodology 2026, Evan Miller 'How Not To Run an A/B Test' 2014, Optimizely Stats Engine docs

## Examples

- 95% confidence that variant B's 2.1% CTR vs variant A's 1.8% CTR represents a real improvement
- Determining if a new automated bidding strategy's 15% higher ROAS is statistically significant
- Validating that a targeting expansion's lower CPAs aren't just due to random chance

## How AdSights Helps

**Tracking Statistical Significance:** The most common reason paid-social A/B tests fail isn't bad statistics — it's running tests that can never reach significance at the volume the account actually has. AdSights pre-calculates the sample size required to detect the lift you actually need (10%, 20%, etc.) at your historical CVR and current spend, so teams know upfront whether a creative test is worth running or whether to consolidate variants until they can be tested at adequate power. During tests, AdSights surfaces the running p-value, the elapsed sample, and an explicit 'don't peek yet' indicator until the pre-registered end-of-test condition is met — preventing the silent abuse of fixed-horizon tests by mid-flight peeking that inflates false-positive rates by 2–5×. The result: fewer 'this variant won' calls that don't replicate, faster identification of real winners, and an honest read on whether the lift you're chasing is detectable in your account at all.

## FAQs

### What's a good p-value for a paid-social A/B test?

The standard threshold is p < 0.05 (α = 0.05), meaning there's less than a 5% chance the observed difference is random. That's the bar for confidently calling a winner. For early-stage directional reads where you're going to follow up with a larger test, p < 0.10 is acceptable. For high-stakes decisions — committing 6-figure spend to a new strategy, eliminating a creative concept — use p < 0.01. The most important rule: pick the alpha BEFORE the test starts. Adjusting alpha post-hoc to call a borderline result significant is the most common (and most damaging) abuse of significance testing in performance marketing.

### Why do my creative tests never reach significance?

Almost always one of three reasons. First, the lift you're hoping for is smaller than your account can detect at current spend — a 10% CTR lift requires ~12,000 impressions per variant; a 10% CVR lift requires ~75,000 visitors per variant. If you only have 5,000 of either, no statistical method can rescue the test. Second, you're running too many variants at once — 8 ads in an ad set means each variant gets 1/8 of the spend, dropping power below the threshold. Three or four variants is the practical maximum for most account sizes. Third, you're peeking — checking results daily and stopping early when a variant looks ahead inflates the false-positive rate by 2–5× and means your 'significant' winners often don't replicate. Switch to a sequential testing method if peeking is unavoidable.

### How long should an A/B test run?

Long enough to hit the pre-calculated sample size for your minimum-detectable effect, AND long enough to span at least one full business cycle (typically 7 days for ecommerce, 14 days for B2B). The seven-day floor is non-negotiable for DTC — Mondays and Fridays convert at different rates than Tuesdays, and any test shorter than a full week is biased by day-of-week composition. For a typical DTC account doing $20K–$50K weekly Meta spend, that's usually 7–14 days to detect a 20% lift at α = 0.05. Below 7 days, you're rolling a die; above 21 days, you've passed the point where creative fatigue and audience drift make the test invalid anyway.

### Statistical significance vs practical significance — what's the difference?

Statistical significance only answers 'is this difference real?' — it doesn't answer 'is it worth caring about?' A test with 500,000 visitors per variant can detect a 0.3% CVR lift at p < 0.001, but if that lift translates to $80/month in additional revenue and the variant takes 20 hours to produce, it's statistically significant but practically irrelevant. Always pair the p-value with the effect size (the absolute lift, the confidence interval, the projected revenue impact). The marketer's question is always 'is this big enough to act on,' not 'is this a real effect.'

### Should I use Bayesian or frequentist methods?

For most paid-social testing, frequentist methods (the classic p-value < 0.05) are fine and align with how every tool reports significance. Bayesian methods (which give probability statements like 'variant B has an 83% chance of being better') are more intuitive and don't break under peeking, but they require a prior assumption about likely lift magnitudes that most operators can't articulate honestly. Optimizely Stats Engine and VWO SmartStats use Bayesian internally; Meta Experiments and Google Optimize (RIP) used frequentist. The framework matters less than disciplined sample-size planning and avoiding mid-test peeking — both methods fail when those break down.

### What does '95% confidence' actually mean?

It means: if you ran the exact same test 100 times under the null hypothesis (no real difference between variants), you'd expect 5 of those tests to show a 'significant' result purely by chance. So a 95%-confident winning variant has a 5% chance of being a false alarm. It does NOT mean 'there's a 95% chance variant B is better' — that's the Bayesian framing, which requires different math. The frequentist 95% claim is about long-run repeatability of the test procedure, not about the specific variant's truth. Most marketers and most tools (correctly) treat the distinction as academic, but knowing the literal meaning prevents over-stating what a 'significant' result implies.

## Related Terms

### Similar Terms

- **[A/B Testing](/resources/glossary/creative/ab-testing)**: Statistical significance ensures A/B test decisions are based on reliable data
- **[Creative Testing](/resources/glossary/creative/creative-testing)**: Statistical significance prevents premature creative optimization decisions
- **[Incrementality](/resources/glossary/general/incrementality)**: Significance testing is the validation mechanism inside an incrementality study

### Component Terms

- **[Confidence Interval](/resources/glossary/metrics/confidence-interval)**: Quantifies uncertainty range around observed performance differences
- **[Standard Deviation](/resources/glossary/metrics/standard-deviation)**: Helps determine if variations are statistically meaningful

## Related Resources

- [A/B Test Statistical Significance Calculator](/resources/tools/calculators/ab-test-statistical-significance-calculator) - Compute the p-value and required sample size for an A/B test on CTR, CVR, or ROAS.

## Featured in topic hubs

- [Creative Testing](/resources/topics/creative-testing)
- [Experimentation & Statistics](/resources/topics/experimentation)