A/B Testing with Feature Flags: Ship Experiments Without the Complexity
Why A/B Test with Feature Flags?
Most teams think A/B testing requires a dedicated experimentation platform — Optimizely, LaunchDarkly Experimentation, or Google Optimize (RIP). These tools cost thousands per month, add SDK bloat, and introduce yet another vendor into your stack.
Here's the thing: if you already have feature flags, you already have 80% of what you need for A/B testing.
A feature flag with percentage-based rollout is fundamentally an A/B test. The only missing pieces are:
- Consistent assignment — same user always sees the same variant
- Variant tracking — recording which variant each user saw
- Metric collection — measuring outcomes per variant
- Statistical analysis — determining if the difference is real
Let's build this step by step.
The Basics: Flags as Experiments
A traditional feature flag splits users into two groups: flag ON vs flag OFF. An A/B test does the same thing, but with intent — you're measuring which group performs better on a specific metric.
// This is a feature flag
const showNewPricing = rollgate.isEnabled('new-pricing-page', { userId });
// This is also an A/B test
// Same code, different intent
if (showNewPricing) {
renderNewPricingPage(); // Variant B
track('pricing_page_view', { variant: 'new' });
} else {
renderCurrentPricingPage(); // Variant A (control)
track('pricing_page_view', { variant: 'control' });
}
The code is identical. The difference is operational: you're tracking outcomes and making a data-driven decision.
Consistent User Assignment
The most important requirement for A/B testing is consistency — a user must always see the same variant for the duration of the experiment. If user #42 sees the new pricing page on Monday, they must see it on Tuesday too.
Feature flags handle this through sticky assignment. When you pass a userId (or any stable identifier) to the flag evaluation, the system hashes it deterministically:
// Rollgate SDK handles this automatically
const variant = rollgate.isEnabled('new-pricing-page', {
userId: user.id, // Stable identifier
});
// Same userId always gets the same result
// No database lookup, no cookie required
This works because good feature flag systems use consistent hashing (like MurmurHash) on the user ID. The hash maps to a number between 0-100, and if your rollout is at 50%, users with hash < 50 get the variant, others get the control.
Why this matters: Cookie-based assignment breaks across devices. Server-side hashing on user ID works everywhere — web, mobile, API, email.
Designing Your Experiment
Before writing code, define three things:
1. Hypothesis
Bad: "Let's see if the new pricing page is better."
Good: "Changing the pricing page CTA from 'Start Free Trial' to 'Get Started Free' will increase trial signups by at least 10%."
A clear hypothesis tells you what to measure and when to stop.
2. Primary Metric
Pick one metric that determines success. Secondary metrics are fine for context, but having multiple primary metrics inflates your false positive rate.
| Experiment | Bad Primary Metric | Good Primary Metric |
|---|---|---|
| New checkout flow | Page views | Completed purchases |
| Pricing page redesign | Time on page | Trial signups |
| Search algorithm | Searches performed | Click-through on first result |
3. Sample Size
This is where most teams mess up. You need enough data to detect a meaningful difference. Running an experiment for 2 days with 100 users proves nothing.
Rule of thumb for a standard A/B test:
- 5% conversion rate baseline → need ~1,500 users per variant to detect a 20% relative change
- 2% conversion rate baseline → need ~4,000 users per variant
- High-traffic pages → you might have enough data in days
- Low-traffic pages → it might take weeks
Don't stop the experiment early because one variant "looks better." That's the statistical equivalent of flipping a coin 5 times, getting 4 heads, and concluding the coin is biased.
Implementation Pattern
Here's a practical implementation using feature flags:
Step 1: Create the Flag
Set up a boolean feature flag with percentage-based rollout:
- Flag name:
experiment-pricing-cta - Rollout: 50% (half see variant B, half see control)
- Targeting: All logged-in users
Step 2: Instrument Your Code
// Server-side (Node.js example)
app.get('/pricing', async (req, res) => {
const showNewCTA = rollgate.isEnabled('experiment-pricing-cta', {
userId: req.user.id,
});
// Track exposure
analytics.track('experiment_exposure', {
experiment: 'pricing-cta',
variant: showNewCTA ? 'new-cta' : 'control',
userId: req.user.id,
});
res.render('pricing', { showNewCTA });
});
// Client-side (React example)
function PricingPage() {
const showNewCTA = useFlag('experiment-pricing-cta');
useEffect(() => {
analytics.track('experiment_exposure', {
experiment: 'pricing-cta',
variant: showNewCTA ? 'new-cta' : 'control',
});
}, [showNewCTA]);
return (
<div>
<h1>Choose your plan</h1>
<Button onClick={handleSignup}>
{showNewCTA ? 'Get Started Free' : 'Start Free Trial'}
</Button>
</div>
);
}
Step 3: Track Conversions
// When the target action happens
function handleSignup(plan) {
analytics.track('trial_signup', {
experiment: 'pricing-cta',
variant: rollgate.isEnabled('experiment-pricing-cta', { userId })
? 'new-cta'
: 'control',
plan: plan,
});
}
Step 4: Analyze Results
After reaching your target sample size, pull the data:
-- Conversion rate per variant
SELECT
variant,
COUNT(DISTINCT user_id) as users,
COUNT(DISTINCT CASE WHEN converted THEN user_id END) as conversions,
ROUND(
COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric /
COUNT(DISTINCT user_id) * 100, 2
) as conversion_rate
FROM experiment_events
WHERE experiment = 'pricing-cta'
GROUP BY variant;
| variant | users | conversions | conversion_rate |
|---|---|---|---|
| control | 12,847 | 411 | 3.20% |
| new-cta | 12,903 | 658 | 5.10% |
A 59% relative improvement looks great — but is it statistically significant? Use a chi-squared test or an online calculator. At these numbers, p < 0.001 — this is a real effect, not noise.
Multi-Variant Tests (A/B/n)
Sometimes you want to test more than two variants. Feature flags support this through string variants instead of boolean on/off:
// Flag returns a string variant instead of boolean
const ctaVariant = rollgate.getVariant('experiment-pricing-cta', {
userId: user.id,
});
// ctaVariant could be: 'control', 'free-trial', 'get-started', 'try-now'
const ctaText = {
'control': 'Start Free Trial',
'free-trial': 'Try Free for 14 Days',
'get-started': 'Get Started Free',
'try-now': 'Try Now — No Card Required',
}[ctaVariant];
Warning: More variants means more traffic needed. With 4 variants, you need roughly 4x the sample size of a simple A/B test to reach significance.
Common Pitfalls
1. Peeking at Results Too Early
Looking at experiment data daily and stopping when you see significance is called optional stopping, and it dramatically inflates false positives. Set your sample size upfront and wait.
If you absolutely must peek, use sequential testing methods that account for multiple looks at the data.
2. Running Too Many Experiments Simultaneously
If experiments overlap (same user is in multiple experiments), interactions between them can pollute your results. Keep concurrent experiments on different parts of the product, or use mutually exclusive experiment groups.
3. Not Accounting for Novelty Effect
Users often engage more with anything new simply because it's new. A new UI might show higher engagement in week 1 that drops off by week 3. Run experiments long enough to capture steady-state behavior — typically 2-4 weeks.
4. Survivorship Bias
If your experiment only measures users who reach a certain point (e.g., checkout), you're missing the users who dropped off earlier. Always measure from the point of exposure, not from the point of conversion.
5. Ignoring Segments
An experiment might show no overall effect but have a strong positive effect on mobile users and a negative effect on desktop users. Check segment-level results for:
- Device type (mobile vs desktop)
- New vs returning users
- Geography
- Plan tier
When NOT to A/B Test
Not everything needs an experiment:
- Bug fixes — just fix them
- Legal/compliance changes — not optional
- Performance improvements — measure, don't A/B test
- Changes with < 1,000 weekly users exposed — you won't reach significance in a reasonable time
- Obvious improvements — if the old version is clearly broken, skip the test
A/B testing is for decisions where the answer isn't obvious and the stakes justify the effort.
Feature Flags vs Dedicated A/B Testing Platforms
| Capability | Feature Flags | Dedicated Platform |
|---|---|---|
| Traffic splitting | Yes | Yes |
| Consistent assignment | Yes | Yes |
| Multi-variant support | Yes | Yes |
| Built-in analytics | Basic | Advanced |
| Statistical engine | DIY or basic | Built-in |
| Bayesian analysis | No (usually) | Yes |
| Visual editor | No | Yes |
| Price | $0-99/mo | $1,000-10,000+/mo |
Use feature flags when you run < 10 experiments per month, have engineers who can write code for variants, and want to keep your stack simple.
Use a dedicated platform when experimentation is a core competency, you run dozens of concurrent experiments, and you need non-engineers to create tests.
For most teams under 50 engineers, feature flags are more than enough.
Getting Started
Here's a practical checklist for your first A/B test with feature flags:
- Pick a high-traffic page where you can reach sample size in 1-2 weeks
- Define your hypothesis and primary metric before writing code
- Calculate sample size using an online calculator
- Create a flag with 50% rollout and user-based targeting
- Instrument exposure and conversion events in your analytics
- Wait for full sample size — don't peek and stop early
- Analyze results and document the decision
- Clean up — roll the winner to 100% and remove the flag
The best part? If you're already using feature flags for releases, you already have the infrastructure. You just need to add tracking and discipline.
Rollgate supports percentage-based rollouts with consistent user assignment out of the box — everything you need to run A/B tests without adding another vendor to your stack. Get started free.