A/B Testing with Feature Flags: Ship Experiments Without the Complexity

Why A/B Test with Feature Flags?

Most teams think A/B testing requires a dedicated experimentation platform — Optimizely, LaunchDarkly Experimentation, or Google Optimize (RIP). These tools cost thousands per month, add SDK bloat, and introduce yet another vendor into your stack.

Here's the thing: if you already have feature flags, you already have 80% of what you need for A/B testing.

A feature flag with percentage-based rollout is fundamentally an A/B test. The only missing pieces are:

Consistent assignment — same user always sees the same variant
Variant tracking — recording which variant each user saw
Metric collection — measuring outcomes per variant
Statistical analysis — determining if the difference is real

Let's build this step by step.

The Basics: Flags as Experiments

A traditional feature flag splits users into two groups: flag ON vs flag OFF. An A/B test does the same thing, but with intent — you're measuring which group performs better on a specific metric.

// This is a feature flag
const showNewPricing = rollgate.isEnabled('new-pricing-page', { userId });

// This is also an A/B test
// Same code, different intent
if (showNewPricing) {
  renderNewPricingPage(); // Variant B
  track('pricing_page_view', { variant: 'new' });
} else {
  renderCurrentPricingPage(); // Variant A (control)
  track('pricing_page_view', { variant: 'control' });
}

The code is identical. The difference is operational: you're tracking outcomes and making a data-driven decision.

Consistent User Assignment

The most important requirement for A/B testing is consistency — a user must always see the same variant for the duration of the experiment. If user #42 sees the new pricing page on Monday, they must see it on Tuesday too.

Feature flags handle this through sticky assignment. When you pass a userId (or any stable identifier) to the flag evaluation, the system hashes it deterministically:

// Rollgate SDK handles this automatically
const variant = rollgate.isEnabled('new-pricing-page', {
  userId: user.id, // Stable identifier
});

// Same userId always gets the same result
// No database lookup, no cookie required

This works because good feature flag systems use consistent hashing (like MurmurHash) on the user ID. The hash maps to a number between 0-100, and if your rollout is at 50%, users with hash < 50 get the variant, others get the control.

Why this matters: Cookie-based assignment breaks across devices. Server-side hashing on user ID works everywhere — web, mobile, API, email.

Designing Your Experiment

Before writing code, define three things:

1. Hypothesis

Bad: "Let's see if the new pricing page is better."

Good: "Changing the pricing page CTA from 'Start Free Trial' to 'Get Started Free' will increase trial signups by at least 10%."

A clear hypothesis tells you what to measure and when to stop.

2. Primary Metric

Pick one metric that determines success. Secondary metrics are fine for context, but having multiple primary metrics inflates your false positive rate.

Experiment	Bad Primary Metric	Good Primary Metric
New checkout flow	Page views	Completed purchases
Pricing page redesign	Time on page	Trial signups
Search algorithm	Searches performed	Click-through on first result

3. Sample Size

This is where most teams mess up. You need enough data to detect a meaningful difference. Running an experiment for 2 days with 100 users proves nothing.

Rule of thumb for a standard A/B test:

5% conversion rate baseline → need ~1,500 users per variant to detect a 20% relative change
2% conversion rate baseline → need ~4,000 users per variant
High-traffic pages → you might have enough data in days
Low-traffic pages → it might take weeks

Don't stop the experiment early because one variant "looks better." That's the statistical equivalent of flipping a coin 5 times, getting 4 heads, and concluding the coin is biased.

Implementation Pattern

Here's a practical implementation using feature flags:

Step 1: Create the Flag

Set up a boolean feature flag with percentage-based rollout:

Flag name: experiment-pricing-cta
Rollout: 50% (half see variant B, half see control)
Targeting: All logged-in users

Step 2: Instrument Your Code

// Server-side (Node.js example)
app.get('/pricing', async (req, res) => {
  const showNewCTA = rollgate.isEnabled('experiment-pricing-cta', {
    userId: req.user.id,
  });

  // Track exposure
  analytics.track('experiment_exposure', {
    experiment: 'pricing-cta',
    variant: showNewCTA ? 'new-cta' : 'control',
    userId: req.user.id,
  });

  res.render('pricing', { showNewCTA });
});

// Client-side (React example)
function PricingPage() {
  const showNewCTA = useFlag('experiment-pricing-cta');

  useEffect(() => {
    analytics.track('experiment_exposure', {
      experiment: 'pricing-cta',
      variant: showNewCTA ? 'new-cta' : 'control',
    });
  }, [showNewCTA]);

  return (
    <div>
      <h1>Choose your plan</h1>
      <Button onClick={handleSignup}>
        {showNewCTA ? 'Get Started Free' : 'Start Free Trial'}
      </Button>
    </div>
  );
}

Step 3: Track Conversions

// When the target action happens
function handleSignup(plan) {
  analytics.track('trial_signup', {
    experiment: 'pricing-cta',
    variant: rollgate.isEnabled('experiment-pricing-cta', { userId })
      ? 'new-cta'
      : 'control',
    plan: plan,
  });
}

Step 4: Analyze Results

After reaching your target sample size, pull the data:

-- Conversion rate per variant
SELECT
  variant,
  COUNT(DISTINCT user_id) as users,
  COUNT(DISTINCT CASE WHEN converted THEN user_id END) as conversions,
  ROUND(
    COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric /
    COUNT(DISTINCT user_id) * 100, 2
  ) as conversion_rate
FROM experiment_events
WHERE experiment = 'pricing-cta'
GROUP BY variant;

variant	users	conversions	conversion_rate
control	12,847	411	3.20%
new-cta	12,903	658	5.10%

A 59% relative improvement looks great — but is it statistically significant? Use a chi-squared test or an online calculator. At these numbers, p < 0.001 — this is a real effect, not noise.

Multi-Variant Tests (A/B/n)

Sometimes you want to test more than two variants. Feature flags support this through string variants instead of boolean on/off:

// Flag returns a string variant instead of boolean
const ctaVariant = rollgate.getVariant('experiment-pricing-cta', {
  userId: user.id,
});

// ctaVariant could be: 'control', 'free-trial', 'get-started', 'try-now'
const ctaText = {
  'control': 'Start Free Trial',
  'free-trial': 'Try Free for 14 Days',
  'get-started': 'Get Started Free',
  'try-now': 'Try Now — No Card Required',
}[ctaVariant];

Warning: More variants means more traffic needed. With 4 variants, you need roughly 4x the sample size of a simple A/B test to reach significance.

Common Pitfalls

1. Peeking at Results Too Early

Looking at experiment data daily and stopping when you see significance is called optional stopping, and it dramatically inflates false positives. Set your sample size upfront and wait.

If you absolutely must peek, use sequential testing methods that account for multiple looks at the data.

2. Running Too Many Experiments Simultaneously

If experiments overlap (same user is in multiple experiments), interactions between them can pollute your results. Keep concurrent experiments on different parts of the product, or use mutually exclusive experiment groups.

3. Not Accounting for Novelty Effect

Users often engage more with anything new simply because it's new. A new UI might show higher engagement in week 1 that drops off by week 3. Run experiments long enough to capture steady-state behavior — typically 2-4 weeks.

4. Survivorship Bias

If your experiment only measures users who reach a certain point (e.g., checkout), you're missing the users who dropped off earlier. Always measure from the point of exposure, not from the point of conversion.

5. Ignoring Segments

An experiment might show no overall effect but have a strong positive effect on mobile users and a negative effect on desktop users. Check segment-level results for:

Device type (mobile vs desktop)
New vs returning users
Geography
Plan tier

When NOT to A/B Test

Not everything needs an experiment:

Bug fixes — just fix them
Legal/compliance changes — not optional
Performance improvements — measure, don't A/B test
Changes with < 1,000 weekly users exposed — you won't reach significance in a reasonable time
Obvious improvements — if the old version is clearly broken, skip the test

A/B testing is for decisions where the answer isn't obvious and the stakes justify the effort.

Feature Flags vs Dedicated A/B Testing Platforms

Capability	Feature Flags	Dedicated Platform
Traffic splitting	Yes	Yes
Consistent assignment	Yes	Yes
Multi-variant support	Yes	Yes
Built-in analytics	Basic	Advanced
Statistical engine	DIY or basic	Built-in
Bayesian analysis	No (usually)	Yes
Visual editor	No	Yes
Price	$0-99/mo	$1,000-10,000+/mo

Use feature flags when you run < 10 experiments per month, have engineers who can write code for variants, and want to keep your stack simple.

Use a dedicated platform when experimentation is a core competency, you run dozens of concurrent experiments, and you need non-engineers to create tests.

For most teams under 50 engineers, feature flags are more than enough.

Getting Started

Here's a practical checklist for your first A/B test with feature flags:

Pick a high-traffic page where you can reach sample size in 1-2 weeks
Define your hypothesis and primary metric before writing code
Calculate sample size using an online calculator
Create a flag with 50% rollout and user-based targeting
Instrument exposure and conversion events in your analytics
Wait for full sample size — don't peek and stop early
Analyze results and document the decision
Clean up — roll the winner to 100% and remove the flag

The best part? If you're already using feature flags for releases, you already have the infrastructure. You just need to add tracking and discipline.

Rollgate supports percentage-based rollouts with consistent user assignment out of the box — everything you need to run A/B tests without adding another vendor to your stack. Get started free.

Related reading: Learn the fundamentals in What Are Feature Flags?, see how to implement gradual rollouts step by step, or check out our SDK tutorials for React, Next.js, and Python. For pricing across platforms, see our feature flags pricing comparison.