How We Test a Feature Flag Platform: 1,192 Tests Across 12 SDKs

The Testing Problem No One Talks About

Building a feature flag platform means building something that sits in the critical path of every request your customers serve. If a flag evaluation returns the wrong result, your customer's checkout page shows the wrong UI. A crashing SDK takes the customer's app down with it. And when the SSE connection drops, flags quietly go stale.

The testing surface is enormous:

12 SDKs across 7 languages: React, Vue, Angular, Svelte, Node.js, Go, Python, Java, .NET, Flutter, React Native, Browser
A backend API handling flag evaluation in ~200μs P99
Real-time streaming via Server-Sent Events to thousands of concurrent connections
Consistent hashing that must produce identical results across every SDK

I built Rollgate solo over the past year. Here's how I test all of it with 1,192 test cases and what I learned along the way.

The Test Pyramid

           /\
          /  \     E2E (6)
         /    \    Playwright, real browser
        /------\
       /        \  Contract (124)
      /          \ Cross-SDK behavioral consistency
     /------------\
    /              \  Integration (58)
   /                \ SDK ↔ API server, resilience
  /------------------\
 /                    \  Unit (1,004)
/______________________\ Go API, React, SDKs, evaluation engine

Layer	Count	Run Time	What It Guarantees
Unit	1,004	under 1 min	Logic correctness
Integration	58	~2 min	Component interaction
Contract	124	~5 min	All 12 SDKs behave identically
E2E	6	~2 min	Full user flows work
Total	1,192	~10 min	-

Plus 5 load/stress harness scripts that aren't test cases but verified the system handles 28,000 concurrent SSE connections on a single node.

Layer 1: Unit Tests (1,004)

Unit tests cover the core logic without external dependencies. The most important area is the evaluation engine, the function that takes a flag configuration and a user context and returns a value.

Testing Consistent Hashing

The heart of percentage rollouts is consistent hashing. When you set a flag to 10% rollout, the system uses SHA-256(flagKey + userId) % 10000 to deterministically assign each user to a bucket. The same user must always get the same result.

This is easy to get wrong. The most common mistake is using Math.random() (or rand() in Go), which makes users flicker between variants across page loads.

Our test verifies three properties:

Consistency: The same user + flag always returns the same boolean, 1,000 times in a row
Distribution: Over 10,000 users at 10% rollout, roughly 10% see true (within statistical tolerance)
Monotonicity: Increasing rollout from 10% to 20% never removes a user who was already in the 10%

func TestIsInRollout_Consistency(t *testing.T) {
    // Same user + same flag = same result, always
    results := make(map[bool]int)
    for i := 0; i < 1000; i++ {
        results[isInRollout("flag-1", "user-42", 10)] ++
    }
    // Should have exactly 1 unique result
    if len(results) != 1 {
        t.Error("Inconsistent rollout result for same user")
    }
}

func TestIsInRollout_Distribution(t *testing.T) {
    enabled := 0
    total := 10000
    for i := 0; i < total; i++ {
        if isInRollout("test-flag", fmt.Sprintf("user-%d", i), 10) {
            enabled++
        }
    }
    // Should be roughly 10% (allow 8-12% for statistical variance)
    ratio := float64(enabled) / float64(total) * 100
    if ratio < 8 || ratio > 12 {
        t.Errorf("Expected ~10%%, got %.1f%%", ratio)
    }
}

Property 3 (monotonicity) is what makes gradual rollouts actually "gradual". When you increase from 10% to 20%, the 10% who already saw the feature continue to see it. New users are added, never removed.

Testing Targeting Rules

The targeting rule engine supports 18 operators: equals, notEquals, contains, notContains, startsWith, endsWith, in, notIn, greaterThan, greaterEqual, lessThan, lessEqual, regex, isSet, isNotSet, semverGt, semverLt, semverEq.

Each operator has its own test cases, including edge cases:

What happens when the attribute doesn't exist on the user?
What if the attribute value is a number but the rule expects a string?
What if the regex is invalid?
What if the semver string doesn't have a v prefix?

We have ~50 tests for the evaluation engine alone. These are the most important tests in the entire codebase because a bug here means every customer gets wrong flag values.

Testing the Web Frontend

The dashboard has 178 tests covering React components, hooks, and the API client. Most use React Testing Library:

describe('ConfirmDialog', () => {
  it('renders destructive variant with correct styling', () => {
    render(
      <ConfirmDialog
        open={true}
        title="Delete Flag"
        description="This cannot be undone."
        variant="destructive"
        onConfirm={jest.fn()}
        onCancel={jest.fn()}
      />
    );
    expect(screen.getByText('Delete Flag')).toBeInTheDocument();
    expect(screen.getByRole('button', { name: /confirm/i })).toHaveClass('bg-red');
  });
});

The API client tests (34 cases) are particularly important because they verify error handling, authentication flow, and how the frontend reacts to various HTTP status codes from the backend.

Layer 2: Integration Tests (58)

Integration tests verify that components work together correctly. Each SDK has integration tests that exercise the full cycle: initialize → fetch flags → evaluate → receive updates.

The most interesting integration tests are the resilience tests in the Node SDK (25 tests). These simulate failure scenarios:

API server goes down → SDK falls back to cached values
API returns malformed JSON → SDK doesn't crash
SSE connection drops → SDK reconnects automatically
Circuit breaker opens after failures → SDK stops making requests temporarily
Circuit breaker recovers → SDK resumes normal operation

These tests use a mock HTTP server that can be programmed to fail in specific ways:

it('falls back to cache when API returns 500', async () => {
  // First request succeeds and populates cache
  mockServer.respondWith(200, { flags: { 'my-flag': true } });
  const client = createClient({ apiKey: 'test' });
  await client.initialize();

  // Second request fails
  mockServer.respondWith(500, 'Internal Server Error');
  await client.refresh();

  // Should still return cached value
  expect(client.isEnabled('my-flag')).toBe(true);
});

This is where most feature flag SDKs fail in practice. The SDK itself becomes a single point of failure. If it crashes on initialization or doesn't handle network errors gracefully, it takes down the customer's application. The circuit breaker pattern is critical: after N consecutive failures, the SDK stops trying and serves cached values until the API recovers.

Layer 3: Contract Tests (124): The Most Important Layer

This is the layer that most feature flag platforms skip, and it's the one that matters most.

The Problem: SDK Drift

When you have 12 SDKs across 7 languages, behavioral drift is inevitable. A flag might evaluate to true in the Node SDK but false in Go for the same user, because the consistent hashing implementation used a different byte order, or the targeting rule engine parsed a numeric attribute differently.

The only way to prevent this is contract testing: run the exact same assertions against every SDK and verify they all produce identical results.

How It Works

The contract test harness is a Go program that:

Starts a real Rollgate API server with PostgreSQL and Redis
Seeds test flags with known configurations (boolean flags, string variants, percentage rollouts, targeting rules with every operator)
Starts each SDK's test service (a small HTTP server that wraps the SDK and exposes a uniform REST API):
- POST /evaluate → evaluates a flag for a given user
- POST /identify → changes the user context
- GET /health → verifies the SDK is connected
Runs 124 identical test cases against each SDK service
Compares results: every SDK must return the same value for the same flag + user combination

// This test runs against EVERY SDK adapter
func TestBooleanFlag_Enabled(t *testing.T) {
    flag := seedFlag(t, "test-boolean", FlagConfig{
        Type:    "boolean",
        Enabled: true,
        Rollout: 100,
    })

    for _, sdk := range sdkAdapters {
        t.Run(sdk.Name, func(t *testing.T) {
            result := sdk.Evaluate(flag.Key, testUser)
            assert.True(t, result.Enabled)
            assert.Equal(t, true, result.Value)
            assert.Equal(t, "FALLTHROUGH", result.Reason)
        })
    }
}

What Contract Tests Catch

In practice, contract tests have caught:

Hashing differences: Go's crypto/sha256 and Node's crypto.createHash('sha256') produce the same output, but the way we extracted 4 bytes from the hash differed between implementations. One SDK used big-endian, another little-endian. Same user, different rollout bucket.
Operator inconsistencies: The contains operator was case-sensitive in the Go SDK but case-insensitive in the Node SDK. Without contract tests, a targeting rule like "country contains US" would match "us" in Node but not in Go.
Default value handling: When a flag is disabled, some SDKs returned null while others returned the configured default value. Both are "correct" depending on your perspective, but they must be consistent.
Evaluation reason strings: The Node SDK reported "RULE_MATCH" while the Go SDK reported "rule_match". The casing difference broke clients that parsed the reason string.

None of these bugs would have been caught by unit tests. They only surface when you compare two SDKs against the same input.

Why Not Just Test Each SDK Independently?

Because independent tests test the implementation, not the contract. Each SDK's test suite can pass while the SDKs disagree with each other. Contract tests are the only way to verify cross-SDK consistency.

The analogy: unit tests verify that each musician plays their part correctly. Contract tests verify that the orchestra is in tune.

Layer 4: E2E Tests (6)

End-to-end tests use Playwright to verify complete user flows in a real browser against production or staging:

OAuth login callback → authenticated redirect
Session cookie security flags (HttpOnly, SameSite, Secure)
SSE flag propagation (create account → create flag → toggle → verify real-time update)

We have only 6 E2E tests because they're slow, flaky by nature, and expensive to maintain. The contract tests handle most of what E2E would typically cover: verifying that the system works correctly from the SDK's perspective.

The Stress Test: 28,000 Concurrent SSE Connections

The most dramatic test isn't a test case at all. It's a stress harness written in Go:

// sse-stress.go - simplified
func main() {
    var connected int64
    for i := 0; i < 30000; i++ {
        go func() {
            resp, _ := http.Get(sseURL)
            atomic.AddInt64(&connected, 1)
            // Hold connection open
            io.Copy(io.Discard, resp.Body)
        }()
    }
    // Monitor connected count
    for {
        fmt.Printf("Connected: %d\n", atomic.LoadInt64(&connected))
        time.Sleep(time.Second)
    }
}

On a CX33 VPS (4 vCPU, 8GB RAM, €13/month), the system sustained 28,000 concurrent SSE connections before hitting kernel limits. At that point:

CPU: ~60%
Memory: ~4GB (mostly TCP buffers)
Flag evaluation latency: unchanged at ~200μs P99
Flag change propagation: under 500ms to all connected clients

This test informed our pricing tiers: the Growth plan allows 500 SSE connections per environment, which is far below what a single node can handle. We have headroom.

Lessons Learned

1. Contract Tests Are Non-Negotiable for Multi-SDK Products

If you build SDKs for multiple languages, contract tests are the single highest-value investment you can make. They catch bugs that no other test layer can find, and they give you confidence to ship SDK updates without breaking cross-SDK consistency.

2. Test the Failure Modes, Not Just the Happy Path

The resilience integration tests (circuit breaker, cache fallback, reconnection) have caught more production issues than any other test suite. When your SDK sits in someone else's critical path, graceful degradation is more important than feature completeness.

3. The Test Pyramid Is Real, So Respect It

We have 1,004 unit tests, 58 integration tests, 124 contract tests, and 6 E2E tests. The ratio matters. Unit tests are fast and reliable. E2E tests are slow and flaky. Every time I've been tempted to add an E2E test, I've found a way to cover the same behavior at a lower layer.

4. Stress Tests Aren't Tests, They're Experiments

The SSE stress test doesn't have a pass/fail assertion. It's an experiment that reveals the system's limits. The result (28K connections on a €13/month VPS) became a selling point, but only because we actually measured it instead of guessing.

5. Count Your Tests Accurately

For months I told people we had "about 850 tests." When I actually counted, we had 1,192. The discrepancy came from not counting contract tests (124) and undertaking the Go API growth (294 → 452). If you're going to cite a number, make sure it's real.

The Full Breakdown

Area	Tests	Type
Go API (handlers, evaluation, middleware, models)	452	Unit
Web Frontend (components, hooks, API client)	178	Unit
SDK Core (cache, circuit breaker, retry, dedup)	53	Unit
SDK Node	166	Unit + Integration
SDK Go	100	Unit
SDK React	28	Unit + Integration
SDK Svelte	32	Unit + Integration
SDK Vue	22	Unit + Integration
SDK Angular	20	Unit
SDK Browser	11	Unit
Contract test harness	124	Contract
E2E (Playwright)	6	E2E
Total	1,192	-
Load/stress scripts	5	Harness

Every test runs on every PR via CI. The entire suite completes in under 10 minutes.

Rollgate is a feature flag platform with 12 SDKs and a free tier (500K requests/month). If you're interested in the architecture, see How I Built Rollgate. For gradual rollout strategies and how we compare to LaunchDarkly on pricing, check our pricing comparison. The SDKs are open source.

How We Test a Feature Flag Platform: 1,192 Tests Across 12 SDKs

The Testing Problem No One Talks About

The Test Pyramid

Layer 1: Unit Tests (1,004)

Testing Consistent Hashing

Testing Targeting Rules

Testing the Web Frontend

Layer 2: Integration Tests (58)

Layer 3: Contract Tests (124): The Most Important Layer

The Problem: SDK Drift

How It Works

What Contract Tests Catch

Why Not Just Test Each SDK Independently?

Layer 4: E2E Tests (6)

The Stress Test: 28,000 Concurrent SSE Connections

Lessons Learned

1. Contract Tests Are Non-Negotiable for Multi-SDK Products

2. Test the Failure Modes, Not Just the Happy Path

3. The Test Pyramid Is Real, So Respect It

4. Stress Tests Aren't Tests, They're Experiments

5. Count Your Tests Accurately

The Full Breakdown

What Are Feature Flags? A Complete Guide for 2026

How to Add Feature Flags to React in 5 Minutes

How I Built Rollgate: From Side Project to Feature Flags SaaS

Feature Flags Pricing Comparison 2026: Stop Paying Per Seat

The Testing Problem No One Talks About

The Test Pyramid

Layer 1: Unit Tests (1,004)

Testing Consistent Hashing

Testing Targeting Rules

Testing the Web Frontend

Layer 2: Integration Tests (58)

Layer 3: Contract Tests (124): The Most Important Layer

The Problem: SDK Drift

How It Works

What Contract Tests Catch

Why Not Just Test Each SDK Independently?

Layer 4: E2E Tests (6)

The Stress Test: 28,000 Concurrent SSE Connections

Lessons Learned

1. Contract Tests Are Non-Negotiable for Multi-SDK Products

2. Test the Failure Modes, Not Just the Happy Path

3. The Test Pyramid Is Real, So Respect It

4. Stress Tests Aren't Tests, They're Experiments

5. Count Your Tests Accurately

The Full Breakdown

Continue reading

What Are Feature Flags? A Complete Guide for 2026

How to Add Feature Flags to React in 5 Minutes

How I Built Rollgate: From Side Project to Feature Flags SaaS

Feature Flags Pricing Comparison 2026: Stop Paying Per Seat