How We Test a Feature Flag Platform: 1,192 Tests Across 13 SDKs
The Testing Problem No One Talks About
Building a feature flag platform means building something that sits in the critical path of every request your customers serve. If a flag evaluation returns the wrong result, your customer's checkout page shows the wrong UI. If the SDK crashes, your customer's app crashes. If the SSE connection drops, flags go stale.
The testing surface is enormous:
- 13 SDKs across 7 languages — React, Vue, Angular, Svelte, Node.js, Go, Python, Java, .NET, Flutter, React Native, Browser
- A backend API handling flag evaluation in ~200μs P99
- Real-time streaming via Server-Sent Events to thousands of concurrent connections
- Consistent hashing that must produce identical results across every SDK
I built Rollgate solo over the past year. Here's how I test all of it with 1,192 test cases and what I learned along the way.
The Test Pyramid
/\
/ \ E2E (6)
/ \ Playwright, real browser
/------\
/ \ Contract (124)
/ \ Cross-SDK behavioral consistency
/------------\
/ \ Integration (58)
/ \ SDK ↔ API server, resilience
/------------------\
/ \ Unit (1,004)
/______________________\ Go API, React, SDKs, evaluation engine
| Layer | Count | Run Time | What It Guarantees |
|---|---|---|---|
| Unit | 1,004 | under 1 min | Logic correctness |
| Integration | 58 | ~2 min | Component interaction |
| Contract | 124 | ~5 min | All 13 SDKs behave identically |
| E2E | 6 | ~2 min | Full user flows work |
| Total | 1,192 | ~10 min | — |
Plus 5 load/stress harness scripts that aren't test cases but verified the system handles 28,000 concurrent SSE connections on a single node.
Layer 1: Unit Tests (1,004)
Unit tests cover the core logic without external dependencies. The most important area is the evaluation engine — the function that takes a flag configuration and a user context and returns a value.
Testing Consistent Hashing
The heart of percentage rollouts is consistent hashing. When you set a flag to 10% rollout, the system uses SHA-256(flagKey + userId) % 10000 to deterministically assign each user to a bucket. The same user must always get the same result.
This is easy to get wrong. The most common mistake is using Math.random() (or rand() in Go), which makes users flicker between variants across page loads.
Our test verifies three properties:
- Consistency: The same user + flag always returns the same boolean, 1,000 times in a row
- Distribution: Over 10,000 users at 10% rollout, roughly 10% see
true(within statistical tolerance) - Monotonicity: Increasing rollout from 10% to 20% never removes a user who was already in the 10%
func TestIsInRollout_Consistency(t *testing.T) {
// Same user + same flag = same result, always
results := make(map[bool]int)
for i := 0; i < 1000; i++ {
results[isInRollout("flag-1", "user-42", 10)] ++
}
// Should have exactly 1 unique result
if len(results) != 1 {
t.Error("Inconsistent rollout result for same user")
}
}
func TestIsInRollout_Distribution(t *testing.T) {
enabled := 0
total := 10000
for i := 0; i < total; i++ {
if isInRollout("test-flag", fmt.Sprintf("user-%d", i), 10) {
enabled++
}
}
// Should be roughly 10% (allow 8-12% for statistical variance)
ratio := float64(enabled) / float64(total) * 100
if ratio < 8 || ratio > 12 {
t.Errorf("Expected ~10%%, got %.1f%%", ratio)
}
}
Property 3 (monotonicity) is what makes gradual rollouts actually "gradual" — when you increase from 10% to 20%, the 10% who already saw the feature continue to see it. New users are added, never removed.
Testing Targeting Rules
The targeting rule engine supports 18 operators: equals, notEquals, contains, notContains, startsWith, endsWith, in, notIn, greaterThan, greaterEqual, lessThan, lessEqual, regex, isSet, isNotSet, semverGt, semverLt, semverEq.
Each operator has its own test cases, including edge cases:
- What happens when the attribute doesn't exist on the user?
- What if the attribute value is a number but the rule expects a string?
- What if the regex is invalid?
- What if the semver string doesn't have a
vprefix?
We have ~50 tests for the evaluation engine alone. These are the most important tests in the entire codebase because a bug here means every customer gets wrong flag values.
Testing the Web Frontend
The dashboard has 178 tests covering React components, hooks, and the API client. Most use React Testing Library:
describe('ConfirmDialog', () => {
it('renders destructive variant with correct styling', () => {
render(
<ConfirmDialog
open={true}
title="Delete Flag"
description="This cannot be undone."
variant="destructive"
onConfirm={jest.fn()}
onCancel={jest.fn()}
/>
);
expect(screen.getByText('Delete Flag')).toBeInTheDocument();
expect(screen.getByRole('button', { name: /confirm/i }))
.toHaveClass('bg-red');
});
});
The API client tests (34 cases) are particularly important because they verify error handling, authentication flow, and how the frontend reacts to various HTTP status codes from the backend.
Layer 2: Integration Tests (58)
Integration tests verify that components work together correctly. Each SDK has integration tests that exercise the full cycle: initialize → fetch flags → evaluate → receive updates.
The most interesting integration tests are the resilience tests in the Node SDK (25 tests). These simulate failure scenarios:
- API server goes down → SDK falls back to cached values
- API returns malformed JSON → SDK doesn't crash
- SSE connection drops → SDK reconnects automatically
- Circuit breaker opens after failures → SDK stops making requests temporarily
- Circuit breaker recovers → SDK resumes normal operation
These tests use a mock HTTP server that can be programmed to fail in specific ways:
it('falls back to cache when API returns 500', async () => {
// First request succeeds and populates cache
mockServer.respondWith(200, { flags: { 'my-flag': true } });
const client = createClient({ apiKey: 'test' });
await client.initialize();
// Second request fails
mockServer.respondWith(500, 'Internal Server Error');
await client.refresh();
// Should still return cached value
expect(client.isEnabled('my-flag')).toBe(true);
});
This is where most feature flag SDKs fail in practice. The SDK itself becomes a single point of failure — if it crashes on initialization or doesn't handle network errors gracefully, it takes down the customer's application. The circuit breaker pattern is critical: after N consecutive failures, the SDK stops trying and serves cached values until the API recovers.
Layer 3: Contract Tests (124) — The Most Important Layer
This is the layer that most feature flag platforms skip, and it's the one that matters most.
The Problem: SDK Drift
When you have 13 SDKs across 7 languages, behavioral drift is inevitable. A flag might evaluate to true in the Node SDK but false in Go for the same user — because the consistent hashing implementation used a different byte order, or the targeting rule engine parsed a numeric attribute differently.
The only way to prevent this is contract testing: run the exact same assertions against every SDK and verify they all produce identical results.
How It Works
The contract test harness is a Go program that:
- Starts a real Rollgate API server with PostgreSQL and Redis
- Seeds test flags with known configurations (boolean flags, string variants, percentage rollouts, targeting rules with every operator)
- Starts each SDK's test service — a small HTTP server that wraps the SDK and exposes a uniform REST API:
POST /evaluate→ evaluates a flag for a given userPOST /identify→ changes the user contextGET /health→ verifies the SDK is connected
- Runs 124 identical test cases against each SDK service
- Compares results — every SDK must return the same value for the same flag + user combination
// This test runs against EVERY SDK adapter
func TestBooleanFlag_Enabled(t *testing.T) {
flag := seedFlag(t, "test-boolean", FlagConfig{
Type: "boolean",
Enabled: true,
Rollout: 100,
})
for _, sdk := range sdkAdapters {
t.Run(sdk.Name, func(t *testing.T) {
result := sdk.Evaluate(flag.Key, testUser)
assert.True(t, result.Enabled)
assert.Equal(t, true, result.Value)
assert.Equal(t, "FALLTHROUGH", result.Reason)
})
}
}
What Contract Tests Catch
In practice, contract tests have caught:
-
Hashing differences: Go's
crypto/sha256and Node'scrypto.createHash('sha256')produce the same output, but the way we extracted 4 bytes from the hash differed between implementations. One SDK used big-endian, another used little-endian. Same user, different rollout bucket. -
Operator inconsistencies: The
containsoperator was case-sensitive in the Go SDK but case-insensitive in the Node SDK. Without contract tests, a targeting rule like "country contains US" would match "us" in Node but not in Go. -
Default value handling: When a flag is disabled, some SDKs returned
nullwhile others returned the configured default value. Both are "correct" depending on your perspective, but they must be consistent. -
Evaluation reason strings: The Node SDK reported
"RULE_MATCH"while the Go SDK reported"rule_match". The casing difference broke clients that parsed the reason string.
None of these bugs would have been caught by unit tests. They only surface when you compare two SDKs against the same input.
Why Not Just Test Each SDK Independently?
Because independent tests test the implementation, not the contract. Each SDK's test suite can pass while the SDKs disagree with each other. Contract tests are the only way to verify cross-SDK consistency.
The analogy: unit tests verify that each musician plays their part correctly. Contract tests verify that the orchestra is in tune.
Layer 4: E2E Tests (6)
End-to-end tests use Playwright to verify complete user flows in a real browser against production or staging:
- OAuth login callback → authenticated redirect
- Session cookie security flags (HttpOnly, SameSite, Secure)
- SSE flag propagation (create account → create flag → toggle → verify real-time update)
We have only 6 E2E tests because they're slow, flaky by nature, and expensive to maintain. The contract tests handle most of what E2E would typically cover — verifying that the system works correctly from the SDK's perspective.
The Stress Test: 28,000 Concurrent SSE Connections
The most dramatic test isn't a test case at all — it's a stress harness written in Go:
// sse-stress.go - simplified
func main() {
var connected int64
for i := 0; i < 30000; i++ {
go func() {
resp, _ := http.Get(sseURL)
atomic.AddInt64(&connected, 1)
// Hold connection open
io.Copy(io.Discard, resp.Body)
}()
}
// Monitor connected count
for {
fmt.Printf("Connected: %d\n", atomic.LoadInt64(&connected))
time.Sleep(time.Second)
}
}
On a CX33 VPS (4 vCPU, 8GB RAM, €13/month), the system sustained 28,000 concurrent SSE connections before hitting kernel limits. At that point:
- CPU: ~60%
- Memory: ~4GB (mostly TCP buffers)
- Flag evaluation latency: unchanged at ~200μs P99
- Flag change propagation: under 500ms to all connected clients
This test informed our pricing tiers: the Growth plan allows 500 SSE connections per environment, which is far below what a single node can handle. We have headroom.
Lessons Learned
1. Contract Tests Are Non-Negotiable for Multi-SDK Products
If you build SDKs for multiple languages, contract tests are the single highest-value investment you can make. They catch bugs that no other test layer can find, and they give you confidence to ship SDK updates without breaking cross-SDK consistency.
2. Test the Failure Modes, Not Just the Happy Path
The resilience integration tests (circuit breaker, cache fallback, reconnection) have caught more production issues than any other test suite. When your SDK sits in someone else's critical path, graceful degradation is more important than feature completeness.
3. The Test Pyramid Is Real — Respect It
We have 1,004 unit tests, 58 integration tests, 124 contract tests, and 6 E2E tests. The ratio matters. Unit tests are fast and reliable. E2E tests are slow and flaky. Every time I've been tempted to add an E2E test, I've found a way to cover the same behavior at a lower layer.
4. Stress Tests Aren't Tests — They're Experiments
The SSE stress test doesn't have a pass/fail assertion. It's an experiment that reveals the system's limits. The result (28K connections on a €13/month VPS) became a selling point — but only because we actually measured it instead of guessing.
5. Count Your Tests Accurately
For months I told people we had "about 850 tests." When I actually counted, we had 1,192. The discrepancy came from not counting contract tests (124) and undertaking the Go API growth (294 → 452). If you're going to cite a number, make sure it's real.
The Full Breakdown
| Area | Tests | Type |
|---|---|---|
| Go API (handlers, evaluation, middleware, models) | 452 | Unit |
| Web Frontend (components, hooks, API client) | 178 | Unit |
| SDK Core (cache, circuit breaker, retry, dedup) | 53 | Unit |
| SDK Node | 166 | Unit + Integration |
| SDK Go | 100 | Unit |
| SDK React | 28 | Unit + Integration |
| SDK Svelte | 32 | Unit + Integration |
| SDK Vue | 22 | Unit + Integration |
| SDK Angular | 20 | Unit |
| SDK Browser | 11 | Unit |
| Contract test harness | 124 | Contract |
| E2E (Playwright) | 6 | E2E |
| Total | 1,192 | — |
| Load/stress scripts | 5 | Harness |
Every test runs on every PR via CI. The entire suite completes in under 10 minutes.
Rollgate is a feature flag platform with 13 SDKs and a free tier (500K requests/month). If you're interested in the architecture, see How I Built Rollgate. For gradual rollout strategies and how we compare to LaunchDarkly on pricing, check our pricing comparison. The SDKs are open source.