# Critical Feedback: Why Current Chaos Injection is Insufficient for Production APIs **To:** Apophis Engineering Team **From:** Arbiter Platform Engineering **Date:** 2026-04-27 **Context:** Production SaaS platform with 500+ endpoints, Stripe integration, complex middleware chains --- ## The Core Problem Current chaos injection operates exclusively at the **HTTP transport layer** (`executeHttp()` wrapper). This tests: - ✅ Response schemas under forced errors - ✅ Timeout contracts with artificial delays - ✅ Response validation with corrupted bodies But **production APIs fail at the dependency layer**, not the transport layer: - Stripe API returns 429 rate limit - Database connection pool exhausted - Redis cache timeout - Third-party webhook delivery fails - Message queue backlog **Current chaos cannot simulate these.** It can force a 503 response, but it cannot simulate "Stripe returned 429, so we need to propagate retry-after header" because the handler never sees the Stripe error. --- ## Specific Pain Points ### 1. Error Injection is Backwards **Current behavior:** ``` Handler runs → creates side effects → response overridden to 503 ``` **What we need:** ``` Handler runs → Stripe call fails with 429 → handler catches error → returns 503 with retry-after ``` The current approach tests "what does our 503 response look like" but not "does our handler correctly handle Stripe errors." These are different: - Current: Tests schema compliance for hardcoded error responses - Needed: Tests business logic for dependency failures **Impact:** We have 503 contracts that pass, but our handler might not actually set the retry-after header when Stripe fails. The contract gives false confidence. ### 2. Chaos Events Are Invisible When chaos injects, the test result shows: ``` POST /billing/plans (#1): FAIL Error: Contract violation: if status:503 then response_body(this).data.error != null else true ``` But there's no indication that: - Chaos was the cause (not a real bug) - What type of chaos was injected (error? corruption? delay?) - What the original response was before override **Impact:** Debugging chaos failures is impossible. We can't tell if our contract is wrong or if chaos mutated the response unexpectedly. ### 3. Resilience Verification is Dangerous for Stateful APIs When `resilience: { enabled: true }`, Apophis retries the same request up to `maxRetries` times. For `POST /billing/plans`: - Attempt 1: Creates plan A → gets 503 → retries - Attempt 2: Creates plan B → gets 503 → retries - Attempt 3: Creates plan C → gets 503 → retries - Attempt 4: Creates plan D → succeeds **Result: 4 plans created, 1 expected.** This pollutes state and makes follow-up tests (GET, PATCH, DELETE) behave unpredictably. **Impact:** Can't use resilience testing on stateful routes without idempotency. Most real APIs are stateful. ### 4. Dropout Returns Status Code 0 Network failures in production don't return status code 0. They: - Time out (status undefined, error "ETIMEDOUT") - Reset connection (error "ECONNRESET") - Return 503 from load balancer Status 0 is a browser-specific artifact. Node.js HTTP clients don't produce status 0. **Impact:** Contracts can't match status 0. We have to either: - Add `status:0` to all contracts (meaningless) - Or ignore dropout failures (makes dropout useless) --- ## What Would Make Chaos Useful for Arbiter ### Option A: Outbound Request Contracts (Preferred) Apophis intercepts outbound HTTP requests from the handler: ```javascript // In Apophis config chaos: { outbound: { 'api.stripe.com': { delay: { probability: 0.1, minMs: 1000, maxMs: 5000 }, error: { probability: 0.05, responses: [ { statusCode: 429, headers: { 'retry-after': '60' } }, { statusCode: 503, body: { error: 'stripe_unavailable' } } ] } } } } ``` **Benefits:** - Handler sees real dependency failures - Tests actual error handling logic - Side effects only occur when handler succeeds - No state pollution from retries ### Option B: Service Method Wrapping Apophis wraps methods on decorated services: ```javascript // Fastify decorator app.decorate('stripe', new StripeService()); // Apophis wraps it apophis.chaos.wrap(app.stripe, { 'paymentIntents.create': { delay: { probability: 0.1, ms: 5000 }, error: { probability: 0.05, throws: new StripeTimeoutError() } } }); ``` **Benefits:** - Works with any service pattern (HTTP, DB, queue) - Tests business logic directly - Minimal changes to existing code ### Option C: Event-Driven Chaos For async architectures: ```javascript chaos: { events: { 'webhook.received': { drop: { probability: 0.1 }, // Simulate webhook loss delay: { probability: 0.2, ms: 30000 } // Simulate queue delay } } } ``` --- ## Recommended Priority Order ### P0 (Critical): Fix Event Reporting Every chaos injection should be visible: ```javascript // In test results test.diagnostics.chaos = { injected: true, type: 'error', details: { statusCode: 503, originalStatusCode: 201, strategy: 'override' } } ``` Without this, chaos failures are indistinguishable from real bugs. ### P1 (High): Add Dependency-Aware Chaos Implement outbound request interception or service wrapping. Current HTTP-layer chaos is too superficial for production APIs. ### P2 (Medium): Fix Dropout Semantics Return proper status codes: - `504 Gateway Timeout` for timeouts - `503 Service Unavailable` for network failures - Or make it configurable: `dropout: { statusCode: 503 }` ### P3 (Low): Stateful Retry Safety Either: - Make retries use unique IDs (prevent duplicate creation) - Or document that resilience requires idempotent handlers - Or skip resilience for non-idempotent routes --- ## What We're Doing Instead Since current chaos doesn't serve our needs, we're writing application-layer failure tests: ```javascript test('Stripe rate limit handling', async () => { // Mock Stripe to return 429 app.stripe.paymentIntents.create = async () => { const err = new Error('Rate limit exceeded'); err.statusCode = 429; err.headers = { 'retry-after': '60' }; throw err; }; const res = await payInvoice({ invoiceId: 'test' }); assert.strictEqual(res.statusCode, 429); assert.strictEqual(res.json().data.error, 'stripe_rate_limit'); assert.strictEqual(res.headers['retry-after'], '60'); }); ``` This tests what we actually need: **handler behavior when dependencies fail.** --- ## Conclusion Apophis chaos is a good start for HTTP-layer resilience testing, but it's insufficient for production APIs with external dependencies. The framework needs to evolve from "HTTP response mutator" to "dependency failure simulator" to be truly valuable. We want Apophis to succeed. The schema-driven contract approach is innovative and valuable. But chaos testing needs to be dependency-aware to be useful for real-world APIs. **Happy to collaborate** on designing the outbound interception API or service wrapping approach.