235 lines
7.0 KiB
Markdown
235 lines
7.0 KiB
Markdown
|
|
# Critical Feedback: Why Current Chaos Injection is Insufficient for Production APIs
|
||
|
|
|
||
|
|
**To:** Apophis Engineering Team
|
||
|
|
**From:** Arbiter Platform Engineering
|
||
|
|
**Date:** 2026-04-27
|
||
|
|
**Context:** Production SaaS platform with 500+ endpoints, Stripe integration, complex middleware chains
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Core Problem
|
||
|
|
|
||
|
|
Current chaos injection operates exclusively at the **HTTP transport layer** (`executeHttp()` wrapper). This tests:
|
||
|
|
- ✅ Response schemas under forced errors
|
||
|
|
- ✅ Timeout contracts with artificial delays
|
||
|
|
- ✅ Response validation with corrupted bodies
|
||
|
|
|
||
|
|
But **production APIs fail at the dependency layer**, not the transport layer:
|
||
|
|
- Stripe API returns 429 rate limit
|
||
|
|
- Database connection pool exhausted
|
||
|
|
- Redis cache timeout
|
||
|
|
- Third-party webhook delivery fails
|
||
|
|
- Message queue backlog
|
||
|
|
|
||
|
|
**Current chaos cannot simulate these.** It can force a 503 response, but it cannot simulate "Stripe returned 429, so we need to propagate retry-after header" because the handler never sees the Stripe error.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Specific Pain Points
|
||
|
|
|
||
|
|
### 1. Error Injection is Backwards
|
||
|
|
|
||
|
|
**Current behavior:**
|
||
|
|
```
|
||
|
|
Handler runs → creates side effects → response overridden to 503
|
||
|
|
```
|
||
|
|
|
||
|
|
**What we need:**
|
||
|
|
```
|
||
|
|
Handler runs → Stripe call fails with 429 → handler catches error → returns 503 with retry-after
|
||
|
|
```
|
||
|
|
|
||
|
|
The current approach tests "what does our 503 response look like" but not "does our handler correctly handle Stripe errors." These are different:
|
||
|
|
- Current: Tests schema compliance for hardcoded error responses
|
||
|
|
- Needed: Tests business logic for dependency failures
|
||
|
|
|
||
|
|
**Impact:** We have 503 contracts that pass, but our handler might not actually set the retry-after header when Stripe fails. The contract gives false confidence.
|
||
|
|
|
||
|
|
### 2. Chaos Events Are Invisible
|
||
|
|
|
||
|
|
When chaos injects, the test result shows:
|
||
|
|
```
|
||
|
|
POST /billing/plans (#1): FAIL
|
||
|
|
Error: Contract violation: if status:503 then response_body(this).data.error != null else true
|
||
|
|
```
|
||
|
|
|
||
|
|
But there's no indication that:
|
||
|
|
- Chaos was the cause (not a real bug)
|
||
|
|
- What type of chaos was injected (error? corruption? delay?)
|
||
|
|
- What the original response was before override
|
||
|
|
|
||
|
|
**Impact:** Debugging chaos failures is impossible. We can't tell if our contract is wrong or if chaos mutated the response unexpectedly.
|
||
|
|
|
||
|
|
### 3. Resilience Verification is Dangerous for Stateful APIs
|
||
|
|
|
||
|
|
When `resilience: { enabled: true }`, Apophis retries the same request up to `maxRetries` times.
|
||
|
|
|
||
|
|
For `POST /billing/plans`:
|
||
|
|
- Attempt 1: Creates plan A → gets 503 → retries
|
||
|
|
- Attempt 2: Creates plan B → gets 503 → retries
|
||
|
|
- Attempt 3: Creates plan C → gets 503 → retries
|
||
|
|
- Attempt 4: Creates plan D → succeeds
|
||
|
|
|
||
|
|
**Result: 4 plans created, 1 expected.** This pollutes state and makes follow-up tests (GET, PATCH, DELETE) behave unpredictably.
|
||
|
|
|
||
|
|
**Impact:** Can't use resilience testing on stateful routes without idempotency. Most real APIs are stateful.
|
||
|
|
|
||
|
|
### 4. Dropout Returns Status Code 0
|
||
|
|
|
||
|
|
Network failures in production don't return status code 0. They:
|
||
|
|
- Time out (status undefined, error "ETIMEDOUT")
|
||
|
|
- Reset connection (error "ECONNRESET")
|
||
|
|
- Return 503 from load balancer
|
||
|
|
|
||
|
|
Status 0 is a browser-specific artifact. Node.js HTTP clients don't produce status 0.
|
||
|
|
|
||
|
|
**Impact:** Contracts can't match status 0. We have to either:
|
||
|
|
- Add `status:0` to all contracts (meaningless)
|
||
|
|
- Or ignore dropout failures (makes dropout useless)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What Would Make Chaos Useful for Arbiter
|
||
|
|
|
||
|
|
### Option A: Outbound Request Contracts (Preferred)
|
||
|
|
|
||
|
|
Apophis intercepts outbound HTTP requests from the handler:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
// In Apophis config
|
||
|
|
chaos: {
|
||
|
|
outbound: {
|
||
|
|
'api.stripe.com': {
|
||
|
|
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
|
||
|
|
error: {
|
||
|
|
probability: 0.05,
|
||
|
|
responses: [
|
||
|
|
{ statusCode: 429, headers: { 'retry-after': '60' } },
|
||
|
|
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
|
||
|
|
]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Handler sees real dependency failures
|
||
|
|
- Tests actual error handling logic
|
||
|
|
- Side effects only occur when handler succeeds
|
||
|
|
- No state pollution from retries
|
||
|
|
|
||
|
|
### Option B: Service Method Wrapping
|
||
|
|
|
||
|
|
Apophis wraps methods on decorated services:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
// Fastify decorator
|
||
|
|
app.decorate('stripe', new StripeService());
|
||
|
|
|
||
|
|
// Apophis wraps it
|
||
|
|
apophis.chaos.wrap(app.stripe, {
|
||
|
|
'paymentIntents.create': {
|
||
|
|
delay: { probability: 0.1, ms: 5000 },
|
||
|
|
error: { probability: 0.05, throws: new StripeTimeoutError() }
|
||
|
|
}
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Works with any service pattern (HTTP, DB, queue)
|
||
|
|
- Tests business logic directly
|
||
|
|
- Minimal changes to existing code
|
||
|
|
|
||
|
|
### Option C: Event-Driven Chaos
|
||
|
|
|
||
|
|
For async architectures:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
chaos: {
|
||
|
|
events: {
|
||
|
|
'webhook.received': {
|
||
|
|
drop: { probability: 0.1 }, // Simulate webhook loss
|
||
|
|
delay: { probability: 0.2, ms: 30000 } // Simulate queue delay
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Priority Order
|
||
|
|
|
||
|
|
### P0 (Critical): Fix Event Reporting
|
||
|
|
|
||
|
|
Every chaos injection should be visible:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
// In test results
|
||
|
|
test.diagnostics.chaos = {
|
||
|
|
injected: true,
|
||
|
|
type: 'error',
|
||
|
|
details: {
|
||
|
|
statusCode: 503,
|
||
|
|
originalStatusCode: 201,
|
||
|
|
strategy: 'override'
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Without this, chaos failures are indistinguishable from real bugs.
|
||
|
|
|
||
|
|
### P1 (High): Add Dependency-Aware Chaos
|
||
|
|
|
||
|
|
Implement outbound request interception or service wrapping. Current HTTP-layer chaos is too superficial for production APIs.
|
||
|
|
|
||
|
|
### P2 (Medium): Fix Dropout Semantics
|
||
|
|
|
||
|
|
Return proper status codes:
|
||
|
|
- `504 Gateway Timeout` for timeouts
|
||
|
|
- `503 Service Unavailable` for network failures
|
||
|
|
- Or make it configurable: `dropout: { statusCode: 503 }`
|
||
|
|
|
||
|
|
### P3 (Low): Stateful Retry Safety
|
||
|
|
|
||
|
|
Either:
|
||
|
|
- Make retries use unique IDs (prevent duplicate creation)
|
||
|
|
- Or document that resilience requires idempotent handlers
|
||
|
|
- Or skip resilience for non-idempotent routes
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What We're Doing Instead
|
||
|
|
|
||
|
|
Since current chaos doesn't serve our needs, we're writing application-layer failure tests:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
test('Stripe rate limit handling', async () => {
|
||
|
|
// Mock Stripe to return 429
|
||
|
|
app.stripe.paymentIntents.create = async () => {
|
||
|
|
const err = new Error('Rate limit exceeded');
|
||
|
|
err.statusCode = 429;
|
||
|
|
err.headers = { 'retry-after': '60' };
|
||
|
|
throw err;
|
||
|
|
};
|
||
|
|
|
||
|
|
const res = await payInvoice({ invoiceId: 'test' });
|
||
|
|
|
||
|
|
assert.strictEqual(res.statusCode, 429);
|
||
|
|
assert.strictEqual(res.json().data.error, 'stripe_rate_limit');
|
||
|
|
assert.strictEqual(res.headers['retry-after'], '60');
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
This tests what we actually need: **handler behavior when dependencies fail.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Apophis chaos is a good start for HTTP-layer resilience testing, but it's insufficient for production APIs with external dependencies. The framework needs to evolve from "HTTP response mutator" to "dependency failure simulator" to be truly valuable.
|
||
|
|
|
||
|
|
We want Apophis to succeed. The schema-driven contract approach is innovative and valuable. But chaos testing needs to be dependency-aware to be useful for real-world APIs.
|
||
|
|
|
||
|
|
**Happy to collaborate** on designing the outbound interception API or service wrapping approach.
|