chore: crush git history - reborn from consolidation on 2026-03-10
This commit is contained in:
@@ -0,0 +1,234 @@
|
||||
# Critical Feedback: Why Current Chaos Injection is Insufficient for Production APIs
|
||||
|
||||
**To:** Apophis Engineering Team
|
||||
**From:** Arbiter Platform Engineering
|
||||
**Date:** 2026-04-27
|
||||
**Context:** Production SaaS platform with 500+ endpoints, Stripe integration, complex middleware chains
|
||||
|
||||
---
|
||||
|
||||
## The Core Problem
|
||||
|
||||
Current chaos injection operates exclusively at the **HTTP transport layer** (`executeHttp()` wrapper). This tests:
|
||||
- ✅ Response schemas under forced errors
|
||||
- ✅ Timeout contracts with artificial delays
|
||||
- ✅ Response validation with corrupted bodies
|
||||
|
||||
But **production APIs fail at the dependency layer**, not the transport layer:
|
||||
- Stripe API returns 429 rate limit
|
||||
- Database connection pool exhausted
|
||||
- Redis cache timeout
|
||||
- Third-party webhook delivery fails
|
||||
- Message queue backlog
|
||||
|
||||
**Current chaos cannot simulate these.** It can force a 503 response, but it cannot simulate "Stripe returned 429, so we need to propagate retry-after header" because the handler never sees the Stripe error.
|
||||
|
||||
---
|
||||
|
||||
## Specific Pain Points
|
||||
|
||||
### 1. Error Injection is Backwards
|
||||
|
||||
**Current behavior:**
|
||||
```
|
||||
Handler runs → creates side effects → response overridden to 503
|
||||
```
|
||||
|
||||
**What we need:**
|
||||
```
|
||||
Handler runs → Stripe call fails with 429 → handler catches error → returns 503 with retry-after
|
||||
```
|
||||
|
||||
The current approach tests "what does our 503 response look like" but not "does our handler correctly handle Stripe errors." These are different:
|
||||
- Current: Tests schema compliance for hardcoded error responses
|
||||
- Needed: Tests business logic for dependency failures
|
||||
|
||||
**Impact:** We have 503 contracts that pass, but our handler might not actually set the retry-after header when Stripe fails. The contract gives false confidence.
|
||||
|
||||
### 2. Chaos Events Are Invisible
|
||||
|
||||
When chaos injects, the test result shows:
|
||||
```
|
||||
POST /billing/plans (#1): FAIL
|
||||
Error: Contract violation: if status:503 then response_body(this).data.error != null else true
|
||||
```
|
||||
|
||||
But there's no indication that:
|
||||
- Chaos was the cause (not a real bug)
|
||||
- What type of chaos was injected (error? corruption? delay?)
|
||||
- What the original response was before override
|
||||
|
||||
**Impact:** Debugging chaos failures is impossible. We can't tell if our contract is wrong or if chaos mutated the response unexpectedly.
|
||||
|
||||
### 3. Resilience Verification is Dangerous for Stateful APIs
|
||||
|
||||
When `resilience: { enabled: true }`, Apophis retries the same request up to `maxRetries` times.
|
||||
|
||||
For `POST /billing/plans`:
|
||||
- Attempt 1: Creates plan A → gets 503 → retries
|
||||
- Attempt 2: Creates plan B → gets 503 → retries
|
||||
- Attempt 3: Creates plan C → gets 503 → retries
|
||||
- Attempt 4: Creates plan D → succeeds
|
||||
|
||||
**Result: 4 plans created, 1 expected.** This pollutes state and makes follow-up tests (GET, PATCH, DELETE) behave unpredictably.
|
||||
|
||||
**Impact:** Can't use resilience testing on stateful routes without idempotency. Most real APIs are stateful.
|
||||
|
||||
### 4. Dropout Returns Status Code 0
|
||||
|
||||
Network failures in production don't return status code 0. They:
|
||||
- Time out (status undefined, error "ETIMEDOUT")
|
||||
- Reset connection (error "ECONNRESET")
|
||||
- Return 503 from load balancer
|
||||
|
||||
Status 0 is a browser-specific artifact. Node.js HTTP clients don't produce status 0.
|
||||
|
||||
**Impact:** Contracts can't match status 0. We have to either:
|
||||
- Add `status:0` to all contracts (meaningless)
|
||||
- Or ignore dropout failures (makes dropout useless)
|
||||
|
||||
---
|
||||
|
||||
## What Would Make Chaos Useful for Arbiter
|
||||
|
||||
### Option A: Outbound Request Contracts (Preferred)
|
||||
|
||||
Apophis intercepts outbound HTTP requests from the handler:
|
||||
|
||||
```javascript
|
||||
// In Apophis config
|
||||
chaos: {
|
||||
outbound: {
|
||||
'api.stripe.com': {
|
||||
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
|
||||
error: {
|
||||
probability: 0.05,
|
||||
responses: [
|
||||
{ statusCode: 429, headers: { 'retry-after': '60' } },
|
||||
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Handler sees real dependency failures
|
||||
- Tests actual error handling logic
|
||||
- Side effects only occur when handler succeeds
|
||||
- No state pollution from retries
|
||||
|
||||
### Option B: Service Method Wrapping
|
||||
|
||||
Apophis wraps methods on decorated services:
|
||||
|
||||
```javascript
|
||||
// Fastify decorator
|
||||
app.decorate('stripe', new StripeService());
|
||||
|
||||
// Apophis wraps it
|
||||
apophis.chaos.wrap(app.stripe, {
|
||||
'paymentIntents.create': {
|
||||
delay: { probability: 0.1, ms: 5000 },
|
||||
error: { probability: 0.05, throws: new StripeTimeoutError() }
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Works with any service pattern (HTTP, DB, queue)
|
||||
- Tests business logic directly
|
||||
- Minimal changes to existing code
|
||||
|
||||
### Option C: Event-Driven Chaos
|
||||
|
||||
For async architectures:
|
||||
|
||||
```javascript
|
||||
chaos: {
|
||||
events: {
|
||||
'webhook.received': {
|
||||
drop: { probability: 0.1 }, // Simulate webhook loss
|
||||
delay: { probability: 0.2, ms: 30000 } // Simulate queue delay
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommended Priority Order
|
||||
|
||||
### P0 (Critical): Fix Event Reporting
|
||||
|
||||
Every chaos injection should be visible:
|
||||
|
||||
```javascript
|
||||
// In test results
|
||||
test.diagnostics.chaos = {
|
||||
injected: true,
|
||||
type: 'error',
|
||||
details: {
|
||||
statusCode: 503,
|
||||
originalStatusCode: 201,
|
||||
strategy: 'override'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Without this, chaos failures are indistinguishable from real bugs.
|
||||
|
||||
### P1 (High): Add Dependency-Aware Chaos
|
||||
|
||||
Implement outbound request interception or service wrapping. Current HTTP-layer chaos is too superficial for production APIs.
|
||||
|
||||
### P2 (Medium): Fix Dropout Semantics
|
||||
|
||||
Return proper status codes:
|
||||
- `504 Gateway Timeout` for timeouts
|
||||
- `503 Service Unavailable` for network failures
|
||||
- Or make it configurable: `dropout: { statusCode: 503 }`
|
||||
|
||||
### P3 (Low): Stateful Retry Safety
|
||||
|
||||
Either:
|
||||
- Make retries use unique IDs (prevent duplicate creation)
|
||||
- Or document that resilience requires idempotent handlers
|
||||
- Or skip resilience for non-idempotent routes
|
||||
|
||||
---
|
||||
|
||||
## What We're Doing Instead
|
||||
|
||||
Since current chaos doesn't serve our needs, we're writing application-layer failure tests:
|
||||
|
||||
```javascript
|
||||
test('Stripe rate limit handling', async () => {
|
||||
// Mock Stripe to return 429
|
||||
app.stripe.paymentIntents.create = async () => {
|
||||
const err = new Error('Rate limit exceeded');
|
||||
err.statusCode = 429;
|
||||
err.headers = { 'retry-after': '60' };
|
||||
throw err;
|
||||
};
|
||||
|
||||
const res = await payInvoice({ invoiceId: 'test' });
|
||||
|
||||
assert.strictEqual(res.statusCode, 429);
|
||||
assert.strictEqual(res.json().data.error, 'stripe_rate_limit');
|
||||
assert.strictEqual(res.headers['retry-after'], '60');
|
||||
});
|
||||
```
|
||||
|
||||
This tests what we actually need: **handler behavior when dependencies fail.**
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Apophis chaos is a good start for HTTP-layer resilience testing, but it's insufficient for production APIs with external dependencies. The framework needs to evolve from "HTTP response mutator" to "dependency failure simulator" to be truly valuable.
|
||||
|
||||
We want Apophis to succeed. The schema-driven contract approach is innovative and valuable. But chaos testing needs to be dependency-aware to be useful for real-world APIs.
|
||||
|
||||
**Happy to collaborate** on designing the outbound interception API or service wrapping approach.
|
||||
Reference in New Issue
Block a user