Files
apophis-fastify/docs/attic/root-history/FEEDBACK_CHAOS_CRITICAL.md
T

235 lines
7.0 KiB
Markdown
Raw Normal View History

# Critical Feedback: Why Current Chaos Injection is Insufficient for Production APIs
**To:** Apophis Engineering Team
**From:** Arbiter Platform Engineering
**Date:** 2026-04-27
**Context:** Production SaaS platform with 500+ endpoints, Stripe integration, complex middleware chains
---
## The Core Problem
Current chaos injection operates exclusively at the **HTTP transport layer** (`executeHttp()` wrapper). This tests:
- ✅ Response schemas under forced errors
- ✅ Timeout contracts with artificial delays
- ✅ Response validation with corrupted bodies
But **production APIs fail at the dependency layer**, not the transport layer:
- Stripe API returns 429 rate limit
- Database connection pool exhausted
- Redis cache timeout
- Third-party webhook delivery fails
- Message queue backlog
**Current chaos cannot simulate these.** It can force a 503 response, but it cannot simulate "Stripe returned 429, so we need to propagate retry-after header" because the handler never sees the Stripe error.
---
## Specific Pain Points
### 1. Error Injection is Backwards
**Current behavior:**
```
Handler runs → creates side effects → response overridden to 503
```
**What we need:**
```
Handler runs → Stripe call fails with 429 → handler catches error → returns 503 with retry-after
```
The current approach tests "what does our 503 response look like" but not "does our handler correctly handle Stripe errors." These are different:
- Current: Tests schema compliance for hardcoded error responses
- Needed: Tests business logic for dependency failures
**Impact:** We have 503 contracts that pass, but our handler might not actually set the retry-after header when Stripe fails. The contract gives false confidence.
### 2. Chaos Events Are Invisible
When chaos injects, the test result shows:
```
POST /billing/plans (#1): FAIL
Error: Contract violation: if status:503 then response_body(this).data.error != null else true
```
But there's no indication that:
- Chaos was the cause (not a real bug)
- What type of chaos was injected (error? corruption? delay?)
- What the original response was before override
**Impact:** Debugging chaos failures is impossible. We can't tell if our contract is wrong or if chaos mutated the response unexpectedly.
### 3. Resilience Verification is Dangerous for Stateful APIs
When `resilience: { enabled: true }`, Apophis retries the same request up to `maxRetries` times.
For `POST /billing/plans`:
- Attempt 1: Creates plan A → gets 503 → retries
- Attempt 2: Creates plan B → gets 503 → retries
- Attempt 3: Creates plan C → gets 503 → retries
- Attempt 4: Creates plan D → succeeds
**Result: 4 plans created, 1 expected.** This pollutes state and makes follow-up tests (GET, PATCH, DELETE) behave unpredictably.
**Impact:** Can't use resilience testing on stateful routes without idempotency. Most real APIs are stateful.
### 4. Dropout Returns Status Code 0
Network failures in production don't return status code 0. They:
- Time out (status undefined, error "ETIMEDOUT")
- Reset connection (error "ECONNRESET")
- Return 503 from load balancer
Status 0 is a browser-specific artifact. Node.js HTTP clients don't produce status 0.
**Impact:** Contracts can't match status 0. We have to either:
- Add `status:0` to all contracts (meaningless)
- Or ignore dropout failures (makes dropout useless)
---
## What Would Make Chaos Useful for Arbiter
### Option A: Outbound Request Contracts (Preferred)
Apophis intercepts outbound HTTP requests from the handler:
```javascript
// In Apophis config
chaos: {
outbound: {
'api.stripe.com': {
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
error: {
probability: 0.05,
responses: [
{ statusCode: 429, headers: { 'retry-after': '60' } },
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
]
}
}
}
}
```
**Benefits:**
- Handler sees real dependency failures
- Tests actual error handling logic
- Side effects only occur when handler succeeds
- No state pollution from retries
### Option B: Service Method Wrapping
Apophis wraps methods on decorated services:
```javascript
// Fastify decorator
app.decorate('stripe', new StripeService());
// Apophis wraps it
apophis.chaos.wrap(app.stripe, {
'paymentIntents.create': {
delay: { probability: 0.1, ms: 5000 },
error: { probability: 0.05, throws: new StripeTimeoutError() }
}
});
```
**Benefits:**
- Works with any service pattern (HTTP, DB, queue)
- Tests business logic directly
- Minimal changes to existing code
### Option C: Event-Driven Chaos
For async architectures:
```javascript
chaos: {
events: {
'webhook.received': {
drop: { probability: 0.1 }, // Simulate webhook loss
delay: { probability: 0.2, ms: 30000 } // Simulate queue delay
}
}
}
```
---
## Recommended Priority Order
### P0 (Critical): Fix Event Reporting
Every chaos injection should be visible:
```javascript
// In test results
test.diagnostics.chaos = {
injected: true,
type: 'error',
details: {
statusCode: 503,
originalStatusCode: 201,
strategy: 'override'
}
}
```
Without this, chaos failures are indistinguishable from real bugs.
### P1 (High): Add Dependency-Aware Chaos
Implement outbound request interception or service wrapping. Current HTTP-layer chaos is too superficial for production APIs.
### P2 (Medium): Fix Dropout Semantics
Return proper status codes:
- `504 Gateway Timeout` for timeouts
- `503 Service Unavailable` for network failures
- Or make it configurable: `dropout: { statusCode: 503 }`
### P3 (Low): Stateful Retry Safety
Either:
- Make retries use unique IDs (prevent duplicate creation)
- Or document that resilience requires idempotent handlers
- Or skip resilience for non-idempotent routes
---
## What We're Doing Instead
Since current chaos doesn't serve our needs, we're writing application-layer failure tests:
```javascript
test('Stripe rate limit handling', async () => {
// Mock Stripe to return 429
app.stripe.paymentIntents.create = async () => {
const err = new Error('Rate limit exceeded');
err.statusCode = 429;
err.headers = { 'retry-after': '60' };
throw err;
};
const res = await payInvoice({ invoiceId: 'test' });
assert.strictEqual(res.statusCode, 429);
assert.strictEqual(res.json().data.error, 'stripe_rate_limit');
assert.strictEqual(res.headers['retry-after'], '60');
});
```
This tests what we actually need: **handler behavior when dependencies fail.**
---
## Conclusion
Apophis chaos is a good start for HTTP-layer resilience testing, but it's insufficient for production APIs with external dependencies. The framework needs to evolve from "HTTP response mutator" to "dependency failure simulator" to be truly valuable.
We want Apophis to succeed. The schema-driven contract approach is innovative and valuable. But chaos testing needs to be dependency-aware to be useful for real-world APIs.
**Happy to collaborate** on designing the outbound interception API or service wrapping approach.