336 lines
9.6 KiB
Markdown
336 lines
9.6 KiB
Markdown
# Dependency-Aware Chaos Testing
|
|
|
|
## Overview
|
|
|
|
Dependency-aware chaos testing has two layers:
|
|
|
|
1. **Outbound Layer** — Intercepts outbound requests to dependencies (Stripe, APIs, DBs)
|
|
2. **Body Corruption Layer** — Corrupts HTTP response bodies (truncation, malformed data)
|
|
|
|
This addresses the critical limitation of HTTP-layer chaos (v1) which only tested response schemas, not handler error handling logic.
|
|
|
|
## Two-Layer Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ OUTBOUND LAYER │
|
|
│ Tests: Handler error handling, retry logic, circuit breakers │
|
|
│ │
|
|
│ • Outbound HTTP interception (Stripe, APIs) │
|
|
│ • Dependency failure simulation │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ BODY CORRUPTION LAYER │
|
|
│ Tests: Response parsing, validation, streaming resilience │
|
|
│ │
|
|
│ • Truncation (partial responses) │
|
|
│ • Malformed data (invalid JSON, corrupted structure) │
|
|
│ • Partial chunks (missing NDJSON lines) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Outbound Layer Chaos
|
|
|
|
### Outbound HTTP Interception
|
|
|
|
Intercept requests from handlers to external APIs:
|
|
|
|
```javascript
|
|
await fastify.apophis.contract({
|
|
depth: 'quick',
|
|
chaos: {
|
|
probability: 0.1,
|
|
outbound: [
|
|
{
|
|
target: 'api.stripe.com',
|
|
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
|
|
error: {
|
|
probability: 0.05,
|
|
responses: [
|
|
{ statusCode: 429, headers: { 'retry-after': '60' } },
|
|
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
})
|
|
```
|
|
|
|
**What it tests:**
|
|
- Does the handler catch Stripe 429 and return retry-after header?
|
|
- Does the handler handle Stripe 503 and return meaningful error?
|
|
- Does the handler implement exponential backoff?
|
|
|
|
**What it does NOT test:**
|
|
- Response schema compliance (that's body corruption layer)
|
|
|
|
### wrapFetch
|
|
|
|
Wrap a `fetch` implementation so outbound requests are intercepted:
|
|
|
|
```javascript
|
|
import { wrapFetch, createOutboundInterceptor } from 'apophis-fastify'
|
|
|
|
const interceptor = createOutboundInterceptor([
|
|
{
|
|
target: 'api.stripe.com',
|
|
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
|
|
error: {
|
|
probability: 0.05,
|
|
responses: [
|
|
{ statusCode: 429, headers: { 'retry-after': '60' } }
|
|
]
|
|
}
|
|
}
|
|
], 42)
|
|
|
|
const interceptedFetch = wrapFetch(globalThis.fetch, interceptor)
|
|
const res = await interceptedFetch('https://api.stripe.com/v1/charges')
|
|
```
|
|
|
|
## Body Corruption Layer
|
|
|
|
### Response Truncation
|
|
|
|
Simulate partial responses:
|
|
|
|
```javascript
|
|
await fastify.apophis.contract({
|
|
depth: 'quick',
|
|
chaos: {
|
|
probability: 0.1,
|
|
corruption: { probability: 0.1 }
|
|
}
|
|
})
|
|
```
|
|
|
|
**What it tests:**
|
|
- Does the client handle partial JSON gracefully?
|
|
- Does streaming parser recover from truncated chunks?
|
|
- Does validation fail gracefully with incomplete data?
|
|
|
|
### Malformed Data
|
|
|
|
Corruption is content-type aware. Built-in strategies:
|
|
|
|
| Content Type | Strategy | Kind |
|
|
|-------------|----------|------|
|
|
| `application/json` | Truncates objects/arrays or nulls random fields | `body-truncate` / `body-malformed` |
|
|
| `application/x-ndjson` | Corrupts a random chunk | `body-malformed` |
|
|
| `text/event-stream` | Corrupts SSE event format | `body-malformed` |
|
|
| `multipart/form-data` | Corrupts a multipart field | `body-malformed` |
|
|
| `text/plain` | Truncates text response | `body-truncate` |
|
|
| `text/html` | Truncates HTML response | `body-truncate` |
|
|
|
|
## Chaos Event Reporting
|
|
|
|
Every chaos injection is visible in test diagnostics:
|
|
|
|
```javascript
|
|
// Outbound layer chaos
|
|
{
|
|
ok: false,
|
|
name: 'POST /billing/plans (#1)',
|
|
diagnostics: {
|
|
error: 'Contract violation: status:200',
|
|
chaos: {
|
|
injected: true,
|
|
type: 'outbound-error',
|
|
details: {
|
|
statusCode: 429,
|
|
dependencyUrl: 'https://api.stripe.com/v1/payment_intents',
|
|
reason: 'Outbound error: 429 from https://api.stripe.com/v1/payment_intents',
|
|
errorResponse: { error: 'rate_limit' }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
// Body corruption layer
|
|
{
|
|
ok: false,
|
|
name: 'GET /users (#2)',
|
|
diagnostics: {
|
|
error: 'Contract violation: response_body(this).users != null',
|
|
chaos: {
|
|
injected: true,
|
|
type: 'corruption',
|
|
details: {
|
|
reason: 'Body corruption: Truncates JSON response or nulls a random field',
|
|
strategy: 'json-truncate'
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Dropout Semantics
|
|
|
|
Dropout simulations are reported as HTTP-style failure statuses:
|
|
- **504 Gateway Timeout** for timeouts (default)
|
|
- **503 Service Unavailable** for network failures
|
|
- Configurable: `dropout: { probability: 0.1, statusCode: 503 }`
|
|
|
|
## Blast Radius Cap
|
|
|
|
Limit total chaos injections per test suite:
|
|
|
|
```javascript
|
|
await fastify.apophis.contract({
|
|
depth: 'quick',
|
|
chaos: {
|
|
probability: 0.5,
|
|
delay: { probability: 1.0, minMs: 10, maxMs: 50 },
|
|
maxInjectionsPerSuite: 10
|
|
}
|
|
})
|
|
```
|
|
|
|
## Stateful Retry Safety
|
|
|
|
Resilience verification automatically skips non-idempotent routes:
|
|
|
|
```javascript
|
|
await fastify.apophis.contract({
|
|
depth: 'quick',
|
|
chaos: {
|
|
probability: 0.1,
|
|
resilience: {
|
|
enabled: true,
|
|
maxRetries: 3
|
|
},
|
|
// Skip retries for routes that create side effects
|
|
skipResilienceFor: ['constructor', 'mutator']
|
|
}
|
|
})
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Use Outbound Layer for Business Logic
|
|
|
|
Test handler behavior when dependencies fail:
|
|
|
|
```javascript
|
|
// Good: Tests that handler catches Stripe 429
|
|
chaos: {
|
|
outbound: [{
|
|
target: 'api.stripe.com',
|
|
error: { probability: 0.1, responses: [{ statusCode: 429 }] }
|
|
}]
|
|
}
|
|
|
|
// Bad: Only tests response schema
|
|
chaos: {
|
|
error: { probability: 0.1, statusCode: 429 }
|
|
}
|
|
```
|
|
|
|
### 2. Use Body Corruption for Parsing Resilience
|
|
|
|
Test response parsing and validation:
|
|
|
|
```javascript
|
|
// Good: Tests JSON parser resilience
|
|
chaos: {
|
|
corruption: { probability: 0.1 }
|
|
}
|
|
```
|
|
|
|
### 3. Combine Both Layers
|
|
|
|
```javascript
|
|
await fastify.apophis.contract({
|
|
depth: 'quick',
|
|
chaos: {
|
|
probability: 0.1,
|
|
// Outbound layer: dependency failures
|
|
outbound: [{
|
|
target: 'api.stripe.com',
|
|
error: { probability: 0.05, responses: [{ statusCode: 429 }] }
|
|
}],
|
|
// Body corruption: response corruption
|
|
corruption: { probability: 0.05 },
|
|
// Safety: skip retries for stateful routes
|
|
skipResilienceFor: ['constructor', 'mutator']
|
|
}
|
|
})
|
|
```
|
|
|
|
### 4. Write Contracts for Error Handling
|
|
|
|
```javascript
|
|
fastify.get('/billing/plans', {
|
|
schema: {
|
|
'x-category': 'observer',
|
|
'x-ensures': [
|
|
'if status:429 then response_headers(this)["retry-after"] != null else true',
|
|
'if status:503 then response_body(this).error == "stripe_unavailable" else true',
|
|
'if status:200 then response_body(this).plans != null else true'
|
|
]
|
|
}
|
|
}, async () => { ... })
|
|
```
|
|
|
|
## Migration from v1
|
|
|
|
The old HTTP-layer chaos is still supported but should be used for transport testing only:
|
|
|
|
```javascript
|
|
// v1 (legacy — use for transport testing only)
|
|
chaos: {
|
|
probability: 0.1,
|
|
error: { probability: 0.1, statusCode: 503 }
|
|
}
|
|
|
|
// v2.3 (recommended)
|
|
chaos: {
|
|
probability: 0.1,
|
|
// Outbound layer
|
|
outbound: [{
|
|
target: 'api.stripe.com',
|
|
error: { probability: 0.1, responses: [{ statusCode: 429 }] }
|
|
}],
|
|
// Body corruption layer
|
|
corruption: { probability: 0.05 }
|
|
}
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### OutboundChaosConfig
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `target` | `string` | Hostname or URL pattern to intercept |
|
|
| `delay` | `{ probability, minMs, maxMs }` | Delay outbound requests |
|
|
| `error` | `{ probability, responses }` | Return error responses |
|
|
| `dropout` | `{ probability, statusCode? }` | Simulate network failures |
|
|
|
|
### Body Corruption Types
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| `body-truncate` | Partial response |
|
|
| `body-malformed` | Invalid data |
|
|
|
|
### ChaosConfig
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `probability` | `number` | Probability of injecting any chaos event (0.0 - 1.0) |
|
|
| `delay` | `{ probability, minMs, maxMs }` | Delay injection |
|
|
| `error` | `{ probability, statusCode, body? }` | Error injection |
|
|
| `dropout` | `{ probability, statusCode? }` | Dropout injection |
|
|
| `corruption` | `{ probability }` | Body corruption injection |
|
|
| `outbound` | `OutboundChaosConfig[]` | Outbound HTTP interception |
|
|
| `routes` | `Record<string, Partial<ChaosConfig>>` | Per-route overrides |
|
|
| `include` | `string[]` | Include only these routes |
|
|
| `exclude` | `string[]` | Exclude these routes |
|
|
| `resilience` | `{ enabled, maxRetries?, backoffMs? }` | Resilience verification |
|
|
| `skipResilienceFor` | `string[]` | Skip resilience for categories |
|
|
| `dropoutStatusCode` | `number` | Status code for dropout (default: 504) |
|
|
| `maxInjectionsPerSuite` | `number` | Maximum injections per suite |
|