784 lines
23 KiB
Markdown
784 lines
23 KiB
Markdown
|
|
# Arbiter → Apophis Feedback Report
|
||
|
|
|
||
|
|
**Date:** 2026-04-27
|
||
|
|
**Reporter:** Arbiter Engineering Team
|
||
|
|
**Context:** Integration of Apophis v2.2 into Arbiter Platform for behavioral contract testing
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Apophis provides genuinely valuable capabilities for behavioral contract testing that go beyond traditional unit/integration tests. The schema-to-contract inference, cross-operation verification, and chaos testing infrastructure are compelling. However, we encountered 3 bugs in core infrastructure and several design friction points that should be addressed for wider adoption.
|
||
|
|
|
||
|
|
**Overall Assessment:** Strong value proposition for teams willing to invest in schema-driven testing. Needs polish on edge cases and configurability.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 1: How Chaos Injection Would Help Arbiter
|
||
|
|
|
||
|
|
### Current State
|
||
|
|
Arbiter is a multi-tenant SaaS platform with:
|
||
|
|
- 500+ API endpoints across 15 route families
|
||
|
|
- Billing, graph storage, auth, sessions, webhooks, etc.
|
||
|
|
- Mock Stripe integration for payment processing
|
||
|
|
- In-memory and persistent storage backends
|
||
|
|
- Complex middleware chain: auth → tenant boundary → permissions → preflight → handler
|
||
|
|
|
||
|
|
### Where Chaos Testing Adds Value
|
||
|
|
|
||
|
|
**1. Middleware Resilience Verification**
|
||
|
|
|
||
|
|
Our middleware chain has implicit dependencies:
|
||
|
|
```
|
||
|
|
Transport → AuthN → Scope → AuthZ → Challenge → Preflight → Handler
|
||
|
|
```
|
||
|
|
|
||
|
|
Chaos testing would verify:
|
||
|
|
- What happens when `preflight()` times out? Does the handler still execute?
|
||
|
|
- If auth middleware fails with 503, do we get proper retry headers?
|
||
|
|
- Does a slow tenant boundary check cascade to response timeouts?
|
||
|
|
|
||
|
|
**Concrete scenario:** If the billing preflight gate (budget check) is slow, does the subscription creation handler wait or fail? Our contracts say `response_time < 2000ms` — chaos would tell us if that's actually enforced.
|
||
|
|
|
||
|
|
**2. Mock Service Degradation**
|
||
|
|
|
||
|
|
We use `MockStripeService` for payment processing. In production, Stripe can:
|
||
|
|
- Return 429 (rate limit)
|
||
|
|
- Time out on `paymentIntents.create`
|
||
|
|
- Return network errors
|
||
|
|
|
||
|
|
Chaos testing would inject:
|
||
|
|
```
|
||
|
|
if chaos:stripe-timeout then response_code == 503
|
||
|
|
if chaos:stripe-rate-limit then retry-after header != null
|
||
|
|
```
|
||
|
|
|
||
|
|
This validates our fallback logic — currently untested because mocks always succeed.
|
||
|
|
|
||
|
|
**3. Resource Leak Detection**
|
||
|
|
|
||
|
|
Our `BillingApplicationService` uses in-memory Maps. Chaos scenarios:
|
||
|
|
- Create 1000 plans, delete 500, verify GET on deleted returns 404
|
||
|
|
- Cancel subscriptions mid-renewal cycle
|
||
|
|
- Concurrent PATCH operations on same plan
|
||
|
|
|
||
|
|
Cross-operation contracts catch this for single requests, but chaos tests concurrent state corruption.
|
||
|
|
|
||
|
|
**4. Entitlement Boundary Testing**
|
||
|
|
|
||
|
|
We have credit-based preflight gates. Chaos could:
|
||
|
|
- Exhaust credits mid-test
|
||
|
|
- Verify 402 (Payment Required) is returned
|
||
|
|
- Ensure no partial mutations occur when budget is depleted
|
||
|
|
|
||
|
|
This is business-critical: we cannot bill customers for operations that fail.
|
||
|
|
|
||
|
|
**5. Auth Token Expiry**
|
||
|
|
|
||
|
|
JWT tokens expire. Chaos could:
|
||
|
|
- Expire tokens between POST and follow-up GET
|
||
|
|
- Verify 401 with proper `WWW-Authenticate` header
|
||
|
|
- Test refresh token flow under load
|
||
|
|
|
||
|
|
### Proposed Chaos Scenarios for Arbiter
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
billing_chaos:
|
||
|
|
- name: stripe-timeout
|
||
|
|
target: POST /billing/invoices/:id/pay
|
||
|
|
inject: { stripe_delay_ms: 5000 }
|
||
|
|
expected: { status: 503, retry_after: "> 0" }
|
||
|
|
|
||
|
|
- name: storage-corruption
|
||
|
|
target: DELETE /billing/plans/:id
|
||
|
|
inject: { skip_deletion: true }
|
||
|
|
expected: { status: 200, follow_up_get: 404 }
|
||
|
|
|
||
|
|
- name: rate-limit
|
||
|
|
target: POST /billing/plans
|
||
|
|
inject: { rate_limit: 10 }
|
||
|
|
expected: { status: 429, x_retry_after: "> 0" }
|
||
|
|
|
||
|
|
- name: auth-expiry
|
||
|
|
target: PATCH /billing/plans/:id
|
||
|
|
inject: { expire_token_after_ms: 100 }
|
||
|
|
expected: { status: 401, www_authenticate: "Bearer" }
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 2: Bugs Found
|
||
|
|
|
||
|
|
### Bug 1: Scope Registry Ignores Configured Default Scope
|
||
|
|
|
||
|
|
**Severity:** High (breaks auth in cross-operation tests)
|
||
|
|
**File:** `dist/infrastructure/scope-registry.js`
|
||
|
|
**Line:** 60, 76-77
|
||
|
|
|
||
|
|
**Problem:**
|
||
|
|
```javascript
|
||
|
|
const scope = scopeName !== null ? this.scopes.get(scopeName) : undefined;
|
||
|
|
const base = scope ?? this.defaultScope; // Always uses empty DEFAULT_SCOPE
|
||
|
|
```
|
||
|
|
|
||
|
|
When `getHeaders(null)` is called, it uses `this.defaultScope` which is initialized to `{ headers: {}, metadata: {} }` on line 60, ignoring any "default" scope passed in the constructor.
|
||
|
|
|
||
|
|
**Impact:** Cross-operation requests (e.g., `response_code(GET /users/{id})`) don't inherit auth headers from the configured scope, causing 401 failures on protected routes.
|
||
|
|
|
||
|
|
**Fix:**
|
||
|
|
```javascript
|
||
|
|
const base = scope ?? this.scopes.get('default') ?? this.defaultScope;
|
||
|
|
```
|
||
|
|
|
||
|
|
**Reproduction:**
|
||
|
|
```javascript
|
||
|
|
await app.register(apophis, {
|
||
|
|
scopes: {
|
||
|
|
default: { headers: { 'authorization': 'Bearer token' } }
|
||
|
|
}
|
||
|
|
});
|
||
|
|
// Cross-operation GET /users/123 gets 401 because auth header is not passed
|
||
|
|
```
|
||
|
|
|
||
|
|
### Bug 2: Contract Builder Drops Routes Option
|
||
|
|
|
||
|
|
**Severity:** High (route filtering doesn't work)
|
||
|
|
**File:** `dist/plugin/contract-builder.js`
|
||
|
|
**Line:** 8-15
|
||
|
|
|
||
|
|
**Problem:**
|
||
|
|
```javascript
|
||
|
|
const config = {
|
||
|
|
depth: opts.depth ?? 'standard',
|
||
|
|
scope: opts.scope,
|
||
|
|
seed: opts.seed,
|
||
|
|
timeout: opts.timeout,
|
||
|
|
chaos: opts.chaos,
|
||
|
|
// Missing: routes: opts.routes
|
||
|
|
};
|
||
|
|
```
|
||
|
|
|
||
|
|
The `routes` option is documented but never passed to `runPetitTests`, causing all routes to be tested regardless of the `routes` filter.
|
||
|
|
|
||
|
|
**Impact:** Tests run against all 500+ routes instead of the 4 specified, making debugging impossible and CI times explode.
|
||
|
|
|
||
|
|
**Fix:**
|
||
|
|
```javascript
|
||
|
|
const config = {
|
||
|
|
depth: opts.depth ?? 'standard',
|
||
|
|
scope: opts.scope,
|
||
|
|
seed: opts.seed,
|
||
|
|
timeout: opts.timeout,
|
||
|
|
chaos: opts.chaos,
|
||
|
|
routes: opts.routes, // Add this
|
||
|
|
};
|
||
|
|
```
|
||
|
|
|
||
|
|
**Reproduction:**
|
||
|
|
```javascript
|
||
|
|
await app.apophis.contract({
|
||
|
|
routes: ['POST /billing/plans'] // Tests ALL routes instead
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
### Bug 3: Invariant Checking Not Configurable
|
||
|
|
|
||
|
|
**Severity:** Medium (false failures for non-hierarchical APIs)
|
||
|
|
**File:** `dist/test/petit-runner.js`
|
||
|
|
**Line:** 386-398
|
||
|
|
|
||
|
|
**Problem:** Built-in invariants (`no-orphaned-resources`, `parent-reference-integrity`, `resource-integrity`) run unconditionally for all routes. These assume parent-child resource hierarchies (e.g., `/workspaces/:id/projects/:id`).
|
||
|
|
|
||
|
|
**Impact:** For flat resource models (like our billing plans), routes with `x-category: 'constructor'` trigger invariant failures because resources don't have `parentType`/`parentId`.
|
||
|
|
|
||
|
|
**Workaround:** We set `x-category: 'observer'` to avoid resource tracking, but this loses the semantic meaning of the route.
|
||
|
|
|
||
|
|
**Suggested Fix:**
|
||
|
|
```javascript
|
||
|
|
// In config
|
||
|
|
invariants: ['resource-integrity'] // Opt-in per test
|
||
|
|
// Or
|
||
|
|
invariants: false // Disable all
|
||
|
|
// Or per-route
|
||
|
|
schema: {
|
||
|
|
'x-invariants': ['custom-only']
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 3: Design Feedback
|
||
|
|
|
||
|
|
### 1. Schema Inference is Too Aggressive
|
||
|
|
|
||
|
|
**Issue:** `const` values in JSON Schema generate unconditional contracts.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"response": {
|
||
|
|
"200": {
|
||
|
|
"properties": {
|
||
|
|
"fragment_type": { "const": "Action" }
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Generates: `response_body(this).fragment_type == "Action"` (checked for ALL responses)
|
||
|
|
|
||
|
|
This fails when the route returns 404 with `fragment_type: "Error"`.
|
||
|
|
|
||
|
|
**Suggestion:** Infer conditional contracts based on status code:
|
||
|
|
```
|
||
|
|
if status:200 then response_body(this).fragment_type == "Action" else true
|
||
|
|
```
|
||
|
|
|
||
|
|
Or add an option to disable inference: `inferContracts: false`.
|
||
|
|
|
||
|
|
### 2. Cross-Operation Headers Not Documented
|
||
|
|
|
||
|
|
The `scope.headers` behavior for cross-operation requests is not documented. We had to read source code to discover that:
|
||
|
|
- `createOperationResolver(fastify, request.headers)` passes request headers
|
||
|
|
- But `request.headers` comes from `scope.getHeaders(null)`
|
||
|
|
- Which had bug #1 above
|
||
|
|
|
||
|
|
**Suggestion:** Document that cross-operation requests inherit the scope headers of the original request.
|
||
|
|
|
||
|
|
### 3. Missing 400 Response Handling
|
||
|
|
|
||
|
|
When Fastify schema validation fails (e.g., enum mismatch), it returns 400 with a validation error object. Apophis treats this as a contract failure unless:
|
||
|
|
- The schema has a 400 response documented
|
||
|
|
- The contract explicitly accepts 400
|
||
|
|
|
||
|
|
Most developers won't document 400 responses. Apophis should either:
|
||
|
|
- Auto-generate 400 contracts from validation rules
|
||
|
|
- Or provide a global 400 handler pattern
|
||
|
|
|
||
|
|
### 4. HEAD Routes Cause Noise
|
||
|
|
|
||
|
|
Fastify auto-generates HEAD routes for every GET. These have no response body, causing `response_body(this).id != null` failures.
|
||
|
|
|
||
|
|
**Suggestion:** Auto-skip HEAD routes in contract tests, or provide `skipMethods: ['HEAD']` option.
|
||
|
|
|
||
|
|
### 5. Error Suggestions Need Context
|
||
|
|
|
||
|
|
When a contract fails, the error is:
|
||
|
|
```
|
||
|
|
Field 'fragment_type' does not match expected value 'Error'.
|
||
|
|
```
|
||
|
|
|
||
|
|
But it doesn't say:
|
||
|
|
- What the actual status code was
|
||
|
|
- What the actual response body was
|
||
|
|
- Which route generated the request
|
||
|
|
|
||
|
|
**Suggestion:** Include actual vs expected in violation objects.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 4: What We Love
|
||
|
|
|
||
|
|
### 1. Cross-Operation Contracts
|
||
|
|
|
||
|
|
```
|
||
|
|
if status:201 then response_code(GET /billing/plans/{response_body(this).data.plan_id}) == 200 else true
|
||
|
|
```
|
||
|
|
|
||
|
|
This is genuinely hard to test manually. Apophis makes it declarative and automatic.
|
||
|
|
|
||
|
|
### 2. Property-Based Generation
|
||
|
|
|
||
|
|
Fast-check found edge cases we missed:
|
||
|
|
- Empty string `name` (schema allowed it, service rejected it)
|
||
|
|
- Invalid `billing_interval` values
|
||
|
|
- Missing required fields
|
||
|
|
|
||
|
|
### 3. Schema as Single Source of Truth
|
||
|
|
|
||
|
|
Once schemas are correct, contracts are free. The `x-ensures` array supplements rather than replaces schema validation.
|
||
|
|
|
||
|
|
### 4. Fast Feedback Loop
|
||
|
|
|
||
|
|
Contract tests run in ~1.5s for 4 routes. Much faster than spinning up a full test environment.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 5: Feature Requests
|
||
|
|
|
||
|
|
### 1. Hypermedia Contract Support
|
||
|
|
|
||
|
|
Arbiter returns LDF (Linked Data Fragment) responses with `controls` and `actions`. We'd love to verify:
|
||
|
|
|
||
|
|
```
|
||
|
|
if status:200 then response_body(this).controls.self == request_url(this) else true
|
||
|
|
if status:200 then response_body(this).actions.create.method == "POST" else true
|
||
|
|
if status:200 then response_body(this).actions.update.target == "/billing/plans/{response_body(this).data.id}" else true
|
||
|
|
```
|
||
|
|
|
||
|
|
Currently we have to write these manually. Could Apophis infer hypermedia controls from route registration?
|
||
|
|
|
||
|
|
### 2. Conditional Schema Contracts
|
||
|
|
|
||
|
|
Instead of removing `const` from schemas, allow:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"response": {
|
||
|
|
"200": {
|
||
|
|
"properties": {
|
||
|
|
"fragment_type": { "const": "Action", "x-apophis-conditional": "status:200" }
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
This preserves schema expressiveness while generating correct contracts.
|
||
|
|
|
||
|
|
### 3. Middleware Contract Verification
|
||
|
|
|
||
|
|
Our middleware chain is critical. We'd like to verify:
|
||
|
|
|
||
|
|
```
|
||
|
|
if request_headers(this).authorization == null then status:401 else true
|
||
|
|
if request_headers(this).x-tenant-id == null then status:400 else true
|
||
|
|
```
|
||
|
|
|
||
|
|
Apophis already supports `request_headers` — making this a first-class feature (e.g., `x-requires`) would be powerful.
|
||
|
|
|
||
|
|
### 4. State Cleanup Hooks
|
||
|
|
|
||
|
|
After destructive tests (DELETE), we need to clean up:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
await app.apophis.contract({
|
||
|
|
routes: ['DELETE /billing/plans/:id'],
|
||
|
|
cleanup: async (state) => {
|
||
|
|
// Remove created plans from database
|
||
|
|
await db.plans.deleteMany({ id: { $in: state.createdPlans } });
|
||
|
|
}
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
This would enable stateful testing without polluting the test environment.
|
||
|
|
|
||
|
|
### 5. Contract Coverage Report
|
||
|
|
|
||
|
|
After running tests, we'd like:
|
||
|
|
```
|
||
|
|
Contract Coverage:
|
||
|
|
POST /billing/plans:
|
||
|
|
- 201 response: ✓ tested (42 cases)
|
||
|
|
- 400 response: ✓ tested (8 cases)
|
||
|
|
- 503 response: ✗ not tested
|
||
|
|
- Cross-op GET: ✓ tested (42 cases)
|
||
|
|
```
|
||
|
|
|
||
|
|
This helps identify gaps in contract coverage.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Apophis is a powerful tool that fills a gap in API testing — behavioral contracts and chaos testing. The core concepts are solid, but the implementation needs hardening for production use:
|
||
|
|
|
||
|
|
**Must-fix:** Bugs #1 and #2 (scope registry, route filtering)
|
||
|
|
**Should-fix:** Bug #3 (configurable invariants), inference aggressiveness
|
||
|
|
**Nice-to-have:** Hypermedia support, middleware contracts, coverage reports
|
||
|
|
|
||
|
|
We're committed to using Apophis for Arbiter's contract testing and will contribute fixes upstream. The value of cross-operation verification alone justifies the investment.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Contact:** Arbiter Engineering Team
|
||
|
|
**Repository:** https://github.com/anomalyco/apophis (we'll open issues for each bug)
|
||
|
|
# Critical Feedback: Why Current Chaos Injection is Insufficient for Production APIs
|
||
|
|
|
||
|
|
**To:** Apophis Engineering Team
|
||
|
|
**From:** Arbiter Platform Engineering
|
||
|
|
**Date:** 2026-04-27
|
||
|
|
**Context:** Production SaaS platform with 500+ endpoints, Stripe integration, complex middleware chains
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Core Problem
|
||
|
|
|
||
|
|
Current chaos injection operates exclusively at the **HTTP transport layer** (`executeHttp()` wrapper). This tests:
|
||
|
|
- ✅ Response schemas under forced errors
|
||
|
|
- ✅ Timeout contracts with artificial delays
|
||
|
|
- ✅ Response validation with corrupted bodies
|
||
|
|
|
||
|
|
But **production APIs fail at the dependency layer**, not the transport layer:
|
||
|
|
- Stripe API returns 429 rate limit
|
||
|
|
- Database connection pool exhausted
|
||
|
|
- Redis cache timeout
|
||
|
|
- Third-party webhook delivery fails
|
||
|
|
- Message queue backlog
|
||
|
|
|
||
|
|
**Current chaos cannot simulate these.** It can force a 503 response, but it cannot simulate "Stripe returned 429, so we need to propagate retry-after header" because the handler never sees the Stripe error.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Specific Pain Points
|
||
|
|
|
||
|
|
### 1. Error Injection is Backwards
|
||
|
|
|
||
|
|
**Current behavior:**
|
||
|
|
```
|
||
|
|
Handler runs → creates side effects → response overridden to 503
|
||
|
|
```
|
||
|
|
|
||
|
|
**What we need:**
|
||
|
|
```
|
||
|
|
Handler runs → Stripe call fails with 429 → handler catches error → returns 503 with retry-after
|
||
|
|
```
|
||
|
|
|
||
|
|
The current approach tests "what does our 503 response look like" but not "does our handler correctly handle Stripe errors." These are different:
|
||
|
|
- Current: Tests schema compliance for hardcoded error responses
|
||
|
|
- Needed: Tests business logic for dependency failures
|
||
|
|
|
||
|
|
**Impact:** We have 503 contracts that pass, but our handler might not actually set the retry-after header when Stripe fails. The contract gives false confidence.
|
||
|
|
|
||
|
|
### 2. Chaos Events Are Invisible
|
||
|
|
|
||
|
|
When chaos injects, the test result shows:
|
||
|
|
```
|
||
|
|
POST /billing/plans (#1): FAIL
|
||
|
|
Error: Contract violation: if status:503 then response_body(this).data.error != null else true
|
||
|
|
```
|
||
|
|
|
||
|
|
But there's no indication that:
|
||
|
|
- Chaos was the cause (not a real bug)
|
||
|
|
- What type of chaos was injected (error? corruption? delay?)
|
||
|
|
- What the original response was before override
|
||
|
|
|
||
|
|
**Impact:** Debugging chaos failures is impossible. We can't tell if our contract is wrong or if chaos mutated the response unexpectedly.
|
||
|
|
|
||
|
|
### 3. Resilience Verification is Dangerous for Stateful APIs
|
||
|
|
|
||
|
|
When `resilience: { enabled: true }`, Apophis retries the same request up to `maxRetries` times.
|
||
|
|
|
||
|
|
For `POST /billing/plans`:
|
||
|
|
- Attempt 1: Creates plan A → gets 503 → retries
|
||
|
|
- Attempt 2: Creates plan B → gets 503 → retries
|
||
|
|
- Attempt 3: Creates plan C → gets 503 → retries
|
||
|
|
- Attempt 4: Creates plan D → succeeds
|
||
|
|
|
||
|
|
**Result: 4 plans created, 1 expected.** This pollutes state and makes follow-up tests (GET, PATCH, DELETE) behave unpredictably.
|
||
|
|
|
||
|
|
**Impact:** Can't use resilience testing on stateful routes without idempotency. Most real APIs are stateful.
|
||
|
|
|
||
|
|
### 4. Dropout Returns Status Code 0
|
||
|
|
|
||
|
|
Network failures in production don't return status code 0. They:
|
||
|
|
- Time out (status undefined, error "ETIMEDOUT")
|
||
|
|
- Reset connection (error "ECONNRESET")
|
||
|
|
- Return 503 from load balancer
|
||
|
|
|
||
|
|
Status 0 is a browser-specific artifact. Node.js HTTP clients don't produce status 0.
|
||
|
|
|
||
|
|
**Impact:** Contracts can't match status 0. We have to either:
|
||
|
|
- Add `status:0` to all contracts (meaningless)
|
||
|
|
- Or ignore dropout failures (makes dropout useless)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What Would Make Chaos Useful for Arbiter
|
||
|
|
|
||
|
|
### Option A: Outbound Request Contracts (Preferred)
|
||
|
|
|
||
|
|
Apophis intercepts outbound HTTP requests from the handler:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
// In Apophis config
|
||
|
|
chaos: {
|
||
|
|
outbound: {
|
||
|
|
'api.stripe.com': {
|
||
|
|
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
|
||
|
|
error: {
|
||
|
|
probability: 0.05,
|
||
|
|
responses: [
|
||
|
|
{ statusCode: 429, headers: { 'retry-after': '60' } },
|
||
|
|
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
|
||
|
|
]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Handler sees real dependency failures
|
||
|
|
- Tests actual error handling logic
|
||
|
|
- Side effects only occur when handler succeeds
|
||
|
|
- No state pollution from retries
|
||
|
|
|
||
|
|
### Option B: Service Method Wrapping
|
||
|
|
|
||
|
|
Apophis wraps methods on decorated services:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
// Fastify decorator
|
||
|
|
app.decorate('stripe', new StripeService());
|
||
|
|
|
||
|
|
// Apophis wraps it
|
||
|
|
apophis.chaos.wrap(app.stripe, {
|
||
|
|
'paymentIntents.create': {
|
||
|
|
delay: { probability: 0.1, ms: 5000 },
|
||
|
|
error: { probability: 0.05, throws: new StripeTimeoutError() }
|
||
|
|
}
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Works with any service pattern (HTTP, DB, queue)
|
||
|
|
- Tests business logic directly
|
||
|
|
- Minimal changes to existing code
|
||
|
|
|
||
|
|
### Option C: Event-Driven Chaos
|
||
|
|
|
||
|
|
For async architectures:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
chaos: {
|
||
|
|
events: {
|
||
|
|
'webhook.received': {
|
||
|
|
drop: { probability: 0.1 }, // Simulate webhook loss
|
||
|
|
delay: { probability: 0.2, ms: 30000 } // Simulate queue delay
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Priority Order
|
||
|
|
|
||
|
|
### P0 (Critical): Fix Event Reporting
|
||
|
|
|
||
|
|
Every chaos injection should be visible:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
// In test results
|
||
|
|
test.diagnostics.chaos = {
|
||
|
|
injected: true,
|
||
|
|
type: 'error',
|
||
|
|
details: {
|
||
|
|
statusCode: 503,
|
||
|
|
originalStatusCode: 201,
|
||
|
|
strategy: 'override'
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Without this, chaos failures are indistinguishable from real bugs.
|
||
|
|
|
||
|
|
### P1 (High): Add Dependency-Aware Chaos
|
||
|
|
|
||
|
|
Implement outbound request interception or service wrapping. Current HTTP-layer chaos is too superficial for production APIs.
|
||
|
|
|
||
|
|
### P2 (Medium): Fix Dropout Semantics
|
||
|
|
|
||
|
|
Return proper status codes:
|
||
|
|
- `504 Gateway Timeout` for timeouts
|
||
|
|
- `503 Service Unavailable` for network failures
|
||
|
|
- Or make it configurable: `dropout: { statusCode: 503 }`
|
||
|
|
|
||
|
|
### P3 (Low): Stateful Retry Safety
|
||
|
|
|
||
|
|
Either:
|
||
|
|
- Make retries use unique IDs (prevent duplicate creation)
|
||
|
|
- Or document that resilience requires idempotent handlers
|
||
|
|
- Or skip resilience for non-idempotent routes
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What We're Doing Instead
|
||
|
|
|
||
|
|
Since current chaos doesn't serve our needs, we're writing application-layer failure tests:
|
||
|
|
|
||
|
|
```javascript
|
||
|
|
test('Stripe rate limit handling', async () => {
|
||
|
|
// Mock Stripe to return 429
|
||
|
|
app.stripe.paymentIntents.create = async () => {
|
||
|
|
const err = new Error('Rate limit exceeded');
|
||
|
|
err.statusCode = 429;
|
||
|
|
err.headers = { 'retry-after': '60' };
|
||
|
|
throw err;
|
||
|
|
};
|
||
|
|
|
||
|
|
const res = await payInvoice({ invoiceId: 'test' });
|
||
|
|
|
||
|
|
assert.strictEqual(res.statusCode, 429);
|
||
|
|
assert.strictEqual(res.json().data.error, 'stripe_rate_limit');
|
||
|
|
assert.strictEqual(res.headers['retry-after'], '60');
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
This tests what we actually need: **handler behavior when dependencies fail.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Apophis chaos is a good start for HTTP-layer resilience testing, but it's insufficient for production APIs with external dependencies. The framework needs to evolve from "HTTP response mutator" to "dependency failure simulator" to be truly valuable.
|
||
|
|
|
||
|
|
We want Apophis to succeed. The schema-driven contract approach is innovative and valuable. But chaos testing needs to be dependency-aware to be useful for real-world APIs.
|
||
|
|
|
||
|
|
**Happy to collaborate** on designing the outbound interception API or service wrapping approach.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
# Appendix: Concrete Proposals for Apophis Improvements
|
||
|
|
|
||
|
|
|
||
|
|
## Proposal 1: Conditional Schema Inference
|
||
|
|
|
||
|
|
Instead of removing `const` from schemas, generate conditional contracts:
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// Current behavior (WRONG):
|
||
|
|
// Schema: { properties: { fragment_type: { const: "Action" } } }
|
||
|
|
// Generates: response_body(this).fragment_type == "Action" // Applies to ALL responses
|
||
|
|
|
||
|
|
// Proposed behavior:
|
||
|
|
// Generates: if status:200 then response_body(this).fragment_type == "Action" else true
|
||
|
|
```
|
||
|
|
|
||
|
|
Implementation:
|
||
|
|
```typescript
|
||
|
|
function inferContractsFromResponseSchema(responseSchema, statusCode) {
|
||
|
|
const formulas = [];
|
||
|
|
// ... existing inference logic ...
|
||
|
|
|
||
|
|
// Wrap in conditional if status code is 2xx
|
||
|
|
if (statusCode >= 200 && statusCode < 300) {
|
||
|
|
return formulas.map(f => `if status:${statusCode} then ${f} else true`);
|
||
|
|
}
|
||
|
|
return formulas;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Proposal 2: Configurable Invariants
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// In test config
|
||
|
|
const result = await app.apophis.contract({
|
||
|
|
invariants: ['resource-integrity'], // Opt-in specific invariants
|
||
|
|
// Or
|
||
|
|
invariants: false, // Disable all
|
||
|
|
});
|
||
|
|
|
||
|
|
// Or per-route in schema
|
||
|
|
schema: {
|
||
|
|
'x-invariants': ['resource-integrity'],
|
||
|
|
'x-invariants-exclude': ['no-orphaned-resources']
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Proposal 3: Outbound Request Interception
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// Apophis provides fetch/http client wrapper
|
||
|
|
const stripeClient = apophis.createChaosAwareClient({
|
||
|
|
name: 'stripe',
|
||
|
|
baseURL: 'https://api.stripe.com',
|
||
|
|
defaults: {
|
||
|
|
headers: { 'Authorization': `Bearer ${process.env.STRIPE_KEY}` }
|
||
|
|
}
|
||
|
|
});
|
||
|
|
|
||
|
|
// In chaos config
|
||
|
|
chaos: {
|
||
|
|
outbound: {
|
||
|
|
'stripe': {
|
||
|
|
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
|
||
|
|
error: {
|
||
|
|
probability: 0.05,
|
||
|
|
responses: [
|
||
|
|
{ statusCode: 429, headers: { 'retry-after': '60' } },
|
||
|
|
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
|
||
|
|
]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Implementation approach:
|
||
|
|
- Monkey-patch `fetch` or `http.request` at module level
|
||
|
|
- Track outbound requests by hostname
|
||
|
|
- Match against chaos config
|
||
|
|
- Inject delays/errors before request reaches network
|
||
|
|
|
||
|
|
## Proposal 4: Service Method Wrapping
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// After Fastify ready
|
||
|
|
app.addHook('onReady', () => {
|
||
|
|
apophis.chaos.wrap(app.billingService, {
|
||
|
|
'createPricingPlan': {
|
||
|
|
delay: { probability: 0.1, ms: 100 },
|
||
|
|
error: {
|
||
|
|
probability: 0.05,
|
||
|
|
throws: new ServiceUnavailableError('stripe_timeout')
|
||
|
|
}
|
||
|
|
}
|
||
|
|
});
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
## Proposal 5: Chaos Event Reporting
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// In petit-runner, after chaos execution
|
||
|
|
const chaosEvents = result.events || [];
|
||
|
|
for (const event of chaosEvents) {
|
||
|
|
results.push({
|
||
|
|
ok: true, // Chaos events are informational, not failures
|
||
|
|
name: `${route.method} ${route.path} (chaos: ${event.type})`,
|
||
|
|
diagnostics: {
|
||
|
|
chaos: {
|
||
|
|
injected: true,
|
||
|
|
type: event.type,
|
||
|
|
details: event.details
|
||
|
|
}
|
||
|
|
}
|
||
|
|
});
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Proposal 6: Dropout Semantics
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// Configurable dropout behavior
|
||
|
|
chaos: {
|
||
|
|
dropout: {
|
||
|
|
probability: 0.1,
|
||
|
|
statusCode: 503, // Default: 503 instead of 0
|
||
|
|
body: { error: 'network_failure' }
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Proposal 7: Hypermedia Contract Support
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// New APOSTL operation headers
|
||
|
|
response_body(this).controls.self == request_url(this)
|
||
|
|
response_body(this).actions.update.method == "PATCH"
|
||
|
|
response_body(this).actions.update.target == "/billing/plans/{response_body(this).data.id}"
|
||
|
|
```
|
||
|
|
|
||
|
|
Or schema annotation:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"x-apophis-hypermedia": {
|
||
|
|
"controls": ["self", "next", "prev"],
|
||
|
|
"actions": ["create", "update", "delete"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|