Files
apophis-fastify/docs/attic/root-history/FEEDBACK_COMPLETE.md
T

784 lines
23 KiB
Markdown
Raw Normal View History

# Arbiter → Apophis Feedback Report
**Date:** 2026-04-27
**Reporter:** Arbiter Engineering Team
**Context:** Integration of Apophis v2.2 into Arbiter Platform for behavioral contract testing
---
## Executive Summary
Apophis provides genuinely valuable capabilities for behavioral contract testing that go beyond traditional unit/integration tests. The schema-to-contract inference, cross-operation verification, and chaos testing infrastructure are compelling. However, we encountered 3 bugs in core infrastructure and several design friction points that should be addressed for wider adoption.
**Overall Assessment:** Strong value proposition for teams willing to invest in schema-driven testing. Needs polish on edge cases and configurability.
---
## Part 1: How Chaos Injection Would Help Arbiter
### Current State
Arbiter is a multi-tenant SaaS platform with:
- 500+ API endpoints across 15 route families
- Billing, graph storage, auth, sessions, webhooks, etc.
- Mock Stripe integration for payment processing
- In-memory and persistent storage backends
- Complex middleware chain: auth → tenant boundary → permissions → preflight → handler
### Where Chaos Testing Adds Value
**1. Middleware Resilience Verification**
Our middleware chain has implicit dependencies:
```
Transport → AuthN → Scope → AuthZ → Challenge → Preflight → Handler
```
Chaos testing would verify:
- What happens when `preflight()` times out? Does the handler still execute?
- If auth middleware fails with 503, do we get proper retry headers?
- Does a slow tenant boundary check cascade to response timeouts?
**Concrete scenario:** If the billing preflight gate (budget check) is slow, does the subscription creation handler wait or fail? Our contracts say `response_time < 2000ms` — chaos would tell us if that's actually enforced.
**2. Mock Service Degradation**
We use `MockStripeService` for payment processing. In production, Stripe can:
- Return 429 (rate limit)
- Time out on `paymentIntents.create`
- Return network errors
Chaos testing would inject:
```
if chaos:stripe-timeout then response_code == 503
if chaos:stripe-rate-limit then retry-after header != null
```
This validates our fallback logic — currently untested because mocks always succeed.
**3. Resource Leak Detection**
Our `BillingApplicationService` uses in-memory Maps. Chaos scenarios:
- Create 1000 plans, delete 500, verify GET on deleted returns 404
- Cancel subscriptions mid-renewal cycle
- Concurrent PATCH operations on same plan
Cross-operation contracts catch this for single requests, but chaos tests concurrent state corruption.
**4. Entitlement Boundary Testing**
We have credit-based preflight gates. Chaos could:
- Exhaust credits mid-test
- Verify 402 (Payment Required) is returned
- Ensure no partial mutations occur when budget is depleted
This is business-critical: we cannot bill customers for operations that fail.
**5. Auth Token Expiry**
JWT tokens expire. Chaos could:
- Expire tokens between POST and follow-up GET
- Verify 401 with proper `WWW-Authenticate` header
- Test refresh token flow under load
### Proposed Chaos Scenarios for Arbiter
```yaml
billing_chaos:
- name: stripe-timeout
target: POST /billing/invoices/:id/pay
inject: { stripe_delay_ms: 5000 }
expected: { status: 503, retry_after: "> 0" }
- name: storage-corruption
target: DELETE /billing/plans/:id
inject: { skip_deletion: true }
expected: { status: 200, follow_up_get: 404 }
- name: rate-limit
target: POST /billing/plans
inject: { rate_limit: 10 }
expected: { status: 429, x_retry_after: "> 0" }
- name: auth-expiry
target: PATCH /billing/plans/:id
inject: { expire_token_after_ms: 100 }
expected: { status: 401, www_authenticate: "Bearer" }
```
---
## Part 2: Bugs Found
### Bug 1: Scope Registry Ignores Configured Default Scope
**Severity:** High (breaks auth in cross-operation tests)
**File:** `dist/infrastructure/scope-registry.js`
**Line:** 60, 76-77
**Problem:**
```javascript
const scope = scopeName !== null ? this.scopes.get(scopeName) : undefined;
const base = scope ?? this.defaultScope; // Always uses empty DEFAULT_SCOPE
```
When `getHeaders(null)` is called, it uses `this.defaultScope` which is initialized to `{ headers: {}, metadata: {} }` on line 60, ignoring any "default" scope passed in the constructor.
**Impact:** Cross-operation requests (e.g., `response_code(GET /users/{id})`) don't inherit auth headers from the configured scope, causing 401 failures on protected routes.
**Fix:**
```javascript
const base = scope ?? this.scopes.get('default') ?? this.defaultScope;
```
**Reproduction:**
```javascript
await app.register(apophis, {
scopes: {
default: { headers: { 'authorization': 'Bearer token' } }
}
});
// Cross-operation GET /users/123 gets 401 because auth header is not passed
```
### Bug 2: Contract Builder Drops Routes Option
**Severity:** High (route filtering doesn't work)
**File:** `dist/plugin/contract-builder.js`
**Line:** 8-15
**Problem:**
```javascript
const config = {
depth: opts.depth ?? 'standard',
scope: opts.scope,
seed: opts.seed,
timeout: opts.timeout,
chaos: opts.chaos,
// Missing: routes: opts.routes
};
```
The `routes` option is documented but never passed to `runPetitTests`, causing all routes to be tested regardless of the `routes` filter.
**Impact:** Tests run against all 500+ routes instead of the 4 specified, making debugging impossible and CI times explode.
**Fix:**
```javascript
const config = {
depth: opts.depth ?? 'standard',
scope: opts.scope,
seed: opts.seed,
timeout: opts.timeout,
chaos: opts.chaos,
routes: opts.routes, // Add this
};
```
**Reproduction:**
```javascript
await app.apophis.contract({
routes: ['POST /billing/plans'] // Tests ALL routes instead
});
```
### Bug 3: Invariant Checking Not Configurable
**Severity:** Medium (false failures for non-hierarchical APIs)
**File:** `dist/test/petit-runner.js`
**Line:** 386-398
**Problem:** Built-in invariants (`no-orphaned-resources`, `parent-reference-integrity`, `resource-integrity`) run unconditionally for all routes. These assume parent-child resource hierarchies (e.g., `/workspaces/:id/projects/:id`).
**Impact:** For flat resource models (like our billing plans), routes with `x-category: 'constructor'` trigger invariant failures because resources don't have `parentType`/`parentId`.
**Workaround:** We set `x-category: 'observer'` to avoid resource tracking, but this loses the semantic meaning of the route.
**Suggested Fix:**
```javascript
// In config
invariants: ['resource-integrity'] // Opt-in per test
// Or
invariants: false // Disable all
// Or per-route
schema: {
'x-invariants': ['custom-only']
}
```
---
## Part 3: Design Feedback
### 1. Schema Inference is Too Aggressive
**Issue:** `const` values in JSON Schema generate unconditional contracts.
Example:
```json
{
"response": {
"200": {
"properties": {
"fragment_type": { "const": "Action" }
}
}
}
}
```
Generates: `response_body(this).fragment_type == "Action"` (checked for ALL responses)
This fails when the route returns 404 with `fragment_type: "Error"`.
**Suggestion:** Infer conditional contracts based on status code:
```
if status:200 then response_body(this).fragment_type == "Action" else true
```
Or add an option to disable inference: `inferContracts: false`.
### 2. Cross-Operation Headers Not Documented
The `scope.headers` behavior for cross-operation requests is not documented. We had to read source code to discover that:
- `createOperationResolver(fastify, request.headers)` passes request headers
- But `request.headers` comes from `scope.getHeaders(null)`
- Which had bug #1 above
**Suggestion:** Document that cross-operation requests inherit the scope headers of the original request.
### 3. Missing 400 Response Handling
When Fastify schema validation fails (e.g., enum mismatch), it returns 400 with a validation error object. Apophis treats this as a contract failure unless:
- The schema has a 400 response documented
- The contract explicitly accepts 400
Most developers won't document 400 responses. Apophis should either:
- Auto-generate 400 contracts from validation rules
- Or provide a global 400 handler pattern
### 4. HEAD Routes Cause Noise
Fastify auto-generates HEAD routes for every GET. These have no response body, causing `response_body(this).id != null` failures.
**Suggestion:** Auto-skip HEAD routes in contract tests, or provide `skipMethods: ['HEAD']` option.
### 5. Error Suggestions Need Context
When a contract fails, the error is:
```
Field 'fragment_type' does not match expected value 'Error'.
```
But it doesn't say:
- What the actual status code was
- What the actual response body was
- Which route generated the request
**Suggestion:** Include actual vs expected in violation objects.
---
## Part 4: What We Love
### 1. Cross-Operation Contracts
```
if status:201 then response_code(GET /billing/plans/{response_body(this).data.plan_id}) == 200 else true
```
This is genuinely hard to test manually. Apophis makes it declarative and automatic.
### 2. Property-Based Generation
Fast-check found edge cases we missed:
- Empty string `name` (schema allowed it, service rejected it)
- Invalid `billing_interval` values
- Missing required fields
### 3. Schema as Single Source of Truth
Once schemas are correct, contracts are free. The `x-ensures` array supplements rather than replaces schema validation.
### 4. Fast Feedback Loop
Contract tests run in ~1.5s for 4 routes. Much faster than spinning up a full test environment.
---
## Part 5: Feature Requests
### 1. Hypermedia Contract Support
Arbiter returns LDF (Linked Data Fragment) responses with `controls` and `actions`. We'd love to verify:
```
if status:200 then response_body(this).controls.self == request_url(this) else true
if status:200 then response_body(this).actions.create.method == "POST" else true
if status:200 then response_body(this).actions.update.target == "/billing/plans/{response_body(this).data.id}" else true
```
Currently we have to write these manually. Could Apophis infer hypermedia controls from route registration?
### 2. Conditional Schema Contracts
Instead of removing `const` from schemas, allow:
```json
{
"response": {
"200": {
"properties": {
"fragment_type": { "const": "Action", "x-apophis-conditional": "status:200" }
}
}
}
}
```
This preserves schema expressiveness while generating correct contracts.
### 3. Middleware Contract Verification
Our middleware chain is critical. We'd like to verify:
```
if request_headers(this).authorization == null then status:401 else true
if request_headers(this).x-tenant-id == null then status:400 else true
```
Apophis already supports `request_headers` — making this a first-class feature (e.g., `x-requires`) would be powerful.
### 4. State Cleanup Hooks
After destructive tests (DELETE), we need to clean up:
```javascript
await app.apophis.contract({
routes: ['DELETE /billing/plans/:id'],
cleanup: async (state) => {
// Remove created plans from database
await db.plans.deleteMany({ id: { $in: state.createdPlans } });
}
});
```
This would enable stateful testing without polluting the test environment.
### 5. Contract Coverage Report
After running tests, we'd like:
```
Contract Coverage:
POST /billing/plans:
- 201 response: ✓ tested (42 cases)
- 400 response: ✓ tested (8 cases)
- 503 response: ✗ not tested
- Cross-op GET: ✓ tested (42 cases)
```
This helps identify gaps in contract coverage.
---
## Conclusion
Apophis is a powerful tool that fills a gap in API testing — behavioral contracts and chaos testing. The core concepts are solid, but the implementation needs hardening for production use:
**Must-fix:** Bugs #1 and #2 (scope registry, route filtering)
**Should-fix:** Bug #3 (configurable invariants), inference aggressiveness
**Nice-to-have:** Hypermedia support, middleware contracts, coverage reports
We're committed to using Apophis for Arbiter's contract testing and will contribute fixes upstream. The value of cross-operation verification alone justifies the investment.
---
**Contact:** Arbiter Engineering Team
**Repository:** https://github.com/anomalyco/apophis (we'll open issues for each bug)
# Critical Feedback: Why Current Chaos Injection is Insufficient for Production APIs
**To:** Apophis Engineering Team
**From:** Arbiter Platform Engineering
**Date:** 2026-04-27
**Context:** Production SaaS platform with 500+ endpoints, Stripe integration, complex middleware chains
---
## The Core Problem
Current chaos injection operates exclusively at the **HTTP transport layer** (`executeHttp()` wrapper). This tests:
- ✅ Response schemas under forced errors
- ✅ Timeout contracts with artificial delays
- ✅ Response validation with corrupted bodies
But **production APIs fail at the dependency layer**, not the transport layer:
- Stripe API returns 429 rate limit
- Database connection pool exhausted
- Redis cache timeout
- Third-party webhook delivery fails
- Message queue backlog
**Current chaos cannot simulate these.** It can force a 503 response, but it cannot simulate "Stripe returned 429, so we need to propagate retry-after header" because the handler never sees the Stripe error.
---
## Specific Pain Points
### 1. Error Injection is Backwards
**Current behavior:**
```
Handler runs → creates side effects → response overridden to 503
```
**What we need:**
```
Handler runs → Stripe call fails with 429 → handler catches error → returns 503 with retry-after
```
The current approach tests "what does our 503 response look like" but not "does our handler correctly handle Stripe errors." These are different:
- Current: Tests schema compliance for hardcoded error responses
- Needed: Tests business logic for dependency failures
**Impact:** We have 503 contracts that pass, but our handler might not actually set the retry-after header when Stripe fails. The contract gives false confidence.
### 2. Chaos Events Are Invisible
When chaos injects, the test result shows:
```
POST /billing/plans (#1): FAIL
Error: Contract violation: if status:503 then response_body(this).data.error != null else true
```
But there's no indication that:
- Chaos was the cause (not a real bug)
- What type of chaos was injected (error? corruption? delay?)
- What the original response was before override
**Impact:** Debugging chaos failures is impossible. We can't tell if our contract is wrong or if chaos mutated the response unexpectedly.
### 3. Resilience Verification is Dangerous for Stateful APIs
When `resilience: { enabled: true }`, Apophis retries the same request up to `maxRetries` times.
For `POST /billing/plans`:
- Attempt 1: Creates plan A → gets 503 → retries
- Attempt 2: Creates plan B → gets 503 → retries
- Attempt 3: Creates plan C → gets 503 → retries
- Attempt 4: Creates plan D → succeeds
**Result: 4 plans created, 1 expected.** This pollutes state and makes follow-up tests (GET, PATCH, DELETE) behave unpredictably.
**Impact:** Can't use resilience testing on stateful routes without idempotency. Most real APIs are stateful.
### 4. Dropout Returns Status Code 0
Network failures in production don't return status code 0. They:
- Time out (status undefined, error "ETIMEDOUT")
- Reset connection (error "ECONNRESET")
- Return 503 from load balancer
Status 0 is a browser-specific artifact. Node.js HTTP clients don't produce status 0.
**Impact:** Contracts can't match status 0. We have to either:
- Add `status:0` to all contracts (meaningless)
- Or ignore dropout failures (makes dropout useless)
---
## What Would Make Chaos Useful for Arbiter
### Option A: Outbound Request Contracts (Preferred)
Apophis intercepts outbound HTTP requests from the handler:
```javascript
// In Apophis config
chaos: {
outbound: {
'api.stripe.com': {
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
error: {
probability: 0.05,
responses: [
{ statusCode: 429, headers: { 'retry-after': '60' } },
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
]
}
}
}
}
```
**Benefits:**
- Handler sees real dependency failures
- Tests actual error handling logic
- Side effects only occur when handler succeeds
- No state pollution from retries
### Option B: Service Method Wrapping
Apophis wraps methods on decorated services:
```javascript
// Fastify decorator
app.decorate('stripe', new StripeService());
// Apophis wraps it
apophis.chaos.wrap(app.stripe, {
'paymentIntents.create': {
delay: { probability: 0.1, ms: 5000 },
error: { probability: 0.05, throws: new StripeTimeoutError() }
}
});
```
**Benefits:**
- Works with any service pattern (HTTP, DB, queue)
- Tests business logic directly
- Minimal changes to existing code
### Option C: Event-Driven Chaos
For async architectures:
```javascript
chaos: {
events: {
'webhook.received': {
drop: { probability: 0.1 }, // Simulate webhook loss
delay: { probability: 0.2, ms: 30000 } // Simulate queue delay
}
}
}
```
---
## Recommended Priority Order
### P0 (Critical): Fix Event Reporting
Every chaos injection should be visible:
```javascript
// In test results
test.diagnostics.chaos = {
injected: true,
type: 'error',
details: {
statusCode: 503,
originalStatusCode: 201,
strategy: 'override'
}
}
```
Without this, chaos failures are indistinguishable from real bugs.
### P1 (High): Add Dependency-Aware Chaos
Implement outbound request interception or service wrapping. Current HTTP-layer chaos is too superficial for production APIs.
### P2 (Medium): Fix Dropout Semantics
Return proper status codes:
- `504 Gateway Timeout` for timeouts
- `503 Service Unavailable` for network failures
- Or make it configurable: `dropout: { statusCode: 503 }`
### P3 (Low): Stateful Retry Safety
Either:
- Make retries use unique IDs (prevent duplicate creation)
- Or document that resilience requires idempotent handlers
- Or skip resilience for non-idempotent routes
---
## What We're Doing Instead
Since current chaos doesn't serve our needs, we're writing application-layer failure tests:
```javascript
test('Stripe rate limit handling', async () => {
// Mock Stripe to return 429
app.stripe.paymentIntents.create = async () => {
const err = new Error('Rate limit exceeded');
err.statusCode = 429;
err.headers = { 'retry-after': '60' };
throw err;
};
const res = await payInvoice({ invoiceId: 'test' });
assert.strictEqual(res.statusCode, 429);
assert.strictEqual(res.json().data.error, 'stripe_rate_limit');
assert.strictEqual(res.headers['retry-after'], '60');
});
```
This tests what we actually need: **handler behavior when dependencies fail.**
---
## Conclusion
Apophis chaos is a good start for HTTP-layer resilience testing, but it's insufficient for production APIs with external dependencies. The framework needs to evolve from "HTTP response mutator" to "dependency failure simulator" to be truly valuable.
We want Apophis to succeed. The schema-driven contract approach is innovative and valuable. But chaos testing needs to be dependency-aware to be useful for real-world APIs.
**Happy to collaborate** on designing the outbound interception API or service wrapping approach.
---
# Appendix: Concrete Proposals for Apophis Improvements
## Proposal 1: Conditional Schema Inference
Instead of removing `const` from schemas, generate conditional contracts:
```typescript
// Current behavior (WRONG):
// Schema: { properties: { fragment_type: { const: "Action" } } }
// Generates: response_body(this).fragment_type == "Action" // Applies to ALL responses
// Proposed behavior:
// Generates: if status:200 then response_body(this).fragment_type == "Action" else true
```
Implementation:
```typescript
function inferContractsFromResponseSchema(responseSchema, statusCode) {
const formulas = [];
// ... existing inference logic ...
// Wrap in conditional if status code is 2xx
if (statusCode >= 200 && statusCode < 300) {
return formulas.map(f => `if status:${statusCode} then ${f} else true`);
}
return formulas;
}
```
## Proposal 2: Configurable Invariants
```typescript
// In test config
const result = await app.apophis.contract({
invariants: ['resource-integrity'], // Opt-in specific invariants
// Or
invariants: false, // Disable all
});
// Or per-route in schema
schema: {
'x-invariants': ['resource-integrity'],
'x-invariants-exclude': ['no-orphaned-resources']
}
```
## Proposal 3: Outbound Request Interception
```typescript
// Apophis provides fetch/http client wrapper
const stripeClient = apophis.createChaosAwareClient({
name: 'stripe',
baseURL: 'https://api.stripe.com',
defaults: {
headers: { 'Authorization': `Bearer ${process.env.STRIPE_KEY}` }
}
});
// In chaos config
chaos: {
outbound: {
'stripe': {
delay: { probability: 0.1, minMs: 1000, maxMs: 5000 },
error: {
probability: 0.05,
responses: [
{ statusCode: 429, headers: { 'retry-after': '60' } },
{ statusCode: 503, body: { error: 'stripe_unavailable' } }
]
}
}
}
}
```
Implementation approach:
- Monkey-patch `fetch` or `http.request` at module level
- Track outbound requests by hostname
- Match against chaos config
- Inject delays/errors before request reaches network
## Proposal 4: Service Method Wrapping
```typescript
// After Fastify ready
app.addHook('onReady', () => {
apophis.chaos.wrap(app.billingService, {
'createPricingPlan': {
delay: { probability: 0.1, ms: 100 },
error: {
probability: 0.05,
throws: new ServiceUnavailableError('stripe_timeout')
}
}
});
});
```
## Proposal 5: Chaos Event Reporting
```typescript
// In petit-runner, after chaos execution
const chaosEvents = result.events || [];
for (const event of chaosEvents) {
results.push({
ok: true, // Chaos events are informational, not failures
name: `${route.method} ${route.path} (chaos: ${event.type})`,
diagnostics: {
chaos: {
injected: true,
type: event.type,
details: event.details
}
}
});
}
```
## Proposal 6: Dropout Semantics
```typescript
// Configurable dropout behavior
chaos: {
dropout: {
probability: 0.1,
statusCode: 503, // Default: 503 instead of 0
body: { error: 'network_failure' }
}
}
```
## Proposal 7: Hypermedia Contract Support
```typescript
// New APOSTL operation headers
response_body(this).controls.self == request_url(this)
response_body(this).actions.update.method == "PATCH"
response_body(this).actions.update.target == "/billing/plans/{response_body(this).data.id}"
```
Or schema annotation:
```json
{
"x-apophis-hypermedia": {
"controls": ["self", "next", "prev"],
"actions": ["create", "update", "delete"]
}
}
```