Architecture Overview¶
Errorworks is a composable chaos-testing service framework. Each server type (LLM, Web) is built from shared engine components rather than inheriting from a base class. This document explains the design rationale, key components, and extension points.
Composition Over Inheritance¶
The central design principle is that HTTP concerns stay out of domain logic. Each chaos plugin (ChaosLLM, ChaosWeb) creates instances of shared engine utilities and delegates specific responsibilities to them:
- InjectionEngine handles burst state and error selection algorithms
- MetricsStore handles SQLite persistence and timeseries aggregation
- LatencySimulator handles delay calculation
- ConfigLoader handles YAML loading and config precedence
The server classes (ChaosLLMServer, ChaosWebServer) own the HTTP routing, request parsing, and response formatting. They compose engine components but never extend them. This means a new server type (e.g., email, gRPC) can reuse the same engine components without inheriting HTTP-specific behavior it does not need.
Package Structure¶
src/errorworks/
├── engine/ # Shared core utilities
│ ├── types.py # ServerConfig, MetricsConfig, LatencyConfig,
│ │ # BurstConfig, ErrorSpec, SelectionMode,
│ │ # MetricsSchema, ColumnDef
│ ├── injection_engine.py # Burst state machine + selection algorithms
│ ├── metrics_store.py # Thread-safe SQLite with schema-driven DDL
│ ├── latency.py # Latency simulation (base +/- jitter)
│ ├── config_loader.py # YAML preset loading + deep merge
│ ├── admin.py # Shared admin endpoint handlers
│ ├── validators.py # Shared Pydantic validators (range parsing)
│ └── cli.py # Unified chaosengine CLI
│
├── llm/ # ChaosLLM: Fake OpenAI-compatible server
│ ├── config.py # ChaosLLMConfig, ErrorInjectionConfig, ResponseConfig
│ ├── server.py # ChaosLLMServer (Starlette ASGI app)
│ ├── error_injector.py # LLM-specific error decision logic
│ ├── response_generator.py# OpenAI-format response generation
│ ├── metrics.py # LLM-specific MetricsRecorder wrapper
│ ├── cli.py # chaosllm CLI
│ └── presets/ # YAML preset files
│
├── web/ # ChaosWeb: Fake web server for scraping tests
│ ├── config.py # ChaosWebConfig, WebErrorInjectionConfig, WebContentConfig
│ ├── server.py # ChaosWebServer (Starlette ASGI app)
│ ├── error_injector.py # Web-specific error decision logic
│ ├── content_generator.py # HTML content generation + corruption functions
│ ├── metrics.py # Web-specific MetricsRecorder wrapper
│ ├── cli.py # chaosweb CLI
│ └── presets/ # YAML preset files
│
├── llm_mcp/ # MCP server for ChaosLLM metrics analysis
│ └── server.py # Claude-optimized metrics tools via MCP protocol
│
└── testing/ # Pytest fixture support
└── ... # In-process test fixtures using Starlette TestClient
Key Engine Components¶
InjectionEngine¶
File: engine/injection_engine.py
The InjectionEngine is the decision-making core for error injection. It handles two concerns:
-
Burst state machine -- Periodic burst windows where error rates are elevated. Bursts occur every
interval_secseconds and last forduration_secseconds. The state is computed from elapsed time using modular arithmetic (elapsed % interval < duration), making it stateless beyond the start timestamp. -
Error selection -- Two algorithms:
- Priority mode: Specs are evaluated in order. The first one that triggers (based on a random roll against its weight) wins. This gives deterministic precedence to high-priority errors.
- Weighted mode: A single error is selected proportionally from all active specs. Success probability is implicitly
max(0, 100 - total_weight).
The engine is deliberately domain-agnostic. It works with ErrorSpec(tag, weight) objects where tag is an opaque string. The calling plugin builds the spec list (with domain-specific tags like "rate_limit" or "ssrf_redirect") and interprets the selected tag to produce a response.
Thread safety: The burst start time is protected by a lock. The RNG is not thread-safe, but this is handled by the config snapshot pattern (each request snapshots the engine reference, so concurrent requests use different engine instances after a config update).
Testability: Both time_func and rng are injectable. Tests pass time.monotonic replacements and seeded random.Random instances for deterministic behavior.
MetricsStore¶
File: engine/metrics_store.py
Thread-safe SQLite storage with several notable design choices:
-
Thread-local connections: Each thread gets its own
sqlite3.Connectionviathreading.local(). This avoids SQLite's thread-safety limitations while allowing concurrent access from uvicorn workers. -
WAL mode for file databases: File-backed databases use Write-Ahead Logging (
PRAGMA journal_mode=WAL) withPRAGMA synchronous=NORMALfor better concurrent read/write performance. In-memory databases usePRAGMA journal_mode=MEMORYwithPRAGMA synchronous=OFFfor maximum speed. -
Schema-driven DDL: Table structures are defined declaratively via
MetricsSchemadataclasses containingColumnDeftuples. The store generatesCREATE TABLE IF NOT EXISTSstatements from the schema at initialization. This means each plugin defines its own schema (LLM requests havemodelanddeploymentcolumns; Web requests havepathandredirect_hopscolumns) without modifying the store. -
Timeseries UPSERT: The
update_timeseries()method uses SQLite'sINSERT ... ON CONFLICT(bucket_utc) DO UPDATE SETto atomically increment counters per time bucket. Latency statistics (avg, p99) are computed via SQL aggregation rather than loading all values into Python. -
Stale connection cleanup: When a new connection is created, connections from dead threads are detected and closed. Thread ID reuse is an acknowledged edge case that is acceptable for a testing tool.
LatencySimulator¶
File: engine/latency.py
Adds artificial delays to simulate real service latency. The formula is:
The result is always non-negative (clamped to 0). The simulator also provides simulate_slow_response(min_sec, max_sec) for slow response error injection where delays are specified as second-level ranges.
Like the InjectionEngine, the RNG is injectable for deterministic testing.
ConfigLoader¶
File: engine/config_loader.py
Handles configuration loading with a four-layer precedence model:
- CLI flags (highest) -- Only explicitly provided values;
Nonevalues are excluded so they do not override lower layers. - Config file -- YAML file specified by
--config. - Preset -- Named YAML file from the plugin's
presets/directory. - Built-in defaults (lowest) -- Pydantic field defaults.
The deep_merge(base, override) function recursively merges dicts so that nested updates (e.g., changing only burst.enabled within error_injection) preserve sibling fields rather than resetting them to defaults. The function returns a new dict and never mutates its inputs.
Preset safety: Preset names are validated against ^[a-zA-Z0-9][a-zA-Z0-9_-]*$ to prevent path traversal attacks.
Config Snapshot Pattern¶
Request handlers in both ChaosLLMServer and ChaosWebServer snapshot component references at the start of each request:
with self._config_lock:
error_injector = self._error_injector
response_generator = self._response_generator
latency_simulator = self._latency_simulator
This snapshot is taken under _config_lock and produces local references that the handler uses for the remainder of the request. If a concurrent update_config() call swaps in new components while the request is in progress, the request continues using the old components, guaranteeing a consistent configuration view throughout its lifetime.
This pattern is critical because the alternative -- reading self._error_injector at error check time and self._response_generator later at response time -- could produce a half-updated view where the error rates come from the new config but the response settings come from the old one.
Immutable Config Update Flow¶
All Pydantic config models use frozen=True and extra="forbid". This means fields cannot be mutated after construction and unknown fields cause validation errors.
Runtime configuration updates through POST /admin/config follow this sequence:
- Receive the partial update dict from the HTTP request body.
- Deep-merge the update with the current config (preserving unspecified nested fields).
- Construct new Pydantic model instances from the merged dict (validation happens here).
- Create new component instances (e.g., new
ErrorInjector, newResponseGenerator) from the new config. This happens outside the lock because construction and validation may be expensive. - Swap the new components atomically under
_config_lock.
If validation fails at step 3, no changes are applied and a 422 error is returned. If construction succeeds, the swap in step 5 is an atomic pointer replacement -- there is no intermediate state where some components are updated and others are not.
Thread Safety Model¶
Errorworks is designed for multi-worker uvicorn deployments. The thread safety strategy has several layers:
-
_config_lock(per-server instance): Protects reads and writes of component references (_error_injector,_response_generator, etc.). The lock is held briefly for pointer reads (snapshot) and pointer swaps (update), never for request processing. -
InjectionEngine lock: Protects the burst start timestamp. Held only for the time calculation.
-
MetricsStore thread-local connections: Each thread gets its own SQLite connection, avoiding cross-thread connection sharing entirely.
-
Immutable config models: Frozen Pydantic models cannot be accidentally mutated by concurrent readers.
-
Best-effort metrics recording: Metrics writes that fail (SQLite errors) are logged but never propagated to the caller. A metrics side-effect must not replace an intended chaos response with an unintended real 500 error.
Adding a New Server Type¶
To add a new chaos server type (e.g., email, gRPC, GraphQL), follow this pattern:
-
Create a new package under
src/errorworks/(e.g.,src/errorworks/email/). -
Define config models in
config.py: - Create an error injection config with domain-specific
_pctfields - Create a content/response config appropriate to the protocol
- Create a top-level config composing
ServerConfig,MetricsConfig,LatencyConfig, and your domain configs -
Wire up
load_config()using the sharedconfig_loader.load_config()generic function -
Define a metrics schema using
MetricsSchemaandColumnDefwith domain-specific columns for the requests and timeseries tables. -
Create an error injector that:
- Composes an
InjectionEngineinstance - Builds
ErrorSpeclists from your config (with burst-aware adjustments) -
Calls
engine.select(specs)and maps the selected tag to a domain-specific decision dataclass -
Create a server class that:
- Composes all components (error injector, content generator, latency simulator, metrics recorder)
- Implements the
ChaosServerprotocol fromengine/admin.py(get_admin_token,get_current_config,update_config,reset,export_metrics,get_stats) - Uses the config snapshot pattern in request handlers
-
Uses the immutable config update flow in
update_config() -
Register routes including
/health,/admin/*(delegating toengine.adminhandlers), and your domain-specific endpoints. -
Add a CLI using Typer, with a
servecommand and apresetscommand. Register it as a console script inpyproject.tomland add it as a subcommand tochaosengine. -
Add presets as YAML files in a
presets/directory within your package.
The shared engine layer handles all the infrastructure: burst timing, selection algorithms, SQLite management, config loading, admin authentication, and deep merge. Your plugin only needs to define what errors look like in your domain and how to render responses.