ChaosWeb Guide¶

ChaosWeb is a fake web server that injects configurable faults into HTTP responses for testing web scraping pipeline resilience. It serves HTML pages on any URL path and randomly injects errors -- anti-scraping blocks, broken encoding, SSRF redirects, and more.

Point your scraper at ChaosWeb to verify it handles every real-world failure mode before scraping production sites.

Quick Start¶

# Start with a realistic error profile
uv run chaosweb serve --preset=realistic

# Your scraper fetches from localhost:8200 instead of the real site
curl http://127.0.0.1:8200/articles/some-page

Endpoints¶

Content Serving¶

Endpoint	Method	Description
`/{any-path}`	GET	Catch-all route -- serves HTML with error injection
`/redirect`	GET	Redirect loop handler (hop counter management)

Any GET request to any path returns either a successful HTML page or an injected error. The path is available to templates and echo mode for generating path-specific content.

Health and Admin¶

Endpoint	Method	Auth	Description
`/health`	GET	None	Server health check (includes `run_id`, `started_utc`, `in_burst`)
`/admin/config`	GET	Bearer token	View current configuration
`/admin/config`	POST	Bearer token	Update configuration at runtime
`/admin/stats`	GET	Bearer token	Request statistics summary
`/admin/export`	GET	Bearer token	Export raw metrics data
`/admin/reset`	POST	Bearer token	Reset metrics and start new run

Admin endpoints require an Authorization: Bearer <token> header.

Error Injection¶

ChaosWeb injects five categories of errors, each controlled by percentage fields (0-100).

HTTP Errors¶

Standard HTTP error responses with HTML error pages:

Error Type	Status Code	Config Field	Description
Rate Limit	429	`rate_limit_pct`	Anti-scraping throttle, includes `Retry-After`
Forbidden	403	`forbidden_pct`	Bot detection block
Not Found	404	`not_found_pct`	Deleted or missing page
Gone	410	`gone_pct`	Permanently removed resource
Payment Required	402	`payment_required_pct`	Paywall / quota exceeded
Unavailable for Legal	451	`unavailable_for_legal_pct`	Geo-blocking
Service Unavailable	503	`service_unavailable_pct`	Maintenance page
Bad Gateway	502	`bad_gateway_pct`	Upstream failure
Gateway Timeout	504	`gateway_timeout_pct`	Upstream timeout
Internal Error	500	`internal_error_pct`	Server-side failure

Connection Failures¶

Network-level problems your scraper must handle:

Error Type	Config Field	Behavior
Timeout	`timeout_pct`	Hangs for `timeout_sec` range, then returns 504
Connection Reset	`connection_reset_pct`	Sends headers then drops the connection
Connection Stall	`connection_stall_pct`	Delays, stalls, then disconnects
Slow Response	`slow_response_pct`	Delays `slow_response_sec` then returns successful HTML
Incomplete Response	`incomplete_response_pct`	Sends partial HTML body then disconnects

Warning

Incomplete responses are particularly tricky -- your scraper receives a 200 status code and partial HTML, then the connection drops. Always validate that your parsed HTML is structurally complete.

Content Malformations¶

HTTP 200 responses with corrupted content -- the subtlest failures:

Error Type	Config Field	What Goes Wrong
Wrong Content-Type	`wrong_content_type_pct`	Declares `application/pdf` or similar instead of `text/html`
Encoding Mismatch	`encoding_mismatch_pct`	Header says UTF-8, body is ISO-8859-1
Truncated HTML	`truncated_html_pct`	HTML cut off mid-tag
Invalid Encoding	`invalid_encoding_pct`	Non-decodable bytes in the declared encoding
Charset Confusion	`charset_confusion_pct`	HTTP header says one charset, `<meta>` tag says another
Malformed Meta	`malformed_meta_pct`	Invalid `<meta http-equiv="refresh">` directives

Redirect Injection¶

Tests your scraper's redirect handling:

Error Type	Config Field	Behavior
Redirect Loop	`redirect_loop_pct`	Chain of 301 redirects up to `max_redirect_loop_hops` (default 10)
SSRF Redirect	`ssrf_redirect_pct`	301 redirect to private IPs (169.254.169.254, 10.x.x.x, etc.)

Warning

SSRF redirect testing verifies that your scraper blocks redirects to private/internal addresses. Real scrapers should never follow redirects to cloud metadata endpoints like http://169.254.169.254/.

Burst Patterns¶

Bursts simulate coordinated anti-scraping escalation -- a site suddenly blocks most requests, then backs off:

error_injection:
  burst:
    enabled: true
    interval_sec: 60    # Burst every 60 seconds
    duration_sec: 8     # Lasts 8 seconds
    rate_limit_pct: 40  # During burst: 40% rate limits
    forbidden_pct: 30   # During burst: 30% forbidden

Selection Mode¶

priority (default): Errors evaluated in category order (connection > redirect > HTTP > malformed). First match wins.
weighted: All percentages treated as proportional weights for uniform distribution.

Content Modes¶

When a request is not selected for error injection, ChaosWeb generates HTML content using one of four modes:

Random (default)¶

Generates syntactically valid HTML pages with random content:

content:
  mode: random
  random:
    min_words: 100
    max_words: 500
    vocabulary: english  # or "lorem"

Template¶

Renders HTML through a Jinja2 SandboxedEnvironment:

content:
  mode: template
  template:
    body: >
      <html><head><title>{{ path }}</title></head>
      <body><h1>{{ path }}</h1>
      <p>{{ random_words(100, 300) }}</p></body></html>

Template helpers include random_words, random_choice, random_float, timestamp, and more. The path variable contains the requested URL path.

Echo¶

Reflects request information as HTML. Content is HTML-escaped to prevent XSS when rendering in a browser:

content:
  mode: echo

Preset¶

Loads HTML page snapshots from a JSONL file:

content:
  mode: preset
  preset:
    file: ./pages.jsonl
    selection: random  # or "sequential"

Per-Request Overrides¶

When allow_header_overrides is true (the default), use the X-Fake-Content-Mode header:

curl -H "X-Fake-Content-Mode: echo" http://localhost:8200/articles/test

Available Presets¶

ChaosWeb ships with five presets. Use them with --preset=<name>:

Preset	Error Rate	Latency	Burst	Best For
`silent`	0%	200ms +/- 100ms	Off	Baseline throughput measurement
`gentle`	~2%	100ms +/- 50ms	Off	Basic scraping functionality testing
`realistic`	~19%	300ms +/- 150ms	60s/8s	Production-like scraping conditions
`stress_scraping`	~57%	500ms +/- 200ms	60s/10s	Heavy anti-scraping resilience testing
`stress_extreme`	~98%	800ms +/- 400ms	30s/8s	Breaking-point stress testing

Preset Details¶

silent -- Zero errors. Every request returns HTML. Use this to establish baseline scraping throughput.

gentle -- Minimal error injection: 1% rate limits and 1% not-found errors. No connection failures, malformations, or bursts. Verifies your scraper handles basic error paths.

realistic -- Mimics typical web scraping conditions. Moderate rate limiting (5%), bot detection (3% forbidden), occasional slow responses (5%), and rare encoding issues. Bursts every 60 seconds simulate coordinated anti-scraping responses.

stress_scraping -- Heavy anti-scraping simulation. 15% rate limits, 10% forbidden, connection failures (5% timeout, 3% reset), content malformations (3% wrong content-type, 2% encoding mismatch), and SSRF redirect testing (1%). Aggressive burst escalation with 80% rate limiting.

stress_extreme -- Every error type is active at high rates. 25% rate limits, 15% forbidden, 10% timeout, 5% connection reset, heavy content malformations, redirect loops (3%), and SSRF redirects (2%). Very aggressive 30-second burst cycles. Use for finding failure modes and verifying graceful degradation.

Usage Examples¶

CLI¶

# Start with a preset
uv run chaosweb serve --preset=realistic

# Start with a custom config file
uv run chaosweb serve --config=my-config.yaml

# Via the unified CLI
uv run chaosengine web serve --preset=realistic

Python¶

from errorworks.web.config import ChaosWebConfig, load_config
from errorworks.web.server import ChaosWebServer, create_app

# Quick start
config = load_config(preset="realistic")
app = create_app(config)

# With full control
server = ChaosWebServer(config)
server.update_config({"error_injection": {"rate_limit_pct": 25.0}})
stats = server.get_stats()

Presets -- Full preset comparison and customization
Configuration -- YAML config file structure and precedence rules
Metrics -- Querying request statistics
Testing Fixtures -- In-process testing with pytest