ADR 0014: Edge rate limiting via Traefik¶

Status: Accepted Date: 2026-05-01

Context¶

Anything internet-facing without rate limits invites abuse. The two realistic threat models for SCALR at this stage are:

Credential stuffing. Attackers iterate breach-list email + password pairs against the auth flow. Authentik has internal detection but it kicks in after the request hits its app server; we can shed bad traffic at the edge.
Scrapers / bots. Anyone hammering an app's API for data exfiltration, accidental loops, or DoS-by-curiosity. Default app handlers don't push back on volume, so a misbehaving client can wedge a worker thread or fill a queue.

Both are unlikely at our current scale but cheap to defend against.

Decision¶

Define two rate-limit middlewares on the Traefik gateway and attach them to every prod router via Docker labels:

ratelimit-default: 100 req/sec average, 200 burst, per source IP. Attached to every app's prod (-redirect, -tls) routers. Generous enough that real users browsing apps never trigger; tight enough that scrapers get 429s after a brief burst.
ratelimit-auth: 10 req/sec, 30 burst, per source IP. Attached only to the auth subdomain. Authentik's own login flow takes ~3-4 requests per attempt; legitimate SSO traffic stays well under. Tightens the credential-stuffing window from "as fast as your TLS handshake" to ~10/sec — meaningful even against distributed attackers because each source IP independently hits the cap.

Local dev (*.localhost) routers deliberately don't carry rate limits — fast iteration matters more than abuse defence on a host that isn't internet-reachable.

Why these thresholds¶

The numbers come from these observations:

A single SCALR app, real-user browsing, peaks at maybe 20 req/sec per browser session (frontend asset loads + a few API calls). 100/sec gives 5x headroom per IP.
Authentik's login flow: identification stage GET, password POST, authorize POST, redirect — call it 6 requests for one login. 10/sec lets a single IP attempt 1-2 logins per second sustained, with 30 burst absorbing the occasional 4-request flurry. Slow enough that brute-force at scale is painful.
Traefik's ratelimit is a token-bucket. average is the steady refill rate; burst is the bucket size. Above the bucket, the router returns 429 with Retry-After.
Per-source-IP via sourcecriterion.ipstrategy.depth=1 — Traefik reads X-Forwarded-For from the immediate hop. Since the only thing in front of Traefik is the cloud provider's load balancer (or nothing, in dev), this gives us the real client IP.

These are defaults; override per-app via additional middlewares if a specific endpoint needs tighter or looser bounds.

Why edge, not in-app¶

Rate-limiting in each app would require:

A shared Redis-or-similar so app instances share the bucket.
Middleware in every framework (FastAPI dependency, Vite plugin, whatever).
Per-app configuration that drifts.

Doing it at Traefik gives us:

One config surface (gateway compose labels).
Bad traffic doesn't reach the app at all — saves the worker thread.
Free for *.localhost to be unrestricted (dev productivity).
Provider-agnostic — when we eventually swap Traefik for something else, we re-implement the rate-limit decision once, not in N apps.

Why not a WAF (Cloudflare, AWS WAF) instead¶

Cloudflare in front of the gateway would give us this for free, plus DDoS absorption, plus geographic blocks, plus rule-based filtering. We don't run that today because:

Putting Cloudflare in front means TLS terminates at Cloudflare and re-originates to our origin — worth the trade once we're at Cloudflare-volume traffic, not at solo scale.
The Cloudflare orange-cloud setup is its own ADR-worth of work (proxy mode, IP whitelisting Traefik to only accept CF IPs, cert reissue cycle).
Traefik's built-in middleware covers the realistic 99% case for our scale at zero added complexity.

When Cloudflare lands, this middleware gets retired or relegated to a defence-in-depth role. Trivial to remove — just delete the labels.

Consequences¶

Positive: - Closes the most embarrassing "no rate limit on /token endpoint" finding any pen-tester would flag. - Zero per-app configuration; new apps inherit ratelimit-default via the template-app's compose file. - Works in dev unchanged (localhost routers don't carry the middleware). - Trivial to tune — change one line in services/gateway/docker-compose.yml, recreate the gateway, all apps pick up the new threshold.

Negative: - A genuinely high-rate legitimate workload (e.g. a CSV upload of 100k rows) will now see 429s. Workaround: that endpoint should be enqueued to the background-jobs queue (ADR 0012) anyway, not beaten as a synchronous loop. - Per-source-IP isn't perfect — NATs share IPs (corporate networks, mobile carriers). At our scale this is academic; revisit when an enterprise customer's whole office complains they're being limited. - Not visible in app logs — Traefik returns 429 before the request hits the app. Surfaces in Traefik logs and (once Loki is in place) in the centralised logs panel.

Future work¶

Per-endpoint limits. If /api/auth/login ever takes more abuse than the rest of an app, attach a third middleware just to that path's router rule.
Sliding-window, not token-bucket. Traefik's middleware is token-bucket; if we ever need stricter "5 attempts per 5 minutes" semantics, that's an in-app rate-limit on top of the edge layer.
Cloudflare proxy. When traffic warrants. The ADR for that switchover supersedes this one for the public-facing layer.