Skip to content

ADR 0007: Traefik as the API Gateway / Edge Router

Status: Accepted Date: 2026-04-30

Context

SCALR runs many apps that should be reachable via clean per-app URLs (portal.<public-domain>, invoices.<public-domain>, docs.<public-domain>) rather than host:port combinations. We need a component at the edge that:

  1. Terminates TLS and serves Let's Encrypt certificates.
  2. Routes <subdomain>.${SCALR_PUBLIC_DOMAIN} to the right container based on each app's app.manifest.yml.
  3. Auto-discovers new apps as they're added (we don't want to edit a central nginx.conf every time make new-app runs).
  4. Stays out of the way during local dev — adding the gateway shouldn't make the developer experience worse.

The realistic candidates were:

  • Traefik — Docker-native; reads container labels and rebuilds routes on the fly; built-in ACME/Let's Encrypt; small footprint; active maintenance.
  • Caddy — automatic HTTPS with simple Caddyfile; route changes require config reloads or the admin API.
  • Nginx — battle-tested, predictable; routes via per-app server blocks under /etc/nginx/conf.d/. Adding apps means generating config files (e.g. via the scaffold script) and reloading.

Decision

Use Traefik v3 as the SCALR edge router, deployed as services/gateway/.

Each app's docker-compose.yml carries Traefik labels — three routers per service so dev stays HTTP-only on *.localhost and prod gets HTTPS via Let's Encrypt:

labels:
  - "traefik.enable=true"
  - "traefik.docker.network=scalr-edge"

  # Local dev — HTTP on <slug>.localhost. No TLS, no redirect.
  - "traefik.http.routers.<slug>-frontend.rule=Host(`<slug>.localhost`)"
  - "traefik.http.routers.<slug>-frontend.entrypoints=web"
  - "traefik.http.routers.<slug>-frontend.service=<slug>-frontend"

  # Production HTTP — redirects to HTTPS via the gateway middleware.
  - "traefik.http.routers.<slug>-frontend-redirect.rule=Host(`<slug>.${SCALR_PUBLIC_DOMAIN:-scalr.com}`)"
  - "traefik.http.routers.<slug>-frontend-redirect.entrypoints=web"
  - "traefik.http.routers.<slug>-frontend-redirect.service=<slug>-frontend"
  - "traefik.http.routers.<slug>-frontend-redirect.middlewares=https-redirect@docker"

  # Production HTTPS — Let's Encrypt cert via the `le` resolver.
  - "traefik.http.routers.<slug>-frontend-tls.rule=Host(`<slug>.${SCALR_PUBLIC_DOMAIN:-scalr.com}`)"
  - "traefik.http.routers.<slug>-frontend-tls.entrypoints=websecure"
  - "traefik.http.routers.<slug>-frontend-tls.service=<slug>-frontend"
  - "traefik.http.routers.<slug>-frontend-tls.tls=true"
  - "traefik.http.routers.<slug>-frontend-tls.tls.certresolver=le"
  - "traefik.http.services.<slug>-frontend.loadbalancer.server.port=5173"
  # (same trio for the backend service, with `&& PathPrefix(\`/api\`)` on each rule)

SCALR_PUBLIC_DOMAIN lives in <repo-root>/.env.shared and is set platform-wide via make set-public-domain DOMAIN=<your-domain>. The scaffold script writes these labels into each new app with <slug> substituted from app.manifest.yml's slug (which equals routing.subdomain by convention).

The shared https-redirect middleware is defined once on the gateway container itself, via redirectscheme.scheme=https labels.

Traefik joins a shared Docker network (scalr-edge) along with each app's gateway-facing services. The gateway uses Docker as its provider and discovers services on container start.

Why Traefik beat the alternatives

  1. Self-describing apps fit Traefik's model exactly. SCALR's whole pattern (ADR 0005) is "the manifest is the registry; no central config." Traefik labels are the same idea — each container declares its own routes. Adding an app is one make new-app away with no central edit.
  2. Live reconfiguration. No reload step, no signal handling. A container starts; Traefik picks it up. A container stops; the route disappears.
  3. TLS automation. Built-in Let's Encrypt resolver via the certificatesresolvers block. We turn it on with one stanza and Traefik manages renewal.
  4. One container, one binary. Operationally similar to Authentik's footprint. No external state besides a small ACME storage file.

Why not the others

  • Caddy — also great, but its Docker integration (caddy-docker-proxy) is community-maintained and less feature-rich than Traefik's first-party Docker provider. We'd lose live label-driven routes on each container.
  • Nginx — every new app would mean a generated config file under infra/nginx/conf.d/<slug>.conf plus a reload step in the scaffold script. Two more failure modes than Traefik's labels.

Consequences

Positive: - Adding an app to the platform = scaffold runs + container starts. Traefik does the rest. - Labels live in the same docker-compose.yml as the service they describe. No drift between routing intent and reality. - Same image and config in dev and prod; only the entrypoint definitions (with vs. without TLS) and the certificate resolver differ.

Negative: - Labels are stringly-typed and verbose. The scaffold script has to write them correctly per app, and humans tweaking them locally need to know the syntax. Mitigated by template inheritance and the scaffold writing them automatically. - Traefik does not touch the Docker socket directly. A small nginx sidecar (scalr-gateway-socket-proxy) bind-mounts /var/run/docker.sock and exposes a TCP endpoint on a private network that Traefik talks to via --providers.docker.endpoint=tcp://socket-proxy:2375. This was forced by a real bug — Traefik v3.x's vendored docker client is hard-pinned to API v1.24 and ignores DOCKER_API_VERSION, while Docker Desktop 4.63+ / Engine 29 enforces MinAPIVersion=1.44 and replies with an empty HTTP 400. The proxy rewrites /vX.Y/.../v1.44/... on the way through. It's also the right prod posture: the gateway never has container-management power, and the proxy is a one-line attack surface to audit. If we ever federate or run in a multi-tenant context we'll move to a non-socket discovery model (Kubernetes IngressRoute, Nomad Connect, etc.). - Local-dev URL pattern shifts from localhost:5184 to portal.localhost. macOS resolves *.localhost to 127.0.0.1 by default; Linux developers will need an /etc/hosts entry or dnsmasq. Direct-port access (localhost:5184) keeps working in standalone mode for developers who don't want to run the gateway.

Migration path

If Traefik ever has to be replaced, the only thing the apps need to change is their compose labels. The manifest schema doesn't reference Traefik, and routing.subdomain is gateway-agnostic. Routing rules can be regenerated for whatever replaces Traefik from each manifest.