ADR 0013: Centralised logs via Loki + Promtail + Grafana¶

Status: Accepted Date: 2026-05-01

Context¶

Up to now debugging meant docker logs <container> against one container at a time, and once a container was recreated its history was gone. As the platform grows past a single app, this stops being viable:

A bug spanning portal → app → auth meant tailing three terminals and manually correlating timestamps.
A 2 AM incident meant SSHing in cold and grepping live output — zero history once the offending container restarted.
"Did this happen before?" was unanswerable without external forensic tooling.

We need somewhere all logs land, with searchable history, available even after containers churn.

Decision¶

Adopt Grafana Loki as the log store, Promtail as the collector, and Grafana as the UI, deployed together as services/logs/. Self-hosted, free, single-host. Retention 14 days.

Promtail auto-discovers every running container via the Docker socket and ships logs to Loki tagged with container/image/compose-project metadata. Grafana queries via LogQL with a pre-provisioned datasource and a starter "SCALR Overview" dashboard.

Grafana is reachable behind the gateway at logs.${SCALR_PUBLIC_DOMAIN}; Loki and Promtail are internal-only on the scalr-logs network.

Why Loki over the alternatives¶

We considered four families:

Loki (chosen). Same vendor as Grafana, log-specific, indexes only labels (not the log body) so storage and ingest are cheap. Single binary, single config file, single host happy path. LogQL is simple — Prometheus-shaped queries that compose with grep-style filters.
ELK (Elasticsearch + Logstash + Kibana). The default before Loki existed. Disqualifying for solo: 4–8 GB RAM minimum just for Elasticsearch idling, plus Logstash overhead, plus Kibana. Massive storage footprint because everything is indexed.
Grafana Cloud (managed). Free tier 50 GB/month, 14-day retention. Tempting, but pushes log data to a third party (some customer logs may include PII — privacy concern), and creates an outbound dependency for debugging. Self-hosting is the right answer at this scale; revisit if we outgrow the free tier or want multi-region redundancy.
Datadog / New Relic / Splunk. SaaS with great UX, but per-GB pricing that becomes ruinous as logs grow. Wrong shape for a solo operator.

Why we don't need Prometheus / metrics yet¶

This stack is logs-only. Metrics (container CPU/RAM, request rates, queue depth) need Prometheus + cAdvisor + node_exporter. We're not building that yet because:

At single-app scale, logs answer 90% of "why is this slow?"
Adding metrics doubles the operational surface (Prometheus storage, retention, scrape configs, exporters per service).
Most Grafana panels we'd build for metrics, we can already build off LogQL aggregations (rate({container="..."}[1m])).

When the platform grows past a few real users, metrics become worth it. ADR follow-up at that point.

Why not put logs through the gateway¶

Promtail and Loki communicate over the internal scalr-logs Docker network only — no external endpoints. This is deliberate:

**Promtail's source is the local Docker socket and bind-mounted log files; nothing it needs is on the public internet.
Loki's only client is Promtail. No reason to expose its push API externally.
Grafana is the only piece users touch, so only Grafana joins scalr-edge and gets Traefik labels.

Smaller surface, fewer auth concerns, simpler ops.

Authentication¶

Grafana ships with built-in admin/admin login (forces password change on first use). Acceptable for v1.

The cleaner production posture is Authentik forward-auth in front of Grafana — only members of a scalr-admins group can reach the URL, and they sign in via SSO instead of a separate Grafana password. Documented in services/logs/README.md as deferred work.

Consequences¶

Positive: - One URL for all platform logs. No more multi-tab docker logs. - 14-day searchable history survives container churn. - LogQL queries reusable across apps; no per-app log format prerequisite. - Promtail's Docker auto-discovery means new apps and new containers are picked up automatically — zero config per app.

Negative: - One more compose stack to keep up. Mitigated by health checks and Uptime Kuma monitoring. - Disk usage grows with log volume. Default 14d retention bounds it; monitor loki-data volume size monthly. - Grafana's built-in auth is a separate password to manage. Future ADR will move it behind Authentik. - LogQL has a learning curve for ops who knew grep. Mitigated by the starter dashboard which covers the common queries out of the box.

Future work¶

Authentik forward-auth for Grafana. Half-day. Replaces built-in admin login with SSO.
Metrics. Add Prometheus + cAdvisor when the platform's performance picture matters more than its event picture.
Alerting. Loki ruler can fire to Alertmanager. Wire when there's something specific to alert on (e.g. "auth error rate > N/min").
Log retention to S3. Loki supports tiered storage — cheap-and-deep beyond the 14-day local window. Set up if we ever need year-long forensic history.