Architecture

Why We Host Every Drupal Tenant on Fly.io (Instead of Kubernetes)

Jay Callicott··8 min read

Decoupled.io runs on Fly.io. Every Drupal space a customer creates is its own Fly app, with its own machine, its own regional volume, and its own domain. No shared host. No noisy neighbors. No custom orchestration layer we have to maintain.

We chose Fly as the production substrate for a specific reason: a hosted headless CMS needs real isolation between tenants, and we wanted the platform to give us that by default. Kubernetes — EKS, GKE, self-hosted, k3s — can deliver the same guarantees, but the team time to design, operate, and secure a cluster wasn't where we wanted to spend energy early on. Fly got us to the same place faster.

This isn't "Kubernetes bad." Kubernetes is genuinely excellent at what it's designed for, and plenty of hosted-CMS companies run happily on it. The rest of this post is an honest look at why Fly fits our particular shape better, and what we gave up by choosing it.

What "per-tenant" means for us

Every Drupal space we provision gets its own Fly app. That app owns:

  • one machine (a small VM)
  • one regional volume holding Drupal's database and user files
  • one hostname (tenant-<id>.fly.dev, plus any custom domain the customer adds)
  • its own deploy history, logs, metrics, and secrets

Under Kubernetes, you can model the same isolation with a namespace per tenant, resource limits, and PersistentVolumeClaims. It works. It just requires more moving parts: a cluster, an ingress controller, a cert-manager, a StorageClass provider, Helm charts, RBAC for the tenant operator. Fly collapses those into "create an app."

What Fly gives us on day one

Isolation by default

Every tenant being its own Fly app means:

  • Independent resource limits. CPU, RAM, and disk are scoped to the app. A runaway Drupal cron on one tenant cannot steal headroom from another.
  • Independent domains and TLS. Each app has its own hostname and its own LetsEncrypt cert. No shared ingress to configure.
  • Independent deploy cycles. We can ship a new Drupal image to one tenant without touching any other tenant — or deliberately keep some tenants on an older version during a rollout.

On Kubernetes we'd get equivalent isolation with strict resource limits + NetworkPolicies + per-tenant namespaces. The guarantees are similar; the configuration surface is larger.

Auto-stop, auto-start for the long tail

Fly machines can be configured to stop on idle and start on the first incoming request. For Decoupled.io, that maps cleanly to our Free and Starter tiers:

  • Free and Starter tenants run with auto_stop_machines = true. They spin down after a few minutes of inactivity. A first request cold-starts them in 5–8 seconds.
  • Pro tenants run a different fly.toml with always-on machines. Customers pay more; they get hot performance with no cold start.

Kubernetes has equivalents — KEDA scale-to-zero, or Knative — they just live one or two layers deeper in the stack. Fly ships the feature as a flag in the app config.

The decoupled architecture also makes cold-start much less user-visible than it sounds. End users see the CDN's cached copy; Drupal's cold start only shows up on cache-miss revalidation requests, a content editor's first admin load of the day, and webhook-driven rebuilds. None of those are a "the site is slow" moment for real traffic.

Volume snapshots, built-in

Every Fly volume gets a daily automatic snapshot with 5-day retention, free. That's our baseline disaster-recovery story: if a tenant's volume corrupts, we can create a new volume from yesterday's snapshot in minutes.

On Kubernetes, volume snapshots exist (the VolumeSnapshot CSI API) but aren't universally enabled out of the box — you need a CSI driver that supports them, a snapshot controller, and a schedule. It's doable, just another thing to run. We layer our paid Backups add-on on top of Fly's snapshots — nightly DB + user files backups to Cloudflare R2 with 30-day retention, downloadable from the dashboard.

Anycast edge routing

Fly terminates TLS at an Anycast edge network. A request from London hits a Fly edge in London, gets TLS-terminated there, and is then routed to the tenant's machine (which might live in IAD or wherever the volume is). If one edge region has a network blip, other regions keep serving.

On Kubernetes, the equivalent is an ingress + a CDN/edge like Cloudflare or Fastly in front. It's more components to configure but well-trodden territory.

Per-tenant observability

flyctl logs -a tenant-abc123 shows exactly one customer's output. Same for metrics, deploy history, secrets. On a shared Kubernetes cluster, kubectl logs interleaves output from every pod in a namespace by default; you end up filtering by labels or shipping logs to a third-party tool (Loki, Datadog, etc.) to get per-tenant views. Both work; Fly's just has less setup.

The tradeoffs we accept (and why decoupled architecture softens them)

Fly isn't magic. Three things are genuinely weaker than a well-run Kubernetes platform — but it's worth noting up front that most of these tradeoffs hurt less for a headless CMS than they would for a monolithic one. In a decoupled setup, the frontend (Next.js on Vercel, Astro on Cloudflare, whatever) lives on a global CDN. Drupal is the backend. Cached pages, SSG output, and ISR mean end users almost never hit the origin directly; Drupal only gets traffic on cache misses, webhook rebuilds, and admin edits.

That framing matters for every bullet below:

  • Volumes are mount-exclusive and region-pinned. Only one machine can mount a given volume at a time, and the volume can't hop regions. True multi-region HA with a hot standby is non-trivial on Fly — you'd need app-level replication like LiteFS. Kubernetes with a cloud-native StorageClass (EBS multi-attach, or a distributed FS like Ceph) has more flexibility here. For a decoupled CMS this matters much less than it sounds: every end-user page view goes through the frontend CDN, so Drupal only gets hit on cache misses, webhook rebuilds, and admin edits. Region pinning adds latency to those few requests, not to the millions of cached page views your users actually see. Kubernetes with EBS multi-attach or Ceph gives you more flexibility for workloads where the DB is in the hot path — ours isn't.
  • The platform is younger than the Kubernetes ecosystem. Fly has had public multi-hour incidents in past years. EKS / GKE on AWS / GCP have a much longer track record. The decoupled architecture helps here too — if Fly has a bad day, the frontend CDN keeps serving cached content, so end users often don't notice. What they would notice is admin edits not saving and newly-published content not propagating until the origin is back. We also keep off-platform backups in Cloudflare R2, so worst-case data recovery is a separate provider.
  • Less escape-hatch flexibility. Kubernetes is a universal substrate — once you're there, you can run almost any workload. Fly is more opinionated. If we ever need to run Kafka, Elasticsearch, or something with complex stateful behavior, we might reach for a cloud-native platform. For a stateless-ish HTTP app + one DB — which is exactly what a headless Drupal backend looks like — Fly fits perfectly.

When Kubernetes is the better call

Honestly, it often is:

  • You already have a Kubernetes platform team. If cluster expertise is in-house, most of what we listed as "Fly wins" disappears — you already have the infrastructure to run multi-tenant hosting well.
  • You need to run complex stateful systems (Kafka, Cassandra, Postgres clusters with streaming replication) alongside your apps. Kubernetes operators for those are mature.
  • Your tenants are large and few. Kubernetes excels when a handful of customers each want serious compute; Fly is optimized for many-small rather than few-big.
  • Compliance frameworks mandate specific cloud substrates. EKS on AWS GovCloud is a known quantity for FedRAMP / HIPAA customers; Fly is a smaller target list.

Our bet is that we don't have any of those constraints today — and if we ever do, the Fly architecture is straightforward to port (each tenant is a Docker image + a volume, portable to any Kubernetes cluster).

How our Fly setup is wired

The shape of Decoupled.io on Fly, end to end:

  1. A "source" Fly app (decoupled-drupal-frankenphp) holds the canonical Drupal + FrankenPHP image. Each new tenant is created by cloning configuration from this app.
  2. A provision workflow (a GitHub Action) creates the tenant's Fly app, attaches a regional volume, starts the machine, and runs drush site:install dc_core. End to end: about two minutes from signup to a booting Drupal site.
  3. A tier-upgrade workflow handles Starter → Pro transitions — it extends the volume to 20 GB and redeploys the tenant with a Pro-specific fly.toml that keeps the machine always-on (no cold start).
  4. A nightly backup workflow dumps every Backups-add-on tenant's database and user files to Cloudflare R2 with 30-day retention, so customers have off-platform recovery in addition to Fly's built-in volume snapshots.

The whole stack is a few hundred lines of shell scripts and GitHub Actions. Not because we think Kubernetes is overkill for everyone — plenty of companies our size run beautiful clusters — but because Fly's primitives fit this product shape with less code.

The practical result

A customer who signs up on Decoupled.io today gets:

  • Their own Fly app, created in about 2 minutes
  • Their own domain (tenant-xxx.fly.dev by default, custom domain optional)
  • Independent CPU and RAM limits, so their imports can't slow anyone else's site
  • Automatic daily volume snapshots for disaster recovery
  • Optional nightly off-platform backups to Cloudflare R2 (Backups add-on)
  • Per-tenant logs, metrics, and deploy history they can see in their dashboard

If you're evaluating where to run a multi-tenant headless CMS — whether that's Kubernetes, Fly, or something else — the question worth asking isn't "which platform is better?" It's "which platform fits the product shape with the least code I have to write and maintain?" For us, that answer was Fly. For other teams with different constraints, it might very well be Kubernetes.

Either way, the isolation story is what matters most. Whichever substrate gets your product to real per-tenant isolation with the least ops burden is the right one.