ArchitectureCloud

Multi-Cloud Architecture: Patterns for Vendor Independence and Resilience

TT
TopicTrick Team
Multi-Cloud Architecture: Patterns for Vendor Independence and Resilience

Multi-Cloud Architecture: Patterns for Vendor Independence and Resilience

Multi-cloud architecture runs workloads across two or more cloud providers simultaneously. It is not the same as using one cloud provider for different services (that is just cloud adoption). True multi-cloud means your critical workloads can operate on AWS, Azure, or Google Cloud — with automation to shift traffic between them.

The motivations are real: vendor lock-in risk, regulatory data residency requirements, best-of-breed service selection, and genuine resilience against a single provider's outage. But multi-cloud carries a substantial complexity tax. This guide covers the architecture patterns that make multi-cloud work, the data replication challenge, the tooling layer that makes providers interchangeable, and the honest cost-benefit analysis for deciding if it is right for your organisation.


When Multi-Cloud Is and Is Not Appropriate

Before choosing patterns, be clear about the motivation:

MotivationMulti-CloudAlternative
Resilience against provider outageYesMulti-region on one provider (cheaper)
Regulatory data residencySometimesRegion selection on one provider often suffices
Vendor negotiation leverageYesOften achievable with documented exit strategy
Best-of-breed services (AWS ML + GCP data)Yes (polycloud)Accept trade-offs of one provider
Cost optimisationSometimesReserved instances + savings plans on one provider
Team of < 20 engineersNoComplexity will consume the team
Early-stage startupNoFocus on product, not infrastructure sovereignty

Multi-cloud is appropriate for: large enterprises with regulatory requirements, organisations with genuine risk management mandates, and teams with specific performance requirements that no single provider satisfies. It is not appropriate for most startups and small-to-mid engineering teams — the operational overhead is not worth it.


Pattern 1: Active-Active Multi-Cloud

Both clouds run production workloads simultaneously. A global load balancer distributes traffic based on latency, health, or geography.

text
Users (global)
      │
      â–¼
┌─────────────────────────┐
│  Global Load Balancer   │  (Cloudflare, AWS Route 53 Latency, Fastly)
│  Health check polling   │
└────────────┬────────────┘
             │
     ┌───────┴───────┐
     â–¼               â–¼
┌──────────┐   ┌──────────┐
│   AWS    │   │  Google  │
│  us-east │   │  europe  │
│ (50%)    │   │ (50%)    │
└──────────┘   └──────────┘

Traffic routing logic:

  • Normal: distribute by geography (US users → AWS, EU users → GCP)
  • Provider incident: detect via health checks, shift 100% to healthy provider within 30-60 seconds
  • Maintenance: drain one cloud, perform maintenance, restore

Cloudflare Load Balancing configuration

javascript
// Cloudflare Workers + Load Balancer rule
// routes.json (defined in Cloudflare Dashboard)
{
  "origins": [
    {
      "name": "aws-us-east",
      "address": "api.us-east.example.com",
      "weight": 50,
      "health_threshold": 2
    },
    {
      "name": "gcp-europe",
      "address": "api.europe.example.com",
      "weight": 50,
      "health_threshold": 2
    }
  ],
  "health_checks": {
    "path": "/health",
    "interval": 30,
    "timeout": 5,
    "expected_codes": "200"
  },
  "steering_policy": "geo"
}

Pattern 2: Active-Passive Multi-Cloud

One cloud handles all production traffic. The second cloud is a warm standby that takes over if the primary fails.

text
Normal:
Users → AWS (100% traffic) → Application
                              GCP (warm standby, synced, no traffic)

Failover:
Users → AWS (health check fails) → GCP (traffic redirected within minutes)

Active-passive is significantly cheaper than active-active — the standby cloud runs minimal infrastructure (enough to receive traffic, not full production capacity). It is also simpler operationally.

Trade-off: failover takes time (2-10 minutes for DNS TTL to propagate + application warm-up) and may involve some data loss (RPO = time since last replication). For most recovery scenarios, this is acceptable.


Pattern 3: Cloud Specialisation (Polycloud)

Different workloads run on different clouds based on best-of-breed service strengths, not redundancy. This is the most common "multi-cloud" pattern in practice.

text
Workload distribution by strength:
  ML training jobs       → Google Cloud (TPUs, Vertex AI)
  Core application       → AWS (broadest service ecosystem)
  Microsoft 365 users    → Azure (Active Directory integration)
  Edge/CDN               → Cloudflare (Workers at the edge)
  Static assets          → Cloudflare R2 (cheapest egress)

Architecture challenge: workloads must communicate across clouds. Options:

  1. Public internet: simplest, but adds latency and egress costs
  2. Cloud interconnects: AWS Direct Connect + Google Cloud Interconnect for private, low-latency cross-cloud networking (expensive, justified for high-volume data transfer)
  3. Data plane in the middle: use Cloudflare or a transit VPC as the hub

Pattern 4: Cloud-Agnostic Abstraction Layer

To avoid rewriting application code for each cloud's APIs, build an abstraction layer that hides provider-specific implementations.

Storage abstraction

typescript
// Abstract interface — application uses this
interface ObjectStorage {
  upload(key: string, data: Buffer, contentType: string): Promise<void>;
  download(key: string): Promise<Buffer>;
  delete(key: string): Promise<void>;
  signedUrl(key: string, expirySeconds: number): Promise<string>;
}

// AWS S3 implementation
class S3Storage implements ObjectStorage {
  private client = new S3Client({ region: process.env.AWS_REGION });

  async upload(key: string, data: Buffer, contentType: string) {
    await this.client.send(new PutObjectCommand({
      Bucket: process.env.S3_BUCKET,
      Key: key,
      Body: data,
      ContentType: contentType
    }));
  }

  async download(key: string) {
    const response = await this.client.send(new GetObjectCommand({
      Bucket: process.env.S3_BUCKET,
      Key: key
    }));
    return Buffer.from(await response.Body!.transformToByteArray());
  }

  async signedUrl(key: string, expirySeconds: number) {
    return getSignedUrl(this.client, new GetObjectCommand({
      Bucket: process.env.S3_BUCKET,
      Key: key
    }), { expiresIn: expirySeconds });
  }

  async delete(key: string) {
    await this.client.send(new DeleteObjectCommand({
      Bucket: process.env.S3_BUCKET,
      Key: key
    }));
  }
}

// Google Cloud Storage implementation
class GCSStorage implements ObjectStorage {
  private bucket = new Storage().bucket(process.env.GCS_BUCKET!);

  async upload(key: string, data: Buffer, contentType: string) {
    await this.bucket.file(key).save(data, { contentType });
  }

  async download(key: string) {
    const [contents] = await this.bucket.file(key).download();
    return contents;
  }

  async signedUrl(key: string, expirySeconds: number) {
    const [url] = await this.bucket.file(key).getSignedUrl({
      action: 'read',
      expires: Date.now() + expirySeconds * 1000
    });
    return url;
  }

  async delete(key: string) {
    await this.bucket.file(key).delete();
  }
}

// Factory — swap providers via environment variable
export function createStorage(): ObjectStorage {
  switch (process.env.CLOUD_PROVIDER) {
    case 'aws': return new S3Storage();
    case 'gcp': return new GCSStorage();
    default: throw new Error(`Unknown provider: ${process.env.CLOUD_PROVIDER}`);
  }
}

Apply the same pattern to message queues (SQS vs Pub/Sub), secrets managers (AWS Secrets Manager vs Google Secret Manager), and container registries (ECR vs Artifact Registry).


The Data Replication Challenge

Moving compute across clouds is relatively straightforward. Moving data is the hard problem:

  • Cloud-native managed databases (RDS Aurora, Cloud Spanner) do not replicate natively across providers
  • Data egress costs can reach $0.09/GB — significant at scale
  • Consistency guarantees differ between providers

Cloud-agnostic distributed databases

These databases are designed to run across multiple clouds with native replication:

DatabaseTypeMulti-cloud replication
CockroachDBDistributed SQL (PostgreSQL compatible)Native cross-region, cross-cloud
YugabyteDBDistributed SQL (PostgreSQL + Cassandra compatible)Native cross-cloud
TiDBDistributed SQL (MySQL compatible)Via TiDB Cloud
CassandraWide-column NoSQLMulti-datacenter, cloud-agnostic
MongoDB AtlasDocument (managed)Multi-cloud clusters

CockroachDB multi-cloud configuration

sql
-- CockroachDB: configure nodes across clouds with region locality
-- Node startup (AWS node)
cockroach start \
  --locality=cloud=aws,region=us-east-1,zone=us-east-1a \
  --join=aws-node:26257,gcp-node:26257

-- Node startup (GCP node)
cockroach start \
  --locality=cloud=gcp,region=europe-west1,zone=europe-west1-b \
  --join=aws-node:26257,gcp-node:26257

-- Configure replication to keep data on both clouds
ALTER TABLE orders CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '{"+cloud=aws": 2, "+cloud=gcp": 2}',
  lease_preferences = '[[+cloud=aws]]';  -- Primary reads from AWS

Kubernetes as the Cloud-Agnostic Compute Layer

Kubernetes abstracts the underlying cloud from your application. The same Deployment YAML runs on AWS EKS, Google GKE, and Azure AKS without modification.

yaml
# deployment.yaml — identical on any cloud
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    spec:
      containers:
        - name: api
          image: my-registry.example.com/api:v2.4.1   # Private registry (cloud-agnostic)
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"

The cloud-specific parts live in the infrastructure layer (Terraform, Pulumi), not in the application manifests:

hcl
# terraform/aws/main.tf
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"
  cluster_name = "production"
  cluster_version = "1.29"
  vpc_id = module.vpc.vpc_id
}

# terraform/gcp/main.tf  (same application, different infrastructure)
resource "google_container_cluster" "production" {
  name     = "production"
  location = "europe-west1"
  initial_node_count = 3
}

Multi-Cloud Cost Reality

Multi-cloud has significant cost overhead beyond just running workloads on two providers:

Cost categoryTypical impact
Duplicate infrastructure overhead+30-50% compute costs
Data egress between clouds$0.02-$0.09/GB transferred
Network connectivity (Direct Connect / Interconnect)$1,000-$5,000/month per connection
Doubled operational tooling (two monitoring stacks, two pipelines)Engineering time
Cloud-agnostic database licensing (CockroachDB Enterprise)Significant for large data volumes
Security/compliance review for each providerOngoing audit cost

The cost must be weighed against the benefit: genuine resilience value, regulatory compliance value, or negotiation leverage with providers.


Frequently Asked Questions

Q: Is multi-cloud the same as using multiple cloud services from different providers?

No. Using AWS for hosting and Cloudflare for CDN is not multi-cloud in the architectural sense — it is normal practice. Multi-cloud architecture means your core application and data can run on multiple full cloud platforms (AWS, Azure, GCP) with the ability to shift workloads between them. The defining characteristic is portability of the entire workload stack, not just using services from different vendors.

Q: What is the difference between multi-cloud and hybrid cloud?

Hybrid cloud combines on-premises infrastructure with public cloud. Your data centre is one "cloud" and AWS is another. Multi-cloud uses two or more public cloud providers (AWS + GCP, AWS + Azure) without on-premises infrastructure. Many large enterprises operate both: an on-premises data centre plus two public clouds — this is sometimes called "hybrid multi-cloud."

Q: Can small teams benefit from multi-cloud?

Rarely. The operational overhead of managing two cloud environments, two CI/CD pipelines, two monitoring stacks, and cross-cloud networking typically consumes more engineering capacity than the resilience benefit justifies. For small teams, a multi-region deployment on a single cloud provider (AWS us-east + us-west) provides 95% of the resilience benefit at 20% of the operational cost. Only consider multi-cloud when a specific regulatory, contractual, or risk management requirement demands it.

Q: How do I implement a cloud exit strategy without running multi-cloud?

A cloud exit strategy documents how you would migrate to another provider if required. It does not require actively running on multiple clouds. Key elements: maintain infrastructure-as-code (Terraform) for all resources; use cloud-agnostic technologies for critical data (PostgreSQL instead of Aurora Serverless v2, Kafka instead of Kinesis); document the migration runbook including data export/import procedures; and test the runbook periodically. This satisfies most regulatory cloud exit requirements without the ongoing cost of true multi-cloud operation.


Key Takeaway

Multi-cloud architecture provides genuine resilience, regulatory compliance, and vendor independence — at significant operational cost. The core enabling patterns are: a global load balancer for traffic routing, cloud-agnostic databases (CockroachDB, YugabyteDB) for data portability, Kubernetes as a portable compute layer, and abstraction interfaces that hide provider-specific APIs from application code. For most organisations, cloud specialisation (polycloud) — using the best service from each provider for specific workloads — delivers more value than full active-active redundancy. Only pursue full multi-cloud when the business requires genuine resilience against a complete cloud provider failure.

Read next: Peer-to-Peer Architecture: Blockchain and Decentralization →


Part of the Software Architecture Hub — engineering the sovereignty.