DevOpsGitHub

Continuous Deployment: Blue-Green and Canary

TT
TopicTrick Team
Continuous Deployment: Blue-Green and Canary

Continuous Deployment: Blue-Green and Canary Strategies

Continuous deployment pipelines that push directly to production carry risk: a bad deployment can immediately impact all users. Deployment strategies like Blue-Green and Canary releases reduce this risk by controlling when and how new code reaches production — enabling instant rollback and incremental exposure to catch problems before they affect everyone.

This guide covers Blue-Green deployment with instantaneous rollback, Canary releases with automated promotion gates, Rolling deployments in Kubernetes, Feature flags for decoupling deploy from release, and the GitHub Actions CD workflow that automates the entire process.


Blue-Green Deployment

Blue-Green maintains two identical production environments. At any time, one is live (serving traffic) and one is idle (holding the previous version or the new version being tested).

text
Normal state:
  Users → Load Balancer → Blue (v1.4.2, live)
                          Green (idle)

Deployment:
  1. Deploy v1.4.3 to Green
  2. Test Green privately (smoke tests, health checks)
  3. If Green is healthy: flip load balancer
  4. Users now → Load Balancer → Green (v1.4.3, live)
  5. Blue still running v1.4.2 (instant rollback available)

Rollback:
  Problem detected → flip load balancer back to Blue → 0 seconds downtime

Blue-Green with AWS ECS

yaml
# .github/workflows/blue-green-deploy.yml
name: Blue-Green Deploy

on:
  workflow_dispatch:
    inputs:
      image_tag:
        description: 'Docker image tag to deploy'
        required: true

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    permissions:
      id-token: write
      contents: read

    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Determine current live environment
        id: current
        run: |
          CURRENT=$(aws ecs describe-services \
            --cluster production \
            --services api \
            --query 'services[0].deployments[0].taskDefinition' \
            --output text)
          echo "Identifying blue/green environment..."
          # In practice, tag your environments explicitly
          LIVE=$(aws elbv2 describe-rules \
            --listener-arn ${{ vars.LISTENER_ARN }} \
            --query 'Rules[?Priority==`1`].Actions[0].TargetGroupArn' \
            --output text)
          echo "live-tg=$LIVE" >> "$GITHUB_OUTPUT"

      - name: Deploy to inactive environment
        id: deploy
        run: |
          # Register new task definition with updated image
          NEW_TASK_DEF=$(aws ecs register-task-definition \
            --cli-input-json "$(cat task-definition.json | \
              jq '.containerDefinitions[0].image = "my-app:${{ inputs.image_tag }}"')" \
            --query 'taskDefinition.taskDefinitionArn' \
            --output text)

          # Update the inactive target group's ECS service
          aws ecs update-service \
            --cluster production \
            --service api-green \
            --task-definition "$NEW_TASK_DEF" \
            --force-new-deployment

          # Wait for deployment to stabilise
          aws ecs wait services-stable \
            --cluster production \
            --services api-green

          echo "new-task-def=$NEW_TASK_DEF" >> "$GITHUB_OUTPUT"

      - name: Run smoke tests against Green
        run: |
          GREEN_URL="https://green.internal.example.com"
          # Run health check
          HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "$GREEN_URL/health")
          if [ "$HEALTH" != "200" ]; then
            echo "Health check failed: $HEALTH"
            exit 1
          fi

          # Run critical path smoke tests
          npx jest tests/smoke/ --env=$GREEN_URL

      - name: Switch traffic to Green
        run: |
          # Modify ALB listener rule to point to Green target group
          aws elbv2 modify-rule \
            --rule-arn ${{ vars.PRODUCTION_RULE_ARN }} \
            --actions Type=forward,TargetGroupArn=${{ vars.GREEN_TG_ARN }}

          echo "Traffic switched to Green (v${{ inputs.image_tag }})"

      - name: Verify production health
        run: |
          sleep 30  # Allow DNS propagation
          HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "https://api.example.com/health")
          if [ "$HEALTH" != "200" ]; then
            echo "Production health check failed — initiating rollback"
            aws elbv2 modify-rule \
              --rule-arn ${{ vars.PRODUCTION_RULE_ARN }} \
              --actions Type=forward,TargetGroupArn=${{ vars.BLUE_TG_ARN }}
            exit 1
          fi
          echo "Deployment successful. Blue environment retained for rollback."

Canary Releases

A canary release sends a small percentage of real traffic to the new version. If error rates and latency remain acceptable, the percentage increases automatically. If a problem is detected, the canary is rolled back automatically before most users are affected.

text
Phase 1 (1% canary):
  100 requests → 99 to v1.4.2, 1 to v1.4.3
  Monitor: error rate, p99 latency, business metrics

Phase 2 (10% if Phase 1 healthy):
  100 requests → 90 to v1.4.2, 10 to v1.4.3
  Monitor: same metrics

Phase 3 (50%):
  ...

Phase 4 (100% if all phases healthy):
  All traffic → v1.4.3
  v1.4.2 decommissioned

Canary with Kubernetes and Argo Rollouts

yaml
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5         # 5% canary
        - pause:
            duration: 5m       # Watch metrics for 5 minutes
        - setWeight: 20
        - pause:
            duration: 10m
        - setWeight: 50
        - pause:
            duration: 10m
        - setWeight: 100       # Full rollout

      # Automatic rollback if analysis fails
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1

  selector:
    matchLabels:
      app: api-server
  template:
    spec:
      containers:
        - name: api
          image: my-app:v1.4.3
yaml
# analysis-template.yaml — automatic rollback trigger
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01    # < 1% error rate required
      failureLimit: 3                        # 3 consecutive failures = rollback
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))

    - name: p99-latency
      interval: 1m
      successCondition: result[0] < 500     # < 500ms p99
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_ms_bucket[5m])) by (le)
            )

If either metric fails 3 consecutive checks, Argo Rollouts automatically shifts all traffic back to the stable version and marks the rollout as failed.


Rolling Deployments (Kubernetes Default)

Rolling deployments replace instances incrementally without requiring a duplicate environment:

yaml
# deployment.yaml
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0   # Never reduce capacity during rollout
      maxSurge: 2         # At most 2 extra pods during rollout

# Rollout sequence:
# Start: 6 × v1.4.2
# Surge:  6 × v1.4.2 + 2 × v1.4.3
# Replace: 4 × v1.4.2 + 2 × v1.4.3 (wait for health checks)
# Continue until: 0 × v1.4.2 + 6 × v1.4.3

Rolling deployments are simpler but offer less control than Blue-Green or Canary — there is no single point where you can "hold" and test before proceeding.


Feature Flags: Decoupling Deploy from Release

Feature flags separate code deployment from feature activation. Code ships to production hidden behind a flag; a product manager or engineer can enable it for a subset of users without a deployment.

typescript
// Using a feature flag service (LaunchDarkly, Flagsmith, or a simple config)
import { getLDClient } from './featureFlags';

export async function handleCheckout(userId: string, cart: Cart) {
  const ldClient = getLDClient();
  const user = { key: userId };

  // Check flag — only 10% of users see the new checkout flow
  const useNewCheckout = await ldClient.variation('new-checkout-flow', user, false);

  if (useNewCheckout) {
    return newCheckoutFlow(cart);
  }
  return legacyCheckoutFlow(cart);
}
typescript
// Self-hosted feature flags with a simple JSON config
interface FeatureFlag {
  enabled: boolean;
  rollout: number;        // 0-100 percentage
  allowedUserIds?: string[];
}

async function isFeatureEnabled(flagName: string, userId: string): Promise<boolean> {
  const flag = await flagsStore.get<FeatureFlag>(flagName);
  if (!flag.enabled) return false;

  // Specific users (internal beta testers)
  if (flag.allowedUserIds?.includes(userId)) return true;

  // Percentage rollout (deterministic: same user always gets same result)
  const hash = parseInt(createHash('sha256').update(userId + flagName).digest('hex').slice(0, 8), 16);
  return (hash % 100) < flag.rollout;
}

Benefits of feature flags:

  • Deploy code without activating it — separated concerns
  • Dark launch: ship and test in production with internal users before public release
  • Kill switch: instantly disable a feature if problems are detected, no deployment needed
  • A/B testing: show different variations to different user segments

Database Migrations with Zero Downtime

The hardest part of zero-downtime deployments is database schema changes. The application must work with both the old and new schema simultaneously during the rollout period.

text
❌ Dangerous migration pattern (breaks Blue-Green):
  1. Deploy v2 code (expects new column 'preferences')
  2. Run migration to add 'preferences' column
  Problem: v1 code is still running during migration, fails if new column appears

✓ Safe expand-contract migration pattern:
  Phase 1 (deploy v1.5):
    - Add migration: add 'preferences' column (nullable, no default)
    - v1.5 code writes to BOTH old and new column
    - v1.4 code (still running) ignores new column

  Phase 2 (after v1.4 is fully replaced):
    - Backfill: copy data from old column to new column
    - Deploy v1.6: reads from new column, writes to new column only

  Phase 3 (after v1.6 is stable):
    - Remove migration: drop old column
    - Remove dual-write code from v1.7
typescript
// Dual-write pattern during migration
async function updateUserPreferences(userId: string, prefs: UserPreferences) {
  await db.transaction(async (trx) => {
    // Write to new column
    await trx('users')
      .where({ id: userId })
      .update({ preferences: JSON.stringify(prefs) });

    // Also write to old column (for rollback compatibility during migration period)
    await trx('users')
      .where({ id: userId })
      .update({ settings: legacyFormat(prefs) });  // Old format
  });
}

GitHub Actions CD Workflow

yaml
# .github/workflows/cd.yml
name: Continuous Deployment

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.build.outputs.tag }}
    steps:
      - uses: actions/checkout@v4

      - name: Build and push Docker image
        id: build
        run: |
          TAG="${{ github.sha }}"
          docker build -t my-app:$TAG .
          docker push ${{ vars.ECR_REGISTRY }}/my-app:$TAG
          echo "tag=$TAG" >> "$GITHUB_OUTPUT"

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.STAGING_ROLE_ARN }}
          aws-region: us-east-1

      - name: Deploy canary to staging (10%)
        run: |
          kubectl argo rollouts set image deployment/api-server \
            api=${{ vars.ECR_REGISTRY }}/my-app:${{ needs.build.outputs.image-tag }}

      - name: Wait for staging rollout
        run: kubectl argo rollouts status deployment/api-server --timeout=10m

      - name: Run smoke tests
        run: npx jest tests/smoke/ --env=https://staging.example.com

  deploy-production:
    needs: [build, deploy-staging]
    runs-on: ubuntu-latest
    environment: production    # Requires manual approval
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.PROD_ROLE_ARN }}
          aws-region: us-east-1

      - name: Start canary rollout (5% → auto-promote)
        run: |
          kubectl argo rollouts set image deployment/api-server \
            api=${{ vars.ECR_REGISTRY }}/my-app:${{ needs.build.outputs.image-tag }} \
            --namespace production

      - name: Monitor rollout
        run: |
          kubectl argo rollouts status deployment/api-server \
            --namespace production \
            --timeout=30m

Frequently Asked Questions

Q: Which strategy should I use — Blue-Green or Canary?

Blue-Green is better for smaller applications or when you need instant, guaranteed rollback with no partial-failure state. Canary is better for large-scale applications where you want to validate new code under real production load before full exposure, and when automated metrics can drive promotion decisions. In practice, many organisations use Blue-Green for infrastructure changes (database migrations, config changes) and Canary for application code changes.

Q: How do I handle database migrations with Blue-Green deployments?

The expand-contract pattern is the standard answer: (1) add new schema elements in a backward-compatible way, (2) deploy code that works with both old and new schema, (3) complete the migration, (4) remove old schema elements in a later release. Never add NOT NULL columns without defaults to tables that existing code is actively writing to.

Q: What metrics should trigger an automatic canary rollback?

Standard rollback triggers: HTTP 5xx error rate > 1%, p99 latency increase > 50% over baseline, business metric decline (order completion rate, signup conversion) > 10%, memory leak (increasing RSS over 30 minutes), and CPU saturation. Choose thresholds that are sensitive enough to catch real problems before affecting many users, but not so sensitive that normal traffic variance triggers false rollbacks.

Q: Can I use feature flags instead of deployment strategies?

Feature flags complement deployment strategies — they do not replace them. A deployment strategy controls how code versions reach production (Blue-Green, Canary). A feature flag controls whether a feature within the deployed code is active for specific users. Use both: deploy with a Canary strategy, activate the feature with a flag for internal users first, then roll it out to all users via the flag without another deployment.


Key Takeaway

Blue-Green deployment gives you instantaneous rollback by maintaining two production environments and flipping traffic at the load balancer. Canary deployment reduces blast radius by routing a small percentage of real traffic to the new version, monitoring error rates and latency, and promoting automatically only when metrics confirm the new version is healthy. Feature flags decouple deployment from release, enabling dark launches and kill switches without redeployment. All three strategies require discipline around database migrations — the expand-contract pattern ensures schema changes are backward-compatible with both versions running simultaneously.

Read next: GitHub Actions: Secrets and Environments →


Part of the GitHub Mastery Course — engineering the release.