Continuous Deployment: Blue-Green and Canary

Continuous Deployment: Blue-Green and Canary Strategies
Continuous deployment pipelines that push directly to production carry risk: a bad deployment can immediately impact all users. Deployment strategies like Blue-Green and Canary releases reduce this risk by controlling when and how new code reaches production — enabling instant rollback and incremental exposure to catch problems before they affect everyone.
This guide covers Blue-Green deployment with instantaneous rollback, Canary releases with automated promotion gates, Rolling deployments in Kubernetes, Feature flags for decoupling deploy from release, and the GitHub Actions CD workflow that automates the entire process.
Blue-Green Deployment
Blue-Green maintains two identical production environments. At any time, one is live (serving traffic) and one is idle (holding the previous version or the new version being tested).
Normal state:
Users → Load Balancer → Blue (v1.4.2, live)
Green (idle)
Deployment:
1. Deploy v1.4.3 to Green
2. Test Green privately (smoke tests, health checks)
3. If Green is healthy: flip load balancer
4. Users now → Load Balancer → Green (v1.4.3, live)
5. Blue still running v1.4.2 (instant rollback available)
Rollback:
Problem detected → flip load balancer back to Blue → 0 seconds downtimeBlue-Green with AWS ECS
# .github/workflows/blue-green-deploy.yml
name: Blue-Green Deploy
on:
workflow_dispatch:
inputs:
image_tag:
description: 'Docker image tag to deploy'
required: true
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Determine current live environment
id: current
run: |
CURRENT=$(aws ecs describe-services \
--cluster production \
--services api \
--query 'services[0].deployments[0].taskDefinition' \
--output text)
echo "Identifying blue/green environment..."
# In practice, tag your environments explicitly
LIVE=$(aws elbv2 describe-rules \
--listener-arn ${{ vars.LISTENER_ARN }} \
--query 'Rules[?Priority==`1`].Actions[0].TargetGroupArn' \
--output text)
echo "live-tg=$LIVE" >> "$GITHUB_OUTPUT"
- name: Deploy to inactive environment
id: deploy
run: |
# Register new task definition with updated image
NEW_TASK_DEF=$(aws ecs register-task-definition \
--cli-input-json "$(cat task-definition.json | \
jq '.containerDefinitions[0].image = "my-app:${{ inputs.image_tag }}"')" \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
# Update the inactive target group's ECS service
aws ecs update-service \
--cluster production \
--service api-green \
--task-definition "$NEW_TASK_DEF" \
--force-new-deployment
# Wait for deployment to stabilise
aws ecs wait services-stable \
--cluster production \
--services api-green
echo "new-task-def=$NEW_TASK_DEF" >> "$GITHUB_OUTPUT"
- name: Run smoke tests against Green
run: |
GREEN_URL="https://green.internal.example.com"
# Run health check
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "$GREEN_URL/health")
if [ "$HEALTH" != "200" ]; then
echo "Health check failed: $HEALTH"
exit 1
fi
# Run critical path smoke tests
npx jest tests/smoke/ --env=$GREEN_URL
- name: Switch traffic to Green
run: |
# Modify ALB listener rule to point to Green target group
aws elbv2 modify-rule \
--rule-arn ${{ vars.PRODUCTION_RULE_ARN }} \
--actions Type=forward,TargetGroupArn=${{ vars.GREEN_TG_ARN }}
echo "Traffic switched to Green (v${{ inputs.image_tag }})"
- name: Verify production health
run: |
sleep 30 # Allow DNS propagation
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "https://api.example.com/health")
if [ "$HEALTH" != "200" ]; then
echo "Production health check failed — initiating rollback"
aws elbv2 modify-rule \
--rule-arn ${{ vars.PRODUCTION_RULE_ARN }} \
--actions Type=forward,TargetGroupArn=${{ vars.BLUE_TG_ARN }}
exit 1
fi
echo "Deployment successful. Blue environment retained for rollback."Canary Releases
A canary release sends a small percentage of real traffic to the new version. If error rates and latency remain acceptable, the percentage increases automatically. If a problem is detected, the canary is rolled back automatically before most users are affected.
Phase 1 (1% canary):
100 requests → 99 to v1.4.2, 1 to v1.4.3
Monitor: error rate, p99 latency, business metrics
Phase 2 (10% if Phase 1 healthy):
100 requests → 90 to v1.4.2, 10 to v1.4.3
Monitor: same metrics
Phase 3 (50%):
...
Phase 4 (100% if all phases healthy):
All traffic → v1.4.3
v1.4.2 decommissionedCanary with Kubernetes and Argo Rollouts
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-server
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% canary
- pause:
duration: 5m # Watch metrics for 5 minutes
- setWeight: 20
- pause:
duration: 10m
- setWeight: 50
- pause:
duration: 10m
- setWeight: 100 # Full rollout
# Automatic rollback if analysis fails
analysis:
templates:
- templateName: error-rate-check
startingStep: 1
selector:
matchLabels:
app: api-server
template:
spec:
containers:
- name: api
image: my-app:v1.4.3# analysis-template.yaml — automatic rollback trigger
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01 # < 1% error rate required
failureLimit: 3 # 3 consecutive failures = rollback
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- name: p99-latency
interval: 1m
successCondition: result[0] < 500 # < 500ms p99
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_ms_bucket[5m])) by (le)
)If either metric fails 3 consecutive checks, Argo Rollouts automatically shifts all traffic back to the stable version and marks the rollout as failed.
Rolling Deployments (Kubernetes Default)
Rolling deployments replace instances incrementally without requiring a duplicate environment:
# deployment.yaml
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never reduce capacity during rollout
maxSurge: 2 # At most 2 extra pods during rollout
# Rollout sequence:
# Start: 6 × v1.4.2
# Surge: 6 × v1.4.2 + 2 × v1.4.3
# Replace: 4 × v1.4.2 + 2 × v1.4.3 (wait for health checks)
# Continue until: 0 × v1.4.2 + 6 × v1.4.3Rolling deployments are simpler but offer less control than Blue-Green or Canary — there is no single point where you can "hold" and test before proceeding.
Feature Flags: Decoupling Deploy from Release
Feature flags separate code deployment from feature activation. Code ships to production hidden behind a flag; a product manager or engineer can enable it for a subset of users without a deployment.
// Using a feature flag service (LaunchDarkly, Flagsmith, or a simple config)
import { getLDClient } from './featureFlags';
export async function handleCheckout(userId: string, cart: Cart) {
const ldClient = getLDClient();
const user = { key: userId };
// Check flag — only 10% of users see the new checkout flow
const useNewCheckout = await ldClient.variation('new-checkout-flow', user, false);
if (useNewCheckout) {
return newCheckoutFlow(cart);
}
return legacyCheckoutFlow(cart);
}// Self-hosted feature flags with a simple JSON config
interface FeatureFlag {
enabled: boolean;
rollout: number; // 0-100 percentage
allowedUserIds?: string[];
}
async function isFeatureEnabled(flagName: string, userId: string): Promise<boolean> {
const flag = await flagsStore.get<FeatureFlag>(flagName);
if (!flag.enabled) return false;
// Specific users (internal beta testers)
if (flag.allowedUserIds?.includes(userId)) return true;
// Percentage rollout (deterministic: same user always gets same result)
const hash = parseInt(createHash('sha256').update(userId + flagName).digest('hex').slice(0, 8), 16);
return (hash % 100) < flag.rollout;
}Benefits of feature flags:
- Deploy code without activating it — separated concerns
- Dark launch: ship and test in production with internal users before public release
- Kill switch: instantly disable a feature if problems are detected, no deployment needed
- A/B testing: show different variations to different user segments
Database Migrations with Zero Downtime
The hardest part of zero-downtime deployments is database schema changes. The application must work with both the old and new schema simultaneously during the rollout period.
⌠Dangerous migration pattern (breaks Blue-Green):
1. Deploy v2 code (expects new column 'preferences')
2. Run migration to add 'preferences' column
Problem: v1 code is still running during migration, fails if new column appears
✓ Safe expand-contract migration pattern:
Phase 1 (deploy v1.5):
- Add migration: add 'preferences' column (nullable, no default)
- v1.5 code writes to BOTH old and new column
- v1.4 code (still running) ignores new column
Phase 2 (after v1.4 is fully replaced):
- Backfill: copy data from old column to new column
- Deploy v1.6: reads from new column, writes to new column only
Phase 3 (after v1.6 is stable):
- Remove migration: drop old column
- Remove dual-write code from v1.7// Dual-write pattern during migration
async function updateUserPreferences(userId: string, prefs: UserPreferences) {
await db.transaction(async (trx) => {
// Write to new column
await trx('users')
.where({ id: userId })
.update({ preferences: JSON.stringify(prefs) });
// Also write to old column (for rollback compatibility during migration period)
await trx('users')
.where({ id: userId })
.update({ settings: legacyFormat(prefs) }); // Old format
});
}GitHub Actions CD Workflow
# .github/workflows/cd.yml
name: Continuous Deployment
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.build.outputs.tag }}
steps:
- uses: actions/checkout@v4
- name: Build and push Docker image
id: build
run: |
TAG="${{ github.sha }}"
docker build -t my-app:$TAG .
docker push ${{ vars.ECR_REGISTRY }}/my-app:$TAG
echo "tag=$TAG" >> "$GITHUB_OUTPUT"
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
permissions:
id-token: write
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.STAGING_ROLE_ARN }}
aws-region: us-east-1
- name: Deploy canary to staging (10%)
run: |
kubectl argo rollouts set image deployment/api-server \
api=${{ vars.ECR_REGISTRY }}/my-app:${{ needs.build.outputs.image-tag }}
- name: Wait for staging rollout
run: kubectl argo rollouts status deployment/api-server --timeout=10m
- name: Run smoke tests
run: npx jest tests/smoke/ --env=https://staging.example.com
deploy-production:
needs: [build, deploy-staging]
runs-on: ubuntu-latest
environment: production # Requires manual approval
permissions:
id-token: write
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.PROD_ROLE_ARN }}
aws-region: us-east-1
- name: Start canary rollout (5% → auto-promote)
run: |
kubectl argo rollouts set image deployment/api-server \
api=${{ vars.ECR_REGISTRY }}/my-app:${{ needs.build.outputs.image-tag }} \
--namespace production
- name: Monitor rollout
run: |
kubectl argo rollouts status deployment/api-server \
--namespace production \
--timeout=30mFrequently Asked Questions
Q: Which strategy should I use — Blue-Green or Canary?
Blue-Green is better for smaller applications or when you need instant, guaranteed rollback with no partial-failure state. Canary is better for large-scale applications where you want to validate new code under real production load before full exposure, and when automated metrics can drive promotion decisions. In practice, many organisations use Blue-Green for infrastructure changes (database migrations, config changes) and Canary for application code changes.
Q: How do I handle database migrations with Blue-Green deployments?
The expand-contract pattern is the standard answer: (1) add new schema elements in a backward-compatible way, (2) deploy code that works with both old and new schema, (3) complete the migration, (4) remove old schema elements in a later release. Never add NOT NULL columns without defaults to tables that existing code is actively writing to.
Q: What metrics should trigger an automatic canary rollback?
Standard rollback triggers: HTTP 5xx error rate > 1%, p99 latency increase > 50% over baseline, business metric decline (order completion rate, signup conversion) > 10%, memory leak (increasing RSS over 30 minutes), and CPU saturation. Choose thresholds that are sensitive enough to catch real problems before affecting many users, but not so sensitive that normal traffic variance triggers false rollbacks.
Q: Can I use feature flags instead of deployment strategies?
Feature flags complement deployment strategies — they do not replace them. A deployment strategy controls how code versions reach production (Blue-Green, Canary). A feature flag controls whether a feature within the deployed code is active for specific users. Use both: deploy with a Canary strategy, activate the feature with a flag for internal users first, then roll it out to all users via the flag without another deployment.
Key Takeaway
Blue-Green deployment gives you instantaneous rollback by maintaining two production environments and flipping traffic at the load balancer. Canary deployment reduces blast radius by routing a small percentage of real traffic to the new version, monitoring error rates and latency, and promoting automatically only when metrics confirm the new version is healthy. Feature flags decouple deployment from release, enabling dark launches and kill switches without redeployment. All three strategies require discipline around database migrations — the expand-contract pattern ensures schema changes are backward-compatible with both versions running simultaneously.
Read next: GitHub Actions: Secrets and Environments →
Part of the GitHub Mastery Course — engineering the release.
