Online Inter College
BlogArticlesCoursesSearch
Sign InGet Started

Stay in the loop

Weekly digests of the best articles — no spam, ever.

Online Inter College

Stories, ideas, and perspectives worth sharing. A modern blogging platform built for writers and readers.

Explore

  • All Posts
  • Search
  • Most Popular
  • Latest

Company

  • About
  • Contact
  • Sign In
  • Get Started

© 2026 Online Inter College. All rights reserved.

PrivacyTermsContact
Home/Blog/Technology
Technology

Zero-Downtime Deployments: The Complete Playbook

GGirish Sharma
March 8, 202517 min read13,463 views0 comments
Zero-Downtime Deployments: The Complete Playbook

Zero-Downtime Deployments: The Complete Playbook


Every Second Your App Is Down, Someone Is Leaving and Not Coming Back.

At 2:47am on a Tuesday, a fintech startup pushed a database migration to production. By 2:51am their app was returning 500 errors to every user. By 3:15am their CEO was awake. By 9am their support inbox had 340 unread tickets.

The deployment itself was correct. The code worked perfectly in staging. The migration ran cleanly. But nobody had planned for the four minutes between the old code shutting down and the new code finishing its startup sequence.

Four minutes. Roughly $180,000 in transaction volume for that company. Gone.

Zero-downtime deployment is not an advanced topic reserved for FAANG engineers. It is a fundamental discipline that every team shipping software to real users needs to understand and implement. This playbook covers everything — the strategies, the patterns, the database migrations, the gotchas, and the checklist you run before every deploy.

💡 Who this is for: Engineering teams who have experienced downtime during deployments and want to prevent it permanently, and teams who have not experienced it yet and want to keep it that way.


What Zero-Downtime Deployment Actually Means

Zero-downtime deployment means releasing new versions of your software without any period where your application is unavailable or returning errors to users.

This sounds simple. In practice it requires solving several problems simultaneously:

  • New and old versions of your code running at the same time during the transition

  • Database schemas that work with both old and new application code

  • In-flight requests completing successfully during the transition

  • State and sessions surviving across the deployment boundary

  • Rollback being possible without additional downtime if something goes wrong

🔑 The core insight: Downtime during deployment is almost never caused by the deployment itself failing. It is caused by the gap between the old version stopping and the new version being ready — and by database or state changes that are not backward compatible with the version that is still running.


The Four Deployment Strategies

Every zero-downtime deployment approach is a variation of one of these four core strategies. Understanding all four gives you the vocabulary and mental model to pick the right tool for each situation.


Strategy 1: Rolling Deployment

A rolling deployment replaces instances of your application one at a time or in small batches. At any point during the deployment, some instances are running the old version and some are running the new version.

How it works:

  1. Take one instance out of the load balancer rotation

  2. Deploy the new version to that instance

  3. Health check the new instance until it passes

  4. Return the instance to the load balancer rotation

  5. Repeat for the next instance until all instances are updated

Visual representation:

Start:    [v1] [v1] [v1] [v1]
Step 1:   [v2] [v1] [v1] [v1]
Step 2:   [v2] [v2] [v1] [v1]
Step 3:   [v2] [v2] [v2] [v1]
Complete: [v2] [v2] [v2] [v2]

Best for: Applications where running mixed versions simultaneously is safe. Services with stateless request handling. Teams with limited infrastructure budget who cannot run double capacity.

Watch out for: Database migrations that are not backward compatible. Sessions tied to specific instances. API changes that break v1 clients talking to v2 servers or vice versa.


Strategy 2: Blue-Green Deployment

Blue-green deployment maintains two identical production environments. At any moment, one environment is live and receiving traffic. The other is idle and available for the next deployment.

How it works:

  1. Your blue environment is currently live

  2. Deploy the new version to the idle green environment

  3. Run smoke tests and validation against green

  4. Switch the load balancer to send all traffic to green

  5. Blue is now idle — keep it running for immediate rollback

  6. On the next deployment, green is live and blue receives the update

Visual representation:

Before deploy:
  Load Balancer → [Blue: v1 LIVE] [Green: idle]

After deploy:
  Load Balancer → [Blue: v1 standby] [Green: v2 LIVE]

If rollback needed:
  Load Balancer → [Blue: v1 LIVE] [Green: v2 standby]

Best for: Applications where instant rollback capability is critical. Deployments involving significant database or infrastructure changes. Teams who can absorb the cost of running double infrastructure.

Watch out for: Database migrations that affect both environments. Stateful data that lives in the application layer. The cost of maintaining double infrastructure permanently.


Strategy 3: Canary Deployment

A canary deployment routes a small percentage of real production traffic to the new version while the majority continues hitting the old version. You monitor the canary closely, then gradually increase its traffic share until it is handling everything.

How it works:

  1. Deploy the new version to a small subset of instances

  2. Route 1 to 5 percent of traffic to the new version

  3. Monitor error rates, latency, and business metrics carefully

  4. If metrics look healthy, increase the traffic percentage

  5. Continue gradually until the new version handles 100 percent

  6. Decommission the old version instances

Traffic progression example:

Phase 1:  95% [v1]  —   5% [v2]  — monitor 30 minutes
Phase 2:  75% [v1]  —  25% [v2]  — monitor 30 minutes
Phase 3:  50% [v1]  —  50% [v2]  — monitor 30 minutes
Phase 4:  25% [v1]  —  75% [v2]  — monitor 30 minutes
Phase 5:   0% [v1]  — 100% [v2]  — deployment complete

Best for: High-traffic applications where catching issues with real traffic before full rollout is worth the complexity. Feature changes where user behavior impact is uncertain. Teams with strong observability and alerting infrastructure.

Watch out for: The monitoring overhead required to make canary safe. Database migrations being run before the canary phase completes. Inconsistent user experiences when some users hit v1 and others hit v2.


Strategy 4: Feature Flags

Feature flags decouple code deployment from feature release. You deploy all code to all instances immediately, but new functionality is hidden behind a flag. You turn the flag on for a percentage of users, a specific cohort, or all users at a moment of your choosing.

How it works:

  1. Wrap new feature code behind a flag check

  2. Deploy to production with the flag off — no user impact

  3. Enable the flag for internal users or a small beta group

  4. Monitor and iterate

  5. Gradually roll out to wider audiences

  6. Remove the flag and the old code path in a future deployment

Code example:

async function getRecommendations(userId: string) {
  const useNewAlgorithm = await featureFlags.isEnabled(
    "new-recommendation-engine",
    { userId }
  );

  if (useNewAlgorithm) {
    return newRecommendationEngine.getForUser(userId);
  }

  return legacyRecommendationEngine.getForUser(userId);
}

Best for: Separating deployment risk from release risk. A/B testing new features. Giving product teams control over rollout timing without requiring engineering deployments.

Watch out for: Flag debt accumulating when old flags are never cleaned up. Technical complexity of maintaining parallel code paths long-term. Flags that have database or schema dependencies that make them hard to toggle safely.


Choosing the Right Strategy

Factor

Rolling

Blue-Green

Canary

Feature Flags

Infrastructure cost

Low

High

Medium

Low

Rollback speed

Minutes

Seconds

Minutes

Seconds

Mixed version risk

High

None

Medium

None

Real traffic testing

No

No

Yes

Yes

Complexity

Low

Medium

High

High

Database migration safety

Medium

High

Medium

High

💡 Most mature teams combine strategies. Blue-green for infrastructure deployments. Canary for significant application changes. Feature flags for product feature releases. Rolling for low-risk routine updates.


The Database Migration Problem

This is where most zero-downtime deployment strategies fall apart. Code deployments are relatively easy to make zero-downtime. Database migrations are hard.

The fundamental tension is this: you cannot deploy a database migration and a code change at the exact same moment. There will always be a period where either the old code is running against the new schema, or the new code is running against the old schema.

The solution is the expand-contract pattern — also called the parallel change pattern.


The Expand-Contract Pattern

Every breaking database change is broken into three separate deployments instead of one.

Phase 1: Expand

Add the new structure without removing anything. The database now supports both old and new application behavior simultaneously.

Phase 2: Migrate

Deploy the application code that uses the new structure. The old structure still exists for backward compatibility.

Phase 3: Contract

Remove the old structure that is no longer needed.


Real example — renaming a column safely:

The dangerous way — one deployment, guaranteed downtime:

-- Migration — runs while old code is still deployed
ALTER TABLE users RENAME COLUMN full_name TO display_name;
-- Old code immediately breaks — it still queries full_name

The safe way — three deployments, zero downtime:

Deployment 1 (Expand):

-- Add the new column, keep the old one
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Copy existing data to new column
UPDATE users SET display_name = full_name;

-- Keep both columns in sync with a trigger
CREATE OR REPLACE FUNCTION sync_display_name()
RETURNS TRIGGER AS $$
BEGIN
  NEW.display_name := NEW.full_name;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER users_sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_display_name();

Deployment 2 (Migrate):

-- Deploy application code that reads and writes display_name
-- Old code still works because full_name still exists
-- New code works because display_name exists and is populated

Deployment 3 (Contract):

-- Only after all instances run the new code
DROP TRIGGER users_sync_display_name ON users;
DROP FUNCTION sync_display_name();
ALTER TABLE users DROP COLUMN full_name;

⚠️ The most common mistake: Skipping the expand phase and deploying the migration and the code change together. This creates a window — however brief — where the running application code does not match the database schema.


Additive Migrations — Always Safe

These database operations are safe to run at any time without impacting running application code:

  • Adding a new table

  • Adding a nullable column to an existing table

  • Adding a column with a default value

  • Adding a new index

  • Adding a new foreign key to a new column

  • Creating a new stored procedure or function

Destructive Migrations — Always Require Expand-Contract

These operations will break running application code if deployed naively:

  • Renaming a column

  • Removing a column

  • Renaming a table

  • Changing a column type

  • Adding a non-nullable column without a default

  • Removing a table


Health Checks — The Foundation of Safe Deployments

Every zero-downtime deployment strategy depends on health checks to know when a new instance is ready to receive traffic. Getting health checks right is not optional.

The two types of health check you need:

Liveness check — is the process alive?

GET /health/live

Response: 200 OK
{
  "status": "alive"
}

Returns 200 as long as the process is running. Returns non-200 if the process is deadlocked or otherwise unresponsive. Used by orchestrators like Kubernetes to decide whether to restart a container.

Readiness check — is the instance ready for traffic?

GET /health/ready

Response: 200 OK
{
  "status": "ready",
  "checks": {
    "database": "connected",
    "cache": "connected",
    "migrations": "current"
  }
}

Returns 200 only when the instance has completed startup, connected to all dependencies, and is genuinely ready to handle requests. Used by load balancers to decide whether to route traffic to an instance.

🔑 The critical distinction: A liveness check that passes too early is the cause of most deployment-related downtime. Your readiness check must verify that your application is actually ready — database connection pooled, caches warmed, migrations run — not just that the HTTP server is listening.

A production-grade readiness check:

app.get("/health/ready", async (req, res) => {
  const checks = await Promise.allSettled([
    db.query("SELECT 1"),
    redis.ping(),
    checkMigrationsAreCurrent(),
  ]);

  const results = {
    database: checks[0].status === "fulfilled" ? "ok" : "error",
    cache: checks[1].status === "fulfilled" ? "ok" : "error",
    migrations: checks[2].status === "fulfilled" ? "ok" : "error",
  };

  const allHealthy = Object.values(results).every(v => v === "ok");

  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? "ready" : "not_ready",
    checks: results,
  });
});

Graceful Shutdown — Let In-Flight Requests Finish

When your load balancer stops routing new traffic to an instance, that instance may still be processing requests that arrived before the cutover. Killing the process immediately drops those requests. Graceful shutdown lets them complete.

Node.js graceful shutdown implementation:

const server = app.listen(PORT);

let isShuttingDown = false;

process.on("SIGTERM", async () => {
  console.log("SIGTERM received — beginning graceful shutdown");
  isShuttingDown = true;

  server.close(async () => {
    console.log("HTTP server closed — no new connections accepted");

    await db.pool.end();
    console.log("Database connections closed");

    await redis.quit();
    console.log("Redis connection closed");

    console.log("Graceful shutdown complete");
    process.exit(0);
  });

  setTimeout(() => {
    console.error("Forced shutdown after timeout");
    process.exit(1);
  }, 30000);
});

✅ The readiness check and graceful shutdown work together. When a deployment starts, the load balancer stops routing new traffic to the old instance by checking readiness. The old instance finishes its in-flight requests via graceful shutdown. The new instance passes its readiness check and starts receiving traffic. No requests are dropped.


Session and State Management During Deployments

Stateful sessions stored in application memory are destroyed when an instance is restarted. If a user's session lives only in memory on instance A and instance A is replaced during a deployment, that user is logged out.

The solutions:

Move sessions to an external store:

import session from "express-session";
import RedisStore from "connect-redis";
import { redis } from "./redis";

app.use(
  session({
    store: new RedisStore({ client: redis }),
    secret: process.env.SESSION_SECRET,
    resave: false,
    saveUninitialized: false,
    cookie: {
      secure: true,
      httpOnly: true,
      maxAge: 24 * 60 * 60 * 1000,
    },
  })
);

Use stateless JWT tokens:

import jwt from "jsonwebtoken";

function createToken(userId: string): string {
  return jwt.sign(
    { userId, iat: Math.floor(Date.now() / 1000) },
    process.env.JWT_SECRET,
    { expiresIn: "24h" }
  );
}

function verifyToken(token: string): { userId: string } {
  return jwt.verify(token, process.env.JWT_SECRET) as { userId: string };
}

💡 JWTs are deployment-transparent by design. They are signed and verified with a secret, contain all necessary user information, and require no server-side state. Any instance can verify any token at any time — making them naturally zero-downtime-safe.


Rollback Strategy — When Things Go Wrong

Zero-downtime deployment requires zero-downtime rollback. A deployment strategy with fast forward but slow reverse is not complete.

Rollback decision tree:

Deploy new version
        |
        v
Monitor error rate for 10 minutes
        |
   Spike detected?
   /           \
  Yes           No
   |             |
   v             v
Auto-rollback  Continue
triggered      monitoring
   |
   v
Previous version restored
   |
   v
Alert on-call engineer
   |
   v
Investigate root cause
before next attempt

Automated rollback trigger example:

async function monitorDeployment(
  deploymentId: string,
  errorRateThreshold: number = 0.01
) {
  const monitoringWindow = 10 * 60 * 1000;
  const checkInterval = 30 * 1000;
  const startTime = Date.now();

  while (Date.now() - startTime < monitoringWindow) {
    await sleep(checkInterval);

    const errorRate = await metrics.getErrorRate({
      window: "5m",
      deploymentId,
    });

    if (errorRate > errorRateThreshold) {
      console.error(`Error rate ${errorRate} exceeds threshold — rolling back`);
      await triggerRollback(deploymentId);
      await alertOnCall(`Automatic rollback triggered for ${deploymentId}`);
      return { success: false, rolledBack: true };
    }
  }

  return { success: true, rolledBack: false };
}

⚠️ Database rollback is the hard part. Application code rollback is straightforward. Rolling back a database migration that has already run requires a down migration that is the exact inverse of the up migration. Always write and test your down migrations before you need them.


Kubernetes Zero-Downtime Configuration

For teams running on Kubernetes, here is the complete deployment configuration that implements the patterns above:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: myapp:v2.0.0
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "15"]

🔑 The preStop sleep is not optional. Without it, Kubernetes removes the pod from service discovery and sends SIGTERM simultaneously. Requests routed to the pod during the propagation delay are dropped. The sleep gives load balancers time to stop routing traffic before the shutdown signal arrives.


The Pre-Deployment Checklist

Run this before every production deployment:

Database migrations:

  • All migrations are additive or follow expand-contract pattern

  • Migrations have been tested against a production data volume snapshot

  • Down migrations are written and tested

  • Migration runtime on production data size is known and acceptable

Application readiness:

  • Readiness check endpoint verifies all critical dependencies

  • Graceful shutdown is implemented and tested

  • Session state is externalized — not in application memory

  • New version has been smoke tested in staging

Deployment configuration:

  • Health check thresholds are set appropriately

  • Rolling update or deployment strategy is configured correctly

  • Rollback procedure is documented and the team knows how to execute it

  • Monitoring and alerting are active and will fire on error rate increase

Post-deployment monitoring:

  • Error rate baseline is known before deployment begins

  • Latency baseline is known before deployment begins

  • On-call engineer is available for the monitoring window

  • Automated rollback trigger is configured or manual rollback is ready


Common Zero-Downtime Deployment Mistakes

Mistake 1: Treating the deployment as complete when instances are updated

The deployment is complete when your new version has been running under production load for your defined monitoring window without triggering rollback thresholds. Updating instances is the beginning of a deployment, not the end.

Mistake 2: Running database migrations as part of application startup

If migrations run at startup, every instance that starts up runs the migration. In a rolling deployment with four instances, your migration runs four times. In most cases this is harmless. For migrations that move data or drop columns, it is catastrophic.

Mistake 3: Health checks that pass before the application is ready

A health check that returns 200 before database connections are established or before connection pools are warmed up will cause your load balancer to route traffic to an instance that is not actually ready. The result looks identical to downtime.

Mistake 4: Forgetting about background jobs and workers

Your API instances are not the only things that need zero-downtime treatment. Queue workers, scheduled jobs, and background processors all need graceful shutdown, and job processing needs to be idempotent to handle the case where a job is interrupted mid-execution.

Mistake 5: No post-deployment monitoring window

Deploying and immediately closing the laptop is how production incidents happen on Friday evenings. Define a monitoring window — typically 10 to 30 minutes depending on traffic volume — and watch your error rate and latency actively before considering the deployment complete.


Observability — You Cannot Have Zero-Downtime Without It

Zero-downtime deployment requires knowing immediately when something is wrong. That requires three things working before you deploy.

The three pillars for deployment observability:

Metrics — are the numbers normal?

  • Request rate, error rate, and latency for every service

  • Database query time and connection pool saturation

  • Infrastructure metrics including CPU, memory, and network

  • Business metrics including transaction success rate and conversion rate

Logs — what is actually happening?

  • Structured JSON logging with deployment version tagged on every log line

  • Log aggregation that lets you filter by deployment version

  • Error logs with full stack traces and request context

Alerts — who gets woken up and when?

  • Error rate alert firing within 2 minutes of exceeding threshold

  • Latency alert firing within 5 minutes of p99 exceeding SLA

  • Alert routing that reaches the engineer who deployed the change

✅ Tag every log line and metric with the deployment version. When you are investigating whether a spike was caused by a deployment, being able to filter metrics and logs by version instantly is the difference between a 5-minute investigation and a 45-minute one.


Zero-Downtime Deployment by Infrastructure Type

Infrastructure

Recommended Strategy

Key Consideration

Kubernetes

Rolling with readiness probes

preStop sleep for load balancer propagation

AWS ECS

Rolling or Blue-Green via CodeDeploy

ALB target group switching

AWS Elastic Beanstalk

Rolling with additional batch

Immutable deployments for critical changes

Heroku

Rolling via Preboot feature

Enable Preboot in app settings

Traditional VMs

Manual rolling with load balancer

Script instance drain and rotation

Serverless functions

Alias-based traffic shifting

Lambda weighted aliases for canary

Docker Compose

Blue-green with nginx reload

Nginx upstream switching without restart


Conclusion

Zero-downtime deployment is not a single technique. It is a discipline that touches your deployment strategy, your database migration approach, your health check implementation, your session architecture, your observability stack, and your team culture around releases.

The teams who get this right share a few common traits. They treat every deployment as a process to be designed, not an event to be hoped. They separate schema changes from code changes. They build rollback capability before they need it. They monitor actively after every deploy rather than assuming success.

Downtime during deployment is inevitable. Downtime during deployment is a choice — a consequence of skipping the steps that prevent it.

The playbook is here. The patterns are proven. The only thing left is execution

Tags:#TypeScript#Open Source#Career#DevOps#SoftwareEngineering#Deployment#ZeroDowntime#Kubernetes#BackendDevelopment#SRE#Infrastructure#DatabaseMigrations#CI-CD
Share:
G

Girish Sharma

Chef Automate & Senior Cloud/DevOps Engineer with 6+ years in IT infrastructure, system administration, automation, and cloud-native architecture. AWS & Azure certified. I help teams ship faster with Kubernetes, CI/CD pipelines, Infrastructure as Code (Chef, Terraform, Ansible), and production-grade monitoring. Founder of Online Inter College.

Related Posts

The Architecture of PostgreSQL: How Queries Actually Execute
Technology

The Architecture of PostgreSQL: How Queries Actually Execute

A journey through PostgreSQL internals: the planner, executor, buffer pool, WAL, and MVCC — understanding these makes every query you write more intentional.

Girish Sharma· March 1, 2025
4m9.9K0

Comments (0)

Sign in to join the conversation

Full-Stack Next.js Mastery — Part 3: Auth, Middleware & Edge Runtime
Technology

Full-Stack Next.js Mastery — Part 3: Auth, Middleware & Edge Runtime

NextAuth v5, protecting routes with Middleware, JWT vs session strategies, and pushing auth logic to the Edge for zero-latency protection — all production-proven patterns.

Girish Sharma· February 10, 2025
3m11.9K0
Full-Stack Next.js Mastery — Part 2: App Router Data Patterns & Caching
Technology

Full-Stack Next.js Mastery — Part 2: App Router Data Patterns & Caching

fetch() cache semantics, revalidation strategies, unstable_cache, route segment config — the complete decision tree for choosing how your Next.js app fetches, caches, and revalidates data.

Girish Sharma· January 25, 2025
3m12.3K0

Newsletter

Get the latest articles delivered to your inbox. No spam, ever.