Zero-Downtime Deployments: The Complete Playbook
Every Second Your App Is Down, Someone Is Leaving and Not Coming Back.
At 2:47am on a Tuesday, a fintech startup pushed a database migration to production. By 2:51am their app was returning 500 errors to every user. By 3:15am their CEO was awake. By 9am their support inbox had 340 unread tickets.
The deployment itself was correct. The code worked perfectly in staging. The migration ran cleanly. But nobody had planned for the four minutes between the old code shutting down and the new code finishing its startup sequence.
Four minutes. Roughly $180,000 in transaction volume for that company. Gone.
Zero-downtime deployment is not an advanced topic reserved for FAANG engineers. It is a fundamental discipline that every team shipping software to real users needs to understand and implement. This playbook covers everything — the strategies, the patterns, the database migrations, the gotchas, and the checklist you run before every deploy.
💡 Who this is for: Engineering teams who have experienced downtime during deployments and want to prevent it permanently, and teams who have not experienced it yet and want to keep it that way.
What Zero-Downtime Deployment Actually Means
Zero-downtime deployment means releasing new versions of your software without any period where your application is unavailable or returning errors to users.
This sounds simple. In practice it requires solving several problems simultaneously:
New and old versions of your code running at the same time during the transition
Database schemas that work with both old and new application code
In-flight requests completing successfully during the transition
State and sessions surviving across the deployment boundary
Rollback being possible without additional downtime if something goes wrong
🔑 The core insight: Downtime during deployment is almost never caused by the deployment itself failing. It is caused by the gap between the old version stopping and the new version being ready — and by database or state changes that are not backward compatible with the version that is still running.
The Four Deployment Strategies
Every zero-downtime deployment approach is a variation of one of these four core strategies. Understanding all four gives you the vocabulary and mental model to pick the right tool for each situation.
Strategy 1: Rolling Deployment
A rolling deployment replaces instances of your application one at a time or in small batches. At any point during the deployment, some instances are running the old version and some are running the new version.
How it works:
Take one instance out of the load balancer rotation
Deploy the new version to that instance
Health check the new instance until it passes
Return the instance to the load balancer rotation
Repeat for the next instance until all instances are updated
Visual representation:
Start: [v1] [v1] [v1] [v1]
Step 1: [v2] [v1] [v1] [v1]
Step 2: [v2] [v2] [v1] [v1]
Step 3: [v2] [v2] [v2] [v1]
Complete: [v2] [v2] [v2] [v2]
Best for: Applications where running mixed versions simultaneously is safe. Services with stateless request handling. Teams with limited infrastructure budget who cannot run double capacity.
Watch out for: Database migrations that are not backward compatible. Sessions tied to specific instances. API changes that break v1 clients talking to v2 servers or vice versa.
Strategy 2: Blue-Green Deployment
Blue-green deployment maintains two identical production environments. At any moment, one environment is live and receiving traffic. The other is idle and available for the next deployment.
How it works:
Your blue environment is currently live
Deploy the new version to the idle green environment
Run smoke tests and validation against green
Switch the load balancer to send all traffic to green
Blue is now idle — keep it running for immediate rollback
On the next deployment, green is live and blue receives the update
Visual representation:
Before deploy:
Load Balancer → [Blue: v1 LIVE] [Green: idle]
After deploy:
Load Balancer → [Blue: v1 standby] [Green: v2 LIVE]
If rollback needed:
Load Balancer → [Blue: v1 LIVE] [Green: v2 standby]
Best for: Applications where instant rollback capability is critical. Deployments involving significant database or infrastructure changes. Teams who can absorb the cost of running double infrastructure.
Watch out for: Database migrations that affect both environments. Stateful data that lives in the application layer. The cost of maintaining double infrastructure permanently.
Strategy 3: Canary Deployment
A canary deployment routes a small percentage of real production traffic to the new version while the majority continues hitting the old version. You monitor the canary closely, then gradually increase its traffic share until it is handling everything.
How it works:
Deploy the new version to a small subset of instances
Route 1 to 5 percent of traffic to the new version
Monitor error rates, latency, and business metrics carefully
If metrics look healthy, increase the traffic percentage
Continue gradually until the new version handles 100 percent
Decommission the old version instances
Traffic progression example:
Phase 1: 95% [v1] — 5% [v2] — monitor 30 minutes
Phase 2: 75% [v1] — 25% [v2] — monitor 30 minutes
Phase 3: 50% [v1] — 50% [v2] — monitor 30 minutes
Phase 4: 25% [v1] — 75% [v2] — monitor 30 minutes
Phase 5: 0% [v1] — 100% [v2] — deployment complete
Best for: High-traffic applications where catching issues with real traffic before full rollout is worth the complexity. Feature changes where user behavior impact is uncertain. Teams with strong observability and alerting infrastructure.
Watch out for: The monitoring overhead required to make canary safe. Database migrations being run before the canary phase completes. Inconsistent user experiences when some users hit v1 and others hit v2.
Strategy 4: Feature Flags
Feature flags decouple code deployment from feature release. You deploy all code to all instances immediately, but new functionality is hidden behind a flag. You turn the flag on for a percentage of users, a specific cohort, or all users at a moment of your choosing.
How it works:
Wrap new feature code behind a flag check
Deploy to production with the flag off — no user impact
Enable the flag for internal users or a small beta group
Monitor and iterate
Gradually roll out to wider audiences
Remove the flag and the old code path in a future deployment
Code example:
async function getRecommendations(userId: string) {
const useNewAlgorithm = await featureFlags.isEnabled(
"new-recommendation-engine",
{ userId }
);
if (useNewAlgorithm) {
return newRecommendationEngine.getForUser(userId);
}
return legacyRecommendationEngine.getForUser(userId);
}
Best for: Separating deployment risk from release risk. A/B testing new features. Giving product teams control over rollout timing without requiring engineering deployments.
Watch out for: Flag debt accumulating when old flags are never cleaned up. Technical complexity of maintaining parallel code paths long-term. Flags that have database or schema dependencies that make them hard to toggle safely.
Choosing the Right Strategy
Factor | Rolling | Blue-Green | Canary | Feature Flags |
|---|---|---|---|---|
Infrastructure cost | Low | High | Medium | Low |
Rollback speed | Minutes | Seconds | Minutes | Seconds |
Mixed version risk | High | None | Medium | None |
Real traffic testing | No | No | Yes | Yes |
Complexity | Low | Medium | High | High |
Database migration safety | Medium | High | Medium | High |
💡 Most mature teams combine strategies. Blue-green for infrastructure deployments. Canary for significant application changes. Feature flags for product feature releases. Rolling for low-risk routine updates.
The Database Migration Problem
This is where most zero-downtime deployment strategies fall apart. Code deployments are relatively easy to make zero-downtime. Database migrations are hard.
The fundamental tension is this: you cannot deploy a database migration and a code change at the exact same moment. There will always be a period where either the old code is running against the new schema, or the new code is running against the old schema.
The solution is the expand-contract pattern — also called the parallel change pattern.
The Expand-Contract Pattern
Every breaking database change is broken into three separate deployments instead of one.
Phase 1: Expand
Add the new structure without removing anything. The database now supports both old and new application behavior simultaneously.
Phase 2: Migrate
Deploy the application code that uses the new structure. The old structure still exists for backward compatibility.
Phase 3: Contract
Remove the old structure that is no longer needed.
Real example — renaming a column safely:
The dangerous way — one deployment, guaranteed downtime:
-- Migration — runs while old code is still deployed
ALTER TABLE users RENAME COLUMN full_name TO display_name;
-- Old code immediately breaks — it still queries full_name
The safe way — three deployments, zero downtime:
Deployment 1 (Expand):
-- Add the new column, keep the old one
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Copy existing data to new column
UPDATE users SET display_name = full_name;
-- Keep both columns in sync with a trigger
CREATE OR REPLACE FUNCTION sync_display_name()
RETURNS TRIGGER AS $$
BEGIN
NEW.display_name := NEW.full_name;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_display_name();
Deployment 2 (Migrate):
-- Deploy application code that reads and writes display_name
-- Old code still works because full_name still exists
-- New code works because display_name exists and is populated
Deployment 3 (Contract):
-- Only after all instances run the new code
DROP TRIGGER users_sync_display_name ON users;
DROP FUNCTION sync_display_name();
ALTER TABLE users DROP COLUMN full_name;
⚠️ The most common mistake: Skipping the expand phase and deploying the migration and the code change together. This creates a window — however brief — where the running application code does not match the database schema.
Additive Migrations — Always Safe
These database operations are safe to run at any time without impacting running application code:
Adding a new table
Adding a nullable column to an existing table
Adding a column with a default value
Adding a new index
Adding a new foreign key to a new column
Creating a new stored procedure or function
Destructive Migrations — Always Require Expand-Contract
These operations will break running application code if deployed naively:
Renaming a column
Removing a column
Renaming a table
Changing a column type
Adding a non-nullable column without a default
Removing a table
Health Checks — The Foundation of Safe Deployments
Every zero-downtime deployment strategy depends on health checks to know when a new instance is ready to receive traffic. Getting health checks right is not optional.
The two types of health check you need:
Liveness check — is the process alive?
GET /health/live
Response: 200 OK
{
"status": "alive"
}
Returns 200 as long as the process is running. Returns non-200 if the process is deadlocked or otherwise unresponsive. Used by orchestrators like Kubernetes to decide whether to restart a container.
Readiness check — is the instance ready for traffic?
GET /health/ready
Response: 200 OK
{
"status": "ready",
"checks": {
"database": "connected",
"cache": "connected",
"migrations": "current"
}
}
Returns 200 only when the instance has completed startup, connected to all dependencies, and is genuinely ready to handle requests. Used by load balancers to decide whether to route traffic to an instance.
🔑 The critical distinction: A liveness check that passes too early is the cause of most deployment-related downtime. Your readiness check must verify that your application is actually ready — database connection pooled, caches warmed, migrations run — not just that the HTTP server is listening.
A production-grade readiness check:
app.get("/health/ready", async (req, res) => {
const checks = await Promise.allSettled([
db.query("SELECT 1"),
redis.ping(),
checkMigrationsAreCurrent(),
]);
const results = {
database: checks[0].status === "fulfilled" ? "ok" : "error",
cache: checks[1].status === "fulfilled" ? "ok" : "error",
migrations: checks[2].status === "fulfilled" ? "ok" : "error",
};
const allHealthy = Object.values(results).every(v => v === "ok");
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? "ready" : "not_ready",
checks: results,
});
});
Graceful Shutdown — Let In-Flight Requests Finish
When your load balancer stops routing new traffic to an instance, that instance may still be processing requests that arrived before the cutover. Killing the process immediately drops those requests. Graceful shutdown lets them complete.
Node.js graceful shutdown implementation:
const server = app.listen(PORT);
let isShuttingDown = false;
process.on("SIGTERM", async () => {
console.log("SIGTERM received — beginning graceful shutdown");
isShuttingDown = true;
server.close(async () => {
console.log("HTTP server closed — no new connections accepted");
await db.pool.end();
console.log("Database connections closed");
await redis.quit();
console.log("Redis connection closed");
console.log("Graceful shutdown complete");
process.exit(0);
});
setTimeout(() => {
console.error("Forced shutdown after timeout");
process.exit(1);
}, 30000);
});
✅ The readiness check and graceful shutdown work together. When a deployment starts, the load balancer stops routing new traffic to the old instance by checking readiness. The old instance finishes its in-flight requests via graceful shutdown. The new instance passes its readiness check and starts receiving traffic. No requests are dropped.
Session and State Management During Deployments
Stateful sessions stored in application memory are destroyed when an instance is restarted. If a user's session lives only in memory on instance A and instance A is replaced during a deployment, that user is logged out.
The solutions:
Move sessions to an external store:
import session from "express-session";
import RedisStore from "connect-redis";
import { redis } from "./redis";
app.use(
session({
store: new RedisStore({ client: redis }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: {
secure: true,
httpOnly: true,
maxAge: 24 * 60 * 60 * 1000,
},
})
);
Use stateless JWT tokens:
import jwt from "jsonwebtoken";
function createToken(userId: string): string {
return jwt.sign(
{ userId, iat: Math.floor(Date.now() / 1000) },
process.env.JWT_SECRET,
{ expiresIn: "24h" }
);
}
function verifyToken(token: string): { userId: string } {
return jwt.verify(token, process.env.JWT_SECRET) as { userId: string };
}
💡 JWTs are deployment-transparent by design. They are signed and verified with a secret, contain all necessary user information, and require no server-side state. Any instance can verify any token at any time — making them naturally zero-downtime-safe.
Rollback Strategy — When Things Go Wrong
Zero-downtime deployment requires zero-downtime rollback. A deployment strategy with fast forward but slow reverse is not complete.
Rollback decision tree:
Deploy new version
|
v
Monitor error rate for 10 minutes
|
Spike detected?
/ \
Yes No
| |
v v
Auto-rollback Continue
triggered monitoring
|
v
Previous version restored
|
v
Alert on-call engineer
|
v
Investigate root cause
before next attempt
Automated rollback trigger example:
async function monitorDeployment(
deploymentId: string,
errorRateThreshold: number = 0.01
) {
const monitoringWindow = 10 * 60 * 1000;
const checkInterval = 30 * 1000;
const startTime = Date.now();
while (Date.now() - startTime < monitoringWindow) {
await sleep(checkInterval);
const errorRate = await metrics.getErrorRate({
window: "5m",
deploymentId,
});
if (errorRate > errorRateThreshold) {
console.error(`Error rate ${errorRate} exceeds threshold — rolling back`);
await triggerRollback(deploymentId);
await alertOnCall(`Automatic rollback triggered for ${deploymentId}`);
return { success: false, rolledBack: true };
}
}
return { success: true, rolledBack: false };
}
⚠️ Database rollback is the hard part. Application code rollback is straightforward. Rolling back a database migration that has already run requires a down migration that is the exact inverse of the up migration. Always write and test your down migrations before you need them.
Kubernetes Zero-Downtime Configuration
For teams running on Kubernetes, here is the complete deployment configuration that implements the patterns above:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
image: myapp:v2.0.0
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["sleep", "15"]
🔑 The preStop sleep is not optional. Without it, Kubernetes removes the pod from service discovery and sends SIGTERM simultaneously. Requests routed to the pod during the propagation delay are dropped. The sleep gives load balancers time to stop routing traffic before the shutdown signal arrives.
The Pre-Deployment Checklist
Run this before every production deployment:
Database migrations:
All migrations are additive or follow expand-contract pattern
Migrations have been tested against a production data volume snapshot
Down migrations are written and tested
Migration runtime on production data size is known and acceptable
Application readiness:
Readiness check endpoint verifies all critical dependencies
Graceful shutdown is implemented and tested
Session state is externalized — not in application memory
New version has been smoke tested in staging
Deployment configuration:
Health check thresholds are set appropriately
Rolling update or deployment strategy is configured correctly
Rollback procedure is documented and the team knows how to execute it
Monitoring and alerting are active and will fire on error rate increase
Post-deployment monitoring:
Error rate baseline is known before deployment begins
Latency baseline is known before deployment begins
On-call engineer is available for the monitoring window
Automated rollback trigger is configured or manual rollback is ready
Common Zero-Downtime Deployment Mistakes
Mistake 1: Treating the deployment as complete when instances are updated
The deployment is complete when your new version has been running under production load for your defined monitoring window without triggering rollback thresholds. Updating instances is the beginning of a deployment, not the end.
Mistake 2: Running database migrations as part of application startup
If migrations run at startup, every instance that starts up runs the migration. In a rolling deployment with four instances, your migration runs four times. In most cases this is harmless. For migrations that move data or drop columns, it is catastrophic.
Mistake 3: Health checks that pass before the application is ready
A health check that returns 200 before database connections are established or before connection pools are warmed up will cause your load balancer to route traffic to an instance that is not actually ready. The result looks identical to downtime.
Mistake 4: Forgetting about background jobs and workers
Your API instances are not the only things that need zero-downtime treatment. Queue workers, scheduled jobs, and background processors all need graceful shutdown, and job processing needs to be idempotent to handle the case where a job is interrupted mid-execution.
Mistake 5: No post-deployment monitoring window
Deploying and immediately closing the laptop is how production incidents happen on Friday evenings. Define a monitoring window — typically 10 to 30 minutes depending on traffic volume — and watch your error rate and latency actively before considering the deployment complete.
Observability — You Cannot Have Zero-Downtime Without It
Zero-downtime deployment requires knowing immediately when something is wrong. That requires three things working before you deploy.
The three pillars for deployment observability:
Metrics — are the numbers normal?
Request rate, error rate, and latency for every service
Database query time and connection pool saturation
Infrastructure metrics including CPU, memory, and network
Business metrics including transaction success rate and conversion rate
Logs — what is actually happening?
Structured JSON logging with deployment version tagged on every log line
Log aggregation that lets you filter by deployment version
Error logs with full stack traces and request context
Alerts — who gets woken up and when?
Error rate alert firing within 2 minutes of exceeding threshold
Latency alert firing within 5 minutes of p99 exceeding SLA
Alert routing that reaches the engineer who deployed the change
✅ Tag every log line and metric with the deployment version. When you are investigating whether a spike was caused by a deployment, being able to filter metrics and logs by version instantly is the difference between a 5-minute investigation and a 45-minute one.
Zero-Downtime Deployment by Infrastructure Type
Infrastructure | Recommended Strategy | Key Consideration |
|---|---|---|
Kubernetes | Rolling with readiness probes | preStop sleep for load balancer propagation |
AWS ECS | Rolling or Blue-Green via CodeDeploy | ALB target group switching |
AWS Elastic Beanstalk | Rolling with additional batch | Immutable deployments for critical changes |
Heroku | Rolling via Preboot feature | Enable Preboot in app settings |
Traditional VMs | Manual rolling with load balancer | Script instance drain and rotation |
Serverless functions | Alias-based traffic shifting | Lambda weighted aliases for canary |
Docker Compose | Blue-green with nginx reload | Nginx upstream switching without restart |
Conclusion
Zero-downtime deployment is not a single technique. It is a discipline that touches your deployment strategy, your database migration approach, your health check implementation, your session architecture, your observability stack, and your team culture around releases.
The teams who get this right share a few common traits. They treat every deployment as a process to be designed, not an event to be hoped. They separate schema changes from code changes. They build rollback capability before they need it. They monitor actively after every deploy rather than assuming success.
Downtime during deployment is inevitable. Downtime during deployment is a choice — a consequence of skipping the steps that prevent it.
The playbook is here. The patterns are proven. The only thing left is execution
