Designing for Scale: Lessons from Operating at 10M+ Users

Everything Works at 100 Users. Almost Nothing Works the Same Way at 10 Million.

The first version of your product is a proof of concept wearing a production costume. The database queries that return in 12 milliseconds with 500 rows take 4 seconds with 50 million. The caching strategy that felt clever with one server becomes a consistency nightmare with forty. The monolith that your team of four ships to every Friday afternoon becomes the thing that wakes your on-call engineer up at 3am when you have a team of forty.

Scale does not just make your existing problems bigger. It reveals problems that did not exist at smaller sizes, invalidates solutions that worked perfectly before, and creates entirely new categories of failure that no staging environment ever warned you about.

This article is built from the hard-won lessons of engineering teams who crossed the 10 million user threshold and survived to document what broke, what held, what they wish they had done earlier, and what they are glad they did not do too soon.

💡 The central premise: Scaling is not a technical problem with a technical solution. It is a series of trade-offs between consistency and availability, between simplicity and resilience, between moving fast and staying stable — and the teams that navigate it well are the ones who understand the trade-offs clearly before they are forced to make them under pressure.

The Scale Inflection Points That Change Everything

Scale does not degrade linearly. There are specific thresholds where systems that were working fine suddenly stop working, and understanding where those thresholds sit helps you prepare before you hit them rather than after.

The thresholds that matter most:

At roughly 10,000 daily active users, a single well-configured server handles most workloads comfortably. Database queries are fast because the working set fits in memory. Deploys are simple. Debugging is straightforward because one server has all the logs.

At roughly 100,000 daily active users, you start seeing the first real scaling pressures. Some database queries begin to slow down under concurrent load. Your single server is no longer comfortable during peak traffic. You start thinking seriously about caching and read replicas.

At roughly 1 million daily active users, the architecture decisions you made in year one are either holding or breaking. Unindexed queries that were acceptable at 100K rows are unacceptable at 100M rows. Services that were coupled for simplicity are now coupled in ways that cause cascading failures. Your deployment process, your monitoring strategy, and your database schema are all being stress-tested at a scale that reveals their true load-bearing capacity.

At 10 million daily active users, you are operating a distributed system whether you designed one or not. The questions are no longer about whether to distribute but about how to manage the consistency, reliability, and operational complexity that distribution introduces.

🔑 The lesson that costs teams the most: Optimizing for the scale you do not have yet is as dangerous as ignoring scale entirely. The teams that build microservices and Kafka pipelines for their first thousand users spend all their time on infrastructure instead of product. The teams that never think about scale until they hit the wall spend all their time on emergency surgery instead of planned improvement. The skill is knowing which scale pressures to prepare for and which to defer.

Lesson 1: Your Database Will Be the First Thing That Breaks

At every scale threshold, the database is the first bottleneck. Not because databases are fragile — PostgreSQL and MySQL are extraordinarily capable pieces of software — but because they are the stateful center of most architectures, which means every performance problem and every scaling strategy ultimately has to account for them.

The database problems that appear at scale, in the order they typically appear:

Missing indexes on high-cardinality columns that are used in WHERE clauses.

This one appears first and hits hardest. A query that filters users by email address is trivially fast with an index and catastrophically slow without one at 10 million rows. The fix is always straightforward. The cost of not having it is a table scan that locks rows, consumes CPU, and makes every concurrent query slower.

-- This is fine at 10,000 users
SELECT * FROM users WHERE email = 'user@example.com';

-- At 10,000,000 users without an index, this is a full table scan
-- Add this before you need it
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

N plus 1 query problems hidden inside ORM abstractions.

An ORM makes it easy to write code that generates one query to fetch a list and then one additional query per item in the list to fetch related data. At 10 rows this is invisible. At 10,000 rows this brings your database to its knees. At scale, every ORM query needs to be examined for the SQL it actually generates.

-- This looks innocent in application code
const posts = await Post.findAll();
const postsWithAuthors = await Promise.all(
  posts.map(post => post.getAuthor())
);
// Generates 1 + N queries where N is the number of posts

-- This generates 1 query with a JOIN
const posts = await Post.findAll({
  include: [{ model: User, as: 'author' }]
});

Write bottlenecks on a single primary database.

Read replicas solve read scaling effectively. Write scaling is harder. When your write volume exceeds what a single primary can handle, you are looking at sharding, partitioning, or rearchitecting the writes that are most expensive.

Long-running transactions that hold locks.

A transaction that takes 30 seconds to run acquires and holds locks for 30 seconds. At low concurrency this is fine. At high concurrency it creates a queue of blocked transactions, cascading timeouts, and the kind of production incident that ends careers.

The database scaling playbook in order of application:

Index every column used in WHERE, JOIN, and ORDER BY clauses
Eliminate N plus 1 queries by auditing ORM-generated SQL
Add read replicas and route read traffic to them
Implement connection pooling between application servers and database
Identify and optimize the ten slowest queries by looking at pg_stat_statements
Partition large tables by date or tenant when they exceed 100 million rows
Consider sharding only when partitioning and read replicas are insufficient

✅ The query you should run every week in production:

SELECT query,
       calls,
       total_exec_time / calls AS avg_ms,
       rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Lesson 2: Caching Is Not a Performance Strategy. It Is an Architecture Decision.

Most teams start caching reactively. A query is slow. They add a cache. The query is fast. They move on. This approach works until it does not — and at scale, it stops working in ways that are subtle and maddening.

Caching at scale is not a collection of individual optimizations. It is an architectural layer with its own consistency model, its own failure modes, and its own operational complexity. Teams that treat it as the former spend months debugging cache-related data inconsistency bugs. Teams that treat it as the latter build it correctly the first time.

The four caching patterns and when to use each:

Cache aside — application manages the cache explicitly:

async function getUserById(userId: string): Promise<User> {
  const cacheKey = `user:${userId}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const user = await db.users.findById(userId);
  await redis.setex(cacheKey, 3600, JSON.stringify(user));

  return user;
}

Best for read-heavy data that changes infrequently. The application controls when cache is populated and when it is invalidated. Simple to understand and reason about.

Write through — cache is always updated when data is written:

async function updateUser(userId: string, data: Partial<User>): Promise<User> {
  const user = await db.users.update(userId, data);

  const cacheKey = `user:${userId}`;
  await redis.setex(cacheKey, 3600, JSON.stringify(user));

  return user;
}

Best for data where cache staleness is unacceptable. Slightly slower writes but guarantees the cache is always current after every update.

Read through — cache layer handles population transparently:

The cache layer itself fetches from the database on a miss and populates itself. The application always talks to the cache, never directly to the database for reads. Simplifies application code but requires a caching layer that understands your data model.

Write behind — writes go to cache first, database asynchronously:

Writes are acknowledged after hitting the cache and flushed to the database asynchronously. Dramatically faster perceived write performance. Dangerous if the cache node fails before the flush — you can lose data. Only appropriate for data where some loss is acceptable.

The cache invalidation problem that bites every team at scale:

Cache invalidation is one of the two genuinely hard problems in computer science. At scale, the common failure mode is not cache misses — it is stale cache entries that serve outdated data to users long after the underlying data has changed.

-- User updates their profile photo
-- Application updates the database correctly
-- But the cache still has the old photo URL
-- For the next 3600 seconds, everyone sees the old photo

-- The fix: invalidate on write, not just on TTL expiry
async function updateUserProfile(userId: string, data: Partial<User>) {
  await db.users.update(userId, data);
  await redis.del(`user:${userId}`);
  await redis.del(`user:profile:${userId}`);
  await redis.del(`feed:followers:${userId}`);
  // Every cache key that contains this user's data must be invalidated
}

⚠️ The cache consistency warning: Every cache entry is a promise that the cached data accurately represents the current state of your system. At 10 million users, broken promises at the cache layer manifest as customer support tickets, trust erosion, and the specific category of bug that is hardest to reproduce because it only occurs in the window between a write and a cache invalidation.

Lesson 3: The Monolith vs Microservices Decision Is Not About Technology

Every engineering team that reaches significant scale eventually has the monolith versus microservices conversation. It is almost always framed as a technical question — what architecture is better for scale? It is actually an organizational question — what architecture matches the structure and maturity of your engineering organization?

What the monolith does well:

A well-structured monolith is simpler to develop, simpler to test, simpler to deploy, and simpler to debug than an equivalent microservices architecture. Transactions are straightforward because everything lives in one database. Refactoring is straightforward because you can trace calls across the entire codebase. Onboarding is faster because there is one system to understand.

What the monolith struggles with at scale:

A monolith scales as a single unit. If your image processing workload requires 16 CPUs and your API workload requires 2 CPUs, you scale the whole monolith to 16 CPUs for every instance. Deployment frequency is limited by the risk of any single change affecting every part of the system. Different teams stepping on each other in a shared codebase creates coordination overhead that slows everyone down.

What microservices do well:

Independent scaling of components with different resource profiles. Independent deployment of services by different teams. Technology heterogeneity — using the right tool for each job. Fault isolation — a failure in one service does not necessarily take down others.

What microservices do poorly that nobody warns you about:

Distributed transactions are genuinely hard. When an operation needs to write to three services atomically, you either accept eventual consistency, implement the saga pattern, or use a distributed transaction coordinator — all of which are significantly more complex than a local database transaction.

Network calls between services fail in ways that local function calls never do. Every inter-service call needs retry logic, timeout handling, circuit breakers, and graceful degradation. The operational surface area is larger by an order of magnitude.

The lesson from teams who have done both:

Start with a monolith that has clean internal boundaries.
Extract services only when you have a specific, measurable reason:

Reasons to extract a service:
- This component has significantly different scaling requirements
- This component needs to be deployed independently by a separate team
- This component has a fundamentally different technology requirement
- This component needs different security or compliance boundaries

Not reasons to extract a service:
- Microservices are more modern
- We might need to scale this someday
- It would be cleaner as a separate service
- Other companies do it this way

🔑 The rule that the best-scaled teams follow: Conway's Law states that organizations design systems that mirror their communication structures. If you have two teams who need to work independently, a service boundary between their domains is appropriate. If you have one team working on everything, a service boundary creates coordination overhead with no organizational benefit.

Lesson 4: Observability Is Not Monitoring. It Is How You Know What Is Happening.

At 100 users, you know what is happening in your system because you can watch it. At 10 million users, you know what is happening only as well as your observability tools tell you. If your observability is incomplete, you are operating a system you cannot see — and at that scale, the things you cannot see are where the problems live.

The three pillars and what each actually tells you:

Metrics — is the system healthy right now?

Metrics are aggregated numerical measurements over time. They tell you whether your error rate is normal, whether your latency is within SLA, whether your database connection pool is saturated, whether your CPU has headroom.

The metrics that matter most at 10 million users are not the ones you started tracking in year one. They are the business metrics correlated with technical health — transaction success rate, not just HTTP 200 rate. Checkout completion rate, not just API availability. Search result relevance degradation, not just search endpoint latency.

Logs — what happened and why?

Logs tell you what the system did. At scale, log volume makes individual log lines nearly impossible to examine manually. The teams that extract value from logs at scale are the ones who structure their logs as machine-parseable data from the beginning, who tag every log line with correlation IDs that connect distributed operations, and who build queries over log data rather than scrolling through it.

// Unstructured log — useless at scale
console.log("User login failed for user@example.com");

// Structured log — queryable at scale
logger.warn({
  event: "auth.login.failed",
  userId: null,
  email: "user@example.com",
  reason: "invalid_password",
  attempt: 3,
  ip: "203.0.113.42",
  requestId: "req_8f3k2m",
  traceId: "trace_9x2p1q",
  timestamp: new Date().toISOString(),
});

Traces — how did this specific request travel through the system?

Distributed tracing follows a single request as it moves through multiple services, databases, and caches. It answers the question that metrics and logs cannot: for this specific user, making this specific request, at this specific time, which service was slow and why?

At 10 million users, a latency spike that affects 0.1 percent of requests still affects 10,000 requests. Without distributed tracing, finding the service responsible for that spike requires correlating logs across multiple systems manually. With distributed tracing, it takes seconds.

The SLO framework — what healthy actually means:

Service Level Objectives define what healthy looks like in measurable terms. Without them, every production incident is a subjective debate about severity. With them, the monitoring system can tell you automatically whether you are meeting your commitments and alert when you are about to breach them.

User-facing API SLOs:
  Availability: 99.9% of requests return non-5xx responses
  Latency p50: under 100ms
  Latency p95: under 500ms
  Latency p99: under 2000ms

Error budget: 0.1% of requests per month may fail
  At 10M users with 50 requests per day each:
  500M requests per month
  Error budget: 500,000 requests may fail per month
  At current error rate of 0.03%: 150,000 failures per month
  Error budget remaining: 350,000 requests

✅ The observability investment principle: The cost of good observability is measured in engineering hours and infrastructure spend. The cost of poor observability is measured in hours spent debugging production incidents without the information needed to resolve them. At 10 million users, the second cost is always larger than the first.

Lesson 5: Resilience Engineering — Designing for Failure, Not Against It

At 10 million users, the question is not whether something will fail. It is which thing will fail next, how badly, and how prepared you are to contain the damage. Resilience engineering is the practice of designing systems that degrade gracefully rather than collapsing catastrophically.

The failure modes that scale introduces:

Cascade failures occur when a failure in one service causes failures in dependent services, which cause failures in their dependencies, until the entire system is down. The trigger is often minor — a slow database query, a memory leak in one service, a misconfigured timeout.

Thundering herd occurs when a large number of clients simultaneously attempt to recover from a failure by retrying requests all at once. The retry storm generates more load than the original request volume, which prevents the system from recovering, which generates more retries.

The circuit breaker pattern — stop calling what is not responding:

class CircuitBreaker {
  private failureCount = 0;
  private lastFailureTime = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  constructor(
    private threshold: number = 5,
    private timeout: number = 60000
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker open — service unavailable");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.threshold) {
      this.state = "open";
    }
  }
}

Exponential backoff with jitter — prevent thundering herd:

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxAttempts: number = 5
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxAttempts) throw error;

      const baseDelay = Math.pow(2, attempt) * 100;
      const jitter = Math.random() * baseDelay;
      const delay = baseDelay + jitter;

      await sleep(delay);
    }
  }
  throw new Error("Max attempts exceeded");
}

Bulkhead pattern — isolate failure domains:

Just as a ship's bulkhead contains flooding to one compartment, the bulkhead pattern isolates failure to one part of the system. Separate thread pools or connection pools for different classes of operations ensure that a slow downstream service does not consume all available connections and starve faster operations.

const criticalOperationsPool = new ConnectionPool({
  maxConnections: 50,
  timeout: 5000
});

const backgroundOperationsPool = new ConnectionPool({
  maxConnections: 20,
  timeout: 30000
});

⚠️ The resilience principle: Every external dependency is a potential failure point. Every external call needs a timeout. Every timeout needs a fallback. Every fallback needs to be tested. At 10 million users, the dependencies you have not planned fallbacks for are the ones that will cause your most memorable production incidents.

Lesson 6: The Content Delivery and Edge Caching Layer

At 10 million users distributed across geographies, the physical distance between your servers and your users is a meaningful source of latency. A server in Virginia serving a user in Singapore introduces 200 to 300 milliseconds of round-trip time regardless of how fast your application is. No amount of query optimization eliminates the speed of light.

What belongs at the edge:

Static assets including JavaScript bundles, CSS, images, and fonts should never be served from your origin at scale. A CDN serves them from a point of presence close to each user and caches them aggressively.

API responses that are the same for all users or for large groups of users can be cached at the edge. Public product catalog pages, documentation, marketing content, and public-facing feeds are all candidates.

Authentication and bot detection logic can run at the edge, rejecting malicious traffic before it ever reaches your origin servers.

What does not belong at the edge:

Personalized content that is unique to each user cannot be cached at the edge without being keyed to the user's identity — and caching at that granularity provides limited benefit. User-specific data fetching belongs at the origin.

Write operations must reach the origin. A CDN that accepted writes would need to be the authoritative source for that data, which introduces consistency problems that the CDN infrastructure is not designed to solve.

The CDN cache hit rate as a performance indicator:

CDN cache hit rate below 80%: investigate what is being requested that is not cacheable
CDN cache hit rate 80 to 90%: acceptable for most applications with mixed content
CDN cache hit rate above 90%: excellent — significant origin traffic reduction achieved
CDN cache hit rate above 95%: optimal for static-heavy or public-content-heavy applications

💡 The edge architecture insight: Teams that think about the edge as a pure caching layer miss half its value. At 10 million users, the edge is a programmable compute layer that can run authentication, personalization logic, A/B test assignment, and feature flag evaluation with single-digit millisecond latency — before a single packet reaches your origin infrastructure.

Lesson 7: Multi-Tenancy and Tenant Isolation at Scale

For B2B products serving multiple organizations, multi-tenancy introduces a scaling challenge that is distinct from raw traffic scaling: how do you ensure that one tenant's behavior does not degrade the experience for others?

The three multi-tenancy models and their trade-offs:

Shared everything — all tenants share the same database tables, application instances, and infrastructure. Cheapest to operate. Hardest to isolate. One tenant with a pathological query pattern can degrade performance for every other tenant.

Shared infrastructure, isolated data — all tenants share application infrastructure but have separate database schemas or databases. Good balance of cost and isolation. Database-level isolation prevents cross-tenant data leakage. Application-level isolation requires disciplined query patterns.

Fully isolated — each tenant gets dedicated infrastructure. Maximum isolation and customization. Operationally expensive at scale. Typically reserved for enterprise customers with specific compliance requirements.

The noisy neighbor problem and how to solve it:

-- Without tenant rate limiting, one tenant can consume all database capacity
-- A single enterprise customer running a bulk export at 9am
-- can cause 200ms latency spikes for every other tenant

-- Rate limit at the application layer by tenant
const tenantRateLimiter = new RateLimiter({
  strategy: "sliding-window",
  limits: {
    starter: { requests: 100, window: "1m" },
    growth: { requests: 500, window: "1m" },
    enterprise: { requests: 2000, window: "1m" },
  }
});

-- Rate limit expensive operations specifically
const bulkExportLimiter = new RateLimiter({
  limits: {
    all_tiers: { requests: 1, window: "1h" }
  }
});

🔑 The multi-tenancy lesson: Tenant isolation is not primarily a security feature — it is a reliability feature. The security case for tenant isolation is obvious. The reliability case is often missed until a large tenant's batch job degrades response times for every other tenant during business hours.

Lesson 8: Security at Scale Is Different From Security at Small Scale

Security concerns do not simply scale with user count. They change in kind. The attack surface at 10 million users is categorically different from the attack surface at 10,000 users, and the security practices that were adequate at small scale are often inadequate at large scale.

The security changes that scale introduces:

Credential stuffing becomes a serious threat. With 10 million user accounts, your platform is a target for automated login attempts using credentials leaked from other breaches. Rate limiting login attempts is mandatory. Anomaly detection for login patterns is necessary. Breach notification monitoring services become worth their cost.

API abuse moves from theoretical to practical. At small scale, API abuse is expensive for the attacker because it is unlikely to yield results. At large scale, scraping, bulk account creation, and data harvesting become economically viable for attackers because the yield justifies the effort.

Insider threat probability increases with team size. At 10 engineers, everyone knows everyone and trust is high. At 100 engineers, you have people with production database access who have been on the team for two weeks. Access controls, audit logs, and the principle of least privilege become operational necessities rather than theoretical best practices.

The security practices that matter most at scale:

Web application firewall with rules tuned to your specific traffic patterns
Rate limiting at multiple layers — CDN, API gateway, and application
Anomaly detection that distinguishes normal user behavior from automated abuse
Audit logs for every privileged action with immutable storage
Secrets management that rotates credentials automatically and never stores them in code
Dependency scanning in CI that fails builds when vulnerable packages are introduced

The Architecture Decision Framework

Every significant architecture decision at scale involves trade-offs between competing values. The teams that make good decisions consistently are not the ones with the most technically sophisticated engineers — they are the ones with the clearest framework for evaluating trade-offs explicitly.

Decision	Consistency	Availability	Simplicity	Choose When
Single primary database	High	Medium	High	Under 1M writes per day
Read replicas	Medium	High	Medium	Read-to-write ratio above 10:1
Database sharding	Medium	High	Low	Single database cannot handle write volume
Synchronous service calls	High	Low	High	Operation requires guaranteed consistency
Asynchronous messaging	Low	High	Medium	Operation can tolerate eventual consistency
Monolith	High	Medium	High	Team under 20 engineers
Microservices	Medium	High	Low	Multiple teams with independent deploy needs
Shared database multi-tenancy	Low	High	High	Under 100 tenants with similar usage patterns
Isolated database multi-tenancy	High	High	Low	Enterprise customers with compliance requirements

The Scaling Mindset — What 10M Users Actually Teaches You

The technical lessons are important. The mindset lessons are more important.

Everything fails. Design for failure explicitly.

At 10 million users, the question is never whether something will fail. It is whether the failure is contained, whether the system degrades gracefully, and whether recovery is fast. The teams who have internalized this build differently from the teams who still believe that with enough care, failures can be prevented.

Complexity is a cost that compounds.

Every clever optimization, every additional service, every non-standard configuration is complexity that every engineer on your team pays for forever. The teams that scale well are relentlessly focused on eliminating accidental complexity — the complexity that does not make the product better but makes the system harder to operate.

Measure before you optimize.

The bottleneck you assume exists is rarely the bottleneck that actually exists. At scale, guessing about performance is expensive. Measuring is cheap. Every optimization should be preceded by a measurement that proves the optimization is necessary and followed by a measurement that proves it worked.

The humans are the hardest part to scale.

Technical systems scale predictably given sufficient resources. Human coordination does not. The on-call rotation that was manageable with five engineers becomes a burnout machine with twenty engineers on it. The deployment process that worked with one team breaks down when five teams are all trying to ship on Friday afternoon. The organizations that scale their engineering culture as deliberately as their infrastructure are the ones that reach 10 million users without destroying the people who got them there.

💡 The final lesson: The best time to think about scale is before you need it — not to over-engineer, but to make decisions now that will not close off options later. Use an ORM but log every query it generates. Start with a monolith but define clean module boundaries. Cache aggressively but build the invalidation logic from the start. The decisions that are cheapest to make correctly are the ones you make before the scale pressure arrives.

Conclusion

Reaching 10 million users is not a destination — it is a vantage point from which you can see clearly how much you did not know at a million, at a hundred thousand, at ten thousand. Every scale threshold is an education in the assumptions that were baked into the previous architecture.

The patterns in this article are not a checklist to implement before you need them. They are a vocabulary for the decisions you will face and a framework for making them with clear eyes about the trade-offs involved.

The principles that hold at every scale:

Your database will be the first bottleneck — understand it deeply
Caching is an architecture decision with consistency implications, not a performance trick
Observability is how you know what is happening — invest in it before you need it
Resilience means designing for failure, not hoping to prevent it
Organizational structure and system architecture mirror each other — align them deliberately

~~Build for the scale you have.~~ Build with clear eyes about the scale you are heading toward, make the decisions that keep your options open, and scale the humans as deliberately as the infrastructure.

The architecture that gets you to 10 million users is not the one you designed on day one. It is the one that emerged from a series of deliberate decisions made by a team that understood the trade-offs clearly enough to choose the right one each time.

Everything Works at 100 Users. Almost Nothing Works the Same Way at 10 Million.

💡 The central premise: Scaling is not a technical problem with a technical solution. It is a series of trade-offs between consistency and availability, between simplicity and resilience, between moving fast and staying stable — and the teams that navigate it well are the ones who understand the trade-offs clearly before they are forced to make them under pressure.

The Scale Inflection Points That Change Everything

The thresholds that matter most:

🔑 The lesson that costs teams the most: Optimizing for the scale you do not have yet is as dangerous as ignoring scale entirely. The teams that build microservices and Kafka pipelines for their first thousand users spend all their time on infrastructure instead of product. The teams that never think about scale until they hit the wall spend all their time on emergency surgery instead of planned improvement. The skill is knowing which scale pressures to prepare for and which to defer.

Lesson 1: Your Database Will Be the First Thing That Breaks

The database problems that appear at scale, in the order they typically appear:

Missing indexes on high-cardinality columns that are used in WHERE clauses.

-- This is fine at 10,000 users
SELECT * FROM users WHERE email = 'user@example.com';

-- At 10,000,000 users without an index, this is a full table scan
-- Add this before you need it
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

N plus 1 query problems hidden inside ORM abstractions.

-- This looks innocent in application code
const posts = await Post.findAll();
const postsWithAuthors = await Promise.all(
  posts.map(post => post.getAuthor())
);
// Generates 1 + N queries where N is the number of posts

-- This generates 1 query with a JOIN
const posts = await Post.findAll({
  include: [{ model: User, as: 'author' }]
});

Write bottlenecks on a single primary database.

Long-running transactions that hold locks.

The database scaling playbook in order of application:

Index every column used in WHERE, JOIN, and ORDER BY clauses
Eliminate N plus 1 queries by auditing ORM-generated SQL
Add read replicas and route read traffic to them
Implement connection pooling between application servers and database
Identify and optimize the ten slowest queries by looking at pg_stat_statements
Partition large tables by date or tenant when they exceed 100 million rows
Consider sharding only when partitioning and read replicas are insufficient

✅ The query you should run every week in production:

SELECT query,
       calls,
       total_exec_time / calls AS avg_ms,
       rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Lesson 2: Caching Is Not a Performance Strategy. It Is an Architecture Decision.

The four caching patterns and when to use each:

Cache aside — application manages the cache explicitly:

async function getUserById(userId: string): Promise<User> {
  const cacheKey = `user:${userId}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const user = await db.users.findById(userId);
  await redis.setex(cacheKey, 3600, JSON.stringify(user));

  return user;
}

Best for read-heavy data that changes infrequently. The application controls when cache is populated and when it is invalidated. Simple to understand and reason about.

Write through — cache is always updated when data is written:

async function updateUser(userId: string, data: Partial<User>): Promise<User> {
  const user = await db.users.update(userId, data);

  const cacheKey = `user:${userId}`;
  await redis.setex(cacheKey, 3600, JSON.stringify(user));

  return user;
}

Best for data where cache staleness is unacceptable. Slightly slower writes but guarantees the cache is always current after every update.

Read through — cache layer handles population transparently:

Write behind — writes go to cache first, database asynchronously:

The cache invalidation problem that bites every team at scale:

-- User updates their profile photo
-- Application updates the database correctly
-- But the cache still has the old photo URL
-- For the next 3600 seconds, everyone sees the old photo

-- The fix: invalidate on write, not just on TTL expiry
async function updateUserProfile(userId: string, data: Partial<User>) {
  await db.users.update(userId, data);
  await redis.del(`user:${userId}`);
  await redis.del(`user:profile:${userId}`);
  await redis.del(`feed:followers:${userId}`);
  // Every cache key that contains this user's data must be invalidated
}

⚠️ The cache consistency warning: Every cache entry is a promise that the cached data accurately represents the current state of your system. At 10 million users, broken promises at the cache layer manifest as customer support tickets, trust erosion, and the specific category of bug that is hardest to reproduce because it only occurs in the window between a write and a cache invalidation.

Lesson 3: The Monolith vs Microservices Decision Is Not About Technology

What the monolith does well:

What the monolith struggles with at scale:

What microservices do well:

What microservices do poorly that nobody warns you about:

The lesson from teams who have done both:

Start with a monolith that has clean internal boundaries.
Extract services only when you have a specific, measurable reason:

Reasons to extract a service:
- This component has significantly different scaling requirements
- This component needs to be deployed independently by a separate team
- This component has a fundamentally different technology requirement
- This component needs different security or compliance boundaries

Not reasons to extract a service:
- Microservices are more modern
- We might need to scale this someday
- It would be cleaner as a separate service
- Other companies do it this way

🔑 The rule that the best-scaled teams follow: Conway's Law states that organizations design systems that mirror their communication structures. If you have two teams who need to work independently, a service boundary between their domains is appropriate. If you have one team working on everything, a service boundary creates coordination overhead with no organizational benefit.

Lesson 4: Observability Is Not Monitoring. It Is How You Know What Is Happening.

The three pillars and what each actually tells you:

Metrics — is the system healthy right now?

Logs — what happened and why?

// Unstructured log — useless at scale
console.log("User login failed for user@example.com");

// Structured log — queryable at scale
logger.warn({
  event: "auth.login.failed",
  userId: null,
  email: "user@example.com",
  reason: "invalid_password",
  attempt: 3,
  ip: "203.0.113.42",
  requestId: "req_8f3k2m",
  traceId: "trace_9x2p1q",
  timestamp: new Date().toISOString(),
});

Traces — how did this specific request travel through the system?

The SLO framework — what healthy actually means:

User-facing API SLOs:
  Availability: 99.9% of requests return non-5xx responses
  Latency p50: under 100ms
  Latency p95: under 500ms
  Latency p99: under 2000ms

Error budget: 0.1% of requests per month may fail
  At 10M users with 50 requests per day each:
  500M requests per month
  Error budget: 500,000 requests may fail per month
  At current error rate of 0.03%: 150,000 failures per month
  Error budget remaining: 350,000 requests

✅ The observability investment principle: The cost of good observability is measured in engineering hours and infrastructure spend. The cost of poor observability is measured in hours spent debugging production incidents without the information needed to resolve them. At 10 million users, the second cost is always larger than the first.

Lesson 5: Resilience Engineering — Designing for Failure, Not Against It

The failure modes that scale introduces:

The circuit breaker pattern — stop calling what is not responding:

class CircuitBreaker {
  private failureCount = 0;
  private lastFailureTime = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  constructor(
    private threshold: number = 5,
    private timeout: number = 60000
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker open — service unavailable");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.threshold) {
      this.state = "open";
    }
  }
}

Exponential backoff with jitter — prevent thundering herd:

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxAttempts: number = 5
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxAttempts) throw error;

      const baseDelay = Math.pow(2, attempt) * 100;
      const jitter = Math.random() * baseDelay;
      const delay = baseDelay + jitter;

      await sleep(delay);
    }
  }
  throw new Error("Max attempts exceeded");
}

Bulkhead pattern — isolate failure domains:

const criticalOperationsPool = new ConnectionPool({
  maxConnections: 50,
  timeout: 5000
});

const backgroundOperationsPool = new ConnectionPool({
  maxConnections: 20,
  timeout: 30000
});

⚠️ The resilience principle: Every external dependency is a potential failure point. Every external call needs a timeout. Every timeout needs a fallback. Every fallback needs to be tested. At 10 million users, the dependencies you have not planned fallbacks for are the ones that will cause your most memorable production incidents.

Lesson 6: The Content Delivery and Edge Caching Layer

What belongs at the edge:

Authentication and bot detection logic can run at the edge, rejecting malicious traffic before it ever reaches your origin servers.

What does not belong at the edge:

The CDN cache hit rate as a performance indicator:

CDN cache hit rate below 80%: investigate what is being requested that is not cacheable
CDN cache hit rate 80 to 90%: acceptable for most applications with mixed content
CDN cache hit rate above 90%: excellent — significant origin traffic reduction achieved
CDN cache hit rate above 95%: optimal for static-heavy or public-content-heavy applications

💡 The edge architecture insight: Teams that think about the edge as a pure caching layer miss half its value. At 10 million users, the edge is a programmable compute layer that can run authentication, personalization logic, A/B test assignment, and feature flag evaluation with single-digit millisecond latency — before a single packet reaches your origin infrastructure.

Lesson 7: Multi-Tenancy and Tenant Isolation at Scale

The three multi-tenancy models and their trade-offs:

The noisy neighbor problem and how to solve it:

-- Without tenant rate limiting, one tenant can consume all database capacity
-- A single enterprise customer running a bulk export at 9am
-- can cause 200ms latency spikes for every other tenant

-- Rate limit at the application layer by tenant
const tenantRateLimiter = new RateLimiter({
  strategy: "sliding-window",
  limits: {
    starter: { requests: 100, window: "1m" },
    growth: { requests: 500, window: "1m" },
    enterprise: { requests: 2000, window: "1m" },
  }
});

-- Rate limit expensive operations specifically
const bulkExportLimiter = new RateLimiter({
  limits: {
    all_tiers: { requests: 1, window: "1h" }
  }
});

🔑 The multi-tenancy lesson: Tenant isolation is not primarily a security feature — it is a reliability feature. The security case for tenant isolation is obvious. The reliability case is often missed until a large tenant's batch job degrades response times for every other tenant during business hours.

Lesson 8: Security at Scale Is Different From Security at Small Scale

The security changes that scale introduces:

The security practices that matter most at scale:

Web application firewall with rules tuned to your specific traffic patterns
Rate limiting at multiple layers — CDN, API gateway, and application
Anomaly detection that distinguishes normal user behavior from automated abuse
Audit logs for every privileged action with immutable storage
Secrets management that rotates credentials automatically and never stores them in code
Dependency scanning in CI that fails builds when vulnerable packages are introduced

The Architecture Decision Framework

Decision	Consistency	Availability	Simplicity	Choose When
Single primary database	High	Medium	High	Under 1M writes per day
Read replicas	Medium	High	Medium	Read-to-write ratio above 10:1
Database sharding	Medium	High	Low	Single database cannot handle write volume
Synchronous service calls	High	Low	High	Operation requires guaranteed consistency
Asynchronous messaging	Low	High	Medium	Operation can tolerate eventual consistency
Monolith	High	Medium	High	Team under 20 engineers
Microservices	Medium	High	Low	Multiple teams with independent deploy needs
Shared database multi-tenancy	Low	High	High	Under 100 tenants with similar usage patterns
Isolated database multi-tenancy	High	High	Low	Enterprise customers with compliance requirements

The Scaling Mindset — What 10M Users Actually Teaches You

The technical lessons are important. The mindset lessons are more important.

Everything fails. Design for failure explicitly.

Complexity is a cost that compounds.

Measure before you optimize.

The humans are the hardest part to scale.

💡 The final lesson: The best time to think about scale is before you need it — not to over-engineer, but to make decisions now that will not close off options later. Use an ORM but log every query it generates. Start with a monolith but define clean module boundaries. Cache aggressively but build the invalidation logic from the start. The decisions that are cheapest to make correctly are the ones you make before the scale pressure arrives.

Conclusion

The principles that hold at every scale:

Your database will be the first bottleneck — understand it deeply
Caching is an architecture decision with consistency implications, not a performance trick
Observability is how you know what is happening — invest in it before you need it
Resilience means designing for failure, not hoping to prevent it
Organizational structure and system architecture mirror each other — align them deliberately

Designing for Scale: Lessons from Operating at 10M+ Users

Everything Works at 100 Users. Almost Nothing Works the Same Way at 10 Million.

The Scale Inflection Points That Change Everything

Lesson 1: Your Database Will Be the First Thing That Breaks

Lesson 2: Caching Is Not a Performance Strategy. It Is an Architecture Decision.

Lesson 3: The Monolith vs Microservices Decision Is Not About Technology

Lesson 4: Observability Is Not Monitoring. It Is How You Know What Is Happening.

Lesson 5: Resilience Engineering — Designing for Failure, Not Against It

Lesson 6: The Content Delivery and Edge Caching Layer

Lesson 7: Multi-Tenancy and Tenant Isolation at Scale

Lesson 8: Security at Scale Is Different From Security at Small Scale

The Architecture Decision Framework

The Scaling Mindset — What 10M Users Actually Teaches You

Conclusion

CodeWithGarry

Comments (0)

Designing for Scale: Lessons from Operating at 10M+ Users

Everything Works at 100 Users. Almost Nothing Works the Same Way at 10 Million.

The Scale Inflection Points That Change Everything

Lesson 1: Your Database Will Be the First Thing That Breaks

Lesson 2: Caching Is Not a Performance Strategy. It Is an Architecture Decision.

Lesson 3: The Monolith vs Microservices Decision Is Not About Technology

Lesson 4: Observability Is Not Monitoring. It Is How You Know What Is Happening.

Lesson 5: Resilience Engineering — Designing for Failure, Not Against It

Lesson 6: The Content Delivery and Edge Caching Layer

Lesson 7: Multi-Tenancy and Tenant Isolation at Scale

Lesson 8: Security at Scale Is Different From Security at Small Scale

The Architecture Decision Framework

The Scaling Mindset — What 10M Users Actually Teaches You

Conclusion

CodeWithGarry

Related Posts

The AI Bubble Question Nobody Wants to Answer — But Every Enterprise Leader Should

Building a Startup in 2024: Lessons Learned

Comments (0)

Newsletter

Related Posts

The AI Bubble Question Nobody Wants to Answer — But Every Enterprise Leader Should

Building a Startup in 2024: Lessons Learned