Distributed Systems Engineering — Part 5: Observability at Scale
Traces, metrics, logs — the three pillars and the fourth nobody talks about: profiling. How to instrument distributed systems so you can debug them when they fail at 3am.
Thoughtfully researched, carefully written. Long-form pieces, deep dives, and expert perspectives — for curious minds who want more than headlines.
Traces, metrics, logs — the three pillars and the fourth nobody talks about: profiling. How to instrument distributed systems so you can debug them when they fail at 3am.
How Google Docs, Figma, and Notion let multiple users edit simultaneously without conflicts — the beautiful mathematics of conflict-free replicated data types.
At-least-once vs exactly-once delivery, dead letter queues, consumer groups, and idempotency — the complete mental model for building reliable event-driven systems.
Raft, Paxos, Viewstamped Replication — not as academic exercises but as practical mental models for understanding what your databases actually guarantee.
Why wall clocks lie in distributed systems, how logical clocks restore causality, and the precise guarantees you can and cannot rely on when reasoning about event ordering.