Availability & Fault Tolerance: Keeping the Lights On

Imagine it’s the day of your board exam results. You, along with millions of other students, log in to the result portal at 10:00 AM. The site crashes. That is an Availability failure.

In previous posts, we discussed Throughput and Latency. Today, we tackle the most critical aspect of any production system: staying online.

Core Concepts

1. Availability

Availability is the percentage of time a system is operational and accessible to users. It’s often measured in “nines”:

99%: Down for 3.65 days/year.
99.99% (“Four Nines”): Down for 52 minutes/year.
99.999% (“Five Nines”): Down for 5 minutes/year.

2. Fault Tolerance

Fault Tolerance is the ability of a system to continue operating properly in the event of the failure of some of its components.

Example: If one engine of an airplane fails, the plane can still fly. The plane is Fault Tolerant.

3. Single Point of Failure (SPOF)

A part of a system that, if it fails, stops the entire system from working.

Monoliths vs. Distributed Systems

The architectural choice you make dictates your system’s survival strategy.

Feature	Monolithic Architecture	Distributed System
Structure	All-in-one bundle	Modular, spread across nodes
Failure Mode	SPOF: If the server crashes, everything dies.	Resilient: If one node dies, others take over.
Availability	Low (Requires downtime for updates/crashes)	High (Zero-downtime deployments)
Recovery	Reboot the entire beast	Automatic failover

  graph TD
    subgraph Monolith [Monolithic System - SPOF]
        User1[User] --> Server[Single Server]
        Server -.-> Crash[X Crash]
        Crash -.-> Down[System Offline]
        style Server fill:#ff9999
    end

    subgraph Distributed [Distributed System - Fault Tolerant]
        User2[User] --> LB[Load Balancer]
        LB --> NodeA[Node A]
        LB --> NodeB[Node B]
        NodeA -.-> Fail[X Fail]
        LB ==Failover==> NodeB
        style NodeA fill:#ff9999
        style NodeB fill:#99ff99
    end

The Secret Sauce: Replication

How do distributed systems achieve high availability? Redundancy.

We don’t just rely on one server. We Replicate everything.

Application Replication: Run the same code on 10 different servers. If 3 crash, 7 are still running.
Data Replication: Store your user data on a primary database and sync it to a standby replica. If the primary burns down, the standby takes over.
Geographic Replication: Don’t put all servers in one data center. If the entire underlying power grid of a region fails, your app typically keeps running from a different region.

The ACID Trade-off

In databases, replication introduces complexity. If you write data to Node A, it takes time to copy to Node B. This touches on the CAP Theorem (Consistency vs. Availability), which we will cover in depth later.

Real-Life Examples

1. Amazon Shopping (Prime Day)

On Prime Day, traffic spikes 100x. Amazon uses thousands of microservices distributed across the globe. If the “Reviews” service crashes, you can still buy items. The system degrades gracefully rather than failing completely.

2. Google Search

Google indexes the web across thousands of machines. If the specific server holding the index for “SpaceX” fails, a replica immediately answers your query. You, the user, never know a failure occurred.

Conclusion

Monoliths put all eggs in one basket. If that basket drops, you have a mess. Distributed Systems accept that failure is inevitable. Hard drives die, networks cut out, and power fails. By designing with Fault Tolerance and Replication in mind, we build systems that can survive the chaos of the real world.