After understanding the basics of fault domain and fault tolerance, I naturally got curious about how they are implemented irl and did a little googling…
Here are a few strategies in action and some fun examples I came across:
Replication
Replication, aka creating copies of data or services, is a fundamental strategy in computing and data management. Replication across different fault domains (servers, data centers, geographic regions…) provides better availability since if one replica becomes unavailable, another can seamlessly take over, minimizing downtime and data loss. (p.s. replication also has other benefit such as load balancing, etc that I won’t get into here)
Sharding
Sharding is a database partitioning technique that splits a larger dataset into smaller, faster, more easily managed parts called shards. Each shard can be placed in a different fault domain, reducing the impact of failures and improving performance by distributing the load. Horizontal partitioning is when data is divided across shards based on a key, such as a user ID or region, which can help localize and isolate faults. While sharding enhances resilience and scalability, one thing to keep in mind that it also increases complexity in managing transactions and data consistency across shards.
Circuit Breakers
The circuit breaker pattern is a software design pattern used to prevent a system from making calls to a part of the system that is likely to fail. It works as follows:
Failure Detection: When the system detects a predetermined threshold of failures for a given service (e.g., timeouts, errors), it "trips" the circuit breaker, temporarily halting all calls to the failing service.
Fallback Mechanisms: Once the circuit is open, the system can redirect requests to a fallback mechanism, such as a cache or a default service response, ensuring that users still receive a response even if it's not fully operational.
Recovery and Reset: After a cooldown period, the circuit breaker allows a limited number of test requests to the failing service. If these succeed, it assumes the service is back to normal and closes the circuit, resuming normal operations.
Example 1: Cloud
So how do major cloud providers like AWS, Azure, and GCP implement fault domains to ensure the availability and reliability of their services?
I randomly picked Azure as the rest should be similar and here are some ways I found:
MSFT Azure’s Implementation of Fault Domains
Availability Sets: Azure encourages the use of availability sets—a logical grouping that distributes VMs across multiple fault domains. This setup is designed to protect applications from failures of physical hardware, power outages, or network issues, by ensuring that not all VMs in an application will be affected simultaneously.
Zone-Redundant Services: Beyond individual data centers, Azure employs availability zones, which are unique physical locations within a region. Each availability zone is an isolated fault domain, ensuring that applications can remain operational even if one zone experiences a disruption.
Managed Disks: Azure Managed Disks are designed to work seamlessly with availability sets and VM scale sets. To quote their documentation: Managed disks achieve this by providing you with three replicas of your data, allowing for high durability.
Another fun read if you want to know how FD Awareness happens in Apache Pinot.
Example 2: Netflix’s Chaos Engineering
When my colleague first mentioned Chaos Monkey in text, I first thought that was a typo… But no! It was actually something really cool!
Netflix deliberately introduces failures into its system to test resilience. A notable example is their use of the Simian Army, a suite of tools designed to simulate failures. One of these tools, Chaos Monkey, randomly terminates virtual machine instances and containers to ensure that Netflix's system can tolerate instance failures without affecting the user experience. This approach ensures that when parts of their system fail, other instances can seamlessly take over. Thanks to this, I am able to enjoy breaking bad without interruption.
Example 3: Shopify’s Resilient Checkout System
Shopify, an e-commerce platform, uses a circuit breaker pattern to enhance the reliability of its checkout system. During high-traffic events like Black Friday, Shopify’s system must handle a surge in transactions without faltering.
By implementing circuit breakers in their payment processing service, Shopify can quickly detect when a payment gateway becomes unreliable (due to slow responses or errors) and reroute transactions through another gateway or temporarily hold transactions until the issue is resolved. This prevents the failure of a single component from disrupting the checkout process for millions of shoppers.
Real-life Incidents :(
However, things can still go wrong with robust systems.
Some scary ones that I don’t seem to remember and had some chills reading:
Google Cloud Outage (2019): Google Cloud suffered a significant outage affecting GCP services, including Google Compute Engine and Google Kubernetes Engine, primarily in the eastern US.
GitHub DDoS Attack (2018): GitHub experienced a massive distributed denial-of-service (DDoS) attack, which flooded its systems with traffic, attempting to overwhelm the service.
Azure Storage Outage (2014): Microsoft Azure experienced a global outage of its storage services due to a bug triggered during a performance update. The bug affected multiple regions simultaneously.
A wise Chinese proverb goes "兵来将挡,水来土掩" which translates into "When soldiers come, block them with a general; when water comes, cover it with earth". This saying emphasizes adaptability and resourcefulness in the face of challenges, suggesting that for every problem, there is an appropriate solution or defense. It's about meeting adversity head-on with strategic thinking and appropriate action.
I believe the same philosophy applies to system design as well. It is always good to look around and see what strategies are available at our disposal when building something new and how to improve after an unprecedented incident (that may have been caused by some man(monkey?)made chaos).
By now, I hope we all understand the importance of well-designed fault domains and a robust fault tolerant system! I also hope you had fun reading :)