Availability Patterns

Let's talk about availability patterns of system design.

Availability Patterns

Availability patterns are established architectural approaches used to ensure a system remains operational and accessible to users, even in the face of failures or unexpected events. These patterns focus on minimizing downtime and maintaining a consistent level of service by incorporating redundancy, fault tolerance, and recovery mechanisms into the system's design. They provide a structured way to address potential points of failure and ensure business continuity.

Availability in Numbers

DurationAcceptable downtime
Downtime per year8h 41min 38s
Downtime per month43m 28s
Downtime per week10m 4.8s
Downtime per day1m 26s

Availability in Parallel vs in Sequence

Overall availability decreases when two components with availability < 100% are in sequence:

Example: If both Foo and Bar each had 99.9% availability, their total availability in sequence would be 99.8%.

Availability Patterns

Replication

Replication is an availability pattern that involves having multiple copies of the same data stored in different locations. In the event of a failure, the data can be retrieved from a different location. There are two main types of replication: Master-Master replication and Master-Slave replication.

Master-Master replication: Multiple servers are configured as "masters," each accepting read and write operations. Provides high availability, but requires conflict resolution.

Master-Slave replication: One server is the "master" handling writes, and multiple "slaves" handle reads. If the master fails, a slave is promoted. Simpler to maintain.

SLI, SLO & SLA

Understanding Availability Terms

SLI, SLO, and SLA are fundamental concepts to measure and manage service availability. They help define the performance indicators, objectives, and contractual agreements for maintaining availability and reliability.

Service Level Indicator (SLI)

SLI is a quantitative metric that measures the performance of a service. It indicates how well a service is performing in terms of availability, latency, or error rate.