Building Resilient Software Systems

software architecture engineering

What is a resilient software system ?

A distributed software system which continues to function accurately despite partial / complete failures in different parts of the system is a resilient software system. You cant really avoid failures in your application architecture, failures are bound to happen. That's why its important to design software system anticipating failures in advance and making provisions to handle those.

Found a good representation of distributed web application architecture , checkout below

Source

Why distributed softwares fail ?

Few of the commonly observed causes of system failures in distributed system include

traffic surge (huge request load) Coinbase outage Feb 2022
natural / catastrophic events like earthquakes, floods, power outages etc leading to server breakdowns e.g Microsoft Azure and M365
planned hacker attacks
severe bug rolled out in production Slack outage 2022
changes in network configurations eg. Cloudflare outage on June 21, 2022
external vendor feature failures e.g Spotify failure Mar 2022

Google lists a variety of motivations (drivers) from different perspectives which can help you in improving resiliency in your application. You can find this list here

Source

How to improve resiliency in software systems ?

Resiliency must be established at all levels in an distributed software system. This includes the software clients ( web app, mobile apps), the backend infrastructure ( application servers), the storage layer ( cache, SQL/NoSQL backend DBs ) and finally your system processes. In following sections, I am going to share specific recommendations to implement resiliency in each of these layers. Let's go !

Software Clients

Graceful degradation

Software clients often depend on multiple services / apis to orchestrate a use case. Lets take an example of movie ticket booking use case. The client has to know list of movies available, show timings, currently available seats etc which all will be served through different APIs.

Software clients must classify/treat all dependencies as hard vs soft dependencies. Whenever there is a failure for a soft dependency, the application should be able to selectively / gracefully degrade the customer experience. Avoid complete workflow failures, specially for long wizard-like workflows.

Here's a detailed article that I would highly recommend to understand how to build resilient UIs supporting graceful degradation.

Retries

Retries are specifically useful for mandatory / hard dependency failures. Applications whenever applicable should implement retry mechanisms in key workflows. One good example is payment use case. In e-commerce web sites during checkout if the payment attempt fails, supporting auto retries without user intervention can really give customer a delightful experience. Retry mechanisms are good for scenarios where the subsequent calls have probability of success.

One of the drawbacks of implementing retries in large distributed software system is that the retries might increase load on the backend server if not managed properly.

Timeouts

Timeouts are really good measure to handle processing heavy backend api calls. When an api call takes more than a certain amount of time to return the response, rather than waiting endlessly, applications should timeout & show degraded experience.

Circuit breakers

Retries is a reactive pattern and circuit breaker is a proactive pattern. Circuit breakers enables the apps to proactively retrict user workflows which have a potential for complete failure due to backend system failures observed in other parts of the system. If you would like to do a deep dive on this pattern, refer this article by Microsoft.

Exponential backoff

Exponential backoff is a pattern to control the server overload due to excessive retries. When implemented by the apps, exponential backoff gradually reduces the frequency of retries.

Infrastructure

Building resilient infrastructure starts with few key patterns like redundancy, load balancing, immutable infrastructure, infrastructure as code etc.

Redundancy

Creating component redundancy in your architecture will enable you to have higher availability for your application. One of the first candidate for replication is your application server. In you are using AWS cloud services for hosting your application server, redundancy means having your app servers deployed across multiple availability zones. Once you have your application deployed across multi-AZs, using an load balancer can help you to distribute the traffic evenly across these regions.

Source
Auto Scaling
Ability to dynamically allocate and de-allocate infrastructure resources as per the application traffic pattern is called as auto scaling. Auto scaling enables your infrastructure to scale in times of high demand and scale down when the demand is back to normal. This helps you to do cost optimizations on your infrastructure.

Source
Infrastructure as Code
IaC is a practice of configuring and maintaining your complete infrastructure through code , often maintained as configuration files. This includes setting up your servers, storages, security profiles, networks, load balancers, connection topologies and other components of your infrastructure. IaC completely eliminates manual configurations which is a tedious and difficult to maintain practice.

IaC takes an declarative approach to defining your infrastructure by making sure that the target infra is defined right at the beginning. It also ensures idempotency.

Few benefits of IaC are:
- Cost reduction
- Increase in speed of deployments
- Reduce errors
- Improve infrastructure consistency
- Eliminate configuration drift

Storage

Data Replication
Data is often replicated across DBs to improve availability of the data.
Data Sharding
Splitting data from same entity across multiple DB instances is called as data sharding.To take an example, if you decide to store all user records with userId 1-10 on instance 1 and from 10 onwards on instance 2, it will be called as sharding.
Caching
Caching is primarily used to get quick access to frequently used data in the system. Caching saves you from making making repeated calls for fetching same data. It also helps in reducing the read load on the servers. In failure case, caching helps in letting the application run without getting fail. Caching can help you reduce overall infra cost.