SD-EP2: Building Reliable, Scalable, And Maintainable Data Systems

Aug 14, 2023

a close up of a window with a building in the background — Photo by Claudio Schwarz on Unsplash

The term "data systems" can be confusing because there are now many technologies for storing data. These tools are designed for different purposes and don't fit into traditional categories. For example, Redis can be used for both data storage and message queuing.

Because software applications need a lot of processing power and storage, developers use different tools for different tasks. Each task is called a "user story," and each user story is managed by a different tool. These tools are connected via code or deployment infrastructure.

For example, an app might use Redis for caching or Elasticsearch for search. These tools are separate from the main database, which might be hosted on Amazon DynamoDB. The application code or business logic is responsible for keeping these tools in sync with the main database.

As a software developer, creating a data system or service can be challenging. You need to ensure the data is always accurate and complete, even if there are problems within the system. You also need to ensure consistent performance for clients, even when parts of the system are not working properly. Lastly, you need to handle an increase in traffic.

A data system's design might be impacted by a variety of elements. These include the team's skills and experience level, existing legacy system dependencies, the organization's principles towards system designs, regulatory constraints, and more. In this issue, we will focus on three key factors: reliability, scalability, and maintainability.

Reliability

A system is understood to be reliable when it continues to work correctly even in the face of adversity—hardware or software faults, or even human faults.

For software to be reliable, typical expectations include:

The system meets the users’ expectations; for example, if you are building a blog app, then users should be able to share articles. If someone uploads an article and the article is not shown on the feed recommendation for other users to have access to, then this leads to a bad user experience.
If users make mistakes, the system should be able to tolerate them. If a user of the blog application uploads an article with a length greater than the allowed length, then the system should not break but handle it appropriately. For instance, maybe the app will show an error message to the user saying that the operation is not allowed.
The system should be efficient enough. It doesn’t have to be hosted on multiple nodes or servers. It should work just fine for the required use case under the expected load and data volumes. If your user opens a blog article and waits for 5 minutes before being able to comment on a post, they won’t be very happy.
If some malicious user tries to abuse your system, the system should endure it.

When a system is unable to meet this criteria, we say it has some faults, and when a system anticipates these faults and can cope with them, it is said to be fault-tolerant or resilient. The idea is to design fault-tolerant software that prevents faults from causing a failure—this is when a component of the system completely deviates from its fundamental duty.

One way to increase the fault tolerance of your application is to deliberately trigger some of the things that could cause a fault on your system. For example, you can randomly kill your application's connection to your database without warning or purposely input a data type that is completely different from what the application expects.

More often than not, you realize you have actually solved your most critical bugs through proper error handling. An example is Netflix's Chaos Monkey, a tool developed by Netflix engineers to assess how effectively their Amazon Web Services (AWS) can manage unforeseen events and recover from them.

What are the causes of a distributed system not being reliable? Let’s discuss:

Hardware Faults

Hardware faults are a very common cause of system failure. Though they may not be as prominent in the modern day of cloud platforms like Amazon Web Services (AWS), one of them is still prominent: hard disk crashes.

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.
Designing Data-Intensive Applications, M Kleppmann

We are not as vulnerable to hardware errors like a power grid blackout or a case where someone mistakenly unplugs the wrong network cable, but we are still exposed to common situations where a virtual machine instance becomes unavailable without warning.

In the end, cloud platforms make it easier to mitigate hardware failures. Application developers can add hardware redundancy to their plans, which helps prevent the total loss of entire machines. This allows a system to tolerate machine failure without downtime for the entire network of systems.

Software Faults

A common cause of software faults is bugs. A bug in your server software code can be critical or mild. Sometimes it will randomly cause the server program to crash and restart, such as when you try to send a transactional email after an action on your platform. The flow may be disrupted when it doesn’t find an environmental variable or has been changed for some other reason.

Sometimes it can be due to a third-party library that you are using. Maybe they have some misplaced condition that caused a set of side effects that affected your system. These are referred to as a systematic failures, and there is no quick solution to the problem.

But we can take a set of small actions to mitigate them. This includes thorough testing, process isolation, monitoring, and analyzing system behavior while in production. While all these may not completely prevent software faults from happening, adopting this behavior on your system will definitely reduce the faults.

Human Errors

There is a reason developers joke around with the saying that you should not push to production on a Friday. It can be attributed to the fact that there is a higher chance of a system failure, an infinite battle with git conflicts, or even outright inefficient code that has been pushed to the production code mistakenly.

While some of these can be avoided with proper development-to-production infrastructures in place like CI/CD, you may be surprised at how much humans can cause a system to fail, even though they design and build these software systems.

For example, one study of large internet services found that configuration errors by operators were the leading causes of outages, whereas hardware faults (servers or networks) played a role in only 10–25% of outages.
Why Do Internet Services Fail, and What Can Be Done About It? - David Oppenheimer, Archana Ganapathi, and David A. Patterson

The question is, How do we make our systems reliable in spite of unreliable humans?

Firstly, we design systems in a way that minimizes opportunities for error by integrating well-designed abstractions and secured APIs. We also reduce system vulnerability by designing an intuitive admin interface. Instead of the admin having direct interactions with the database columns, we can force them to use an admin dashboard. That way, it’s only when it's absolutely necessary that we directly communicate with the database.

Secondly, provide a fully featured non-production environment, like a sandbox, where people can explore and experiment safely without affecting real production users. This way, your developers or customers can easily test their limits without having to interact and risk having the production data jeopardized.

Lastly, it is important to conduct thorough testing at all levels. This includes integrating testing at all levels, from unit tests to whole-system integration tests and manual tests. Automating some of the tests in the code pipelines can be especially valuable for identifying rare cases that may have been missed during human inspection.

Scalability

Scalability is the ability of a system to handle the increased load of its usage. Scalability is when a system that can handle a concurrent user count of 100 can also handle 100,000. This is usually a result of the continual effort put into the system.

It’s not about labeling an application as being scalable; rather, when you build a system and can answer some of the following questions, you can comfortably say that the system is scalable:

How does an increase in the user base affect our system?
If the system grows in a particular direction, what are the different options to cope with the growth?

On a SaaS platform, questions like Where is the load coming from? Is it the business or the customers?—and which of the application features is attracting the most attention, thereby causing the most load?

Efficiently solving this problem indicates the importance of understanding the current load on the system and knowing how to measure it for a solution. Let’s discuss a few measurement techniques.

Requests per second (RPS)

One common method of load measurement is calculating the requests per second (RPS), also known as queries per second (QPS).

This basically means how many queries or requests a node needs to handle each second. For example, say a server receives 5 million requests per day. Then we calculate the RPS as follows:

Determine the number of queries.
Next, determine the total time (in hours).
Next, gather the formula from above: QPS = Q / T /3600.
Finally, calculate the Queries Per Second.

5 million requests per day are as follows:

QPS = 5000000 / 24 hours / 3600 = 57.87

This is roughly 57.9 requests per second. This is the value we consider the QPS of the server.

In a distributed system, you may never have uniform values across your servers throughout the day. There are different variables that could lead to different results; customers may be more active during the day and late at night making purchases on an e-commerce platform, while a restaurant app may see more traffic spike with reservations during the sunset into the night.

However, the bottom line is that, using QPS and user behavior on your platform, you can measure the load of a system. With this insight, it’s now easier to decide when to scale the system.

Read-to-write ratio (r/w)

Additionally, another common way of measuring load in a system is to check the read-to-write ratio (r/w) at the database level.

Putting the ratio of your application into perspective, you can define whether a system is read-heavy or write-heavy. What is the difference?

A read-heavy system has a higher ratio of reads to writes (r/w). This means that it is likely to serve more read requests than write requests. By adding one or more read replicas to the database cluster, it can handle a higher volume of read requests.

A read-replica is like a copy of a database. It helps balance the load between different nodes so that read requests don't slow down the node that handles writing data.

A write-heavy system, on the other hand, is indicated by the low r/w value. When the load increases, the system will see a similar amount of writes compared to reads. Or sometimes, more people write than read.

Heavy write systems are indicated by a low r/w value. As the load increases, the system will experience a similar amount of writes compared to reads, or sometimes even more writes than reads.

Scaling a system that writes a lot of data is more complex than scaling a system that mostly reads data. The method of scaling usually depends on the type of database used. For databases with strong ACID properties that are relational, you can scale vertically by adding more processors. For horizontal scaling, you can add multi-threading with an appropriate level of compatibility. We will discuss in depth the different ways to scale in a much later issue.

Measuring your load and making necessary decisions towards fixing it, like increasing some of the system resources like CPU, memory, and network bandwidths, will affect the performance of the system. So in due time, you will eventually have to measure the performance of your system.

Let’s discuss some performance methods.

Request-Response Time

To measure your system's performance, you can track its request-response time. Response time refers to the amount of time it takes for the client to receive a response after sending a request.

For instance, if it takes 300 milliseconds (300 ms) for the client to get a response from the server, that means the server took 300 ms to reply after the client sent a request. The response time covers the time it takes for data to move across the network in both directions.

The easiest way to measure performance using request-response is to look at the average response time. This is just the average of all the response times.

Percentiles

Suppose we have a large dataset of 100,000 requests to our server and are interested in analyzing the response times. To find out how our server is performing, we can list the response times of all 100,000 requests and sort them in ascending order.

By doing this, we can determine the median response time of our server, which is essentially the response time of the middle request in the sorted list.

For instance, let's say that the median response time is 2 seconds. This implies that exactly half of the requests to our system are served in less than 2 seconds. This metric is known as the "percentile", which indicates the percentage of requests that are served within a certain amount of time.

In this case, the 50th percentile (also referred to as p50) of our system is 2s, meaning that 50% of the requests are served in less than 2s. By measuring different percentiles, we can gain a better understanding of how our server is performing and identify any potential bottlenecks or issues that need to be addressed.

For a system to be deemed to have good performance, it should have a lower response time and a higher percentile. This means that if the 99.9th percentile of a system is 100 ms, only 1 in 1000 requests will be slower than 100 ms. This is a good indicator that the system is functioning properly and is meeting the expectations of most businesses.

It's important to note that having a fast response time is crucial for any system, as it can impact the overall user experience. In addition, a system with a fast response time can also improve the efficiency and productivity of a business by reducing wait times and increasing the number of requests that can be processed within a given time frame.

Maintainability

Let’s be honest: many of us don’t like to work on legacy codes. We prefer the rush of having to start a project from inception to production. Unfortunately, maintenance is the third component required to build systems that scale and stand the test of time.

While building a distributed system, we should design it in such a way that it will minimize pain during maintenance.

There are generally three design principles for a software system:

Operability: Make it easy for operations team to keep the system running smoothly
Simplicity: Make it easy for new engineers to understand the system by removing as much complexity as possible from it.
Evolvability: Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.

Let’s discuss these principles in detail:

Operability: Making Life Easy For Operations

There is a saying, “Good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run efficiently or reliably with bad operations.”

In the business world, there are some common practices you should keep in mind, including:

Making your system easier to monitor and track: Add tools to your system that can help you keep track of what's happening and any issues that come up. Make sure these tools are easy to see and use, both for you and other people on your team.
Simplifying building automation support: Your system's critical parts should be easy to start up quickly. This way, when there's a disaster, you can respond quickly.
Creating documentation: New team members should benefit from the documentation you provide. Good documentation helps build and scale a skilled team. In case of failure, proper documentation allows developers to conduct recovery procedures.
Self-healing mechanism: If there are any issues, the system should try to fix them automatically. This is important for building a strong system. Developing a self-healing mechanism may be challenging, but the end result is worthwhile. However, this mechanism should also allow developers to take control if needed.

Simplicity: Managing Complexity

A maintainable distributed system has to be simple to understand without fluffy complexities. It is not my style as a software developer to write code that is too complicated for the next developer to understand. If a new developer joins tomorrow, they should understand the system with reasonable effort.

Here are a few industry standard practices:

Simplify dependencies: Try to keep your system's dependencies on other systems or third-party services easy to understand. Avoid making them too complicated or tangled up.
Avoid assuming things: Complex systems often result from someone making an assumption without any good reason. While intuition can be helpful, it's not always reliable. To avoid this, developers should avoid making changes that include unwarranted assumptions that can be hard to identify when things go wrong.

Evolvability: Easy To Change or Extend

Over time, as businesses grow and develop new use cases, it becomes necessary to add new features to the system in order to improve and enhance its current state. This process of evolution is essential to ensuring that the system is able to keep up with changing demands and remain relevant in an ever-changing market.

If the system cannot be easily expanded, it will become harder to keep up with as it gets bigger. To make a system that can be expanded, you can follow a few guidelines:

Develop good coding practices. This will help keep your codebase consistent and make it easier to add new features over time, especially when multiple developers are working on the system with different styles.
Make it a habit to update the code when necessary. If you come across code that can be improved, do it. Sometimes, developers put off updating code, and it eventually piles up to a point where it becomes too overwhelming to tackle.
Create an easy-to-follow process for deploying software. Developers should be able to move from developing software to deploying it without difficulty. To make this happen, use automation and create the necessary documentation.

Creating a useful application involves meeting many requirements, some functional (like storing data and returning search results) and others non-functional (like security, compliance, scalability, and reliability).

In this issue, we talked about three things that your system needs. If you include them, your customers will have a much better experience using your app. And as a result, your business will make more money.

I hope you like this issue. We received feedback that some issues are too lengthy, so we're shortening them and breaking them into parts if needed. Please like and comment if you found this issue helpful.

Cheers

Happy Learning!

SkylineCodes

Discussion about this post