Designing Data-Intensive Applications - Chapter 1: Reliable, Scalable, and Maintainable Applications
Charles
Charles
《Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems》https://amzn.to/2WYphy6
Reliability
- Continuing to work correctly, even when things go wrong.
- aka. fault-tolerant or resilient.
- Fault vs. Failure (fault means one component failed), design system that able tolerance the fault to prevent failure;
- Hardware Faults: Redundancy is the key;
- Software Errors: could be more troublesome;
- Developers need to make assumptions and interactions carefully.
- Human Errors: Design to minimize chances of errors; decouple; Test thoroughly; Quick recover ability; monitoring(performance/error rate), aka Telemetry; Good management practice/training;
Scalability:
- A system’s ability to cope with increased load.
- Describing Load: defined by “load parameters”, it depend on architecture of your system, e.g.
- the requests per second to a web server,
- the ratio of reads to writes in a database,
- the number of simultaneously active users in a chatroom,
- the hit rate on a cache, etc.
- Twitter hybrid approach: fanned-out vs. fetched
- the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads → Fanned-Out approach
- On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches.
- Moved to Hybrid approach to handle Justin-Biber effects.
- Describing Performance: When you increase a load parameter, keep the same system resources, what will happen ? How much do you need to increase resources to keep up the performance ?
- E.g. “response time”— the time between a client sending a request and receiving a response.
- Latency vs. response time: Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.
- Better use “percentiles” to measure response time, e.g. median(p50)
- High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service.
- Amazon describes response time requirements for internal services in terms of the 99.9th percentile (affects 1 in 1,000 requests).
- On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and to not yield enough benefit for Amazon’s purposes.
- Amazon has also observed that a 100 ms increase in response time reduces sales by 1% [20], and others report that a 1-second slowdown reduces a customer satisfaction metric by 16%
- It is important to measure response times on the client side.
- Approaches for Coping with Load:
- Good architectures usually involve a pragmatic mixture of approaches: Scaling up(Vertical) & Scaling out(horizontal)
- Scaling out: Stateless service is fairly straightforward compared to Stateful service.
- It is conceivable that distributed data systems will become the default in the future.
- The architecture of systems that operate at large scale is usually highly specific to the application.
- An scalable architecture usually built from general-purpose building blocks arranged in familiar patterns.
- Good architectures usually involve a pragmatic mixture of approaches: Scaling up(Vertical) & Scaling out(horizontal)
Maintainability:
- Operability: Make it easy for operations teams to keep the system running smoothly.
- Simplicity: Make it easy for new engineers to understand the system,
- By removing as much complexity as possible from the system. (Note: this is not the same as simplicity of the user interface.)
- Making a system simpler DOES NOT necessarily mean reducing its functionality;
- Complexity: as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
- One of the best tools we have for removing accidental complexity is abstraction. (However,finding good abstractions is very hard.)
- Evolvability:Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.
- Agile working patterns provide a framework for adapting to change
- simple and easy-to-understand systems are usually easier to modify than complex ones (Linked to the previous point - Maintainability)
Summary
- functional requirements: (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways),
- nonfunctional requirements: (general properties like security, reliability, compliance, scalability, compatibility, and maintainability)
- Reliability means making systems work correctly, even when faults occur.
- Scalability means having strategies for keeping performance good, even when load increases.
- Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system.
《Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems》https://amzn.to/2WYphy6