Hello friends 👋🏾
The purpose of sharing this is to keep myself accountable - making sure I read -, share what I learn, make sure what I learn sticks, and serve as a reference to future me.
I don’t have a lot of experience building applications at scale and therefore my notes may contain a lot of content. I hope this summary will be of value to you too. 🙂
Software is deemed reliable depending on its ability to tolerate faults, function as expected, ensure the right level of authorization access, and have reasonable performance for its tasks.
Not all faults and failures can be captured.
Types of faults
- Hardware faults - e.g., hard drive corruption.
- Software errors - e.g., unhandled exceptions for error handling.
- Systematic errors: these are errors within a system - correlated with areas across nodes in a system, e.g., the leap second glitch ↗ that caused an outage of popular web services.
- Human error - e.g., having faulty configuration during deployment.
Inducing faults helps in capturing faults, improving the reliability of an application.
Making systems more reliable
- System design - this reduces opportunity and can include patterns such as having well-designed abstractions, APIs, and admin interfaces make it easy doing the right thing.
- Sandbox environments - these are environments for battle testing applications before moving to production.
- Testing - using unit, integration, and end-to-end tests taking software to the extreme to capture “edge cases” users may encounter.
- Telemetry - using monitoring software to track usage, exceptions, and application performance.
This is a software’s ability to cope with an increased load within a given set of parameters.
TIL: Tail latency amplification: An instance where a request gets slower because it is dependent on other calls (e.g., to other services or APIs).
Monitoring the performance of an application can be done in two ways:
- Option A: Keeping a log of requests/ responses every 10 minutes, detailing the response times.
- Option B: Collecting all requests/responses and sorting through them every 10 minutes. This may be a naïve approach, and you can sort through them every 10 minutes using the following algorithms:
- Rethinking system architecture - taking a different approach on how systems can be scaled. There are two scaling “models”: scaling up (vertical scaling) and scaling out(horizontal scaling) alias shared-nothing architecture.
- Elastic systems - having systems that can either be manually or automatically scaled under a load.
There is no magic scaling sauce or a one-size-fits-all architecture for designing scalable systems.
This refers to designing software that makes it less painful to maintain in the long run. Maintainability helps ensure developers avoid building legacy software. No one likes legacy systems. 🤷🏾♂️
While writing this summary, the following tweet ↗ resonated:
To ensure maintainability, development teams have to follow a couple of design principles when building systems:
- Operability - refers to making a system easy to run (making life easy for the operations team).
- Simplicity - refers to the ability to manage complexity when building a platform, e.g., creating abstractions.
- Evolvability - refers to adapting software for future needs such as new features.
Cheers, and stay tuned for more. 📣