This week, in our dev meeting, Chris talked about Release It! (second edition) by Michael Nygard. He’d read the original years ago, before “dev-ops” existed, and it was the first time he’d really thought about putting applications into production. It was an eye-opener 10 years ago, so he was excited by the second edition.
The book discusses stability patterns and antipatterns; then about things to consider in production; then about deployment; then about evolution of applications. So far, Chris has only read the first two parts of the book.
It discusses “Designing for Production” rather than “Designing for QA”; starting with a case study about how a small problem escalated into a huge problem. Some approaches are:
- “Testing for Longevity” – identifying things that will fail after X continuous days running, which might not be picked up by standard CI approaches
- limiting chain reactions between servers if there’s a failure causing domino effects (often by blocked threads building up)
- self-denial attacks like sending out mass emails that trigger everyone to look at your site simultaneously
- dog-piling, which is a bit like the thundering herd problem, e.g. if everything server is updated on a schedule exactly on the hour it causes peaks in load
- automated tools going out of control, so applying limits to them is good;
Positive approaches to avoid these are:
- timeouts
- circuit breakers – isolate your system from the remote system when the remote system is down, providing an immediate “fail” response rather than blocking and running out of threads
- bulkheads – e.g. reserving a few threads for the admin tool to fix the problems, or the operating system saving some disk space so the root user can fix out-of-space problems
- steady state – e.g. log files shouldn’t become infinite in size, they should be tidied up; temporary files should be deleted
- fail fast – do validation as soon as possible, if there are expensive operations that require data, then get that data upfront
- for websites, distinguish between 40x errors and 50x errors – a user entering bad data shouldn’t trigger a circuit breaker
- let it crash – if something goes wrong, let it crash and then restart it, but this requires other architectural choices like supervisors
- handshaking – typical REST services don’t have any handshake involved; sometimes it would be useful if they could return an HTTP 503 to their caller to tell it to slow down
- asynch approaches – a pull system can choose the rate at which it receives content
- shedding load – dropping load if it can’t take any more
- back pressure – telling your caller “back off, I can’t take any more work”
- governor – applying limits to the automation tool, so it can’t fire up 10,000 boxes immediately – if it’s outside the standard range of what it does, then it will do it more slowly
When building a system using an Agile approach, then you need to decide when to start applying these techniques – since they may not be expressed directly in user stories, and they may require significant architectural choices that are hard to refactor into an existing application. We discussed that some of these patterns are directly supported in newer frameworks, such as Akka HTTP; which has direct support for back pressure, circuit breakers and asynch methods. This makes the decision about when to implement them easier.