Systems are engineering for use in Production from the start – they are scalable, observable and tolerant of failure.
Building more, smaller systems and utilising cloud environments means understanding and adapting to change, and handling failure, is more important than ever. Ideally, a system is self-healing or produces only relevant, actionable alerts.
- The system should have sufficient monitoring, logging and alerting to be able to understand and report on its health, and the health of its dependencies, in a consistent way. Anyone in JLP should be able to access these telemetry tools.
- It is important that alerting and diagnostic tools are targeted and relevant in the information that they provide. Overloading dashboards with information impairs usability, and excessive alerting leads to fatigue and important notifications being missed. Analytics data is often best surfaced through separate tools from those used for operational purposes.
- Systems should gracefully degrade on disruption – for example, by using circuit breakers, bulkheads, caching or partial responses. Likewise designing calls for idempotency can help avoid issues with transient failures or unforeseen behaviours.
- Teams and business owners will need to work together to identify relevant business and technical KPIs and SLOs for their product.
- Teams should understand and improve the resilience of their service through failure mode & effects analysis, and by experimenting with intentional faults.
- Dashboards need to be implemented for the most important SLOs.
- Not all risks to Production readiness can be analysed in advance so exploratory testing, Production observability, fault injection, post-incident reviews, and operational monitoring should be used to expose new information about software behaviour.
- Teams should aim to release new services into Production as soon as possible, and software should be kept deployable throughout its lifecycle. As well as the benefits arising from faster feedback through Continuous Delivery, this helps expose teams to the release processes and tools in use in Production earlier, so that they are well-understood in advance of launch.
- Release It is an excellent starting point for what defines Production Ready.