A better approach to devops

Site Reliability Engineering

Google’s answer to reliability

Matthew Grey
Cognizant Servian
Published in
10 min readJan 7, 2019

--

Tools hung on a wall

In late 2018, I took part in a workshop run by Google to teach their solution to reliability in large, distributed systems — Site Reliability Engineering. This structure provided a different perspective on how a product should be planned, developed, and operated. Through defining clear internal benchmarks for reliability (SLOs), a product’s performance can be tracked (SLIs) and problems can be intercepted before an outage occurs. Their vision went further than performance metrics though. Google would have the traditional project team structure completely revolutionised with greater transparency and communication between the product, development, and operations teams. I left the workshop a firm believer in this organisational structure, with a desire to implement what I learned in my own team.

SLOs vs SLAs

Consider a service level agreement (SLA). This provides the minimum acceptable operation as part of a contract with the customer. It will define the minimum agreed behaviour for a system; transfer speed, availability, latency, and so on. If the system responds slower than x milliseconds, then the contract is breached and the customer will be compensated. A service level objective (SLO) is the minimum operation they the customer is happy with. An SLA may state that the system responds within 1 second, but if the customer is frustrated by waiting more than 300ms, then an SLO should be defined for delivering requests within 300ms as this is the minimum time that the customer will be happy with. To give some leeway to the SLO (so that we don’t alert people for a single response later than 300ms), we assign a target percentage to the SLO; such that x% of the total requests respond in 300ms, averaged over y minutes.

Balloons with happy faces on them
If the happiness of the customer is key, then you only need to develop to meet their expectations, not exceed them — this will in turn keep them happy without over-engineering the product.

How high a percentage is up for discussion between the three parties that form up the project: the product team, the development team, and the SRE team (service reliability engineering, traditionally the ‘operations’ team). This open discussion should serve to create a realistic SLO definition instead of one team controlling the conversation and delivering a useless SLO. To complete the definition of an SLO, we state where the metric is to be measured. For our latency example, we could measure the latency of the system at the load balancer so that we are measuring the latency of our entire system as it deals with the request.

SLIs and the error budget

SLOs and SLAs both rely upon measurements of the system. What is measured is key to a good SLO. If reliability is of high importance, how would you measure reliability? If you measure the number of responses served, what would happen in periods of low activity? If you hit a health check URL to make sure the service is available, what happens if only part of the system is down or the entire system is operational but degraded? Care must be taken to be as specific as possible in what is being measured, something that should be directly linked to impacting the customer’s happiness.

For an availability SLO, you could use the percentage of total requests in a given time window that were returned with a response code in a range excluding 400 or 500 (indicating either a bad user request, or a server fault — what we are trying to measure). This should handle periods of inactivity, bad user requests, and it can be scaled to incorporate the entire system (all user requests) or only a certain part of the system (tracking only certain API calls).

A stop sign
Major breaches to an SLO means that the customer will be unhappy with the product, and this should trigger consequences for developers. Consequences — such as having to review deployment builds with their SRE until the SLO is met and stable — incentivise developers to maintain reliability of a product.

What happens when an SLO is breached? In comes the error budget. The error budget is a store of time that describes the allowed time that a system may be in breach of an SLO. It is measured in minutes and computed as the period of downtime permitted by an SLO (one minus the SLO’s threshold value). Take the following SLO as an example:

99.9% of the total number of requests for the service must return a 200 or 300 HTTP code, aggregated over a 2 minute period, as measured at the service’s load balancer.

Given that the service is to be reliable 99% of the time, then the error budget would be equal to 100% — 99%, or 1%. The error budget will be set for a given period, commonly a 28 day rolling window. This means that every day, the error budget recovers 1/28 of the total error budget.

A diagram demonstrating an error budget
An example of an error budget being exhausted and recovering once the reliability has been stabilised

In this 28 day period and an error budget of 1% of the total period, then the error budget for 28 days will be around 40 minutes. If the service breaches the error budget, then the system is in a very unreliable state, so clearly something is wrong that needs to be rectified. As a consequence, the development team is limited in their operations such that they can only deploy solutions to the reliability problems. This is enforced at Google by automatically disabling automatic deployment from CI/CD — all deployment happens manually by an SRE once the deployment has been approved as a reliability solution. These consequences serve two purposes: to incentivise developers to maintain the reliability for the system — they don’t want to have to get every build approved by someone else — and to ensure that reliability is continuously improved and maintained.

SREs

What’s very interesting about the SRE structure is how the operations team is structured. Instead of having a team whose duty it is to keep production systems going, Google have embedded site reliability engineers (SRE) directly into development teams. The SREs will design and ultimately control the automated deployment pipeline for the product as well as create the production, development, and testing environments. The SREs exist as a separate branch to the business, and as such do not answer to product owner or development team. Instead they will have their own, SRE-specific chain of command. This is very important as the development team and product owner are both motivated to push out features, often relegating reliability changes to the ‘nice to have’ backlog.

Rusty tools on a shelf
SREs are there to enable the rapid development of reliable products. They spend their time creating and maintaining the production environment for a product, creating tools for developers, and enforcing reliability goals.

When a product’s error budget had been exhausted and development want to release a new build, they interface with their SRE. The SRE is then tasked with inspecting the code and making sure that the build fixes the reliability issue and nothing else. Development have exhausted their error budget, so they must face the consequence of this — a slow deployment process and questions from the SRE.

My opinion — pros

In the traditional development company organisation, power flows down from the product team. Developers accept requirements and build features. Operations support the features with funding acquired from the product team. This has some major downfalls as the product team may not necessarily understand the technical feasibility of the product they are defining. Also, the system is built and features are continually added by the development team, but it is up to the operations team to quote operational costs and fight for funding from the product team. This structure will lend itself to an operations team on a thin budget, leading to cut corners (outages).

A triangle denoting the responsibilities of a development team
In the classic structure of product teams, power flows from the top down. The SRE structure distributes that power more evenly.

This structure differs to an SRE organised structure quite dramatically. In the SRE structure, all three teams come together to define requirements, which means that the operations team can join the discussion over feature reliability requirements. The development team is ultimately responsible for the reliability of the system, as the operations team (or SREs) will develop the production environment to the specifications agreed upon by all three parties. Additionally, funding may be negotiated at an earlier point in the product life-cycle; a product that requires 99.99% availability will cost ten times more than one with 99.9% availability, thus if this level of availability is not needed then the cost of the system can be reduced. This open discussion and agreement between the three teams leads to greater clarity in what the end product should look like, more accurate plans, and less wastage.

My opinion — cons

There are however some drawbacks for this organisational structure. It takes full commitment from all parties for the system to work:

Product must give up some of their power over system requirements as this will now be decided in tandem with reliability through negotiation with development and SREs.

Development, who are feature driven, must relinquish their power to deploy code. SREs hold the keys to the car now, developers can deploy code as please until the error budget is exhausted. An exhausted error budget indicates an unreliable system, so now developers are forced to convince their SRE that their code push will improve reliability.

A bird chirping at another bird
Developers will not be happy when they have to have every build reviewed by an SRE.

SREs must be able to stand their ground against the pressures of product, who want features, and developers, who want to release features. If an SRE relents and allows a feature to be released in spite of reliability issues, then all of the effort spent in defining SLOs, SLIs, etc is for naught.

The investment in the SRE system does not stop here, it must follow the managerial structure all the way to the CEO. If conflict breaks out between an SRE and the development team (it will) and the issue is escalated, then product management must have an SRE management counterpart that they must contend with. If product management escalates to a product executive, then they must have to content with an SRE executive.

Again, if a feature is blocked by the error budget and a product owner, product manager, or a product executive overrides the SREs then the SRE system is useless beyond providing a pretty dashboard.

A hand moving a chess piece on a chess board
The SRE structure implementation in a company is tested every time a developer or product owner escalates a build blocked by an SRE. The SRE’s decision must be backed up by the SRE chain of command so that it is not immediately overridden my a product owner who is pushing for features.

This shift in power is a significant change from the traditional structure of product development — product give requirements to development who give systems to ops to run. It is my opinion that this could be too large an ask for most companies who are being dragged, kicking and screaming, into agile work practices. Though negotiating SLOs may be difficult and take a long time, the cultural change required to adopt the SRE system is easily the biggest challenge that I foresee.

Difficulty of defining SLO and SLI for a system

We found great difficulty in the workshop when trying to define SLOs and SLIs for our own products. The process devolved into a series of negotiations and searches for clarification. It really went to show how without very clear definitions, arguments will break out. Sure enough, these definitions will be challenged when a fault happens, and arguments will break out as to: the severity of the outage, the root cause, even arguments as to if the product is broken at all! It was suggested that a great amount of time was allocated to declaring SLOs and SLIs for a product — several days in fact. This gives the process enough time to properly define the SLOs and SLIs, reducing the need for such arguments later. SLOs should be reviewed every six months to a year as the product will change and the customer’s expectations will also change. This may seem like a daunting amount of work to do, but in my eyes the benefits are really worth it.

Conclusion

The SRE environment described at a recent workshop held by Google serves to solve many of the problems inherent in the traditional cultural structure of product development. It aims to increase transparency of the reliability expectations placed on a product through a shift in how a product is planned, and how it is operated. Unfortunately, the SRE structure does require a complete cultural shift of the product, development, and operations teams — including changes to who holds the power in a project. This challenge will prove overwhelming to many organisations, but those that manage to overcome it will be soundly rewarded with more reliable systems, better disaster recovery, more robust products, and less budget wastage.

Key takeaways:

  • Service level objectives (SLO) bring together product, development, and operations to agree on a realistic goal for the reliability of a system.
  • Service level indicators (SLI) track a specific performance metric in order to measure the state of a system against an SLO.
  • Site reliability engineers (SREs) handle the deployment and operation of a product, but also hold the keys to the system — developers must answer to them when a system is failing.
  • Error budgets track the reliability of a system in a rolling window — every minute of degraded performance eats into the budget.
  • SRE system makes developers culpable for reliability — development halts if an error budget runs dry.
  • SRE system ties reliability expectations to reality — product owners cannot demand 100% reliability.

About the author

Matthew Grey is a principal technology engineering consultant at Servian specialising in Google Cloud. Servian is a technology consulting company specialising in big data, analytics, AI, cybersecurity, cloud infrastructure, and application development.

You can reach me on LinkedIn or check out my other posts here on Medium.

--

--

Technology engineer keen on big data, automation, streaming, and natural language processing. Currently focused on solutions in Google Cloud.