Offering streaming content to 86 million viewers worldwide without hiccups or stutters – that is the challenge Netflix faces every day of the week.
This of course puts enormous demands on the organization, and the company is often invited to DevOps conferences to talk about their work. But in fact, hardly anyone at Netflix uses the term DevOps. Its organization is built on the concept of Site Reliability Engineers (SREs) – a way of working which has much in common with DevOps, and was originally developed by Google. We were given the opportunity to talk to Katharina Probst, Engineering Manager for API and Mantis at Netflix.
FROM WIKIPEDIA:
Site Reliability Engineering
Site Reliability Engineering was created at Google around 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. The team was tasked to make Google’s sites run smoothly, efficiently and more reliably.
A site reliability engineer (SRE) will ideally spend up to 50 % of their time doing ”ops”-related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50 % of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a coder who also has operational and systems knowledge and likes to whittle down complex tasks.
DevOps vs SRE
DevOps encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets.
Katharina Probst is Engineering Manager at Netflix where she has worked since 2015. Formerly she worked with software engineering at Google, both as an engineer and a manager.
What is your background and how did you get involved with Netflix?
I come from an academic background but joined the high tech scene a few years after my Ph.D. At Google, I worked as an engineer and manager for several years, where I contributed to various projects, ranging from developer tools to Gmail and Google Compute Engine. A little over a year ago, I joined Netflix, where I lead the API team and an operational insights team. What initially attracted me to Netflix (aside from the people) was the opportunity to work on a very high-scale system that is extremely critical to the business.
DevOps is not a term that I’ve heardused frequently at Google or at Netflix. The Google model of SREs differs from the Netflix model in some significant ways. Essentially, all server teams at Netflix are responsible for their own operations, including 24/7 oncall. However, several centralized teams of engineers build fantastic tooling to make this model feasible. For instance, we have a CI/CD tool called Spinnaker. Spinnaker is also powerful at runtime, not only at deploy time, e.g., it gives easy access to logs and allows for easy rollback of code.
Could you give a summary of the challenges that Netflix has to deal with when providing its services to the world?
Netflix now has more than 86 million users worldwide and runs on more than 1,000 device types. In addition, Netflix runs dozens of A/B tests. All of these dimensions are continuously increasing. The API provides a platform that allows UI teams to write device-specific server-side logic.
The idea is that server-side logic which deals differently with different form factors (e.g., iPhone vs. 40 inch screen TV) or different interaction models (e.g., mobile vs. laptop) has material benefits: developer velocity will be higher as teams can move quickly and independently, and the customer experience will be better because the Netflix experience will be tailored to their device and interaction model. It should not be overlooked, however, that this model leads to increased complexity. The API has seen the number of scripts supplied by device-specific teams increase to more than 1,000.
Meanwhile, Netflix has very high expectations in terms of uptime and reliability. The API in particular is a critical component of the Netflix ecosystem of microservices: if the API is down, nobody can log into Netflix, sign up, search, discover, or start plays. In other words, the Netflix experience is broken.
DevOps principles are, as we understand it, implemented in Netflix both with a strong culture, and a special organization – can you describe these?
Netflix is built upon a culture of freedom and responsibility. This means that every Netflix employee has ample freedom, but with freedom comes responsibility. As a result, we embrace a philosophy of “operate what you build.” We only have a few operationfocused engineers, a core team of SREs. In addition, each engineering team operates their own services, handles their own deployments (and rollbacks, if necessary), and is on 24/7 oncall to deal with any production issues that come up. This does not mean people work non-stop. It does mean that we take turns being available for a call that might come outside of working hours. Each engineer will take a week’s “shift” when their turn comes. Tying this back to freedom and responsibility is straightforward: you have the freedom to own your service’s deployments, but you have the responsibility to make sure that your service is operating properly.
The role of the core SRE team is to have a good understanding of how our ecosystem of microser-
vices works together globally. Generally, the SRE team does not get involved for every alert that is fired (e.g., if latencies between my service and another service go up). They do get involved for big, especially customer-visible issues. In such cases, they are on the front-lines along with the engineers responsible for the individual services, and together they address the issue. SREs have a more global picture and will understand better what successful mitigation strategies for the issue at hand might be (e.g., rollback, what other teams should be involved, traffic shifting). Each individual team knows their service(s) best and drive such things as root causing and rollbacks for their own service(s).
From what I understand, Netflix is built upon a microservice-based architecture. Could you tell us a little about it?
Netflix runs hundreds of services. Some are small, some cannot legitimately be called microservices. Many teams own more than one service, but each service is operated by an engineering team. We have dozens, if not hundreds of services that are truly microservices: they solve a specific problem, are wellscoped and well-isolated, and publish a clear API. This works well with our model of “operate what you build.” Teams understand their own services, each service has clearly defined boundaries and runs and scales independently. Unless there are (rare) backward-incompatible changes to a microservice’s API, each microservice owner can evolve the service independently of other teams, thus leading to great developer velocity of independent teams.
The API, at present, is more complex than a typical microservice. It consists of a service layer (code written by the API team), but it also integrates with many other services. In particular, the API has dozens of downstream dependent services to which it sends traffic. In addition, the API loads and runs the server-side scripts mentioned above. Because the API has grown to this complexity level over the years, we recognize the need to break it down into smaller pieces, for the same reason that any company breaks down a large complex system into microservices. One aspect of this is a current effort to move the device-specific server-side scripts out of the API and turn them into their own microservices.
What do you think about the future evolution of DevOps in the world and at Netflix?
Many organizations these days are moving to the cloud, which changes the model ops is done naturally. For a company in the cloud, there’s no longer a need for a specialized hardware ops team that sets up new servers, configures hardware load balancers, etc. But we still need people that are great at operations, and the more complex your application, the more expertise is required. Think about it: if you have a highly complex service with 60 downstream services, all of which could experience problems, say, talking to their persistence layer or having their instances come up, all of this can affect your own service. If you run a thin application server with few downstream dependencies and low traffic, the problem is much less complex.
As more companies embrace microservices, I can imagine a world where the DevOps model of “you operate what you build” becomes more prevalent. At the same time, I believe that for very high scale, very complex systems, deep expertise in operations will always be required. Whether that implies building up a team of dedicated specialists (like SREs) or building this expertise in a team of engineers is a different question, and one that each organization, or sub-organization, will need to figure out for themselves.