Surprises happen during operations! Sometimes, when those surprises have a significant impact on the business, we label them “incidents” or “outages”. We might even spend time investigating some of our bigger outages to better understand what happened.
It turns out that what can be learned from investigating an outage is not proportional to how big the impact was! In fact, it can be easier to learn from incidents with less impact because there’s less pressure from the organization to get closure and move on.
In this talk, I will present the OOPS project, an effort inside of Netflix to encourage engineers to report and write up operational surprises they were involved with, even if there was no customer or business impact.
I’ll talk about what we hope to learn as an organization from OOPS writeups, what sorts of questions an investigator should ask in order to maximize learning, and how to write up the results of an OOPS investigation as a story to make it easier for a reader to absorb the lessons.
Lorin Hochstein is a Sr. Software Engineer on the CORE (Cloud Operations & Reliability Engineering) Team at Netflix, where he works on ensuring that Netflix remains available. He was previously Sr. Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California’s Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.
Lorin has a B.Eng. in Computer Engineering from McGill University, an M.S. in Electrical Engineering from Boston University, and a PhD in Computer Science from the University of Maryland.
Meeting Recording (link coming soon)
Meeting Presentation (link coming soon)