97 Things Every SRE Should Know: Collective Wisdom from the Experts
Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provoking questions that drive the direction of the field.
Some of the 97 things you should know:
- "Test Your Disaster Plan"--Tanya Reilly
- "Integrating Empathy into SRE Tools"--Daniella Niyonkuru
- "The Best Advice I Can Give to Teams"--Nicole Forsgren
- "Where to SRE"--Fatema Boxwala
- "Facing That First Page"--Andrew Louis
- "I Have an Error Budget, Now What?"--Alex Hidalgo
- "Get Your Work Recognized: Write a Brag Document"--Julia Evans and Karla Burnett
Earn by promoting books
Earn money by sharing your favorite books through our Affiliate program.
Become an affiliateEmil Stolarsky is a site reliability engineer, who previously worked on caching, performance, & disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He is the program co-chair for SREcon EMEA 2019 and SREcon Americas West 2020, and contributed a chapter to the O'Reilly book "Seeking SRE."
Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He spent three years as a molecular biologist, before working at DigitalOcean, Riot, and Shopify, where he launched the engineering communications function.