Skip to content

073: How to Make Eng Oncall (and Your Life) Suck Less

🖼 Recover (Scotch & Bean) + 👉 What else is true?

Jason Shen
Jason Shen
5 min read
073: How to Make Eng Oncall (and Your Life) Suck Less

This is the 73rd edition of Cultivating Resilience, a weekly newsletter how we build, adapt, and lead in times of change—brought to you by Jason Shen, a 1st gen immigrant, retired gymnast, and 3x startup founder turned Facebook PM.

Welcome to Q4! It’s my last day in Seattle in this mini cross country tour I’m doing with my wife. We have one last stop and one more mural, which is Albuquerque, New Mexico. I thought it was going to be generally warm but it turns out deserts temps get cold fast this time of year, getting as low as 27 degrees Fahrenheit. Brr.

Jason

PS - here’s a look at the Seattle mural

🧠 Applying Lessons from Engineering Oncall

Photo credit Fotis Fotopoulos

I want to share with you some ideas for fostering more resilience in yourself and in the teams you lead, but you have to give me a second to play out an analogy.

For those who have not worked directly with software engineering teams before, there is a best practice called oncall. Active software teams are supposed to continually improve and build upon the products and services they are responsible for. This usually looks like "adding new features" or "making it faster".

However, bugs and issues inevitably crop up. Most software is generally speaking to complex to be 100% bug free right out of the gate and even with extensive testing, new bugs, maybe due to code that was written by a different team in a different part of the company might cause an error. And sometimes that error needs to be fixed right away.

Oncall is a way for engineers to rotate the responsibility of being the "first responder" in the case of an urgent issue. Not dissimilar to what medical professionals often deal with. The idea is that the team that is writes the code for a product also needs to take responsibility when their code stops working. Engineers are typically "oncall" for a week at a time.

So an engineer might be oncall from Tuesday at 10am all the way through the following Tuesday morning. They try to keep their laptop with them and maybe their work phone so they can constantly monitor in case of an issue and get right on fixing it. And while that might be annoying, at least when rotating it in a team of 8 engineers, it'll be two months before you have to do it again.

What if you don't have any bugs?

Great question. Different teams handle it differently. Some teams will have the engineer continue building new capabilities if no major bugs come out. Some will encourage them to work on past bugs that were deprioritized but still lingering. And still others try to dedicate that oncall time to investing in other parts of security, reliability, and observability to prevent more bugs from happening in the future.

Charity Majors is the cofounder and CTO of Honeycomb, which makes code observability software (and is a Facebook alum) who writes frequently about engineering management, software reliability, and other tech topics.

In her piece On Call Shouldn't Suck, a Guide for Managers, Majors writes about her strategies for making on call effective. I thought a lot of the ideas were applicable for non engineers / non engineering situations.

Consider:

It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start.

Translates to: don't let yourself run so raggedly you struggle to recover. Make a proactive effort to clean up your desk, get sleep, see friends, and inject breathers into your work cadence.

Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.

Translates to: try to get the people who caused the problem to fix the problem. Don't get caught always fixing problems on other people's behalf - let them learn how to do it themselves.

When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.

Take a some time, maybe a 30 mins a day, and a day a week, where you really try to take care of yourself and do something that will benefit future you. Maybe that means errands, a nap, a gaming session, a workout. Do not allow other obligations (work, family) to interrupt you during these times, unless it's truly an emergency. Develop this practice into a habit.

Already have an account? Log in