073: How to Make Eng Oncall (and Your Life) Suck Less

073: How to Make Eng Oncall (and Your Life) Suck Less

Newsletter

🖼 Recover (Scotch & Bean) + 👉 What else is true?


This is the 73rd edition of Cultivating Resilience, a weekly newsletter how we build, adapt, and lead in times of change—brought to you by Jason Shen, a 1st gen immigrant, retired gymnast, and 3x startup founder turned Facebook PM.

Welcome to Q4! It’s my last day in Seattle in this mini cross country tour I’m doing with my wife. We have one last stop and one more mural, which is Albuquerque, New Mexico. I thought it was going to be generally warm but it turns out deserts temps get cold fast this time of year, getting as low as 27 degrees Fahrenheit. Brr.

Jason

PS - here’s a look at the Seattle mural

🧠 Applying Lessons from Engineering Oncall

Photo credit Fotis Fotopoulos

I want to share with you some ideas for fostering more resilience in yourself and in the teams you lead, but you have to give me a second to play out an analogy.

For those who have not worked directly with software engineering teams before, there is a best practice called oncall. Active software teams are supposed to continually improve and build upon the products and services they are responsible for. This usually looks like "adding new features" or "making it faster".

However, bugs and issues inevitably crop up. Most software is generally speaking to complex to be 100% bug free right out of the gate and even with extensive testing, new bugs, maybe due to code that was written by a different team in a different part of the company might cause an error. And sometimes that error needs to be fixed right away.

Oncall is a way for engineers to rotate the responsibility of being the "first responder" in the case of an urgent issue. Not dissimilar to what medical professionals often deal with. The idea is that the team that is writes the code for a product also needs to take responsibility when their code stops working. Engineers are typically "oncall" for a week at a time.

So an engineer might be oncall from Tuesday at 10am all the way through the following Tuesday morning. They try to keep their laptop with them and maybe their work phone so they can constantly monitor in case of an issue and get right on fixing it. And while that might be annoying, at least when rotating it in a team of 8 engineers, it'll be two months before you have to do it again.

What if you don't have any bugs?

Great question. Different teams handle it differently. Some teams will have the engineer continue building new capabilities if no major bugs come out. Some will encourage them to work on past bugs that were deprioritized but still lingering. And still others try to dedicate that oncall time to investing in other parts of security, reliability, and observability to prevent more bugs from happening in the future.

Charity Majors is the cofounder and CTO of Honeycomb, which makes code observability software (and is a Facebook alum) who writes frequently about engineering management, software reliability, and other tech topics.

In her piece On Call Shouldn't Suck, a Guide for Managers, Majors writes about her strategies for making on call effective. I thought a lot of the ideas were applicable for non engineers / non engineering situations.

Consider:

It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start.

Translates to: don't let yourself run so raggedly you struggle to recover. Make a proactive effort to clean up your desk, get sleep, see friends, and inject breathers into your work cadence.

Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.

Translates to: try to get the people who caused the problem to fix the problem. Don't get caught always fixing problems on other people's behalf - let them learn how to do it themselves.

When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.

Take a some time, maybe a 30 mins a day, and a day a week, where you really try to take care of yourself and do something that will benefit future you. Maybe that means errands, a nap, a gaming session, a workout. Do not allow other obligations (work, family) to interrupt you during these times, unless it's truly an emergency. Develop this practice into a habit.

Closely track how often your team gets alerted. Take ANY out-of-hours-alert seriously, and prioritize the work to fix it. Night time pages are heart attacks, not diabetes.

For me, working overtime, on nights or weekends, is a red flag. I do it less than 10 times a year (in fact I had to do it this week) and based on this, I should really unpack why I had to do that and how I'll try to prevent it in the future.

Above all: ✨RAISE YOUR STANDARDS✨ for what you expect from yourselves. Your greatest enemy is how easily you accept the status quo, and then make up excuses for why it is necessarily this way. You can do better. I know you can.

Oof, this is a doozy. Do you feel like shit right now? Maybe just a vague sense of unease and anxiety. Some of that is understandable given the world and the dukkha of life. But some of it can be improved! As the saying goes, we don't always achieve our aspirations, but we rarely let ourselves fall below our standards.


đź–Ľ Recover (Scotch & Bean)

This may or may not be based on a true story.


👉 “What else is true?”

Sometimes we get stuck in a negative or disempowering story

  • “I’m not cut out for this job”
  • “I never get asked out”
  • “People always underestimate me”

And it can be hard to reason with the part of our brain that stubbornly repeats this fact as the one and only truth. But there’s a way you can trick it. You can “Yes, and” it, straight outta improv.

Try asking your yourself “what else is true?”

It doesn’t explicitly contradict the initial statement. It accepts it, and continues. Maybe you’ll see that you’ve done this job well in the past, that you haven’t put yourself out there much lately, or that being underestimated can sometimes work to your advantage.

That additional thought, which expands and contextualizes your negative one, might be what you need to move forward.

Like this edition of Cultivating Resilience? Help me reach more people who could use these ideas by sharing it!