Agile Operations?


Hey Folks,

I recently have been asked to work with site reliability and operations as well as some automation engineering teams. As a program manager and coach I’d love to understand what being agile in site reliability and operations context looks like so that I can help coach these teams to be more agile.


There are a lot of unknowns to give a ton of actionable advice, but I think I would say what I usually say to myself on any new engagement: “try to get a process map of how the work is done now, before any changes. Learn how value is created, and where waste lives. Review the process with several on the team to integrity check. From there: experiment, measure, improve.”

I would also focus on what problems you are meant to help solve, what the existing KPIs are, and creating safety on the team.

Hope there is something that helps in there. :slight_smile:


Agree with @ryan.

Additionally I’ve found it very useful to work with Ops team writing user stories. This seems alien at first, but reminds them who their users are and what they are actually trying to achieve. Also this ensures they understand how to break down their work. In my experience with Ops teams, there is often a habit to go for the huge win (several months away) with a major engineering, automation or other improvement rather than deliver iteratively.


To +1 with @ryan and @paul.cutting this is where outcome-based coaching and program management come into play. If leadership can put quantifiable outcomes of a more reliable system on the board, you can then write story cards that achieve these outcomes. Patton talks about this in his User Story Mapping book, and will help prioritize the work accordingly if the right measures are in place.


@chrismurman @ryan @paul.cutting

Currently I’m working with a really complex set of teams. Some that have been doing Scrum, some that have done Kanban, some with multiple Kanban board (3 queues for a 5 person team), and under an executive I’ve never worked with.

In some cases these teams are doing strictly Ops work (change management/predictable changes to production infrastructure) and in other cases full on iterative development (creating a CI/build pipeline or automated repeatable provisioning and orchestration of network infra). For some teams repeatable means having a standard operation procedure (wiki with written instructions) and other are doing full on chef, terraform, DCOS/docker automation.

Do you have any experiences or stories of successful program management with multiple teams coming from all across the agile spectrum? What did your story look like? What worked? What didn’t work?


Are you attending any conferences this year? Because that’s a several-drink conversation LOL.


Hahaha, theoretically I’ll be at Agile Coach Camp in NYC. Assuming that my new boss approves my expense request.


I’ll be there…let’s chat then!


I’ve been on one site reliability team and worked with a few others. I think one of the things to watch out for is that site reliability has a lot of interrupts. You can work on improving things so that you don’t see the same issues twice. Yet, a successful business tends to see greater scale, where things will constantly fail in new and spectacular ways.

We used scrum on the ops team I was on, but most of our sprints failed due to these interrupts. Backlog grooming was extremely important because it helped prepare us for mid-sprint re-prioritization when one of these interrupts exposed an urgent change that needed to be made.


@beekey what you’re describing is fairly common on teams of this nature. There have been many teams I worked with that struggled with the question of what a “successful iteration” looks like. Scrum and ops can work, although I’ve had more success with Kanban on teams like that. You still stop, inspect and adapt but there’s less need to define iteration goals and focus more on keeping the cards sized similarly and moving.

Two thoughts to what you said:
• First, don’t let the team get bogged down with “failed sprints”. The role of this team is different and therefore needs different context.
• Second, if you do want to stick to scrum-style sprints, budget a portion of the backlog for production issues or these interrupts you mentioned. Experiment with different point amounts of slack in the points budget until you get it right.

Hope that makes sense.


One bit I’d add - Be really transparent about the interrupt related work. Visualizing the amount and types of interrupt work for everyone to see (the team, sponsors, execs, etc) will make it a lot easier for the organization to invest in solving those problems at their root (e.g. If we’re constantly spending a ton of time fixing broken builds for another team, maybe it would be cheaper to teach that team to fix it themselves - or maybe loan them a specialist for a little while, etc).

I’ve gone as far as visually indicating the ratio of planned vs. interrupt work by way of a velocity chart… it may have been the only time a velocity chart has proved helpful to one of my teams :slight_smile:


Amazing points @mattdominici I once coached a team to end stand ups with the mantra “planned work before unplanned work”. Helped them understand to not chase fires.

Because you are dealing with ops, I’m not sure just the visualization of work is enough. Numbers speak to leadership with this more than anything. Track the amount of time they spend on interruptions and even classify them if need be. If they look at the percentage of hours a team spends, that translates to money on many fronts.

Leadership cares about money lol!


Great answers and I will join you guys for that drink at Agile Camp NYC. One point that I did not see addressed was your questions about teams coming from all across the Agile Spectrum. I can’t say I had to ever deal with this. Typically I am dealing with teams new to agile adoption. But if I have this problem, I would probably start investing in knowledge sharing and learning to get my teams who are still using practices from the stone ages up to speed with practices of 2017. Will this slow things down? Yes! But, this is an investment! The time spent teaching and mentoring these traditional minded teams will pay itself back fairly quickly. Sometimes we need to slow down now to go really fast in the future. If you don’t make this investment, you run the risk of drowning in technical debt…