Audioboom

Transcript for Running Towards the Fire with Lisa Karlin Curtis, Incident.io

00:00:02,939 → 00:00:05,819

Narrator: You're listening to the humans of DevOps podcast, a

00:00:05,819 → 00:00:09,449

podcast focused on advancing the humans of DevOps through skills,

00:00:09,479 → 00:00:13,799

knowledge, ideas and learning, or the SK il framework.

00:00:33,330 → 00:00:36,090

Jason Baum: Hey everyone, it's Jason Baum, Director of Member

00:00:36,090 → 00:00:40,860

experience at DevOps Institute. And this is the humans of DevOps

00:00:40,860 → 00:00:45,090

podcast. Welcome back. Hope you had another great week this

00:00:45,090 → 00:00:48,240

week. I always hope you have a great week. So I hope this one

00:00:48,240 → 00:00:52,620

was even greater than the last one. Today we're gonna be

00:00:52,620 → 00:00:56,850

talking about incidents, mistakes, do overs, we talked

00:00:56,850 → 00:01:00,210

about blameless culture. It's one of the core principles of

00:01:00,210 → 00:01:05,310

DevOps. But in reality, is it as easy as just saying, we have a

00:01:05,310 → 00:01:08,940

blameless culture. In preparing for today's episode, I'm

00:01:08,940 → 00:01:12,510

reminded of a quote by Phoebe Waller bridge, the creator of

00:01:12,510 → 00:01:16,260

fleabag, and show runner of killing Eve. That's the very

00:01:16,260 → 00:01:19,950

reason they put rubbers on the ends of pencils because people

00:01:19,950 → 00:01:25,290

make mistakes. I love that quote. If you don't know who she

00:01:25,290 → 00:01:29,250

is, you know, Waller bridge has made a career of bringing to

00:01:29,250 → 00:01:33,300

life unconventional women who make a lot of mistakes. But what

00:01:33,300 → 00:01:36,900

her series have in common is that her flawed characters get a

00:01:36,900 → 00:01:40,830

chance at redemption, moving past mistakes, and offering them

00:01:40,830 → 00:01:43,050

an opportunity to prove something to themselves and come

00:01:43,050 → 00:01:47,670

out stronger and more confident on the other side. I feel like

00:01:47,670 → 00:01:50,940

this quote, the metaphor she made is a perfect setup for our

00:01:50,940 → 00:01:54,660

conversation today. Incidents are a great opportunity to

00:01:54,660 → 00:01:58,410

gather both context and skill. They take people out of their

00:01:58,410 → 00:02:01,350

day to day roles and force teams to solve unexpected and

00:02:01,350 → 00:02:05,700

challenging problems together. Joining me today to discuss this

00:02:05,700 → 00:02:09,990

topic is Lisa Carlin Curtis, Lisa is a product engineer

00:02:10,020 → 00:02:15,750

incident.io. In fact, Lisa was employee number two@incident.io.

00:02:16,410 → 00:02:19,680

She started out as a consulting working at Accenture before

00:02:19,680 → 00:02:22,620

accidentally becoming a developer. I'd love to hear how

00:02:22,620 → 00:02:26,190

that happened. Lisa loves building stuff. But it's also

00:02:26,190 → 00:02:28,440

interested in how people interact with each other in a

00:02:28,440 → 00:02:32,310

work environment, particularly in software engineering. Outside

00:02:32,310 → 00:02:35,220

of work, Lisa loves cooking, and pretty much any competitive

00:02:35,220 → 00:02:40,740

sports, we definitely have that in common. But I guess she likes

00:02:40,740 → 00:02:44,040

the British one. She says, I really don't know that much

00:02:44,040 → 00:02:48,120

about them. So Lisa, welcome to the podcast. Thank you for

00:02:48,120 → 00:02:48,870

joining me.

00:02:49,409 → 00:02:50,759

Lisa Karlin Curtis: Hey, lovely to be here. I'm really looking

00:02:50,759 → 00:02:52,319

forward to awesome.

00:02:52,349 → 00:02:55,709

Jason Baum: Are you ready to get human? Alright, let's do it. All

00:02:55,709 → 00:03:03,149

right. So we're talking about incidents? How can incidents be

00:03:03,149 → 00:03:06,209

considered an opportunity to gather context and skill?

00:03:07,740 → 00:03:10,950

Lisa Karlin Curtis: So I kind of started thinking about this.

00:03:11,520 → 00:03:14,640

When I was reflecting on Yeah, I kind of became a software

00:03:14,640 → 00:03:17,610

engineer accidentally, and I've accelerated quite quickly, I've

00:03:17,610 → 00:03:20,760

been very fortunate. And part of that is because a lot of the

00:03:20,760 → 00:03:22,920

stuff I did before I was an engineer was actually quite

00:03:22,920 → 00:03:27,180

useful. But also part of it was, I realized, I started doing this

00:03:27,180 → 00:03:30,630

thing where I was basically running towards the fire. So

00:03:30,660 → 00:03:33,000

stuff would go wrong. And I'd be like, Oh, that looks kind of

00:03:33,000 → 00:03:37,500

interesting. And all of the times where I learned most the

00:03:37,500 → 00:03:40,560

sort of step changes in my understanding or my context,

00:03:41,370 → 00:03:45,000

were around incidents. So they were like, something would go

00:03:45,000 → 00:03:47,400

wrong. And either I would learn, like, while we were fixing that

00:03:47,400 → 00:03:50,070

problem, I would learn about your stuff. Or straight

00:03:50,070 → 00:03:51,930

afterwards, when I was like reflecting on it and talking to

00:03:51,930 → 00:03:55,380

people about it, I'd learned a bunch of stuff. And so I kind of

00:03:55,380 → 00:03:57,690

started thinking about this and talking to people about it. And

00:03:57,840 → 00:04:00,450

it turns out other people had had the same experience, what a

00:04:00,450 → 00:04:04,800

surprise. And so I think that there is there is something very

00:04:04,800 → 00:04:08,010

unusual about incident, which is why it is an incident Right?

00:04:08,010 → 00:04:10,560

Like something, something happens that is unexpected that

00:04:10,560 → 00:04:12,660

you didn't know was going to happen. And then you have to

00:04:12,660 → 00:04:14,970

react to it. And that pushes people outside their comfort

00:04:14,970 → 00:04:17,880

zone. And it pushes you to do things and see things that you

00:04:17,880 → 00:04:21,450

wouldn't otherwise see. So I guess like, I can think of like

00:04:21,450 → 00:04:24,330

three, three key areas where it's really useful. So one is

00:04:24,330 → 00:04:26,850

about like broadening your horizons because you see the

00:04:26,850 → 00:04:30,360

stuff that you wouldn't see in your day to day. One is about

00:04:30,360 → 00:04:33,600

like teaching you how to build stuff that fails gracefully, and

00:04:33,600 → 00:04:38,730

then an observable way. So I think that one of the one of the

00:04:38,730 → 00:04:41,160

key differentiators between good software engineering and great

00:04:41,160 → 00:04:43,740

software engineering is about what happens when the thing that

00:04:43,740 → 00:04:48,180

you didn't think could happen happens. So like step one, make

00:04:48,180 → 00:04:52,290

it work. Step two, make it work really fast. Step three, make it

00:04:52,290 → 00:04:54,720

work really fast. And when you get a negative number that you

00:04:54,720 → 00:04:57,780

weren't expecting you explode really, really loudly as opposed

00:04:57,780 → 00:05:00,480

to just take the negative number and let's just Pay somebody a

00:05:00,480 → 00:05:04,290

negative amount of money or you know, whatever it might be. And

00:05:04,290 → 00:05:06,960

then sorry, just to finish off, the third is about building your

00:05:06,960 → 00:05:09,960

network. So you have a whole bunch of connections with

00:05:09,960 → 00:05:12,000

different people in your organization, you work with,

00:05:12,000 → 00:05:15,540

like your team. But in an incident, often you have to, you

00:05:15,540 → 00:05:17,760

have to work with lots and lots of people from across the

00:05:17,760 → 00:05:21,180

organization. And that builds bonds that are really important.

00:05:21,240 → 00:05:24,300

And I think really valuable both to you as an individual and to

00:05:24,300 → 00:05:24,930

the company.

00:05:25,230 → 00:05:29,040

Jason Baum: Yeah, absolutely. If you listen to this podcast,

00:05:29,040 → 00:05:33,210

you've heard me use parenting often as examples. Because I

00:05:33,210 → 00:05:38,640

think parenting is so applicable to what goes on in day to day

00:05:38,640 → 00:05:43,920

life outside of your home. And one of the things that I was

00:05:43,920 → 00:05:49,350

told when I was a new parent was plan for the unexpected or plan

00:05:49,350 → 00:05:52,440

for the implantable. And I think that's applicable here. I think

00:05:52,440 → 00:05:55,170

with incidents and mistakes, it's almost like, you need to

00:05:55,170 → 00:05:57,840

expect it, it's going to happen, if you're building a program,

00:05:57,840 → 00:06:01,800

it's going to have a bug. It's what happens after it happens.

00:06:02,310 → 00:06:06,150

That really matters, right? That's where all the everything

00:06:06,150 → 00:06:06,780

happens.

00:06:07,439 → 00:06:09,209

Lisa Karlin Curtis: Yeah, I think that's the differentiator

00:06:09,239 → 00:06:13,709

in terms of, if you're, if you're building a system, you,

00:06:13,739 → 00:06:16,319

you will predict a certain number of the possible things

00:06:16,319 → 00:06:18,299

that are going to happen, and you will make your system behave

00:06:18,299 → 00:06:21,179

well in them. And that's all great. And then you read a book

00:06:21,179 → 00:06:23,699

that's like, oh, you should make your system observable. And you

00:06:23,699 → 00:06:25,739

go, Okay, I will add some loglines. And I will add some

00:06:25,739 → 00:06:28,379

metrics. And like, it's really easy to do that in a way that

00:06:28,379 → 00:06:30,359

doesn't really add any value. And we've all we've all seen

00:06:30,359 → 00:06:32,399

examples of that. We've all seen dashboards that really mean

00:06:32,399 → 00:06:35,939

anything. And the way to get from like, having read in a book

00:06:36,029 → 00:06:38,849

to being actually able to do it, valuably, I think is just to see

00:06:38,849 → 00:06:41,879

it. And it's very difficult to do that into learn that in the

00:06:41,879 → 00:06:43,949

abstract, but as soon as you see someone trying to debug a

00:06:43,949 → 00:06:47,759

problem, and you see like, you know, what, what is the

00:06:47,759 → 00:06:50,039

breadcrumb? What are the breadcrumbs that they are

00:06:50,039 → 00:06:54,329

following, in order to get from our API is slow to this is the

00:06:54,329 → 00:06:57,539

root cause I can now fix this problem. And if you see somebody

00:06:57,539 → 00:06:59,789

do that enough times, you start to be able to lay your own

00:06:59,789 → 00:07:02,999

breadcrumbs. Because you can kind of imagine you can, you can

00:07:03,029 → 00:07:05,219

empathize, you can put yourself into that person's shoes who's

00:07:05,219 → 00:07:08,939

trying to debug it and be like, Oh, maybe it'll be useful for me

00:07:08,939 → 00:07:11,729

to like, have a metric here. Because if this specific bit

00:07:11,789 → 00:07:15,269

starts to go weird, we want to know about it. And I think that

00:07:15,269 → 00:07:18,929

that's something as you say, where it's like, it's all about

00:07:18,929 → 00:07:22,169

preparing for the unexpected. And a lot of that is actually

00:07:22,199 → 00:07:24,659

counter intuitively, perhaps it's not being able to handle

00:07:24,659 → 00:07:27,899

every case, it's being able to either be sure that what you're

00:07:27,899 → 00:07:32,279

doing is right, or get a human to help you out, right, and that

00:07:32,309 → 00:07:34,529

ENCODE is like throwing an exception or panicking or

00:07:34,529 → 00:07:37,409

whatever you want to call it. And that's the most important

00:07:37,409 → 00:07:39,479

thing, particularly if you're building your billing software,

00:07:39,479 → 00:07:42,689

like cars and planes and you know, software that we trust

00:07:42,689 → 00:07:46,619

with our lives, then you need to be sure that if the software

00:07:46,619 → 00:07:48,959

sees anything that it doesn't expect it the first the human.

00:07:49,289 → 00:07:51,599

And that's the same in like FinTech, which is my background.

00:07:51,599 → 00:07:54,269

And it's the same in lots of bits of software. And that's

00:07:54,269 → 00:07:56,669

something that like, you're not really taught that and if you

00:07:56,669 → 00:07:58,709

read it in a book, it doesn't really land. But once you see

00:07:58,709 → 00:08:00,689

it, you can really start to engage with it and sort of do it

00:08:00,689 → 00:08:01,229

yourself.

00:08:01,980 → 00:08:04,470

Jason Baum: So I mentioned blameless culture. And that's

00:08:04,470 → 00:08:09,000

really important in DevOps, with incidents of on blameless

00:08:09,000 → 00:08:15,360

culture anywhere, really. But I feel like it is something that

00:08:15,360 → 00:08:19,320

is said, I've heard it is said so much that it almost becomes a

00:08:19,320 → 00:08:23,190

buzzword. And you really question the authenticity of

00:08:23,640 → 00:08:28,170

when someone says, oh, we have a blameless culture? How do you

00:08:28,200 → 00:08:33,390

actually have a blameless culture? How do how do incidents

00:08:33,630 → 00:08:39,330

lead to learning? Without feeling like, the person who

00:08:39,330 → 00:08:47,910

made the mistake is getting in the way or, you know, is? Well,

00:08:47,910 → 00:08:50,520

yeah, feeling like you made a mistake, let people down. I

00:08:50,520 → 00:08:53,400

think that's, that's inherent nature for all of us, right?

00:08:54,390 → 00:08:55,890

Lisa Karlin Curtis: Yeah, absolutely. I think that there's

00:08:55,890 → 00:08:59,190

a lot of people have a lot of shame around, like making

00:08:59,190 → 00:09:03,480

mistakes at work. And that is a very like human thing, humans

00:09:03,480 → 00:09:07,770

are incredibly susceptible to shame. And what that means is

00:09:07,770 → 00:09:10,260

that there is so much psychological pressure when you

00:09:10,260 → 00:09:13,290

make a mistake to try and cover it up. And that is like the

00:09:13,290 → 00:09:15,750

worst possible thing that you can do in a software engineering

00:09:15,750 → 00:09:19,140

environment. And we know that, and yet all of us still have

00:09:19,140 → 00:09:21,090

that, right. All of us still have that moment when we find a

00:09:21,090 → 00:09:24,540

mistake was made. And we're like, maybe I'll just fix it,

00:09:24,660 → 00:09:28,620

and it will be fine. And no one will ever know. And I think that

00:09:28,620 → 00:09:32,700

is such a, it's so hardwired into our brains, that you have

00:09:32,700 → 00:09:36,090

to work very, very hard to combat it. And so some of the

00:09:36,090 → 00:09:39,000

obvious things that you can do as an individual, try and be

00:09:39,000 → 00:09:41,610

very open about your mistakes, particularly if you're in a

00:09:41,610 → 00:09:45,210

leadership role. Or some or if you have quite a lot of social

00:09:45,210 → 00:09:47,010

capital, because you've been in that organization for a long

00:09:47,010 → 00:09:50,610

time. That means that people will kind of monkey see monkey

00:09:50,610 → 00:09:53,430

do right. If you do it, other people will, will copy you and

00:09:53,430 → 00:09:56,550

we'll follow your example. And then there's another part of it,

00:09:56,550 → 00:10:00,210

which is I think you talked about failing together. So The

00:10:00,210 → 00:10:04,530

way that most most technology is more complicated than one person

00:10:04,530 → 00:10:08,310

made one error. It's normally lots of people made lots of

00:10:08,310 → 00:10:10,950

decisions that have all coalesced into a bad thing.

00:10:11,550 → 00:10:16,170

There was a famous one company, I used to work out where a

00:10:16,170 → 00:10:19,230

junior engineer had kind of gone on to there, they'd written some

00:10:19,230 → 00:10:22,380

code that was supposed to send an email, telling people who

00:10:22,380 → 00:10:25,050

weren't paying that they should pay basically kind of trying to

00:10:25,050 → 00:10:28,500

prompt and increase conversion. And the logic was a bit wrong.

00:10:28,530 → 00:10:30,690

And it was actually targeting all the people who were paying.

00:10:30,840 → 00:10:33,900

And they ran it in staging. And customer support got inundated

00:10:33,900 → 00:10:37,680

with requests. And the studio engineer is sitting there being

00:10:37,680 → 00:10:39,870

like, put it in staging, I'm really confused. This is very

00:10:39,870 → 00:10:42,780

stressful, like the team jumps in, like they go to support

00:10:42,780 → 00:10:45,810

support, then is sort of told this the mistake, don't worry,

00:10:45,810 → 00:10:49,290

your billings, fine. And they start to look back. And it turns

00:10:49,290 → 00:10:52,080

out that somebody has seeded staging with production data to

00:10:52,080 → 00:10:55,200

run some load tests, and they didn't anonymize the emails. And

00:10:55,200 → 00:10:57,990

so they've got they've just ran. Basically, they've run their

00:10:57,990 → 00:11:01,530

code in production, but they didn't know. And something like

00:11:01,530 → 00:11:04,800

that the junior engineer is, is really mortified because they've

00:11:04,830 → 00:11:06,900

done this thing. And they've they clicked a button and a

00:11:06,900 → 00:11:10,050

bunch of emails went out. And that's really bad. But I think

00:11:10,050 → 00:11:12,720

it's important to look at that as a group and be like, Well,

00:11:13,260 → 00:11:16,110

how would you have known that? Possibly, right? What Why did we

00:11:16,110 → 00:11:19,440

put production data in staging, why did we not anonymize it? Why

00:11:19,590 → 00:11:22,200

is staging setup so that it can send unlimited emails to

00:11:22,200 → 00:11:24,810

unlimited numbers of people? And there are a whole load of other

00:11:24,810 → 00:11:26,640

questions, right, and you can start to look at it as a

00:11:26,640 → 00:11:28,770

systemic problem. Or you can look at it, there's like the

00:11:28,770 → 00:11:31,530

Swiss cheese analogy of like, all the holes have to line up.

00:11:31,860 → 00:11:35,460

And I think that if you talk about things like that a lot,

00:11:35,490 → 00:11:38,700

then people get it, and people buy into it. And at that point,

00:11:38,700 → 00:11:41,670

it's much more comfortable to admit your mistake, because you

00:11:41,670 → 00:11:43,650

know that your team is going to gather around you and you know

00:11:43,650 → 00:11:46,290

that your team is going to take accountability. And so if you'd

00:11:46,290 → 00:11:48,930

like if you succeed together, if you fail together, you can build

00:11:48,930 → 00:11:52,320

this blameless culture. But if you hang people out to dry, if

00:11:52,320 → 00:11:55,530

you mock people, if you're mean, that's just going to reinforce

00:11:55,530 → 00:11:58,020

the shame that that person is already worried about.

00:11:59,159 → 00:12:03,029

Jason Baum: So if I if I'm hearing you, it's, it's that

00:12:03,029 → 00:12:06,989

proactive honesty, it's the mistake is made calling it out.

00:12:07,169 → 00:12:11,789

But saying, basically, it's calling it out for what it is

00:12:11,819 → 00:12:16,319

this happened. How do we address it? What do we do coming

00:12:16,319 → 00:12:20,159

together, getting everybody to rally around it? Without

00:12:20,189 → 00:12:21,389

pointing the finger?

00:12:22,380 → 00:12:24,000

Lisa Karlin Curtis: Yeah, I think that's exactly it. And

00:12:24,000 → 00:12:27,450

then you need to combine that with incident shouldn't be a big

00:12:27,450 → 00:12:32,100

scary monster. So I think there's a blog post on our blog

00:12:32,100 → 00:12:34,410

about like incidents and no bad thing, you should be declaring

00:12:34,410 → 00:12:36,480

more incidents. And there are there are, there are lots of

00:12:36,480 → 00:12:38,460

organizations who sort of measure their success on number

00:12:38,460 → 00:12:41,430

of incidents, which I think is is a really perverse incentive.

00:12:42,060 → 00:12:44,790

But I think that if if incidents become the norm, then mistakes

00:12:44,790 → 00:12:47,190

become the norm. And if you're all talking about incidents, and

00:12:47,190 → 00:12:49,560

if that information about that those incidents is really

00:12:49,560 → 00:12:53,130

accessible to people in your organization, then you've made a

00:12:53,130 → 00:12:55,170

mistake, just like everybody else on the team has made a

00:12:55,170 → 00:12:58,620

mistake has made hundreds of mistakes. Whereas if that is all

00:12:58,620 → 00:13:02,010

kept hush hush within the team, and it's not broadcast, then all

00:13:02,010 → 00:13:03,840

of a sudden, like that's the first mistake anyone in the

00:13:03,840 → 00:13:06,210

company has ever made as far as you're concerned. And that's a

00:13:06,210 → 00:13:07,560

really terrifying place to be.

00:13:08,309 → 00:13:12,599

Jason Baum: Yeah, and we all know that's not true. But yet we

00:13:12,599 → 00:13:15,989

feel it's just inherent human nature, right? To think that

00:13:15,989 → 00:13:18,539

your mistake is the worst mistake ever made, and oh my

00:13:18,539 → 00:13:23,669

god, they're gonna fire me or they're gonna like, black list

00:13:23,669 → 00:13:26,129

me or something bad is gonna happen. I'm never gonna work

00:13:26,129 → 00:13:28,589

again in this at the end of my life. And like, we just have

00:13:28,589 → 00:13:31,799

this habit, I think even as like from like kids through

00:13:32,129 → 00:13:35,639

adulthood, I always hope that that feeling would go away, and

00:13:35,639 → 00:13:39,629

it never has. Why?

00:13:40,740 → 00:13:43,260

Lisa Karlin Curtis: I think it's kind of it's partly the imposter

00:13:43,260 → 00:13:47,430

syndrome thing of like, the more you progress, the more

00:13:47,460 → 00:13:49,980

responsibility to have you have, the more you can see all the

00:13:49,980 → 00:13:55,620

things that you don't have to do. But I think that also there

00:13:55,620 → 00:14:01,140

is a there's another part of it, which is we, I think we inherent

00:14:01,170 → 00:14:04,350

we net, we inherently strive for perfection. And we want

00:14:04,350 → 00:14:07,440

ourselves to be perfect, because it's quite inconvenient that

00:14:07,440 → 00:14:10,890

we're not, because every single decision you make, you're like,

00:14:10,920 → 00:14:13,440

I think I'm right. You as a software engineer, you spend

00:14:13,440 → 00:14:16,830

your entire day going. Yeah, I think this is right, let's do

00:14:16,830 → 00:14:19,950

it. And if you've got a little voice in your head all the time

00:14:19,950 → 00:14:22,710

going, what if you're not what if you're not, that becomes

00:14:22,710 → 00:14:26,010

really stressful and really tiring. And so what we do is we

00:14:26,010 → 00:14:29,250

go, yeah, I'm right. Most of the time, this is kind of fine. And

00:14:29,250 → 00:14:31,650

then when something goes wrong, that punctures that sense of

00:14:31,650 → 00:14:34,590

confidence, and then that's really destructive. And then we

00:14:34,590 → 00:14:36,840

get really stressed because we think that the only reason

00:14:36,840 → 00:14:39,810

anybody has hired us is because we're right all the time. And

00:14:39,810 → 00:14:42,540

they haven't. But because because that makes our day to

00:14:42,540 → 00:14:45,480

day easier. I think it's very easy to kind of fall into that

00:14:45,840 → 00:14:49,890

and fall into the trap of almost believing your own rubbish that

00:14:49,920 → 00:14:52,380

you are, in fact, perfect and flawless.

00:14:53,519 → 00:14:56,729

Jason Baum: It's so funny because it's even, I would say

00:14:56,759 → 00:15:01,019

probably, if not the most common one. have the most common

00:15:01,019 → 00:15:05,399

questions that come up in an interview processes. Name and

00:15:05,399 → 00:15:08,579

incident that happened that were you like, for example, where you

00:15:08,579 → 00:15:11,519

made a mistake, but how did you handle it? And how did you

00:15:11,519 → 00:15:14,099

overcome it? How did you overcome it? Right? That's like

00:15:14,099 → 00:15:17,789

one of the most common questions, I think. If you

00:15:17,789 → 00:15:21,089

haven't, if you haven't had it, I don't know, maybe you haven't

00:15:21,269 → 00:15:24,419

ever applied for a job before in your life? Because I think it

00:15:24,419 → 00:15:27,809

must be the most common one. And yet, when you're hired to do a

00:15:27,809 → 00:15:31,409

job, it's almost like yeah, we do strive for perfection. And

00:15:31,409 → 00:15:33,959

that question went out the window. It's like, now I can't

00:15:33,959 → 00:15:34,859

make a mistake.

00:15:36,480 → 00:15:40,260

Lisa Karlin Curtis: Yeah, it's, it's really, it feels like one

00:15:40,260 → 00:15:42,540

of those things that there should be a better answer to as

00:15:42,540 → 00:15:44,640

well. But I don't think there is I don't think there's a silver

00:15:44,640 → 00:15:47,190

bullet. I think like, you talk about it, you lead from the

00:15:47,190 → 00:15:53,880

front. You you try and catch it when it does happen. And, and

00:15:53,880 → 00:15:57,150

you hope that slowly but surely, people, you know that that

00:15:57,150 → 00:15:59,340

feeling of wanting to hide it, that feeling of shame just

00:15:59,340 → 00:16:02,460

becomes less and less strong, and the muscle, you develop this

00:16:02,460 → 00:16:06,120

muscle of overriding it. But you know, I'm talking about this in

00:16:06,120 → 00:16:10,590

evangelizing, I still absolutely have that instinct. The only

00:16:10,590 → 00:16:12,960

thing that I've learned is I have a muscle that I can now be

00:16:12,960 → 00:16:15,600

like, I can recognize it. And I can look it in the face and be

00:16:15,600 → 00:16:18,210

like, we're not doing that today. Because that's not

00:16:18,210 → 00:16:22,380

useful. But it's it's definitely an active thing. It's not it's

00:16:22,380 → 00:16:23,580

not a sort of default.

00:16:23,789 → 00:16:27,479

Jason Baum: Yeah, yeah. And, and we have to look at it, it's

00:16:27,509 → 00:16:31,049

based on what you're saying. It's like isolated in each

00:16:31,139 → 00:16:34,949

incident is an incident, isolated, we have to take care

00:16:34,949 → 00:16:39,239

of it, learn from it, and just move past that. If I'm gathering

00:16:39,239 → 00:16:39,629

that.

00:16:40,380 → 00:16:42,420

Lisa Karlin Curtis: Think emotionally, absolutely. I think

00:16:42,420 → 00:16:45,660

there's there's another side to that, right, which is, if you as

00:16:45,660 → 00:16:48,180

an organization, I think incidents are really important

00:16:48,180 → 00:16:50,790

source of data to understand where you should be putting your

00:16:50,790 → 00:16:54,600

chips. So if you have good reporting, if you can look at

00:16:54,600 → 00:16:57,780

your incidents in aggregate, if you have like a good way of

00:16:57,780 → 00:17:00,960

recording them, and categorizing them, then you can start to use

00:17:00,960 → 00:17:04,860

that data and start to go, oh, this bit of tech seems to cause

00:17:04,860 → 00:17:07,290

us a lot of problems or, you know, this process seems to be

00:17:07,290 → 00:17:11,040

really risky for us. And now that will help that will help us

00:17:11,040 → 00:17:13,680

decide like, where are we going to invest next. So I think from

00:17:13,680 → 00:17:15,900

that point of view, you don't want to kind of leave them

00:17:15,900 → 00:17:18,510

behind the tool. And that's in direct conflict with what you

00:17:18,510 → 00:17:22,470

want people to do emotionally, which is to, you know, be there

00:17:22,470 → 00:17:25,710

in the moment, solve the problem, close it and not worry

00:17:25,710 → 00:17:28,320

about it. And I think that that's quite difficult to

00:17:28,320 → 00:17:31,530

manage, because you simultaneously are telling

00:17:31,530 → 00:17:34,170

people to leave it behind, and also telling them to constantly

00:17:34,170 → 00:17:37,230

be thinking about them and be you know, sitting in a quarterly

00:17:37,230 → 00:17:39,210

review being like, what went wrong this quarter? What do we

00:17:39,210 → 00:17:42,300

want to invest in to help make our platform more reliable,

00:17:42,300 → 00:17:42,630

right?

00:17:43,560 → 00:17:46,050

Jason Baum: You know, what it makes me think of, and apologies

00:17:46,050 → 00:17:50,460

ahead of time, American football, we have a quarterback

00:17:50,520 → 00:17:54,930

and the quarterback, throws interceptions. It's a given

00:17:54,930 → 00:17:57,930

thing. Everyone knows their quote, no matter who it is Tom

00:17:57,930 → 00:18:02,220

Brady is I mean, they're going to throw interceptions. And yet,

00:18:02,760 → 00:18:06,120

they strive for perfection. Because what sport what athlete

00:18:06,120 → 00:18:10,650

doesn't, right? And then when they happen, the one thing that

00:18:10,650 → 00:18:14,010

I would say is in that culture is they get on the phone, or

00:18:14,010 → 00:18:16,770

they go next to the offensive coordinator coach, and they look

00:18:16,770 → 00:18:18,960

at what happened in that play. Here's what happened. Here's why

00:18:18,960 → 00:18:20,850

he didn't see it. This is what happens. And then they are

00:18:20,850 → 00:18:23,760

supposed to forget it. Forget it ever happened and move on?

00:18:23,760 → 00:18:25,710

Because how do you move on with the rest of the game, if all

00:18:25,710 → 00:18:30,810

you're thinking about is the one big mistake you made? And I just

00:18:31,620 → 00:18:35,760

I think that for me, when you're when you're talking about kind

00:18:35,760 → 00:18:38,670

of forgetting it, that that instantly popped into my head?

00:18:40,200 → 00:18:42,900

So many things could be applied to that, I think.

00:18:44,070 → 00:18:45,630

Lisa Karlin Curtis: Yeah, I think also when we talk about

00:18:45,630 → 00:18:48,990

incidents, there's nothing that's specific to engineering

00:18:48,990 → 00:18:51,990

about them. Really. I think the engineers talk about them a lot.

00:18:51,990 → 00:18:55,710

We have a language that we discussed them in. But there are

00:18:55,710 → 00:18:58,650

loads of examples of incidents that are not engineering. And I

00:18:58,650 → 00:19:02,670

think almost all the stuff that we discuss around incidents

00:19:02,670 → 00:19:05,400

being a chance to build your network with other people being

00:19:05,400 → 00:19:08,070

a chance to touch things that you don't normally interact

00:19:08,070 → 00:19:11,880

with, you know, being a chance to watch what your system does

00:19:11,910 → 00:19:15,120

when it fails. Like that feels very engineering. But actually,

00:19:15,150 → 00:19:18,390

if you're a customer success team, you know what happens when

00:19:18,390 → 00:19:21,450

your processes fall over? What happens when the person who's

00:19:21,450 → 00:19:24,240

doing all of the glue work has gone on holiday and all of a

00:19:24,240 → 00:19:26,700

sudden something bad happens? And like you're still stress

00:19:26,700 → 00:19:30,150

testing, you're still finding the edges. It's just a slightly

00:19:30,150 → 00:19:30,930

different environments.

00:19:34,020 → 00:19:36,510

Ad: Are you looking to get DevOps certified? Demonstrate

00:19:36,510 → 00:19:38,520

your DevOps knowledge and advance your career with a

00:19:38,520 → 00:19:41,490

certification from DevOps Institute? get certified in

00:19:41,490 → 00:19:45,300

DevOps leader, SRE or dev SEC ops, just to name a few. Learn

00:19:45,300 → 00:19:49,020

anywhere, anytime. The choice is yours. Choose to get certified

00:19:49,020 → 00:19:52,500

through our vast partner network self study programs, or our new

00:19:52,500 → 00:19:55,470

skillup elearning videos. The exams are developed in

00:19:55,470 → 00:19:58,020

collaboration with industry thought leaders, and subject

00:19:58,020 → 00:20:01,050

matter experts in the DevOps space and Learn more at DevOps

00:20:01,050 → 00:20:03,150

institute.com/certifications.

00:20:08,790 → 00:20:10,140

Lisa Karlin Curtis: I think what's what I find interesting

00:20:10,140 → 00:20:13,740

about that is that you you start at, you're like, oh, when things

00:20:13,740 → 00:20:17,250

go wrong when it's bad. And we've had a number of incidents

00:20:17,250 → 00:20:20,100

where I would say the net impact on our company has been

00:20:20,100 → 00:20:24,090

positive. Because somebody reports it, we see it, we've got

00:20:24,870 → 00:20:27,480

some really quite good observability. So often, we can

00:20:27,480 → 00:20:30,180

like, find it pretty quickly fix it, turn it around and say half

00:20:30,180 → 00:20:33,900

an hour. And the customer ends that interaction, actually

00:20:33,900 → 00:20:36,660

feeling better about us than when they started, which is

00:20:36,660 → 00:20:39,300

probably quite counterintuitive, because really, if we just

00:20:39,300 → 00:20:41,490

hadn't broken it in the first place, maybe that would have

00:20:41,490 → 00:20:45,810

been better for them. But we we've joked internally about

00:20:45,810 → 00:20:48,780

maybe we should deliberately come up with bugs because of how

00:20:48,810 → 00:20:51,240

how much like great feedback we get when we fix things from

00:20:51,240 → 00:20:51,600

people,

00:20:51,630 → 00:20:55,140

Jason Baum: right? I feel like that's the evolution, right of

00:20:55,140 → 00:21:01,590

any good product is the feedback. So, you know, in

00:21:01,590 → 00:21:04,200

thinking about letting it go and thinking about these

00:21:04,200 → 00:21:08,760

improvements. I feel like there must be obstacles, though,

00:21:08,760 → 00:21:12,960

besides ourselves, right? And our own internal turmoil that we

00:21:12,960 → 00:21:15,540

put ourselves through, when we make a mistake, or when an

00:21:15,540 → 00:21:20,130

incident happens. There's deadlines, and you need to hit

00:21:20,130 → 00:21:23,880

them, you need to meet them. And when you miss them, that's a big

00:21:23,880 → 00:21:31,530

deal. So how does that play into when incidents happen? How does

00:21:31,530 → 00:21:37,260

that how does that, I guess, impact that feeling that we're

00:21:37,260 → 00:21:39,900

already feeling right, the shame that you talked about, and then

00:21:39,900 → 00:21:42,930

we have this deadline looming over our heads?

00:21:43,980 → 00:21:45,360

Lisa Karlin Curtis: I think that's really interesting. I

00:21:45,360 → 00:21:49,020

think it's very, very difficult because you have a trade off

00:21:49,050 → 00:21:53,850

generally. So normally, there's that there's a triangle of like,

00:21:53,850 → 00:21:57,840

speed, and quality, and the common warts on the other end of

00:21:57,840 → 00:22:01,980

the triangle. But the idea being you have to, you have to trade

00:22:01,980 → 00:22:05,130

off something number of people, maybe, maybe we should just

00:22:05,130 → 00:22:11,220

scrap all of that, I'm gonna start again, that's fine. I

00:22:11,220 → 00:22:14,250

think there's a trade off here. So when something goes wrong,

00:22:15,330 → 00:22:18,420

the first thing is I need to fix the things broke. And that takes

00:22:18,420 → 00:22:21,270

you however long it takes you and basically nothing else

00:22:21,270 → 00:22:24,840

matters. Generally, there is there is a sort of a type of

00:22:24,930 → 00:22:27,960

failure mode, where you're just trying to bring your system back

00:22:27,960 → 00:22:31,470

up, or resolve the bug or stop anything getting any worse. And

00:22:31,470 → 00:22:33,900

that's a really easy decision, because it's there, it's on

00:22:33,900 → 00:22:37,170

fire, we got to fix it. And then you get to a sort of second

00:22:37,170 → 00:22:41,850

stage of an incident, which maybe is like follow ups. Or

00:22:42,030 → 00:22:44,700

maybe you're still kind of in the incident mode where nothing

00:22:44,700 → 00:22:47,850

is on fire anymore. But there's a lot of things that you could

00:22:47,850 → 00:22:51,540

do that would make it less likely to happen or resolve it

00:22:51,540 → 00:22:55,260

in a neater way. And that's where you need your strong

00:22:55,260 → 00:22:58,860

engineers to come in and make those trade offs. And it's like,

00:22:58,860 → 00:23:00,870

what is the value of this piece of work? How long is it going to

00:23:00,870 → 00:23:05,010

take us? How much in the wrong direction? Is it from what we

00:23:05,010 → 00:23:08,100

thought we were doing? And can we afford to punt it? Can we

00:23:08,100 → 00:23:12,090

afford to delay it? And you get to this point where you've got a

00:23:12,090 → 00:23:15,420

deadline, people are set of problems. And you need to make a

00:23:15,420 → 00:23:19,260

decision about basically which is more important. And that is a

00:23:19,410 → 00:23:22,740

that is a strange shootout trade off often because it's like, we

00:23:22,740 → 00:23:25,440

have four people in our team, we have two weeks, what shall we

00:23:25,440 → 00:23:29,850

do? And the answer to that is not worth 60 Nowadays, because

00:23:29,880 → 00:23:32,640

in all likelihood, you won't get anything more than in my

00:23:32,640 → 00:23:36,270

experience. And so instead it has to be right which of these

00:23:36,300 → 00:23:38,430

which of these is more of a risk to us? What happens if we missed

00:23:38,430 → 00:23:40,350

the deadline. And that's a decision that needs to get

00:23:40,350 → 00:23:43,380

escalated to someone who has the authority to make that call and

00:23:43,380 → 00:23:46,950

the information. So that person that means that you have to make

00:23:46,950 → 00:23:49,410

that information really available to them in terms of,

00:23:49,590 → 00:23:51,960

you know, what, what is the work that we could do to mitigate it?

00:23:51,990 → 00:23:56,670

What would it be mitigating? And versus how far behind? Are we on

00:23:56,670 → 00:23:59,970

deadline? What does it mean, if we don't get the deadline? And

00:24:00,030 → 00:24:03,840

that is one of those. I think the lots of people have this,

00:24:03,870 → 00:24:07,140

oh, we'll find a creative solution. And sometimes there's

00:24:07,140 → 00:24:09,420

a creative solution and somebody's overskirt something

00:24:09,420 → 00:24:11,910

and actually, it's all gonna be fine. And sometimes there isn't.

00:24:12,510 → 00:24:15,450

There isn't enough time, and you have to pick something. And I

00:24:15,450 → 00:24:17,730

think identifying that is really important and being really

00:24:17,730 → 00:24:20,310

honest, from a kind of motivation and human point of

00:24:20,310 → 00:24:23,340

view. I think the the times where I've seen that go badly is

00:24:23,340 → 00:24:25,710

when people either kind of try and have their cake and eat it

00:24:25,770 → 00:24:28,260

and sort of say, Oh, I know you said you can't do these two

00:24:28,260 → 00:24:31,830

things. But what if I told you you could. And then there's

00:24:31,830 → 00:24:35,010

another problem where there is a trade off and nobody makes the

00:24:35,010 → 00:24:38,640

decision. And then you just end up in a situation where both

00:24:38,640 → 00:24:41,310

like the team is all kind of looking at each other being

00:24:41,310 → 00:24:43,950

like, do we make the decision now because we don't think it's

00:24:43,950 → 00:24:47,190

our choice, but but I guess no one's telling us what to do. And

00:24:47,190 → 00:24:48,840

then somebody shouts at them afterwards because they made the

00:24:48,840 → 00:24:49,350

wrong decision.

00:24:51,030 → 00:24:53,940

Jason Baum: I think that leads us right to my next question.

00:24:54,930 → 00:24:59,970

It's what can the leaders do to make this mindset part of the

00:25:00,000 → 00:25:04,500

culture of the organization and encourage it across all teams.

00:25:06,450 → 00:25:09,360

Lisa Karlin Curtis: So I think the first thing to look for is

00:25:10,530 → 00:25:15,120

look out with anybody playing the hero. In all organizations

00:25:15,480 → 00:25:18,810

that I've ever seen, there are a group of people who take on more

00:25:18,810 → 00:25:20,490

than their fair share of the burden of dealing with

00:25:20,490 → 00:25:24,570

incidents. And we could call them heroes. And that is really

00:25:24,570 → 00:25:27,240

good until it's really bad. And it's good, because they're

00:25:27,240 → 00:25:28,980

probably very good at dealing with incidents, because they've

00:25:28,980 → 00:25:31,320

had a lot of practice. And they're often the people who've

00:25:31,320 → 00:25:34,440

been at the company for a long time. But it's bad because it

00:25:34,440 → 00:25:36,690

means that nobody else is learning how to do it. And so

00:25:36,690 → 00:25:38,820

all of those benefits that we talked about right at the start

00:25:39,300 → 00:25:42,180

on, no one else is getting that. So they're kind of gatekeeping,

00:25:42,360 → 00:25:45,840

the skill needed to debug these issues. And that's very

00:25:45,840 → 00:25:49,080

problematic, because it stunts other people's growth. And when

00:25:49,080 → 00:25:51,840

that person burns out, or when that person goes on holiday, or

00:25:51,840 → 00:25:53,940

when that person leaves the company, you're suddenly in a

00:25:53,940 → 00:25:56,370

really bad situation. So you end up with these really bad key man

00:25:56,370 → 00:26:00,420

dependencies. And as a leader, I think it's really important to

00:26:00,450 → 00:26:04,860

identify those patterns. And if you're, if you're using tooling

00:26:04,860 → 00:26:07,230

you can look at who's answering who's getting paged, who's

00:26:07,230 → 00:26:09,870

taking your on call load, you can look at your incident, who's

00:26:09,870 → 00:26:12,510

leading your incidents, you know, have you got somebody

00:26:12,510 → 00:26:14,910

who's leading 50% of your incidents? That's probably not a

00:26:14,910 → 00:26:19,380

good sign. And you can use that data to find those people and

00:26:19,380 → 00:26:21,540

then chat to them Be like, why are you doing that? And probably

00:26:21,540 → 00:26:24,330

the answer is, well, because I think it's useful. And that's

00:26:24,330 → 00:26:27,240

like, great. But now we're gonna have a conversation about why we

00:26:27,240 → 00:26:29,880

need to spread this load out of the team. And it's a combination

00:26:29,880 → 00:26:32,850

of like protecting you and your mental health, frankly, but

00:26:32,850 → 00:26:35,370

also, it's about spreading the knowledge and spreading the

00:26:35,370 → 00:26:40,110

experience. So I think that that kind of pattern of having those

00:26:40,110 → 00:26:44,370

superheroes is really damaging. And it restricts your ability to

00:26:44,370 → 00:26:48,000

scale your incident response. And then, as a leader, the other

00:26:48,000 → 00:26:51,960

things you can do is encourage people to show that working. So

00:26:52,770 → 00:26:54,660

if you want people to learn from incidents you to make that

00:26:54,660 → 00:26:57,060

information available to them. And then you need to make it

00:26:57,060 → 00:27:00,060

accessible by which I mean available is like, have your

00:27:00,060 → 00:27:03,180

conversations in a public Slack channel. Ideally, use some

00:27:03,180 → 00:27:06,360

incident tooling so that you can curate those conversations and

00:27:06,360 → 00:27:09,060

build a timeline that somebody can interact with, write a post

00:27:09,060 → 00:27:12,750

mortem, share the post mortem. And then by accessible, I mean,

00:27:12,900 → 00:27:15,600

try and make it really easy for people to get that information.

00:27:15,840 → 00:27:19,080

So have it in a knowledge base that people can look at to find

00:27:19,080 → 00:27:21,900

something that they're interested in. And if you're,

00:27:22,320 → 00:27:25,110

it's like step one, write the thing. But if you just write a

00:27:25,110 → 00:27:28,470

post mortem that goes into draw, no one's really one at that

00:27:28,470 → 00:27:33,030

point. So push it out to people and make it clear to people that

00:27:33,030 → 00:27:36,240

reading those materials is part of their job and considered a

00:27:36,240 → 00:27:39,480

very good use of their time. And that's a difficult balance,

00:27:39,480 → 00:27:41,850

because there are some, sometimes you need people to

00:27:41,850 → 00:27:45,390

ship stuff. But I think it's important to talk about this

00:27:45,390 → 00:27:48,210

explicitly, and talk about the fact that if you look at what

00:27:48,210 → 00:27:50,580

other people did in their incidents, you can build better

00:27:50,580 → 00:27:53,460

software, you're going to have less, fewer incidents or less

00:27:53,460 → 00:27:57,900

fewer, you're gonna have fewer incidents, or your incidents are

00:27:57,900 → 00:28:02,010

going to be less severe and easier to debug. And that then

00:28:02,010 → 00:28:04,230

means that you'll be able to sort of teach the next

00:28:04,230 → 00:28:06,510

generation to the next generation, and you get this

00:28:06,570 → 00:28:08,970

great positive feedback loop, if everybody's talking about it,

00:28:08,970 → 00:28:11,460

and learning from each other, as opposed to the negative feedback

00:28:11,460 → 00:28:13,860

loop, where people are keeping it very secret where people are

00:28:13,860 → 00:28:16,590

gatekeeping it and where not everyone is getting involved.

00:28:17,880 → 00:28:21,420

Jason Baum: I feel like the word of the day, the word of the day

00:28:21,630 → 00:28:25,530

is transparency. I feel like that that's pretty much what

00:28:25,530 → 00:28:29,070

you're saying. Not like not to put a word in your mouth,

00:28:29,070 → 00:28:31,590

because I don't think you've said it specifically. But what

00:28:31,590 → 00:28:34,680

I'm hearing is transparency, transparency, and transparency.

00:28:34,770 → 00:28:37,230

As someone who works with the engineering team, or you know,

00:28:37,230 → 00:28:42,060

I'm just relaying information from people to people, on a on a

00:28:42,090 → 00:28:45,960

leadership team, for example, and half the time with a

00:28:45,960 → 00:28:49,140

deadline. It's because the leadership doesn't necessarily

00:28:49,140 → 00:28:52,680

understand it. You know, they they don't necessarily know,

00:28:53,010 → 00:28:56,340

what is the specific issue? What is the specific reason why a

00:28:56,340 → 00:29:00,780

deadline isn't being hit? Or? Or, you know, and that's where I

00:29:00,780 → 00:29:05,820

feel like the culture part can sometimes go awry, right. We

00:29:05,820 → 00:29:08,970

allow it to happen when there isn't transparency.

00:29:10,080 → 00:29:12,330

Lisa Karlin Curtis: Yeah, I think that when, when you lack

00:29:12,330 → 00:29:18,060

transparency that gives you it gives humans a lot more remit to

00:29:18,060 → 00:29:22,560

try and get what they want from bad, bad ways, basically. And if

00:29:22,560 → 00:29:27,600

you don't have transparency, you can lie. And you can put forward

00:29:27,600 → 00:29:31,830

an argument that suits whatever you think your goal is. And in

00:29:31,830 → 00:29:34,380

an ideal world, everybody in your organization has exactly

00:29:34,380 → 00:29:36,390

the same goal, and they're all pulling in the same direction.

00:29:36,930 → 00:29:40,020

But in reality, often people view their goals as being

00:29:40,110 → 00:29:43,740

slightly like in conflict with other people's goals, because

00:29:43,740 → 00:29:45,660

they're trying to get more resources for their team,

00:29:45,810 → 00:29:49,170

because they think their problem is the most important thing. And

00:29:49,950 → 00:29:52,920

if you if you don't have transparency, it's very

00:29:52,920 → 00:29:55,110

difficult to hold people accountable to those things.

00:29:55,320 → 00:29:59,730

Whereas if you do and if people are very honest and open, then

00:29:59,880 → 00:30:02,190

you The organization can make the right choice for the

00:30:02,190 → 00:30:05,100

organization. And if you think about a sort of individual first

00:30:05,100 → 00:30:08,340

versus organization, first type culture, ideally, as an

00:30:08,340 → 00:30:10,950

organization, you should be putting your chips and the thing

00:30:10,950 → 00:30:13,710

that is most important, not on the thing that has the best

00:30:13,740 → 00:30:19,140

argument. And the way that you don't make that mistake is to

00:30:19,140 → 00:30:22,200

use transparency, and to be open and honest and like generate

00:30:22,200 → 00:30:25,350

that culture and generate the culture of it being okay to make

00:30:25,350 → 00:30:28,860

mistakes. And also it being okay to say, I don't think this is so

00:30:28,860 → 00:30:32,160

important. And that's not know that that's not gonna impact

00:30:32,160 → 00:30:35,310

your career. And I think that, that's one of the reasons why

00:30:35,310 → 00:30:38,400

people don't do that. If because there is this view that like to

00:30:38,400 → 00:30:41,370

get promoted, or to get that job that you really want to get

00:30:41,370 → 00:30:43,440

influenced, you need to be the most important person, you need

00:30:43,440 → 00:30:45,930

to be doing the most important thing. And none of us are always

00:30:45,930 → 00:30:49,470

doing the most important thing. i This week I work I'm not doing

00:30:49,470 → 00:30:51,930

the most important thing that is very clear. And I and that's

00:30:51,930 → 00:30:55,950

kind of fun. But it also means that if somebody else needs an

00:30:55,950 → 00:30:57,930

extra pair of hands, I'm going to jump on their thing I'm not

00:30:57,930 → 00:31:01,440

going to just pursue with mine, because I want to look good. And

00:31:01,440 → 00:31:04,020

that's the sort of team first thinking that I think you need

00:31:04,050 → 00:31:08,070

to try and get into your cultural DNA as an organization.

00:31:08,460 → 00:31:11,550

Jason Baum: I love that I would love to hear in a in a team

00:31:11,550 → 00:31:14,880

update with the company. This week, I'm not working on the

00:31:14,880 → 00:31:18,090

most important thing. You know, I don't think we ever hear that.

00:31:18,090 → 00:31:20,370

Because everyone who does want to be the most important, I

00:31:20,370 → 00:31:26,670

think. So, you know, you you said accountability. And I

00:31:26,670 → 00:31:29,610

hadn't planned on asking this question, but it does now

00:31:29,640 → 00:31:35,370

trigger something in me. Incidents are okay. And they are

00:31:35,370 → 00:31:39,810

learning experiences. We just spent the past 30 minutes

00:31:39,810 → 00:31:45,990

talking about it. But when is how does accountability play

00:31:45,990 → 00:31:52,020

into this? When our incidents? Not that they're not okay. But

00:31:52,260 → 00:31:55,590

when do we need to hold accountability? Because I think

00:31:55,590 → 00:31:58,860

there needs to be an element of accountability. How does that

00:31:58,860 → 00:32:02,760

play in? Especially in a blameless culture,

00:32:02,760 → 00:32:05,940

Lisa Karlin Curtis: I think it's a really difficult balance. And

00:32:05,940 → 00:32:09,150

I think I'd come back to the stuff I was saying about failing

00:32:09,150 → 00:32:15,870

together. So I think that you, you can, as a team, you take

00:32:15,870 → 00:32:19,230

accountability for what happens in your team. There are

00:32:19,230 → 00:32:22,260

obviously occasions where somebody goes rogue and does

00:32:22,260 → 00:32:25,230

something that the team thinks was terrible idea that that's

00:32:25,410 → 00:32:29,790

sort of a HR issue, frankly, and I think it's very separate. But

00:32:29,790 → 00:32:33,210

generally, it's you as a team have made some some choices. And

00:32:33,360 → 00:32:37,050

you're now looking at the consequences of those choices. I

00:32:37,050 → 00:32:40,050

think that the way to hold teams accountable around incidents is

00:32:40,050 → 00:32:43,140

the same way that you hold teams accountable for any kind of

00:32:43,140 → 00:32:48,030

delivery. So if you imagine an incident is normally something

00:32:48,030 → 00:32:53,880

that that team maintains, has broken in some way. And so that

00:32:53,880 → 00:32:56,610

team has a kind of agreement or a contract with the rest of the

00:32:56,610 → 00:32:58,890

org that they will have this service and it will do this

00:32:58,890 → 00:33:02,910

stuff. And sometimes it won't, because incidents happen and

00:33:03,000 → 00:33:06,240

mistakes happen. I think you make people accountable by by

00:33:06,240 → 00:33:09,150

making them transparent, and by making them expose the trade

00:33:09,150 → 00:33:12,840

offs that they're making. And so as an example, if you're in a

00:33:12,840 → 00:33:16,920

team, which is under loads and loads of pressure for time, and

00:33:16,920 → 00:33:18,540

they're like, you've got to shut this thing as quickly as you can

00:33:18,540 → 00:33:21,960

as quickly as you can, if you as the as the senior engineer, or

00:33:21,960 → 00:33:24,960

the tech leader are looking at them saying, cool, we'll do

00:33:24,960 → 00:33:28,050

that. But there's risk. And these are the things that might

00:33:28,410 → 00:33:32,040

go wrong. If we do that, are we comfortable with that risk? Then

00:33:32,040 → 00:33:34,320

you're accountable. Because if something goes wrong, it's

00:33:34,320 → 00:33:36,870

either like, yeah, I said, these things will go wrong. And we

00:33:36,870 → 00:33:39,870

talked about it. And we decided we were okay with the risk, or

00:33:39,900 → 00:33:42,540

it's something completely different has gone wrong, that

00:33:42,540 → 00:33:45,690

is maybe significantly more severe than the things that we

00:33:45,690 → 00:33:48,360

thought my right let's talk about why we thought this was

00:33:48,360 → 00:33:52,170

safe. And why we thought why that wasn't in our risk

00:33:52,170 → 00:33:54,990

assessment. And so you're you're holding people accountable for

00:33:54,990 → 00:33:57,060

the trade offs that they're making. You're not holding

00:33:57,060 → 00:33:59,850

people accountable for an individual thing that went

00:33:59,850 → 00:34:04,200

wrong. It's not like why did this happen? It's why did you

00:34:04,200 → 00:34:07,140

think that we should take this risk, what pressure was put on

00:34:07,140 → 00:34:10,920

you, and let's look at it like at a system level, as opposed to

00:34:11,070 → 00:34:13,590

some person pressed a button on a backfill, and it made the

00:34:13,590 → 00:34:15,780

database really sad. And now we're going to run around

00:34:15,780 → 00:34:19,140

screaming saying that they should be fired. And I think the

00:34:19,140 → 00:34:23,760

other thing about accountability is it's about time. So I think

00:34:23,760 → 00:34:26,670

incidents are generally a lagging indicator, as opposed to

00:34:26,730 → 00:34:29,220

a leading indicator, if that's terminology people are familiar

00:34:29,220 → 00:34:32,070

with. leading indicator basically means you find out

00:34:32,070 → 00:34:34,710

pretty quickly whether a choice you're making is good or bad.

00:34:34,830 → 00:34:37,560

And a lagging indicator is something where a choice that

00:34:37,560 → 00:34:42,060

you make has impact sometime in the future. And because

00:34:42,060 → 00:34:45,360

incidents are a lagging indicator, often the people

00:34:45,360 → 00:34:47,790

handling the incidents are not the people who made those trade

00:34:47,790 → 00:34:51,150

offs. And that's really important to recognize when that

00:34:51,150 → 00:34:55,560

is true. And to understand what is the what are the root causes

00:34:55,590 → 00:34:58,500

to have that kind of discussion whether whether you go down the

00:34:58,500 → 00:35:01,140

five why's route or some other route But to have a really

00:35:01,140 → 00:35:04,110

meaningful discussion about what were the choices we could have

00:35:04,110 → 00:35:08,340

made to avoid this. And why didn't we make them? Was it

00:35:08,340 → 00:35:10,800

because we had loads of pressure on delivery? Was it because no

00:35:10,800 → 00:35:12,690

one was thinking about it, and we just didn't think it was a

00:35:12,690 → 00:35:15,180

risk. And then that's the problem that you have to solve.

00:35:15,210 → 00:35:17,070

And those are the things that you can hold people accountable

00:35:17,070 → 00:35:17,670

for think.

00:35:18,300 → 00:35:20,580

Jason Baum: Awesome. Thank you for answering that. One that I

00:35:20,580 → 00:35:23,490

don't know. It just came when you said accountability. It's it

00:35:23,490 → 00:35:26,400

just popped into my head because I feel like all we've ever heard

00:35:26,910 → 00:35:29,370

for years was accountability, accountability, who's

00:35:29,400 → 00:35:31,440

accountable for this? who's accountable for that? All the

00:35:31,440 → 00:35:37,860

accountability, stuff that people say? And, yeah, it's hard

00:35:37,860 → 00:35:42,000

to be blameless, when, when that's what the buzzword was

00:35:42,030 → 00:35:48,420

before blameless. So now, we're at the point of the podcast,

00:35:48,780 → 00:35:54,120

where I like to ask sort of a fun question of you personally,

00:35:54,780 → 00:35:57,840

because this is the humans of DevOps, and we're all about the

00:35:57,840 → 00:36:02,010

humans. So if there was one thing that you could be

00:36:02,010 → 00:36:04,890

remembered for, what would that be?

00:36:06,420 → 00:36:10,140

Lisa Karlin Curtis: I think I would like to be remembered as

00:36:11,310 → 00:36:16,590

someone who he made systems work better for people.

00:36:17,610 → 00:36:20,310

Jason Baum: Great. I think that's, that's certainly

00:36:20,310 → 00:36:24,600

applicable for today's conversation. Well, thank you,

00:36:24,600 → 00:36:27,270

Lisa, so much for joining me today. It was an absolute

00:36:27,270 → 00:36:27,960

pleasure.

00:36:28,680 → 00:36:30,900

Lisa Karlin Curtis: Thanks so much. I really enjoyed it. And

00:36:30,900 → 00:36:31,110

thank

00:36:31,110 → 00:36:33,210

Jason Baum: you for listening to this episode of the humans of

00:36:33,210 → 00:36:36,390

DevOps Podcast. I'm going to end this episode the same way I

00:36:36,390 → 00:36:38,970

always do encouraging you to become a member of DevOps

00:36:38,970 → 00:36:42,390

Institute to get access to even more great resources just like

00:36:42,390 → 00:36:46,320

this one. Until next time, stay safe, stay healthy, and most of

00:36:46,320 → 00:36:48,960

all, stay human, live long and prosper.

00:36:52,230 → 00:36:54,330

Narrator: Thanks for listening to this episode of the humans of

00:36:54,330 → 00:36:57,900

DevOps podcast. Don't forget to join our global community to get

00:36:57,900 → 00:37:01,230

access to even more great resources like this. Until next

00:37:01,230 → 00:37:04,680

time, remember, you are part of something bigger than yourself.

00:37:04,980 → 00:37:05,790

You belong

Sorry, your browser isn't supported by Audioboom.

Page load failed