Slide 1

How often are we repeating the same programming mistakes?

How often have you dealt with people who said “ok, the problem went away, so it’s solved, I can go back to whatever else I was doing”

Slide 4

Slide 6

Yes, you have other things that need you, but maybe they can wait..

1) Honest ones that would have been hard to plan for

Have you ever watched the old US TV show ER?

Every time you question whether it should :)

I was young and conceit, and also good most of the time

Slide 12

But seriously, if you only remember the last 2 slides, you’ll be so far ahead of most people.

Slide 14

Fingers are somewhat expandable (well the first one or 2), but you only have two eyes. You can treat the second one as a backup or a necessary one for stereo vision :)

Batteries are mostly a slower burning fire normally (but not always) redirected through wires, especially lithium ones

The cell envelope is critical to preventing fires

Slide 19

The battery was upgraded last minute before shipping

Slide 21

Let’s be honest, this is a complex multi level failure

Every circuit has a fuse

You will make mistakes

Your baby is never ugly, at least not to you.

Slide 27

Slide 28

Slide 29

Revert first, discuss/point fingers later

No matter how good we are, we all make mistakes eventually

For (Sys)Ops, sending a plan of what you’re going to change and why works similarly.

When time matters and “We really need to submit this, and I’m putting my job on the line for it”, we have TBRs.

By now, most serious companies and projects use unittests.

Catch regressions early to make them easier to find. I’ve seen too many problems that took so long to find after they were allowed in the tree due to lack of tests. Have tests and CI/CQ as soon as you can.

Hardware tests are often more expensive to scale, so tests are targeted accordingly

Anything that rolls out a text file (like /etc/aliases), should refuse to proceed if more than x% is being changed (catches accidental partial deletes)

https://landing.google.com/sre/sre-book/chapters/postmortem-culture/

We have a week per year when internal systems get to practice live failure and recovery scenarios (within reason)

mkdir -p -m 755 /usr/local/foo is safe, is it not?

https://landing.google.com/sre/sre-book/chapters/automation-at-google/#xref_automation_diskerase-sidebar

Google uses percent rollouts within datacenters and within the global network, they’ve always worked fine

If you have to write it, makes it harder to wave off a mistake and do it again later

From aviation and diving: have a plan before you need it, because once you’re in the middle of problems, the bigger the problem, the more IQ you lose on average

Generally, a given person/company should make the same mistake twice, or you’re really doing it wrong

Slide 46

Slide 47

Slide 48

I’m not a great pilot, I’m just an average one

Slide 50

If the fear of a bad review or writing a postmortem doesn’t do it

Slide 52

Slide 53

Tesla vs Waymo

I’m biased both ways. I work at Google, but I own a Tesla with AP3 since I can’t buy a Waymo car

Slide 56

3 pilots, only one is seasoned, the other two are pretty junior

Pilot pulls nose up so much that the plane enters a deep stall

One of too many Roll Royce engine failures for that plane due to shoddy manufacturing.

Slide 60

Slide 61

Mayday Air Crash Investigation is an addictive series https://en.wikipedia.org/wiki/List_of_Mayday_episodes

Older boeings didn’t limit the pilot in any way

Engines had to be mounted higher up on the wing so as not to scrape the ground. Note that said engines were not designed to fit the 737

To save on money, MCAS was rushed, deadlines “had to be met”

During certification, MCAS was eventually given so much control authority that pilots could not overpower it (trim controls the full horizontal tail, and elevator only a tab on it. At higher speeds, elevator cannot over power the tail)

Decided to fit an engine that didn’t really fit on an airplane for profits

How much would you pay those crucial engineers writing that crucial piece of software essential to the lives of everyone onboard?

Or you could just outsource this to India...

Pushed to fit an engine that wasn’t meant to fit the plane

They forced everyone to pretend the plane was unchanged

Slide 73

Slide 74

Slide 75

Boeing washed their hands of potential TCAS problems with “the pilots can handle it” (and have to within seconds)

Slide 77

Slide 78

The FAA washed their hands of all this complex computer stuff and allowed Boeing to “self certify”. Doesn’t it sound like “mmmh, I can’t understand your code, how about you do your own code review?”

https://www.wsj.com/articles/the-four-second-catastrophe-how-boeing-doomed-the-737-max-11565966629

How the Boeing 737 Max Disaster Looks to a Software Developer “Design shortcuts meant to make a new plane seem like an old, familiar one are to blame”

If you are interested in aviation, a few videos you can watch at home:

Those aviation talks are paramount to pilot pre-training in case of an unexpected emergency in the future

Undo whatever was done last

Learn from other people’s experiences/mistakes.

Slide 86