Slide 1
How often are we repeating the same programming mistakes?
How often have you dealt with people who said “ok, the problem went away, so it’s solved, I can go back to whatever else I was doing”
Slide 4
Slide 6
Yes, you have other things that need you, but maybe they can wait..
1) Honest ones that would have been hard to plan for
Have you ever watched the old US TV show ER?
Every time you question whether it should :)
I was young and conceit, and also good most of the time
Slide 12
But seriously, if you only remember the last 2 slides, you’ll be so far ahead of most people.
Slide 14
Fingers are somewhat expandable (well the first one or 2), but you only have two eyes. You can treat the second one as a backup or a necessary one for stereo vision :)
Batteries are mostly a slower burning fire normally (but not always) redirected through wires, especially lithium ones
The cell envelope is critical to preventing fires
Slide 19
The battery was upgraded last minute before shipping
Slide 21
Let’s be honest, this is a complex multi level failure
Every circuit has a fuse
You will make mistakes
Your baby is never ugly, at least not to you.
Slide 27
Slide 28
Slide 29
Revert first, discuss/point fingers later
No matter how good we are, we all make mistakes eventually
For (Sys)Ops, sending a plan of what you’re going to change and why works similarly.
When time matters and “We really need to submit this, and I’m putting my job on the line for it”, we have TBRs.
By now, most serious companies and projects use unittests.
Catch regressions early to make them easier to find. I’ve seen too many problems that took so long to find after they were allowed in the tree due to lack of tests. Have tests and CI/CQ as soon as you can.
Hardware tests are often more expensive to scale, so tests are targeted accordingly
Anything that rolls out a text file (like /etc/aliases), should refuse to proceed if more than x% is being changed (catches accidental partial deletes)
https://landing.google.com/sre/sre-book/chapters/postmortem-culture/
We have a week per year when internal systems get to practice live failure and recovery scenarios (within reason)
mkdir -p -m 755 /usr/local/foo is safe, is it not?
https://landing.google.com/sre/sre-book/chapters/automation-at-google/#xref_automation_diskerase-sidebar
Google uses percent rollouts within datacenters and within the global network, they’ve always worked fine
If you have to write it, makes it harder to wave off a mistake and do it again later
From aviation and diving: have a plan before you need it, because once you’re in the middle of problems, the bigger the problem, the more IQ you lose on average
Generally, a given person/company should make the same mistake twice, or you’re really doing it wrong
Slide 46
Slide 47
Slide 48
I’m not a great pilot, I’m just an average one
Slide 50
If the fear of a bad review or writing a postmortem doesn’t do it
Slide 52
Slide 53
Tesla vs Waymo
I’m biased both ways. I work at Google, but I own a Tesla with AP3 since I can’t buy a Waymo car
Slide 56
3 pilots, only one is seasoned, the other two are pretty junior
Pilot pulls nose up so much that the plane enters a deep stall
One of too many Roll Royce engine failures for that plane due to shoddy manufacturing.
Slide 60
Slide 61
Mayday Air Crash Investigation is an addictive series https://en.wikipedia.org/wiki/List_of_Mayday_episodes
Older boeings didn’t limit the pilot in any way
Engines had to be mounted higher up on the wing so as not to scrape the ground. Note that said engines were not designed to fit the 737
To save on money, MCAS was rushed, deadlines “had to be met”
During certification, MCAS was eventually given so much control authority that pilots could not overpower it (trim controls the full horizontal tail, and elevator only a tab on it. At higher speeds, elevator cannot over power the tail)
Decided to fit an engine that didn’t really fit on an airplane for profits
How much would you pay those crucial engineers writing that crucial piece of software essential to the lives of everyone onboard?
Or you could just outsource this to India...
Pushed to fit an engine that wasn’t meant to fit the plane
They forced everyone to pretend the plane was unchanged
Slide 73
Slide 74
Slide 75
Boeing washed their hands of potential TCAS problems with “the pilots can handle it” (and have to within seconds)
Slide 77
Slide 78
The FAA washed their hands of all this complex computer stuff and allowed Boeing to “self certify”. Doesn’t it sound like “mmmh, I can’t understand your code, how about you do your own code review?”
https://www.wsj.com/articles/the-four-second-catastrophe-how-boeing-doomed-the-737-max-11565966629
How the Boeing 737 Max Disaster Looks to a Software Developer “Design shortcuts meant to make a new plane seem like an old, familiar one are to blame”
If you are interested in aviation, a few videos you can watch at home:
Those aviation talks are paramount to pilot pre-training in case of an unexpected emergency in the future
Undo whatever was done last
Learn from other people’s experiences/mistakes.
Slide 86