Mistakes were made

Ali Zaidi
Senior Software Engineer

I really enjoyed the first season of Westworld. There were so many themes touched upon, complex storylines masterfully woven into a cohesive narrative. One quote that stuck with me though was this:

“Evolution forged the entirety of sentient life on this planet using only one tool: the mistake.”

Let me put some context around this. When we make a mistake whether from diverted attention, inexperience or lapse in judgement, the first instinct is a negative one. It’s because despite the stark nature of the quote above, we as a people are conditioned to see success as good and failure as bad. No one wants to make a mistake, and at an intrinsic level, admit to it. And that can be disastrous.

To illustrate, I’ll share a story that has been discussed in my circles; a friend was employed as an infrastructure engineer, where all of their infrastructure was hosted on AWS. One fateful day as he worked on some housekeeping automation he noticed a billing alarm come though on the queue. Then another. Three more. And they kept coming. He quickly triaged the alerts and found that hundreds of instances had been launched on their public cloud account. In every available region of the globe.

Horrified, he immediately made his manager aware of this, looped in his teammates and began damage control. What they found was one of their IAM user’s key had been compromised. This user had administrator privileges on EC2 and was able to programmatically launch hundreds of Spot Instances in every availability zone of every available region. All instances were of the highest performance tier so every hour they ran was additional cost for his company. This is without even factoring what was being done on the instances, as that could potentially have been damaging on a much larger scale.

They inactivated the IAM user’s credentials and locked permissions down to the bare minimum and simultaneously took the instances down. All in all it was a several thousand dollar hit. Once the situation was stabilized, and no more instances were being spun up, root cause analysis revealed how the IAM user was compromised.

One of the developers had mistakenly committed and pushed code containing the key to the said user, to his personal public repository instead of the company’s private one. He completely didn’t realize that he was pushing to the wrong repository.

Was he at fault for (accidentally) exposing the company’s intellectual property and the key that could wreak havoc in their infrastructure, to the world? Sure – but there’s a glaring problem in all of this. The key should not have been in the code in the first place.

Of course, once the dust settled, lessons learned – his company had a much stronger code review process, a closer working relationship between the development and infrastructure organizations, higher resolution and more intelligent alerting for operations, strict adherence to separation of duties and secure credential management. To top it off, AWS was gracious enough to waive the amount incurred during this episode from their bill as an act of good faith. All’s well that ends well, as the saying goes.

Through and through, the most important part of a mistake is the aftermath and how it is dealt with. How we rectify it, how we design redundancies and fail-safes around it, how we train ourselves not to repeat it, and perhaps most importantly, not let it demoralize us – to see and use it as a tool, not a curse.

Do you have stories of things not going according to plan in your technical organization and lessons learned? We’d love to hear from you in the comments section below.

Comments

  • Martin McGreal

    I had a boss who used to tell us if we never made mistakes, we just weren’t challenging ourselves enough. Like you said, it’s all abouf how you recover and learn from those mistakes. Great post!

Comments are closed.