The most expensive bug in history...
In 2012, the software development team at Knight Capital—the largest trader of U.S. stocks—got a call from their CEO, Thomas Joyce.
The New York Stock Exchange (NYSE) had just been granted approval to open its dark pool, the Retail Liquidity Program (RLP). A dark pool is a type of shared private market, outside of the public stock market. Some of the largest financial institutions had been increasingly conducting their business through dark pools, away from the public eye.
The Knight developers had 30 days to integrate with the new market, Joyce informed them. It would go live in exactly one month.
Joyce had been a vocal critic of the RLP, but now that it was moving forward, it was essential that Knight moved with it—as a market maker (a company that makes money on the difference between the buy and sell prices of stocks), allowing the flow of orders to move outside of Knight's purview would threaten their ability to generate profits.
With the clock counting down to 9:30 AM EST on August 1, 2012 (when the RLP would accept its first orders), the developers got to work.
The integration would require making changes to Knight's trade execution system, including SMARS, its high speed order router. SMARS had been around for ages—adapting it to send orders to the RLP would be an opportunity to remove some dead legacy code that hadn’t been touched since the early 2000s.
Work proceeded at a feverish pace; 30 days was not a lot of time to write, test, and deploy a complex integration with a new market. As the August 1 deadline loomed, the developers rushed to put finishing touches on the new SMARS router code.
The new system passed all its reviews, and was tested to confirm that it would be ready to process orders to the RLP when it opened for the first time. The plan was to deploy the new system behind a feature flag the week before the deadline; when the market opened on August 1, they'd simply turn it on.
At 9:30 AM EST on August 1, the Knight developers did just that: they enabled the feature flag, and SMARS began to route orders through to the RLP—they were live!
But something was wrong. Their charts showed anomalous spikes in trading activity on the open markets. At 9:34 AM, the NYSE called. Knight was executing a lot of trades—so many, in fact, that trading volumes for the entire market were double their normal level.
To make matters worse, the trades they were making didn't make sense. SMARS appeared to be buying high and selling low. At the current rate, they were losing thousands of dollars per second.
Alerted to the problem, Knight's Chief Information Officer called the top operations engineers together to try to identify the root cause. The rogue orders seemed to be originating from the new RLP router code, but no one could pinpoint the bug.
20 minutes had screamed by since the market opened, and the unauthorized trades executed by SMARS already totalled well into the billions of dollars. It was time to roll back, and ask questions later. Re-deploying the last release would mean that RLP orders would have to wait for the day, but at least it would stop the bleeding.
With a shaky sense of relief, the operations team scrambled to check out the last known stable version of SMARS and deploy it to their 8 production servers.
To their horror, as soon as the router restarted, trading volumes on the NYSE spiked again: they were now executing even more trades than before.
At 9:58 AM, the Knight developers shut down SMARS entirely. It had been 8 minutes since rolling back the RLP code, and 28 minutes since the market opened.
They'd just lost their company $460 million dollars.
The initial shock wave caused the company’s stock to immediately drop 33%. By the next day, 75 percent of Knight's equity value had been erased. Not long after, they were acquired for a fraction of their original value, and Knight Capital was no more.
But what actually happened?
When the developers of Knight's high frequency trading algorithm replaced some unused legacy code, they repurposed a feature flag which had been used to disable it.
The deployment was a success for 7 of their 8 servers, but the deploy to the 8th server failed silently, meaning that one server was still running the legacy code. When they enabled the feature flag, 7 servers operated as expected; the 8th executed the legacy code, which should have never run in production.
Instead of re-deploying the new code to the 8th server, they decided to roll back to the last known good state. Unfortunately, they didn't know that the problem was the feature flag, and it didn’t cross their minds to turn it off. When the old system was re-deployed, every server began to run the legacy code, dramatically compounding their losses.
Where did they go wrong?
I mean, the Knight Capital developers weren't stupid—they worked for one of the largest traders on the New York Stock Exchange, on a system that processed hundreds of millions of dollars in trading every day. Up until that fateful day, they'd done this successfully since 1995.
Most of us do not work on systems that could take down the U.S. financial system if we make a mistake. And yet, we’re not so different. It all comes down to software, and developing software is the same, regardless of whether it runs on AWS or a major stock exchange.
The Knight developers should have never allowed dead code to remain in their app for so long. Had they been more proactive, they could have easily avoided catastrophe. Reusing a feature flag was a dumb mistake that just shouldn't have been made. The developers weren't entirely to blame, though—if there's one certainty in life, it's that we will make mistakes.
In the end, it came down to process. Knight didn't have a good code review process, which could have caught the design issues with the RLP changes before they were approved. When they deployed SMARS, they didn't have an automated deployment pipeline, instead relying on their engineers to manually deploy the new code; as a result, they missed an important step on the fated 8th server. When the first mistake led to a crisis situation, their monitoring was inadequate, and they didn't have documented incident response procedures which could have prevented them from making an even worse mistake under pressure.
As full stack developers, it's our job to care about software delivery from start to finish, and that includes deployment, monitoring, and fixing it in production when it breaks (and it will break). That's what DevOps is all about.
Co-founder, ⚡ Honeybadger.io