Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am the author, and everyone I knew who was at Knight at the time had no power to fix anything. After all the trades happened, the quants and tech people couldn't really do anything to unwind them. That was up to the "business" side of the org, who still had no power because they were essentially left looking for a deal to rescue a bankrupt company. Those people actually knew that this was a company-destroying event, and apparently it was a late night for them at work and they were also quietly talking to recruiters.

At least one manager actually gave his whole team the afternoon off while senior management worked this out, anticipating that there would be no Knight by the end of the week.



In the movie WarGames, someone suggested "just turn the damn power off", and got a good answer on why that was "a bad idea"™.

I'm kinda wondering the same thing here, and can't think of the reason why not?


The power switches were in locked cages inside colos miles from the nearest Knight employee. The sysadmins could probably power off the boxes remotely, but the devs probably didn't have that access.

Yes, a "big red button" kill switch was the right answer, but they didn't have that, and it took time to work through several layers of corporate bureaucracy while losing $150,000 per second, time they didn't have.

Anyone with experience in the industry knows that, especially the day of/following a software release, "if in doubt, kill and roll back". Being out of the market for even a few hours is fairly easily to come back from, even in places where you are legally required to be in the market for a given percentage of the time in order to qualify for transaction tax breaks on your market making trades.

That's why on that day my manger called me after hours to come in early the next morning and test our "big red button" kill switch before the market open.


Many years back, we had just moved our new AS400 to a freshly build server room as one of those home shopping networks expanded into the rest of the building. It had a 'big red button' and one of the veeps touring the pre-power up event really wanted to push the button. He did. What we quickly discovered is the power line was connected to the entire call center terminal too. Hit the button... and then all of our pagers started going off shortly there after.

The next discovery was the fire extinguisher head above the room was not installed right when they turned on the water. Brand new machine soaked in the brackish waster.


Thanks, good answer.

I've been in other types of very stressful situations. Simple things like getting hold of the right person or the right tool can be astonishingly time consuming in a tight spot.


> Being out of the market for even a few hours is fairly easily to come back from

Not if you’re dynamically hedging an options book.


Fair enough. At Goldman, I don't recall our options and equities trading systems simultaneously having outages, so I think we could always shed risk by reducing our options exposure if the auto-hedger was unable to delta-hedge in the equities market. I'm not actually sure if the relative independence of the options and equity execution systems was intentional.

I did some work with connecting the options auto-hedger to the equity execution system, and certainly failures on the delta-one side prevented increasing exposure on the options side. "How long can we be out of the equities market and still be certain of meeting our options market-making obligations with the Hong Kong Exchange?" did come up a couple of times.

Depending on exactly where the outage was, there was potentially also the option of manually hedging the options book like the bad-old days. (Execution engines failed, but order management system and exchange connectivity still intact would be one such scenario.)


That's what always struck me about this story. I like to think that if I was in the room we would've turned the system off at 9:31 when we knew it was working abnormally but didn't know why. Instead they let it run for over an hour while they tried to QA it. I've heard that Knight had no kill switch to stop all systems from trading, which seems like by far the biggest mistake they made.


As far as I know, Knight did not have a "clean" kill switch, but they did have the ability to shut everything down. If they killed trading, that meant killing some processes that would screw up their accounting, which would mean losing the rest of the trading day.

Firms I worked at after the fact tried to make sure the kill switch was non-disruptive, and actually did push it once or twice.


Thanks for the write up. Always interesting to read different takes on it.

This is the first time I’ve read that the PowerPeg flag issue was specifically related to Protobuf definition changes. Other reports were vague on how that flag was expressed (I guessed it might have been C++ enum bit field reuse).




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: