I am the author, and everyone I knew who was at Knight at the time had no power ...

fl7305 · on Nov 28, 2023

In the movie WarGames, someone suggested "just turn the damn power off", and got a good answer on why that was "a bad idea"™.

I'm kinda wondering the same thing here, and can't think of the reason why not?

KMag · on Nov 28, 2023

The power switches were in locked cages inside colos miles from the nearest Knight employee. The sysadmins could probably power off the boxes remotely, but the devs probably didn't have that access.

Yes, a "big red button" kill switch was the right answer, but they didn't have that, and it took time to work through several layers of corporate bureaucracy while losing $150,000 per second, time they didn't have.

Anyone with experience in the industry knows that, especially the day of/following a software release, "if in doubt, kill and roll back". Being out of the market for even a few hours is fairly easily to come back from, even in places where you are legally required to be in the market for a given percentage of the time in order to qualify for transaction tax breaks on your market making trades.

That's why on that day my manger called me after hours to come in early the next morning and test our "big red button" kill switch before the market open.

heelix · on Nov 28, 2023

Many years back, we had just moved our new AS400 to a freshly build server room as one of those home shopping networks expanded into the rest of the building. It had a 'big red button' and one of the veeps touring the pre-power up event really wanted to push the button. He did. What we quickly discovered is the power line was connected to the entire call center terminal too. Hit the button... and then all of our pagers started going off shortly there after.

The next discovery was the fire extinguisher head above the room was not installed right when they turned on the water. Brand new machine soaked in the brackish waster.

fl7305 · on Nov 28, 2023

Thanks, good answer.

I've been in other types of very stressful situations. Simple things like getting hold of the right person or the right tool can be astonishingly time consuming in a tight spot.

JumpCrisscross · on Nov 28, 2023

> Being out of the market for even a few hours is fairly easily to come back from

Not if you’re dynamically hedging an options book.

KMag · on Nov 28, 2023

Fair enough. At Goldman, I don't recall our options and equities trading systems simultaneously having outages, so I think we could always shed risk by reducing our options exposure if the auto-hedger was unable to delta-hedge in the equities market. I'm not actually sure if the relative independence of the options and equity execution systems was intentional.

I did some work with connecting the options auto-hedger to the equity execution system, and certainly failures on the delta-one side prevented increasing exposure on the options side. "How long can we be out of the equities market and still be certain of meeting our options market-making obligations with the Hong Kong Exchange?" did come up a couple of times.

Depending on exactly where the outage was, there was potentially also the option of manually hedging the options book like the bad-old days. (Execution engines failed, but order management system and exchange connectivity still intact would be one such scenario.)

yellowstuff · on Nov 28, 2023

That's what always struck me about this story. I like to think that if I was in the room we would've turned the system off at 9:31 when we knew it was working abnormally but didn't know why. Instead they let it run for over an hour while they tried to QA it. I've heard that Knight had no kill switch to stop all systems from trading, which seems like by far the biggest mistake they made.

pclmulqdq · on Nov 28, 2023

As far as I know, Knight did not have a "clean" kill switch, but they did have the ability to shut everything down. If they killed trading, that meant killing some processes that would screw up their accounting, which would mean losing the rest of the trading day.

Firms I worked at after the fact tried to make sure the kill switch was non-disruptive, and actually did push it once or twice.

neomantra · on Nov 29, 2023

Thanks for the write up. Always interesting to read different takes on it.

This is the first time I’ve read that the PowerPeg flag issue was specifically related to Protobuf definition changes. Other reports were vague on how that flag was expressed (I guessed it might have been C++ enum bit field reuse).