Reminds me of this blunder: trader accidentally switches places for price and quantity and instead of selling 1 contract at ¥610,000 manages to send an order to sell 610,000 contracts at ¥1. Order passes through the GUI, limits checker and several dozen systems like knife through cheese and is placed on the exchange. Exchange happily accepts the order and mayhem is ensured.
If all were executed that would be more than $3B (billions!) loss, heck almost 4 billions. Eventually the company settles for about $300M (millions).
So Knight Capital isn't alone in this "hall of fame" :)
I implemented a slippage warning system in a trading GUI I was in charge of after exactly this scenario happened once: a trader switching price and quantity and temporarily cratering a market. It would show a second order confirmation screen if your order was going to fill with high slippage, and it made you type the words "SHOOT ME" into a text field to send the order. After we had built it, it seemed so obvious to have that kind of sanity checking.
It makes even more sense for the matching engine to disallow this on the back end, though.
I worked on a compliance add-on that blocked institutional traders based on rules they set for themselves. All day long was spent fielding their urgent support requests complaining that the rule was wrong. Massive amounts of time were spent finding the market data or computing their account value at the specific time the trade was blocked, 90% of the time arriving at the conclusion that the product worked as intended. I complained that I wasn't getting to code enough, I was told to "code on the train [while riding to work]".
One time, a client wanted a rule to block a trade if the price exceeded the daily high/low. Well, when you define the limit this way, you pretty much can't trade right at the opening because many/most price ticks are the highest/lowest of the day SO FAR, because the day is only a few seconds old! Customer had a meltdown, we traced the market data back, and realized yeah, it worked exactly as designed. sigh
> As the senior trader at Æxecor, Brad made it very clear that no one — “not even His Holiness, the Pope” — shall question his trades. After all, Brad makes complex trading decisions that no one else could possibly comprehend.
According to what I have heard at my former employer that supplied the trading platform to Mizuho, there was indeed a warning, and the user dismissed it. This was relevant when Mizuho tried to recover some of its losses from us.
It shouldn't require price history. You just need the order book and you can simulate the execution of any order, and figure out its average price.
If the average price is X basis points worse than the current top of book, that's slippage. So eg if highest bid in book is $100 and you are entering a sell order that would eat so much of the book that it would fill with an average price of $70, that's 30% slippage and probably not what you meant to do.
You need to have a book in the first place though. If the instrument is highly illiquid the spread might be huge and the price have little real world relevance.
For liquid books, yes definitely I would expect these sort of checks (tipically against historical prices) to be in place.
I see the "nanoseconds counted so they didn't use protobufs" note, but in case you do you protobufs and want to make sure this never happens to you, I heartily recommend using the "reserved" keyword in your protos whenever you remove a field. Reserving a number is a note to the proto compiler that says "I will not use this number again, and please generate an error if I foolishly later try."
There are several protobuf linters out there, but what I haven't seen is a protobuf linter that integrates with Git to look behind and verify that you haven't accidentally changed protobuf numbers, or reused fields, or etc. Would be handy.
I once made a similar mistake. It ended up being reported in the Economist.
We had developed the central trading system for a major country, which conducted its primary bill and bond issuance through our platform. Approximately once a week, in the mornings, banks would place orders for government bills, specifying a price and the volume they wished to purchase at that price. They submitted multiple orders, indicating a willingness to buy more at higher yields.
Once all orders were submitted, the instrument would be suspended, and a scaled order report generated. The central bank would then retreat to a room to analyze the data, determining how much they could borrow and at what rate. For example, they might be able to borrow a billion at 10% interest, but only 500 million at 8%.
After settling on a price, they would issue a large sell order to match the existing orders, leaving any unmatched orders in the book. We would then restart the instrument in a continuous trading mode, allowing banks to continue buying and selling the new series.
I was tasked with enhancing the instrument suspension feature. The enhancement involved adding a flag and a button to the GUI, allowing for the optional withdrawal of orders when suspending an instrument. This seemingly simple protocol required adding a byte as a flag, indicating either 0 or 1, at the end of the message. After incorporating this feature into both the system and the UI, and following extensive testing, we deployed it.
However, during the next auction, an issue arose. When the operators suspended the market, all orders were mistakenly withdrawn. To avoid the embarrassment of asking participants to resubmit their orders, they announced the auction's cancellation due to poor participation and unfavorable pricing.
The root of the problem was discovered later. Operations had been using a command-line tool to suspend the market, a tool I wasn't even aware of. As the auction price report was generated from the command line just before the report, the command-line tool had a random value in the withdrawal flag's place, leading to the withdrawal of all orders.
So, the system had been accepting malformed messages that were too long? Did the cli tool get fixed and the system get better sanity checks on incoming messages?
The command line tool generated a message using the old format, it was one byte short (missing the 'withdraw on suspend' flag). By the time the message got to the matching engine, the engine looked at the byte on the end of the withdraw message and it randomly happened to be true.
The fix, at the time, was to add a 'N' to the end of the message the command line tool generated.
Later I re-wrote everything to use a compiler generated message encoder and decoder, SNACC ASN.1 (Something like protobuffers, as invented again by Google, which was also invented the first time by Sun as XDR/RPC, that was also invented for the first time as CORBA. I think MS invented it as well for COM).
Interesting analysis. You can easily imagine each bit of sloppiness and oops happening in many shops. Even many instances of sloppiness combining for emergent worse effect isn't unusual. Much less common is for the company to be wiped out by it.
> At 10:15, the kill switch was flipped, stopping the company’s trading operations for the day. By early afternoon, many of Knight Capital’s employees had already sent out resumes,
Was the patient obviously dead in those first few hours, or had people written it off prematurely when they might've been called to help perform CPR?
Maybe there's two problems here: insufficient culture of diligence, and insufficient culture of loyalty.
I am the author, and everyone I knew who was at Knight at the time had no power to fix anything. After all the trades happened, the quants and tech people couldn't really do anything to unwind them. That was up to the "business" side of the org, who still had no power because they were essentially left looking for a deal to rescue a bankrupt company. Those people actually knew that this was a company-destroying event, and apparently it was a late night for them at work and they were also quietly talking to recruiters.
At least one manager actually gave his whole team the afternoon off while senior management worked this out, anticipating that there would be no Knight by the end of the week.
The power switches were in locked cages inside colos miles from the nearest Knight employee. The sysadmins could probably power off the boxes remotely, but the devs probably didn't have that access.
Yes, a "big red button" kill switch was the right answer, but they didn't have that, and it took time to work through several layers of corporate bureaucracy while losing $150,000 per second, time they didn't have.
Anyone with experience in the industry knows that, especially the day of/following a software release, "if in doubt, kill and roll back". Being out of the market for even a few hours is fairly easily to come back from, even in places where you are legally required to be in the market for a given percentage of the time in order to qualify for transaction tax breaks on your market making trades.
That's why on that day my manger called me after hours to come in early the next morning and test our "big red button" kill switch before the market open.
Many years back, we had just moved our new AS400 to a freshly build server room as one of those home shopping networks expanded into the rest of the building. It had a 'big red button' and one of the veeps touring the pre-power up event really wanted to push the button. He did. What we quickly discovered is the power line was connected to the entire call center terminal too. Hit the button... and then all of our pagers started going off shortly there after.
The next discovery was the fire extinguisher head above the room was not installed right when they turned on the water. Brand new machine soaked in the brackish waster.
I've been in other types of very stressful situations. Simple things like getting hold of the right person or the right tool can be astonishingly time consuming in a tight spot.
Fair enough. At Goldman, I don't recall our options and equities trading systems simultaneously having outages, so I think we could always shed risk by reducing our options exposure if the auto-hedger was unable to delta-hedge in the equities market. I'm not actually sure if the relative independence of the options and equity execution systems was intentional.
I did some work with connecting the options auto-hedger to the equity execution system, and certainly failures on the delta-one side prevented increasing exposure on the options side. "How long can we be out of the equities market and still be certain of meeting our options market-making obligations with the Hong Kong Exchange?" did come up a couple of times.
Depending on exactly where the outage was, there was potentially also the option of manually hedging the options book like the bad-old days. (Execution engines failed, but order management system and exchange connectivity still intact would be one such scenario.)
That's what always struck me about this story. I like to think that if I was in the room we would've turned the system off at 9:31 when we knew it was working abnormally but didn't know why. Instead they let it run for over an hour while they tried to QA it. I've heard that Knight had no kill switch to stop all systems from trading, which seems like by far the biggest mistake they made.
As far as I know, Knight did not have a "clean" kill switch, but they did have the ability to shut everything down. If they killed trading, that meant killing some processes that would screw up their accounting, which would mean losing the rest of the trading day.
Firms I worked at after the fact tried to make sure the kill switch was non-disruptive, and actually did push it once or twice.
Thanks for the write up. Always interesting to read different takes on it.
This is the first time I’ve read that the PowerPeg flag issue was specifically related to Protobuf definition changes. Other reports were vague on how that flag was expressed (I guessed it might have been C++ enum bit field reuse).
Loyalty had nothing to do with this. The company lost $400 million dollars in about 45 minutes. That's 9 million dollars a minute or 150 thousand dollars a second.
I work as a quant and remember that morning vividly. It was as if trucks of free money were falling from the sky to the point that many of us were skeptical that these were actual trades; we were convinced that this was some kind of bug at the exchange and these trades would get broken.
Mistakes like this do happen, in fact it's not thaaaat uncommon. What made this situation so unusual is that it just kept going and going.
Quants and HFT developers are one of the best compensated people on Earth. I don't think that someone who's being paid over a million dollars a year to sit in a cubicle and write code should use the same rhetoric as 19th century factory workers.
A lot of quants are objectively paid very well but not that well (at least pre layoffs) compared to some people who do comparatively little work in tech.
Then you don't understand the relationship between capital, labour and being an employee vs employer. Yes, they are well-compensated, but they can still have their lives turned upside down by bosses. Granted, their life getting turned upside down probably means less front-row Knicks tickets and not choosing between paying a water bill and a power bill.
For example: a company that needs loyalty, but then does things like the layoffs we've seen recently (or earlier behavior consistent with that thinking), is creating insufficient culture of loyalty.
> Maybe there's two problems here: insufficient culture of diligence
I know nothing about Knight but I will say that traditional finance often has no ability to think about systems particularly well so the local optimum is usually a rickety shack that has a lot of smart people working on it, and some less smart people making sure problems are being seen to be fixed, but the overall architecture can be extremely brittle or non-existent. It's not like (good) tech companies.
If you want to see how it looked like from the tick scale, take a look here: http://www.nanex.net/aqck2/3522.html
Ps. Anyone know of any other sites / places that does comparable level of research that's open to the public?
In my opinion the root cause is pretty clear: they had a network protocol update that was not backwards compatible and didn’t verify the runtime versions or have versioning.
Everything else isn’t really core to the issue [even that the update script silently failed].
I was working on trading systems at Goldman in NYC at the time. After hours on that day, I got a call from my manager to come in early the next day to ensure our kill switches worked properly, that our release and review processes were sufficient, and that our monitoring systems were sufficient.
A few years later, I was working on trading systems at Goldman in Hong Kong. I sent a change out for review, went out for dinner and drinks with a colleague visiting from Tokyo, and swung by the office on my way home. My change had been approved by my NYC colleagues, so I merged it and went to bed. The next morning, I woke up to news that Goldman had a 100 million USD trading lost due to a software bug.
Edit: This was in Goldman's Slang language, where source code is loaded from a globally-distributed eventually-consistent NoSQL DB. Most applications execute from read-only DB snapshots after extensive release testing. However, as soon as you merge your change, it's potentially instantly running in production somewhere in the world by some team you might not even know exists. It was possible my merge, maybe 45 minutes before the NYC market open, had gotten picked up by the errant system.
I spent a while convincing myself that there was no way my change was the cause, and realized my phone would be ringing off the hook had my change been the cause.
The guy who made the software change (let's call him Zaphod Beeblebrox since that's clearly not his name), and the guy who approved it, were both put on leave before I woke up. I found out who made the change only because I had an open chat window with Zaphod, and through several rounds of "fifth quartile" annual layoffs, knew how the chat and email systems responded when accounts got locked out. The chat system showed Zaphod's location as unknown, and a test email to him came back with the "mailbox full" message for a locked account. I walked over to the desk of one of the senior Equity Options Flow Strats in Hong Kong, and whispered "So... Zaphod Beeblebrox", and the Strat's face lit up and he whispered back "How did you know?", to which I responded "I didn't until I saw your reaction".
The guy who mode the 100 million mistake was actually very very good at his job. He caught quite a few subtle bugs in other people's code that he wasn't even asked to review, but was reviewing out of curiosity. But, he was working late under time pressure, didn't test his change properly, and you only have to slip up once.
As I remember, many of the trades were broken by the exchange, and the total loss came out to about 28 million USD.
On the one hand, the guy didn't deserve to get fired, and I'd totally hire him for my team. On the other hand, if someone cuts corners and that results in tens of millions of USD in losses and doesn't get fired, that's very demotivating for everyone else at the firm. They did a very good job about not naming and shaming.
After waking up to being momentarily scared I had made a 100 million USD error, I don't merge changes after-hours any more, and certainly never after having consumed any alcohol. If a guy like Zaphod can lose 28 million USD from a tired merge, so can I.
Zaphod, if you're reading this and ever looking for a job, give me a ring.
Great story. The part that sticks out to me is 72% (the discount that GS got due to busted trades) and 0% (the discount Knight got due to busted trades.) If you're going to eff up, first make sure you're a big player!
In Goldman's case, Goldman was literally sending out options orders with as ask price of $0. I don't recall if it was the exchange or regulators that decided "If you bought below $x, you knew you were trading against a broken algorithm and shouldn't have expected the trade to last".
I'm not sure if any of the orders Knight was sending out were clearly so erroneous. It's also possible that only after the Knight incident is when it was made clear to market participants that they should expect clearly erroneous trades to be broken.
In any case, Knight was a major liquidity provider, and it wasn't in the market's best interest for them to go bankrupt, but it also sets a very bad precedent if plausible orders get broken.
As far as I know, Knight was unique in that its orders were obviously stupid, but not obviously mispriced or mis-sized. It's not a case of a clear fat-finger error that would be visible to other market participants.
I know that in many markets, the exchange will reverse your trades if the counterparty made an obvious, visible error (eg $0 ask price on a limit order).
> the flag word was out of new bits for flags, so an engineer reused a bit from a deprecated flag
Plus some silent update failures meant that new-feature orders sent to out-of-date servers transformed into old-feature orders. Boom!
> Adding risk checks to the last stage of an order’s life became universal in the industry
I wonder what these "fast-twitch" sanity checks / circuit breakers look like. Whenever I try to model risk, things get complicated quickly -- but presumably simple heuristics must exist if they became universal in the industry.
What became universal in the industry is an item on the checklist saying that you have safety checks to prevent this. How seriously that item is taken, i suspect, varies quite a bit.
In my experience, you aim for multiple extremely simple checks, with minimal logic and minimal calculation, so you can have confidence they won't have surprising behaviour in an unusual situation like this.
The classic example is an order count limit - initialise a counter to some value at startup, and every time the machine sends an order, try to decrement it. When it hits zero, it can't send orders any more. Just throw an exception or return early or something. You display the value of the counter to human operators, and give them a button which resets it to the initial value. In normal operation, you are sending orders at a steady trickle, and humans will have to press the button every now and then. If something goes insanely wrong, as here, the counter will run down quickly, and then the humans hopefully won't push the button, because something is obviously wrong. It's a very crude safety, but it is a simple one.
Another is a limit on message rate. You could use a token bucket filter. Does not affect normal operation, but stops a machine which is spraying out excessive orders. You could have it so that if the bucket runs out, it turns off until a human explicitly turns it back on.
You have limits on net position too, to stop you running up huge positions in anything, but those are higher-level, and not quite the same kind of last-ditch safety check.
I don't really know that either of these would have helped in Knight Capital's situation, because the precise mechanics of the "power peg" aren't clear. It sounds like a kind of explicitly-managed iceberg order, which these safeties would have caught. But another writeup [1] says it was a testing tool, not intended to be used on a real exchange at all, in which case who knows.
(Author) As far as I know, Power Peg was indeed intended to essentially be a manual iceberg order from the time before that was an order type on the exchange (with slightly different semantics).
Rereading the source you quoted, it definitely wasn't a "buy high sell low" system, even if it was never used in prod.
> I wonder what these "fast-twitch" sanity checks / circuit breakers look like. Whenever I try to model risk, things get complicated quickly -- but presumably simple heuristics must exist if they became universal in the industry.
You're right, they are very simple. Think things like orders per second, quantity of order, price of order, notional ordered over time, etc. You basically want to ensure things aren't "too big" or "too fast" as simply as possible.
Other types of risk (portfolio risk, greek risk, etc.) are handled in different ways, upstream of these final checks.
- Do we have accounting for what trading strategy generated this order?
- Will this order immediately lose us money? (e.g. are we buying out-of-the-money options)
- Did we accidentally set the PhysicallyDeliver flag?
- Have we hit our organization's margin limits?
Any situation that no reasonable trading strategy would put you in, or that would otherwise be outright illegal, is a good thing to put in risk checks for.
I was a Core Strat for Goldman's Algorithmic Trading Platform (ATP) at the time of the Knight collapse. ATP had an "upstream compliance layer" and a "downstream compliance layer" even prior to this incident.
> I wonder what these "fast-twitch" sanity checks / circuit breakers look like.
Leaving out any company secrets, things were structured basically how you'd expect from any software architecture design class assignment. Specific components perform sanity checks at specific levels. Some components keep track of order state. Other components making trading decisions. Still other components are tasked with getting data from place to place and translating message formats.
The "compliance" rules were actually a mixture of exchange regulations, extra constraints from the Compliance department, and other sanity checks. Basically, most of the sanity checks were called "compliance rules", and from an engineering point of view it made sense to treat all of the sanity rules the same, regardless of which entity came up with the rule.
ATP is a framework for execution algos: some other algorithm or person makes the big-picture decisions for big orders, and ATP takes those big-picture "parent" orders and breaks them up into smaller "child" orders at various price points at various times based on various parameters/hints annotated on the parent orders.
The other key components of the trading system are the market data feeds, the exchange/venue connectors, the Smart Order Router (SOR) and the Order Management System (OMS). The OMS keeps track of the parent order state and the relationship between parent and child orders. The OMS communicates the child orders to the SOR, which (unless the child order is annotated explicitly with an exchange) distributes the orders across the various trading venues. The exchange/venue connectors allow the SOR to communicate with the exchange.
Some places combine several of these components into single processes, but it's a pretty typical execution architecture. I've heard it's pretty common to either have the OMS and execution algo engine in the same process, or else have them communicate via shared memory. If I were designing a system from scratch, I'd probably have a TWAP-only algo engine using shared memory to communicate with the OMS, and then build all other execution algos (TWAP, Arrival, etc.) from TWAP orders.
For instance a parent order might be for "Buy 10,000 of 1299.HK (the RIC for AIA LTD.), limit price 72.00, get done as close to the current market price as possible, finish by 15:00:00 but never trade more than 10% of the market volume in 1299.HK". An ATP engine subscribes to notifications for state changes for all orders tagged in the OMS with its particular EngineID. Leaving out the details of how the order gets to the OMS and how the parent orders get tagged for a particular engine, the ATP engine sees new parent orders assigned to it.
The Upstream Compliance Layer then performs sanity checks (including some fat-finger checks for limit prices too far from the previous day's adjusted closing price) and either accepts responsibility for executing the parent order and tells the OMS to change Status from Pending to Accepted, or else tells the OMS to change status from Pending to Rejected (along with a short text description of the rejection reason).
Assuming the parent order is accepted, the OMS sends a message to the client (via a series of upstream systems handling client connectivity) informing them that the order is accepted. In the case of targeting arrival price, it has a partial differential equation model of price impact, and it solves this PDE to minimize total estimated price, and decides to split out the first child order(s) to the exchange. Even though the parent order is BUY 10,000 1299.HK @ 72.00, maybe the model indicates it's optimal to split out BUY 300 1299.HK @ 68.70 and BUY 100 1299.HK @ 70.10 and wait for the rest.
Within ATP, the Downstream Compliance Layer performs sanity checks on the proposed child order(s). One of the basic checks is that the executed quantity plus the total quantity across all outstanding child orders won't exceed the quantity of the parent order. In the case of Hong Kong, this includes checks that the child orders are within the exchange's circuit breaker limits for up/down percentage from the previous day's close, checks that the execution algo doesn't have excessive numbers of child orders in the market, rate-limiting child order creation, etc. (In Hong Kong, exchange connectivity is priced by the transaction-per-second, so tons of one-lot child orders eat up tons of transactions from your quota if you're re-pricing or cancelling them. One of the large multinational banks had an algo go crazy with tons of small mispriced child orders in Hong Kong sometime around 2010-2012, and due to the number of transactions-per-second they had purchased, it took them over half an hour to get everything cancelled, bleeding money the whole time.) If the proposed child orders pass the Downstream Compliance Layer, then ATP sends messages to the OMS to create the proposed child orders.
The OMS then performs its own sanity checks, including again checking that the total quantity across the parent's child orders plus the already executed quantity doesn't exceed the parent order quantity. If the OMS's sanity checks pass, then the child orders are created and the SOR is notified.
ATP then waits to see at least 1,000 shares of 1299.HK trade before splitting another 100 share child order (in order to keep to the 10% max participation rate), its plan based on the partial differential equation solution also restricts when and at what prices it places orders.
As I remember, in most markets, it's sufficient to perform a short-sell locate in the Upstream Compliance Layer, getting a stock loan for the quantity of the parent order. However, I seem to remember Japanese regulators requiring locate checks in the Downstream Compliance Layer to check every outgoing child order, resulting in some pain to run the Japanese compliance checks quickly. I guess this results in more fair distribution of locates among clients in the case of hard-to-short stocks, but it's a pain for efficient implementation.
I suspect the main effect of "modern practices" is to make it much easier to make mistakes at scale. Maybe not the exact same mistake as here, but some other one.
By the way, have we heard from Google about how they managed to roll back 6 months of customer data yesterday?
They say, in Aircraft, safety regulations are written in blood. While no one died from this event, its clear the the metaphorical corporate blood spilled probably did wonders to help a lot of other groups.
That's a refreshing reversal from what you usually hear about. You can bet this engineer will never do anything like that again and will likely lead the way toward implementing comprehensive and effective safety mechanisms.
That sentence isn't as impactful as it sounds. There aren't very many engineers out there whose reporting chain remains intact over the better part of a decade, disaster or no disaster.
The best part of these stories is that nothing of value was ever created or lost.
Just a shell game in a casino, our previous generations built rockets with their best and brightest, now we build Ad Exchanges and High frequency trading bots.
If all were executed that would be more than $3B (billions!) loss, heck almost 4 billions. Eventually the company settles for about $300M (millions).
So Knight Capital isn't alone in this "hall of fame" :)
Here's the story: https://www.cbsnews.com/news/stock-trade-typo-costs-firm-225...