I had a similar experience recently while helping my 5 year old daughter vibe code a sandcastle-themed tower defense game (https://sandcastles.finley.lol).
I ended up thinking it might be easier to generate rigged models, animate them, and capture from an iso perspective, then do some kind of pixel art style transfer on the masked sprite sheet. Eventually I realized my kid didn't really care too much about the visuals so I didn't get too far with it.
That's a cute looking game! I have considered using 3D mesh models but to generate a highly detailed, textured 3d mesh it still costs quite a bit especially when you need to do this at scale
Surely they are testing their optimizations against common benchmarks internally? I bet the "real world task" degradation is larger by some multiple than it appears when measured through a benchmark that is part of the target.
I've noticed this and thought about it as well, I have a few suspicions:
Theory 1: Some increasingly-large split of inference compute is moving over to serving the new model for internal users (or partners that are trialing the next models). This results in less compute but the same increasing demand for the previous model. Providers may respond by using quantizations or distillations, compressing k/v store, tweaking parameters, and/or changing system prompts to try to use fewer tokens.
Theory 2: Internal evals are obviously done using full strength models with internally-optimized system prompts. When models are shipped into production the system prompt will inherently need changes. Each time a problematic issue rises to the attention of the team, there is a solid chance it results in a new sentence or two added to the system prompt. These grow over time as bad shit happens with the model in the real world. But it doesn't even need to be a harmful case or bad bugged behavior of the model, even newer models with enhanced capabilities (e.g. mythos) may get protected against in prompts used in agent harnesses (CC) or as system prompts, resulting in a more and more complex system prompt. This has something like "cognitive burden" for the model, which diverges further and further from the eval.
I can see a market for virtual copies of incredibly unpopular CEOs, but I don't think Mark would like how people would likely choose to use these digital effigies.
For both of these scenarios, it seems to happen when the context limit is getting full and the context is summarized. I've found it usually works to respond with the right file, i.e. "great, let's apply those changes in @path/to/file", but it may also be a good time to return to an earlier conversation point by editing one of your previous messages. You might edit the message that got you the response with changes not linked to a specific file, including the file path in that prompt will usually get you back on track.
I ended up thinking it might be easier to generate rigged models, animate them, and capture from an iso perspective, then do some kind of pixel art style transfer on the masked sprite sheet. Eventually I realized my kid didn't really care too much about the visuals so I didn't get too far with it.
reply