Most mathematicians don't take pride in their results having no applications. That's just not true. Maybe some quirky pure logicians or something. But otherwise 90%+* of mathematicians I know would be at least satisfied if not thrilled for their work to be used by others.
Wouldn't that just accelerate collapse? How much do you trust the outputs of the llm to provide trustworthy and valuable new information? I mean I understand distillation works. But that's much more structured and thoughtful than my sessions at least.
I was thinking of curated replay buffers, which would act like "dreams". To prevent collapse, the offline dataset would mix the new mid-term data with a baseline of anchor data (the original training distribution) so the model doesn't drift.
Also, we wouldn't train on the whole session. A separate critic module, like a reward model, would filter the KV cache to extract the high-value information, like a garbage collector before the LoRA.
That's just an idea though. Right now most research focuses on changing the architecture itself (TITAN, HOPE...) instead.
More predictive power is always a good goal, full stop. This is orthogonal to whether the model producing prediction helps with "understanding" directly. Predictability encodes understanding in a strict information theoretic sense, regardless of our ability as humans to access that understanding.
It's not arguing that predictive power is bad. Just that people often mistakenly believe some phenomenon is understood more deeply than it really is, because a model can fit data and generate accurate predictions.
But in some cases it is not good enough. If you look for a better explanation and chose gradient descent as your strategy, then you'll come to a local maximum eventually, but not for another explanation.
Arguably, it is hard to look for better explanation if the current one doesn't have a backtrack of failed predictions. One of the possible ways out of this situation is to search for the predictions that fail.
But what I want to say is explanations are not just for prediction. They are needed to build a mental model that then can drive the research. And new model can be built (theoretically) from the first principles. I can't find clean examples for it though. If we look at Einstein for example, he started with a failure to predict. But what he came up at first was Special Relativity which failed utterly with the gravity. Einstein spent like 10 years rewriting gravity to make it work with SR? Failed predictions of his new shiny theory didn't stop him, and it is considered to be good.
> Predictability encodes understanding in a strict information theoretic sense, regardless of our ability as humans to access that understanding.
But it doesn't necessary implies the possibility to move forward. I'm not sure if an analogy with compressed data is a good one, but you don't work with compressed data, you unpack it, and maybe unpack some more and convert to a very inefficient format with regard to the disk space used.
Compressed theory is good to apply it as is, but to refine it you should probably prefer something else.
Per frontier token. You're not calculating the cost of a fixed quality asset here. Old hw running non-frontier models will be very valuable. In fact, we have two direct examples: older server gpus actually appreciating and the very obvious fact that not everyone always use MAX FULL EFFORT BEST MODEL no matter what.
*Completely made up statistic.
reply