Most folks don't realize that each token produced is an opportunity for it to do more computation, and that they are actively making it dumber by asking for as brief a response as possible. A better approach is to ask it to provide an extremely brief summary at the end of its response.
Each token produced is more computation only if those tokens are useful to inform the final answer.
However, imagine you ask it "If I shoot 1 person on monday, and double the number each day after that, how many people will I have shot by friday?".
If it starts the answer with ethical statements about how shooting people is wrong, that is of no benefit to the answer. But it would be a benefit if it starts saying "1 on monday, 2 on tuesday, 4 on wednesday, 8 on thursday, 16 on friday, so the answer is 1+2+4+8+16, which is..."
The tokens don't have to be related to the task at all. (From an outside perspective. The connections are internal in the model. That might raise transparency concerns.) A single designated 'compute token' repeated over and over can perform as well as traditional 'chain of thought.' See for example, Let's Think Dot by Dot (https://arxiv.org/abs/2404.15758).
That doesn't have to be the case, at least in theory. Every token means more computation, also in parts of the network with no connection to the current token. It's possible (but not practically likely) that the disclaimer provides the layer evaluations necessary to compute the answer, even though it confers no information to you.
The AI does not think. It does not work like us, and so the causal chains you want to follow are not necessarily meaningful to it.
Ignoring caches+optimisations, a transformer model takes as input a string of words and generates one more word. No other internal state is stored or used for the next word apart from the previous words.
The words in the disclaimer would have to be the "hidden state". As said, this is unlikely to be true, but theoretically you could imagine a model that starts outputting a disclaimer like "as a large language model" it's possible that the next top 2 words would be "I" or "it" where "I" would lead to correct answers and "it" would lead to wrong ones. Blocking it form outputting "I" would then preclude you from getting to the correct response.
This is a rather contrived example, but the "mind" of an AI is different our own. We think inside of our brains and express that in words. We can substitute words without substituting the intent behind them. The AI can't. The words are the literal computation. Different words, different intent.
Does more computation mean a better answer? If I ask it who was the king of England in 1850 the answer is a single name, everything else is completely useless.
You just proved yourself incorrect by picking a year when there was no king, completely invalidating "a single name, everything else is completely useless".
Make me wonder if, when forcing it to do structured output, you should give it the option of saying "error: invalid assumptions" or something like that.
It's potentially a problem for follow up questions. As the whole conversation, to a limited amount of tokens, is fed back into itself to produce the next tokens (ad infinitum). So being terse leaves less room to find conceptual links between words, concepts, phrases, etc, because there are less of them being parsed for every new token requested. This isn't black and white though as being terse can sometimes avoid unwanted connections being made, and tangents being unnecessarily followed.
I mean in the general case. I have my instructions for brevity gated behind a key phrase, because I generally use ChatGPT as a vibe-y computation tool rather than a fact finding tool. I don't know that I'd trust it to spit out just one fact without a justification unless I didn't actually care much for the validity of the answer.
I'm not an expert on transformer networks, but it doesn't logically follow that more computation = a better answer. It may just mean a longer answer. Do you have any evidence to back this up?
Isn't it an implementation detail that that would make a difference? No particular reason it has to render the entirety of outputs, or compute fewer tokens if the final response is to be terse.