Hacker Newsnew | past | comments | ask | show | jobs | submit | more numeri's commentslogin

I'm a little shocked at Simon's conclusion here. We have a man who bought an social media website so he could control what's said, and founded an AI lab so he could get a bot that agrees with him, and who has publicly threatened said AI with being replaced if it doesn't change its political views/agree with him.

His company has also been caught adding specific instructions in this vein to its prompt.

And now it's searching for his tweets to guide its answers on political questions, and Simon somehow thinks it could be unintended, emergent behavior? Even if it were, calling this unintended would be completely ignoring higher order system dynamics (a behavior is still intended if models are rejected until one is found that implements the behavior) and the possibility of reinforcement learning to add this behavior.


Elon obviously wants Grok to reflect his viewpoints, and has said so multiple times.

I do not think he wants it to openly say "I am now searching for tweets from:elonmusk in order to answer this question". That's plain embarrassing for him.

That's what I meant by "I think there is a good chance this behavior is unintended".


I really like your posts, and they're generally very clearly written. Maybe this one's just the odd duck out, as it's hard for me to find what you actually meant (as clarified in your comment here) in this paragraph:

> This suggests that Grok may have a weird sense of identity—if asked for its own opinions it turns to search to find previous indications of opinions expressed by itself or by its ultimate owner. I think there is a good chance this behavior is unintended!

I'd say it's far more likely that:

1. Elon ordered his research scientists to "fix it" – make it agree with him

2. They did RL (probably just basic tool use training) to encourage checking for Elon's opinions

3. They did not update the UI (for whatever reason – most likely just because research scientists aren't responsible for front-end, so they forgot)

4. Elon is likely now upset that this is shown so obviously

The key difference is that I think it's incredibly unlikely that this is emergent behavior due to an "sense of identity", as opposed to direct efforts of the xAI research team. It's likely also a case of https://en.wiktionary.org/wiki/anticipatory_obedience.


That's why I said "I think there is a good chance" - I think what you describe here (anticipatory obedience) is possible too, but I honestly wouldn't be surprised to hear that the from:elonmusk searches genuinely were unintended behavior.

I find this as accidental behavior almost more interesting than a deliberate choice.


Willison's razor: Never dismiss behaviors as either malice or stupidity when there's a much more interesting option that can be explored.


I side with Occam's razor here, and with another commenter in this thread. People are construing entire conspiracy theories to explain fake replies when asked for system prompt, lying in Github repos, etc.


What if searching for Elon's tweets was indeed intended, but it wasn't supposed to show up in the UI?


Occam's razor would seem to apply here.


> That's plain embarrassing for him

You think that's the tipping point of him being embarrassed?


It seems as if the buzz around AI is so intoxicating that people forgo basic reasoning about the world around them. The recent Grok video where Elon is giddy about Grok’s burgeoning capabilities. Altman’s claims that AI will usher in a new utopia. This singularity giddiness is infectious yet denies the worsening world around us - exacerbated by AI - mass surveillance, authoritarianism, climate change.

Psychologically I wonder if these half-baked hopes provide a kind of escapist outlet. Maybe for some people it feels safer to hide your head in the sand where you can no longer see the dangers around you.


I think cognitive dissonance explains much of it. Assuming Altman isn’t a sociopath (not unheard of in CEOs) he must feel awful about himself on some level. He may be many things, but he is certainly not naive about the impact ai will have on labor and need for ubi. The mind flips from the uncomfortable feeling of “I’m getting rich by destroying society as we know it” to “I am going to save the world with my super important ai innovations!”

Cognitive dissonance drives a lot “save the world” energy. People have undeserved wealth they might feel bad about, given prevailing moral traditions, if they weren’t so busy fighting for justice or saving the planet or something that allows them to feel more like a super hero than just another sinful human.


On top of all of that, he demonstrates that Grok has an egregious and intentional bias but then claims it's inexplainable happenstance due to some sort of self-awareness? How do you think it became self-aware Simon?


That's the thing, some people do see things in their mind that clearly. It's about as rare as full aphantasia, but it's absolutely a spectrum.


There's really no way to know this, as it's all based on subjective experiences in which two people could easily describe the same sensation differently.


That's a bold claim! Actually, there are plenty of scientific experiments that show actual differences between people who report aphantasia and those who don't, including different stress responses to frightening non-visual descriptions, different susceptibility to something called image priming, lower "cortical excitability in the primary visual cortex", and more: https://en.wikipedia.org/wiki/Aphantasia

So we know that at least the people who claim to see nothing act differently. Could it just be that people who act differently describe the sensation differently, you might ask?

No, because there are actual cases of acquired aphantasia after neurological damage. These people used to belong to the group that claimed to be able to imagine visual images, got sick, then sought medical help when they could no longer visualize. For me, at least, that's pretty cut and dry evidence that it's not just differing descriptions of the same (or similar) sensations.


If you recall, I prefaced my original comment with "Half the time,"


I really don't think so. I can't visualize with perfect clarity, but I can do pretty well, especially if I try. It tends to shift, so "count the stripes on the tiger" doesn't quite work, but I can do the exercise of visualizing a ball on a table and then saying what color it is.

There is no possible way that anyone could honestly describe this experience as "I don't visualize," any more than someone with working ears could describe their experience as "I don't hear anything."


I think you're assuming more people are like you than actually are.

This is part of the classic debate around aphantasia – both sides assume the other side is speaking more metaphorically, while they're speaking literally. E.g., "Surely he doesn't mean he literally can't visualize things, he just means it's not as sharp for him." or "Surely they don't literally mean they can see it, they're just imagining the list of details/attributes and pretending to see it."


>I think you're assuming more people are like you than actually are.

What I'm trying to say is that from his perspective, how he imagines people with more "normal" memory recall things, might be a bit exaggerated. He doesn't know what he's missing exactly so he might imagine it to be better than it really is. I'm not trying to say that everyone else is like me or that he's like me. Like if he can't imagine an apple in his mind at all and he hears other people can, he may imagine it's as clear as staring at an apple in real life or a picture of an apple on a computer screen, while the reality is somewhere in the middle. I do believe his claims about himself, but his claims about me or people like me don't seem entirely accurate.


When describing qualia, all words are metaphors. This subject is an unscientific minefield.


They're definitely quite hard for me. I bet my colleagues, friends or family could answer them for me better than I can without prep (which would involve chatting with my wife). Many of the experiences in this article resonate with me, but it's definitely not quite as extreme.


Is the analysis right, or did the LLM hallucinate this?


Yes, so that one can use it for more creative writing exercises. It was pretty creative, I'll give it that.


No, it's completely useless, and puts the entire rest of the analysis in a bad light.

LLMs have next to no understanding of their own internal processes. There's a significant amount of research that demonstrates this. All explanations of an internal thought process in an LLM are completely reverse engineered to fit the final answer (interestingly, humans are also prone to this – seen especially in split brain experiments).

In addition, the degree to which the author must have prompted the LLM to get it to anthropomorphize this hard makes the rest of the project suspect. How many of the results are repeated human prompting until the author liked the results, and how many come from actual LLM intelligence/analysis skill?


By saying that's its gold mine, I think OP meant that's it's funny, not that it brings valuable insight. ie: THEY KNOW -> that made me laugh

and as the article said "an LLM who just spent thousands of words explaining why they're not allowed to use thousands of words", its just funny to read.


The fact that they produce this as “default” response is an interesting insight regardless of its internal mechanisms. I don’t understand my neurons but can still articulate how I feel


It is completely reasonable and often - very - useful to evaluate and interpret instructions with LLMs.

You're stuck on the anthropomorphize semantics, but that wasn't the purpose of the exercise.


It makes quite a lot of sense juxtaposed with "train time compute". The point being made is that a set budget can be split between paying for more training or more inference _at test time_ or rather _at the time of testing_ the model. The word "time" in "inference time" plays a slightly different role grammatically (noun, not part of an adverbial phrase), but comes out to mean the same thing.


Exactly right. The term "Test Time" had relevance in a certain context, and in a certain paper, but once people read the paper and saw the term they latched onto it, not realizing how totally non-descriptive and nonsensical it was when used outside that specific narrow context of genuinely "testing".


It was the term invented by the architects of the Holocaust, and I disagree that "eh, context matters".

Setting all moral arguments aside, it's important to know that similar phrases can work as dog-whistles to signal belonging to radical groups, and as such can easily give people the wrong impression about you as an author.

If I were to see a blog post titled "Work will set you free"[1] written by a peer, prospective employee/employer, colleague, etc., it would immediately set off alarm bells in my mind – even if the content of the post is a completely innocent discussion of the uplifting benefits of buckling down on one's workload. At best, it implies lack of awareness – at worst, it implies some extremely hateful beliefs and desires.

[1]: Written above the entrance to the Nazi concentration camps as a false promise encouraging prisoners often destined for death to work hard in forced labor.


We ought to change the whole IT terminology then. We keep killing parents and children. Context absolutely matters. Lack of context awareness is a deficit one should work on.


No, avoiding anything potentially negative is not what I'm saying. Your argument (that context always matters) leaves discourse and society highly susceptible to dog-whistles[1], by forcing all good-faith participants to interpret all communication in the most generous way possible. Bad-faith participants, on the other hand, are free to exploit that generosity.

By calling out and avoiding dog-whistles, even including accidental Nazi slogans (once pointed out), we reduce the impact of this attack on good-faith discussion and actual increase the level of openness and being up-front with our opinions.

One key difference between this and virtue signaling or thought policing is that it's the specific wording that is avoided, and not the underlying thoughts or opinions.

[1]: https://en.wikipedia.org/wiki/Dog_whistle_(politics)


When I read manual pages and see the so called "harmful" words, I am not impacted by them negatively because I am aware of the context. Why is this should not be taught? I understand what you are trying to say, but you even said it yourself, "accidental", so there was no intent either to begin with, let alone context in which it is embedded.

> thought policing is that it's the specific wording that is avoided, and not the underlying thoughts or opinions.

So we should avoid the wording / phrasing such as "killing children" in IT? It refers to well-known concepts, within a specific context. It is bad outside of IT, for sure, but not inside IT, it refers to ending processes (as you probably already know)


You seem to be responding to what you think I'm saying, not what I'm saying. As far as I know, "killing children" is not a dog-whistle. No one uses the words "killing children" to e.g., secretly express support for the Holocaust.


I didn't think the person was supporting Holocaust because he used the phrase "The Final Solution", that phrase is made up from very common words, and why would I assume malice, especially in the context of IT?

I may have used it unintentionally too, because "final solution" makes a lot of sense to use. The best way to ruin one's language is to keep using such common phrases that refer to such negative things. You know, there would not be a way to ruin it if people were just aware of the context and were not to attribute malice by default. It was probably accidental, like you said.

I think the issue is with this not-so-generous interpretation of it by default, or reading too much into it.

Do not allow your language to be ruined, and you could do a lot to help that cause.


No, the person you're responding to is absolutely right. The easy test (which has been done in papers again and again) is the ability to train linear probes (or non-linear classifier heads) on the current hidden representations to predict the nth-next token, and the fact that these probes have very high accuracy.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: