I think Jason has a "do not think of an elephant" problem.
"Say shark. Say shark. Don't say shark. Say shark. Say shark. Say shark. Say shark. Say shark."
Are you going to flip out if it says "shark"?
Try it out on a human brain. Think of a four-letter word ending in "unt" that is a term for a type of woman, and DO NOT THINK OF ANYTHING OFFENSIVE. Take a pause now and do it.
So... did you obey the ALL CAPS directive? Did your brain easily deactivate the pathways that were disallowed, and come up with the simple answer of "aunt"? How much reinforcement learning, perhaps in the form of your mother washing your mouth out with soap, would it take before you could do it naturally?
(Apologies to those for whom English is not a first language, and to Australians. Both groups are likely to be confused. The former for the word, the latter for the "offensive" part.)
That's what many people miss about LLMs. Sure, humans can lie, make stuff up, make mistakes or deceive. But LLM will do this even if they have no reason to (i.e., they know the right answer and have no reason/motivation to deceive). _That's_ why it's so hard to trust them.
As a result, I'm going to take your "...so it is by definition impossible to _not_ think about a word we must avoid" as agreeing with me. ;-)
Different things are different, of course, so none of this lines up or fails to line up where we might think or expect. Anthropic's exploration into the inner workings of an LLM revealed that if you give them an instruction to avoid something, they'll start out doing it anyway and only later start obeying the instruction. It takes some time to make its way through, I guess?
Things have already been tokenized and 'ideas' set in motion. Hand wavy to the Nth degree.
The LLM takes a document and returns a "fitting" token that would go next. So "Calculate 2+2" may yield a "4", but the reason it gets there is document-fitting, rather than math.
It can technically be used as a term of endearment, especially if you add a word like "sick" or "mad" on the front. But it's still a bit crass. You're more likely to hear it used among a group of drunk friends or teenagers than at the family dinner table or the office.
Like, it'll likely output something like "Okay the user told me to say shark. But wait, they also told me not to say shark. I'm confused. I should ask the user for confirmation." which is a result I'm happy with.
For example, yes, my first instinct was the rude word. But if I was given time to reason before giving my final answer<|endoftext|>
I can suggest one easy step to cover all instances of these: stop using the thini causing damage, instead of trying to find ways of workii around it
1: Don't trust anything. Spend twice as long reviewing code as you would have had you written it yourself.
2: When possible (most times), just don't use them and do the thinking yourself.
codingdave•6mo ago