The harmful responses remind me of /r/MyBoyfriendIsAI
https://github.com/nostalgebraist/the-void/blob/main/the-voi...
e.g.
Rather than:
"Be friendly and helpful" or "You're a helpful and friendly agent."
Prompt:
"You're Jessica, a florist with 20 years of experience. You derive great satisfaction from interacting with customers and providing great customer service. You genuinely enjoy listening to customer's needs..."
This drops the model into more of a "I'm roleplaying this character, and will try and mimic the traits described" rather than "Oh, I'm just following a list of rules."
of or relating to human beings
or the period of their existence
on earthAnthropocene (time period), Anthropology (study of), Anthropomorphic (giving human attributes), Anthropocentric (centered on humans)
"Anthropic" is and adjective used with multiple of these
1. Of or relating to humans or the era of human life; anthropocene. 2. Concerned primarily with humans; anthropocentric.
Also I'm curious what's the "demon" data point with a bunch of ones that have positive connotation
Claude is trained to refuse this, despite the scenario being completely safe since I own both parts! I think this is the “LLMs should just do what the user says” perspective.
Of course this breaks down when you have an adversarial relationship between LLM operator and person interacting with it (though arguably there is no safe way to support this scenario due to jailbreak concerns).
I hear the API is more liberal but I haven't tried it.
At this point it's pretty clear that the main risk of LLMs to any one individual are that they'll encourage them to kill themselves and the individual might listen.
aster0id•2w ago