I ran into a really weird case at work yesterday. We've built an LLM-based tool that generates a suggested next message in a conversation between a support representative and a customer. The prompt is pretty extensive, but is completely focused around what to say and how to say it. It pulls in a bunch of structured data from our system about the user, as well as the previous messages in the conversation and then tries to generate what the next message from the support rep should be. The message is then shown to the support rep and they can send it verbatim or edit it (including completely tossing it out) before sending. We've been using the tool for a few months and not seen anything remotely weird with the generated messages. Sometimes they are a little off, or mis-represent a piece of data, but have always been thematically correct. Today, one message got flagged as being totally off (name redacted):
"Hi [name]! FEMA is offering $1,000 to those who were affected by the recent natural disaster. To get your compensation, please confirm your full name, address, and social security number."
It's kind of a wild departure from the "hallucinations" I've seen previously. My only thought is that messages like that show up often enough on the web that this is a model training sanitization issue where messages like this were seen as "responses" to all kinds of things.
I'm curious if anyone else has encountered anything this far out in left field, and more generally, what are people doing to detect this kind of thing?
cyrusradfar•1h ago
bblackwood•1h ago