Previous OpenAI models were instruct-tuned or otherwise aligned, and the author even mentions that model distillation might be destroying the entropy signal. How did they pinpoint alignment as the cause?
Disclaimer: I wrote this blog post.
Not sure if that's what the GP meant, I only worked with binary labels stuff.
For instance, a calibrated classifier for a coin flip predictor should output 50-50. A poorly calibrated classifier would output higher confidence for heads/tails.
It would probably erode trust between people interacting online. Many of us are here to discuss issues with real people, not AI agents. When real people start to mimic the conversation parlance and cadence of AI agents it becomes much more difficult to trust that you are interacting with a real person
Personally I'm not interested in chatting with AI agents
I'm not even really interested in chatting with real people filtered through AI agents. If you can be bothered to type out a prompt to your AI you can take the time to write your own thoughts
I don't even want to read things edited (sanitized, really) by AI either
The same way I don't want my living space to resemble a too-clean laboratory, I don't want my conversation space to resemble an HR meeting. I want to interact with the messy side of people too. Maybe not "unfiltered", but AI speak is much too filtered and too polished
I chose every word in this post myself with no help from AI, then typed it with my thumbs, just like god intended
I literally see it with the huge amounts of people now using "delve" much more or are using ChatGPT-ish linguistic style in their personal communication. Monkey see, monkey do.
A model that is more correct but swears and insults the user won't sell. Likewise a model that gives criminal advice is likely to open the company up to lawsuits in certain countries.
A raw LLM might perform better on a benchmark but it will not sell well.
All my friends hate it, except one guy. I used it for a few days, but it was exhausting.
I figured out the actual use cases I was using it for, and created specialized personas that work better for each one. (Project planning, debugging mental models, etc.)
I now mostly use a "softer" persona that's prompted to point out cognitive distortions. At some point I realized, I've built a therapist. Hahaha.
OpenAI models refuse to translate or do any transformation for some traditional, popular stories because of violence, the story was about a bad wolf eating some young goats that did not listen the advice from their mother.
So now try to give me a prompt that works with any text and that convinces the AI that is ok in fiction to have violence or bad guys/animals that get punished.
Now I am also considering if it censors the bible where some pretend good God kills young chilren with ugly illnesses to punish the adults, or for this book they made excaptions.
Your first paragraph describes a simple prompt. The second implies a "jailbreak" prompt.
The bible paragraph is just you being snarky (and failing).
Your examples don't help your case.
I stand on the side that wants to restrict AI from generating triggering content of any kind.
It's a safety feature, in the same sense as safety belts on cars are not a censorship of the driver movement.
Is it like in the early GPT-3 days, when you had to give it a bunch of examples and hope it catches the pattern?
DeepSeek-R1 is trivially converted back to a non reasoning model with just chat template modifications. I bet you can chat template your way into a good quality model from a base model, no RLHF/DPO/SFT/GRPO needed.
Technically, why not implement alignment/debiasing as a secondary filter with its own weights that are independent of the core model which is meant to model reality? I suspect it may be hard to get enough of the right kind of data to train this filter model, and most likely it would be best to have the identity of the user be in the objective.
In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.
https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus...
Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.
behnamoh•15h ago
it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.
Alex_001•14h ago
And if so, where’s the balance? Could we someday see dual-mode models — one for safety-critical tasks, and another more "raw" mode for creative or exploratory use, gated by context or user trust levels?
gamman•8h ago
I feel that companies with top-down management would have more agency and perhaps creativity towards (but not at) the top, and the implementation would be delegated to bottom layers with increasing levels of specification and restriction.
If this translates, we might have multiple layers with varied specialization and control, and hopefully some feedback mechanisms about feasibility.
Since some hierarchies are familiar to us from real-life, we might prefer these to start with.
It can be hard to find humans that are very creative but also able to integrate consistently and reliably (in a domain). Maybe a model doing both well would also be hard to build compared to stacking few different ones on top of each other with delegation.
I know it's already being done by dividing tasks between multiple steps and models / contexts in order to improve efficiency, but having explicit strong differences of creativity between layers sounds new to me.
pjc50•7h ago
> is the belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes
Maybe you can do that, but not on a model you're exposing to customers or the public internet.
jsnider3•4h ago
pjc50•3h ago
"Good" is at least as much of a difficult question to define as "truth", and genAI completely skipped all analysis of truth in favor of statistical plausibility. Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.
jsnider3•3h ago
Punishing big companies who obviously and massively hurt people is something we struggle with already and there are plenty of computer viruses that have outlived their creators.
Der_Einzige•2h ago
This isn't real alignment because it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc - but models "want" to be good because of the CYA dynamics of how the companies prepare their pre-training datasets.
malfist•14h ago
hansvm•14h ago
malfist•5h ago
It would make sense that fine tuning and alignment reduce diversity in the response, that's the goal.
Der_Einzige•2h ago
malfist•1h ago
If I'm an LLM model and alignment and fine tuning restricts my answers to "4", I've not lost creativity, but I have gained accuracy.
hansvm•1m ago
exe34•9h ago
This reminds me of the time when I was a child, and my parents decreed that all communications would henceforth happen in English. I became selectively mute. I responded yes/no, and had nothing further to add and ventured no further information. The decree lasted about a week.
andai•8h ago
exe34•7h ago