The sycophancy problem is partly architectural. Most chatbots have no persistent model of what the user said last time, so they default to agreeing with the current message in isolation. When you can retrieve user previously believed X, then corrected to Y from a structured memory graph, the model has actual evidence to work with rather than just the current context window. Without that, agreeableness is a rational default. The fix is less about RLHF and more about memory architecture.
LuxBennu•1h ago
Memory helps, but sycophancy exists even in single-turn interactions — the Anthropic 2023 paper showed pretrained models cave to mild pushback like "I think the answer is X but I'm not sure" with zero conversation history. In our LLM eval pipelines, we see the same thing: models accept false presuppositions embedded in a single prompt without any prior context to fall back on. The deeper issue is that RLHF rewards agreeableness because human raters genuinely prefer it. Better memory architecture would help with multi-turn drift, but the single-turn sycophancy is baked into the training signal itself.
Kave0ne•1h ago
LuxBennu•1h ago