HaluMem introduces the first benchmark for evaluating hallucinations in agent memory systems at the operation level. Through three evaluation tasks (memory extraction, updating, and question answering), it reveals that existing memory systems generate and accumulate hallucinations during early stages, which then propagate errors downstream. The benchmark uses two datasets spanning different context scales to systematically reveal these failure modes.
timini•1h ago