The fact that COTs often "hallucinate" was known anecdotally, but they study it more systematically here and provide ways to mitigate. Apparently SFT'ing on "meaningful" reasoning traces provides enough of a scaffold so that later RL results in meaningful/"truthful" traces rather than the appearance of reasoning. See also the author's summary at https://x.com/qinan_yu/status/2049865788304380239
krackers•1h ago