It sounds simple, and I'm not going to say it was the most complicated thing ever, but there were quite a few steps involved in getting it right. Getting LLMs to do the cleanup task consistently is very hard. You wouldn't think it but there are often multiple ways to break down a sentence.
An interesting part was structuring the model output so it could use the exact same tokens as the input. Most tokens are prefixed by a space, so you want the model's "desired output" to also involve the words prefixed by a space. It makes the task much easier because the model doesn't have to learn the mapping between prefixed and unprefixed tokens. Doing that instantly made my models start performing much better.