This is not SOTA, though it was also only trained for 3 days on a singular A40 (~$30). It was also vibe coded in a week, very much possible thanks to combining existing ideas in the space and opus.
I had a custom dataset of ~500 songs, from finding official instrumentals and vibe-code aligning them together + some vibe coded synthetic snippets coming from "please get some vocal / voice and instrument textures/datasets and piece them together", "please generate edge cases like vocaloid filters, or really quiet instrumentals over very loud voices", etc.
I have one GPU running all conversions rn, so new imports might be slow but once done, they should be good forever! (and an existing pool of songs exist)