I followed ARM's instructions for converting the stable-audio-open-small model to TensorFlow Lite. These instructions were Android-oriented but are applicable to iOS too.
The first issue I encountered was that Tensorflow Lite's Metal/GPU delegates don't seem to be able to run the three submodels, because many model operations are not supported. This left using the regular (cpu-bound) delegate using XNNPack to try and maximize performance. However, even with XNNPack enabled, the autoencoder model -- the final stage in the pipeline -- performed very poorly in terms of performance and memory usage. Its transient memory usage precluded running the app on older devices with less RAM.
To work around this, I used Apple's CoreML tools to convert just the original autoencoder pytorch model to a CoreML model. I was happy to find that this worked, and performance and memory usage improved significantly, enabling use on older devices. The size of the bundle was also reduced, but remains on the large side.
Although the model sometimes seems weak in some cases where high fidelity is expected, some of the outputs can be quite unusual and unexpected, which might be valuable for creative use-cases. Being able to share the audio clips as a ringtone is a terrific feature of iOS 26 (long overdue).
My app is called Diffuzion and is available in the App Store globally. Exporting audio is a premium feature gated by an in-app purchase, but you can use the promo code DIFFUZION4HN to get the functionality for free. This code is good until November 14th. I would appreciate feedback on possible improvements/features for the future. I have my own set of ideas, but you may have more compelling ones!