It turns out, that with a modern compiler, you do not need much complexity to make a really fast implementation. This implementation is for powers of 2, and optimized for arrays that do not fit in cache. I do think it would be better to use a higher-level language to implement other cases (e.g. n = 2^a * 3^b * 5^c, multiple small FFTs, higher-dimensional), so I am currently working on getting the SaC-compiler to generate this code.