In queries, joins are always painful, and sometimes the better approach is to create multi-dimensional indices inside the data itself. So in my spare time I built LitenDB, an open-source project that extends Spark with data and indices stored in Delta Lake to reshape data into fast, distributed tensors using Arrow:
https://github.com/hkverma/litendb
It speeds up join-heavy and analytic queries, simplifies plans, and can deliver 10–100× performance improvements. You can try the Colab notebook here to see how it works:
https://github.com/hkverma/litendb/blob/main/py/notebooks/LitenTpchQ5Q6.ipynb
Would love to hear feedback from the community and explore collaborations.
Thanks,
HK