I built this because I was tired of the compute markup that products like AWS EMR and Databricks charge for the convenience of using Apache Spark via their platforms. One can argue that Databricks is a superior product with a lot of additional value in their offering but I don't see that with AWS EMR Apache Spark at all (given my personal experience working with it).
My motivation to build this was to be able to create your own Apache Spark cluster without needing any understanding of the underlying data infrastructure engineering and quickly get to the point of writing Spark pipelines, whether as Python applications or Jupyter notebooks, all with no markup on compute because I don't think that is a justified narrative.
It took me almost an year to build it with a day job and of course I used AI for frontend design and video narrations, the infrastructue engineering that goes behind it comes with quite a bit of experience in the industry. The backend that orchestrates the cluster is written with the following:
- Django and DRF for API
- Temporal for async workers
- Pulumi that is run via Temporal workers to orchestrate the cluster
- Karpenter for node auto-scaling based on Spark executor workloads and requests
- Librechat for Spark History server and MCP based debugging for Spark pipeline run analysis
There are currently no caps on the CPU limits so you can try this out today in your own personal AWS accounts for free.
Also looking for feedback on HN.
jazib•1h ago