In my experience, the main reasons are:
- Creating DataFrame fixtures (data and schemas) takes too much time
- Debugging across multiple tables is complicated
- Boilerplate code is verbose and repetitive
To address these pain points, I built PyBujia, a framework that:
- Lets you define table fixtures using Markdown to facilitate DataFrame creation, debugging and readability.
- Generalizes the boilerplate, saving setup time
It's made testing Spark jobs much easier for me, now I do TDD, and I hope it helps other Data Engineers as well.
Feedback is very welcome!