I’ve spent a lot of time debugging large Parquet datasets on S3 where “something is wrong”, but figuring out what usually means either accessing each file individually or even spinning up Spark just to inspect metadata.
In practice, it’s often things like:
- schema drift across partitions
- columns silently disappearing
- timestamp precision changes
- files written by different pipeline versions
- row groups with bad stats or empty data
By the time you notice, the dataset is already messy and hard to reason about.
So I built pqry, a Rust-based CLI tool that scans Parquet metadata at the dataset/prefix level and surfaces issues like schema drift, unstable columns, partition hotspots, and row-group health.
It works entirely from metadata, so you can point it at tens of thousands of files and get results fast.
Example:
- pqry drift s3://bucket/events/
- pqry columns s3://bucket/events/
- pqry quality s3://bucket/events/
Repo: https://github.com/symblic/pqry
I originally built this for debugging production pipelines where writers and schemas evolved over time and problems only showed up weeks later.
Would love feedback from anyone working with large Parquet datasets in production.