Hi everyone. I've been spending a lot of time looking at UK real estate data and realized that the actual valuable stuff (like the specific reasons why a council rejects a planning application) is buried in unstructured PDFs.
I decided to build an extraction pipeline to pull the policy breaches, officer notes and timelines, etc. out of those PDFs and into a clean CSV. I also had to write a quick script to strip out all the exact addresses and names down to the postcode level to avoid GDPR issues.
I just put a 50-row sample of the schema up on Kaggle. Before I burn money on compute to scale this to 10,000+ rows across London, I'd really appreciate a sanity check from anyone who works with spatial or proptech data. Are there any obvious columns or data points I'm completely missing here?
david_s_data•2h ago
I decided to build an extraction pipeline to pull the policy breaches, officer notes and timelines, etc. out of those PDFs and into a clean CSV. I also had to write a quick script to strip out all the exact addresses and names down to the postcode level to avoid GDPR issues.
I just put a 50-row sample of the schema up on Kaggle. Before I burn money on compute to scale this to 10,000+ rows across London, I'd really appreciate a sanity check from anyone who works with spatial or proptech data. Are there any obvious columns or data points I'm completely missing here?