frontpage.

Show HN: Fuzzy-matching messy job board data against the UK Gov Visa Registry

https://apify.com/dakheera47/uk-visa-sponsor-verifier

1•dakheera47•1h ago

Hey HN,

I’ve been working on job application pipelines and kept hitting a massive data friction point: reliably filtering out UK companies that legally cannot sponsor international workers.

The UK Home Office publishes a constantly updating CSV of licensed sponsors. The problem is, the data is practically useless for standard database joins. A job board might list a role at "Acme", but the government registry lists "Acme Technologies Holdings Limited".

If you run an exact-string match or a basic ILIKE against a scrape of 10,000 Indeed or LinkedIn jobs, your false-negative rate is massive.

I wrote a TypeScript-based matching engine to solve this. Here is the pipeline:

Dynamic Ingestion: It bypasses the Gov.uk dynamic routing to pull the raw, multi-megabyte CSV directly into memory. No stale database records.

Text Normalization: I built a custom parser to strip out standard corporate suffixes ("ltd", "plc", "llp", "t/a", etc.) and handle the weird punctuation and localized encodings that break standard scrapers.

Fuzzy Scoring: It runs an optimized Levenshtein distance algorithm over the in-memory array to output a 0-100 confidence score for the match.

Initially, I built this with a persistent local cron scheduler (node:fs) for an open-source job ops project. But to make it scalable for batch processing, I ripped out the local caching and deployed it as an ephemeral Docker container. It spins up, processes an array of thousands of scraped companies entirely in-memory in a few seconds, pushes a clean JSON dataset of verified sponsors, and dies.

If you are building a job aggregator, an ATS, or a lead-gen pipeline and don't want to waste a weekend writing your own corporate-suffix normalization logic, I hosted the serverless endpoint here: https://apify.com/dakheera47/uk-visa-sponsor-verifier

I'd love any feedback on the text normalization approach, or to hear if anyone knows of specific edge cases in the Home Office data formatting that I might have missed.

The economic potential of generative AI: The next productivity frontier

How do you manage prompt versioning and iteration?

Designing a 36-key custom keyboard layout (2021)

Spaco – A Spatial Workspace Platform

Will pressure cause Cuba to finally buckle?

Multiplayer: Share tmux sessions (Claude Code, etc.) over LAN and the internet

Show HN: Ez – project-scoped command aliases for macOS

Anthropic Partners with CodePath

Show HN: Network-AI – A Distributed Mutex for AI Agent Swarms

Open Source Is Not About You

Apocalypse no: how almost everything we thought we knew about the Maya is wrong

A minimal GPT-style language model for character-level next-token prediction

An "ergonomics-first" theme for VSCode

Higher effort reduces deep research accuracy for Gemini Flash 3 and GPT-5

$50 OpenClaw Deck [video]

Google VRP: Closed case Re-opened after Terminal Log proof, then re-closed

A great wee place: the small Scottish factory crafting Olympic curling stones

Img.tara.vision – a privacy-first image toolkit with two processing tiers

Hardware Mute Button

The easiest way to run Claude Code on Kubernetes

Category Theory, AI and Jobs

The Radical Propulsion Needed to Catch the Solar Gravitational Lens

What every compiler writer should know about programmers (Anton Ertl, 2015) [pdf]

Apple, fix my keyboard before the timer ends or I'm leaving iPhone

Consumers and businesses paid nearly 90% of Trump tariffs in 2025

Kodak Charmera Review, Tiny Toy Camera That Makes Garbage Photos Feel Like Gold

OK, so Anthropic's AI built a C compiler. That don't impress me much

Show HN: Micropay – Stripe for Africa's biggest payment network

Programming is no longer the main skill of SWE

FFAB – Free GUI for FFmpeg