fp.

And then you have to pick a threshold -> if similarity of strings is above that threshold, it's a match, otherwise, not. Threshold should be high to prevent false positives. LLM will take care of the non-matches

jackfranklyn•2w ago

Been working on this exact problem in the financial/accounting space - matching bank statement rows to accounting records. Real-world messiness makes it interesting:

The fuzzy threshold question is tricky because false positives are worse than false negatives. A user seeing a wrong match erodes trust fast. We ended up with a tiered approach: high-confidence matches go through automatically, medium-confidence gets surfaced for human review, low-confidence stays unmatched rather than guessing.

One thing we found: the hardest cases aren't the ones where strings are slightly different - they're the ones where the same transaction appears with completely different descriptions on each side. "PAYPAL *ACME" vs "Invoice 1234 - Acme Ltd". No amount of fuzzy matching helps there. That's where learning from historical patterns (how did the user match these before?) beats trying to infer semantic similarity from scratch every time.

ddp26•2w ago

Yep! We have lots of examples like that where two vendors, or two customers, are completely non-matching. With LLMs and LLM web agents, you also can associate things that are not the same entity.

One example we have is merging a table of companies to a table of company websites. You get things like "Acme Corp" matching "my-logicistics.com" that no LLM has memorized, so you have to look them up using the web. ReAct web agents work really well here, but it can be very expensive, so it's all about doing this cost efficiently.