Data Activation Thoughts

https://galsapir.github.io/sparse-thoughts/2026/01/17/data_activation/

21•galsapir•3w ago

i've been working with healthcare/biobank data and keep thinking about what "data moats" mean now that llms can ingest anything. some a16z piece from 2019 said moats were eroding — now the question seems to be whether you can actually make your data useful to these systems, not just have it. there's some recent work (tables2traces, ehr-r1) showing you can convert structured medical data into reasoning traces that improve llm performance, but the approaches are still rough and synthetic traces don't fully hold up to scrutiny (writing this to think through it, not because i have answers)

Comments

sgt101•2w ago

How to know if one should fine tune/pretrain or RL / reasoning train given some data set?

galsapir•2w ago

i honestly dont think there's a simple y/n answer there - i think considerations include mostly like 'how costly it is to do so', 'how often do you think you'll need it', and so on. traces are not as "ephemeral" as FT models - since you can use those to guide agent behaviour when a newer model is released (but still, not as evergreen as other assets - traces generated using say GPT4 would seem pale and outdated compared to ones created on the same dataset using Opus4.5 i reckon)

armcat•2w ago

I've been working in legaltech space and can definitely echo the sentiments there. There are some major legaltech/legal AI companies but after speaking to dozens of law firms, none of them are finding these tools very valuable. But they have signed contracts with many seats, they are busy people, and tech is not intrinsic to them, so they are not in the business of just changing tools and building things in-house (a handful of them are). And the problem is despite massive amount of internal data, all the solutions fail on the relevance and precision scale. When I sit down with actual legal associates, I can see how immensely complex these workflows are, and to fully utilize this data moat you need: (1) multi-step agentic retrieval, (2) a set of rules/heuristics to ground and steer everything per transaction/case "type", (3) adaptation/fine-tuning towards the "house language/style", (4) integration towards many different data sources and tools; and you need to wrap all this with real-world evals (where LLM-as-a-judge technique often fail).

dennisy•2w ago

Could you please expand on “none of them find the tools very useful”?

I would love to know how big your sample is, in what way the tools fail, what features are missing etc.

armcat•2w ago

Sure! So to qualify - I've been working in contractual law, and more specifically contract drafting. There are a tonne of other tools in the areas of document management, research, regulatory, timekeeping, etc, so I cannot speak on behalf of those.

Sample size: around 150 law firms across UK, Nordics and DACH (and a smithering across the US). Some were actual month long pilots so there were deeper interactions with some, whilst others were "just conversations". Let's say in each law firm it's 3-4 associates and 1-2 partners, so it's >600 lawyers.

Typically the legal AI solutions in contract drafting involve the lawyer uploading "their database" aka drag-and-drop a folder or a zip file containing potentially 100s-1000s contracts from previous transactions.

What's missing:

- Relevance: For the current transaction the lawyer is working on, the recommendations from AI tools suggest irrelevant information. For example, if it's an M&A transaction in one market (e.g. Nordics), it suggests pricing mechanics from a different market practice (e.g. US) that are irrelevant or not desirable. The text semantics have closest cosine (or whatever) distance, but the market characteristics are orthogonal.

- Representation: as a lawyer you are always representing a specific party (e.g. a "buyer" purchasing another company or an asset from a "seller"). You want your side to be best represented - however the tools often fail to "understand" what/who you are representing, and tend to recommend the opposite of what you want for your client.

- Diversity: The same handful of documents keep being referenced all the time, even though there are other "better" documents that should be used to ground the responses and recommendations.

- Precision: Sometimes you want precise information, such as specific leverage ratios or very specific warranty clauses for a transaction of a particular size within a particular industry.

- Language/tonality: Lawyers talk to other lawyers and there is a specific tonality and language used - precision, eloquence, professionalism. Each law firm also has their "house style" in terms of how they put the words together. AI tools come across as "odd" in terms of how they respond (even when they are correct). It trips the lawyers up a bit and they lose the trust somewhat.

Etc.

(there are many others)

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

IBM Beam Spring: The Ultimate Retro Keyboard

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

Show HN: Axiomeer – An open marketplace for AI agents

Vouch

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The silent death of good code

Selection rather than prediction

I write games in C (yes, C) (2016)

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Reinforcement Learning from Human Feedback

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox