frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Honda: 2 years of ml vs 1 month of prompting - heres what we learned

https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompting/
82•Ostatnigrosh•4d ago

Comments

yahoozoo•1h ago
I wonder if text embeddings and semantic similarity would be effective here?
davidsainez•1h ago
> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner.
killerstorm•1h ago
Well, "vectorization" can be anything. BERT is in same capability class as GPT, very different from LSA people did in 1980s...
andai•6m ago
Anthropic found a similar result for retrieval: embeddings + BM25 keyword search (variant of TF-IDF) produced significantly better results.

https://www.anthropic.com/engineering/contextual-retrieval

They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.

That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)

Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.

But then the whole thing somehow works really well together (<1% fail rate on most benchmarks. Worse for code retrieval.)

I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.

andy99•1h ago
Yeah I’m curious if they tried training a Bert or similar classifier… intuitively this seems better than tfidf which is throwing away a ton of information.
suriya-ganesh•25m ago
I have. For a similar-ish task.

LLMs still beat a clarifier, because they're able to extract more signals than a text embedding.

It's very difficult to beat an LLM + prompt in terms of semantic extraction.

stego-tech•1h ago
And this is where the strengths of LLMs really lie: making performant ML available to a wider audience, without requiring PHDs in Computer Science or Mathematics to build. It’s consistently where I spend my time tinkering with these, albeit in a local-only environment.

If all the bullshit hype and marketing would evaporate already (“LLMs will replace all jobs!”), stuff like this would float to the top more and companies with large data sets would almost certainly be clamoring for drop-in analysis solutions based on prompt construction. They’d likely be far happier with the results, too, instead of fielding complaints from workers about it (AI) being rammed down their throats at every turn.

Veliladon•1h ago
^ This. I'm waiting for an LLM where I can just point it to a repo, slurp it up, and let me ask questions about it.
cpursley•1h ago
github copilot somewhat does this.
ryandvm•49m ago
Copilot is too stingy with context. In my experience Claude Code is much better at seeing the big picture.
etothet•1h ago
This is exactly what Devin (https://devin.ai) is designed to do. Their deepwiki feature is free. I’ve personally had decent success with it, but YMMV.
bildung•8m ago
Apparently it's also shit. There was a discussion about it a few days ago that contains multiple project maintainers pointing out deepwiki didn't get their repos at all https://news.ycombinator.com/item?id=45884169
nmfisher•1h ago
$ git clone repo && cd repo $ claude

Ask away. Best method I’ve found so far for this.

frikk•12m ago
This technique is surprisingly powerful. Yesterday I built an experimental cellular automata classifier system based on some research papers I found and was curious about. Aside from the sheer magic of the entire build process with Cursor + GPT5-Codex, one big breakthrough was simply cloning the original repo's source code and copy/pasting the paper into a .txt file.

Now when I ask questions about design decisions, the LLM refers to the original paper and cites the decisions without googling or hallucinating.

With just these two things in my local repo, the LLM created test scripts to compare our results versus the paper and fixed bugs automatically, helped me make decisions based on the paper's findings, helped me tune parameters based on the empirical outcomes, and even discovered a critical bug in our code that was caused by our training data being random generated versus the paper's training data being a permutation over the whole solution space.

All of this work was done in one evening and I'm still blown away by it. We even ported our code to golang, parallelized it, and saw a 10x speedup in the processing. Right before heading to bed, I had the LLM spin up a novel simulator using a quirky set of tests that I invented using hypothetical sensors and data that have not yet been implemented, and it nailed it first try - using smart abstractions and not touching the original engine implementation at all. This tech is getting freaky.

HotHotLava•56m ago
Basically every AI agent released in the last 6 months can do this pretty well out of the box? What feature exactly are you missing from these?
pjc50•1h ago
Crucially, this is:

    - text classification, not text generation
    - operating on existing unstructured input
    - existing solution was extremely limited (string matching)
    - comparing LLM to similar but older methods of using neural networks to match
    - seemingly no negative consequences to warranty customers themselves of mis-classification (the data is used to improve process, not to make decisions)
pards•1h ago
> Over multiple years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the headline, but it’s not the point. The real shift is that classification is no longer gated by data availability, annotation cycles, or pipeline engineering.
stogot•1h ago
This was fun to read

“ Fun fact: Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.”

happimess•1h ago
I wonder how they came up with that. Was it a human idea, or did the AI stumble upon it?

Given that it was inside a 9-step text preprocessing pipeline, it would be surprising if the AI had that much autonomy.

embedding-shape•1h ago
I think it's fairly known among "LLM practitioners" (or what to call it), that some languages are better at solving specific tasks. Generally if you find yourself in a domain dominated by research in language X, shifting your prompts to that language will give you better results.
Upvoter33•1h ago
Did the author exactly define "Nova Lite" somewhere in there?
xfalcox•1h ago
It's the Amazon own model. I'm baffled someone would pick it, even more that someone would test Llama 4 for a task in an age where Sonnet 4.5 is already out, so in the last 45 days.

Looks like they were limited by AWS Bedrock options.

killerstorm•1h ago
Hmm, why was their starting point not something like BERT:

  * already known as SotA for text classification and similarity 
     back in 2023
  * natively multi-lingual
embedding-shape•1h ago
People generally sleep when you start talking about fine-tuned BERT and CLIP, although they do a fairly decent job as long as you have good data and know what you're doing.

But no, they want to pay $0.1 per request to recognize if a photo has a person in it by asking a multimodal LLM deployed across 8x GPUs, for some reason, instead of just spending some hours with CLIP and run it effectively even on CPU.

efavdb•36m ago
Are you suggesting use the clip embedding for the text as a feature to train a standard Ml model on?
PaulHoule•20m ago
I think he is. I do things like that plenty.
daemonologist•8m ago
I think they're suggesting doing that with BERT for text and CLIP for images. Which in my experience is indeed quite effective (and easy/fast).

There have been some developments in the image-of-text/other-than-photograph area though recently. From Meta (although they seem unsure of what exactly their AI division is called): https://arxiv.org/abs/2510.05014 and Qihoo360: https://arxiv.org/abs/2510.27350 for instance.

datax2•58m ago
Warranty data is a great example of where LLMs have evolved bureaucratic data overhead. What most people do not know is because of US federal TREAD regulation Automotive companies (If they want to land and look at warranty data) need to review all warranty claims, document, and detect any safety related issues and issue recalls all with an strong auditability requirement. This problem generates huge data and operations overhead, Companies need to either hire 10's if not hundreds of individuals to inspect claims or come up with automation to make this process easier.

Over the past couple of years people have made attempts with NLP (lets say standard ML workflows) but NLP and word temperature scores are hard to integrate into a reliable data pipeline much less a operational review workflow.

Enter LLM's, the world is a data gurus oyster for building an detection system on warranty claims. Passing data to Prompted LLM's means capturing and classifying records becomes significantly easier, and these data applications can flow into more normal analytic work streams.

jwong_•42m ago
Wish there was a bit more technical details in how the prompt iterations looked like.

> We didn’t just replace a model. We replaced a process.

That line sticks out so much now, and I can't unsee it.

prasoonds•38m ago
Right? This one is also very clear ChatGPTese

> That’s not a marginal improvement; it’s a different way of building classifiers.

They've replaced an em-dash with a semi-colon.

klabb3•26m ago
They are really getting to the heart of the problem!
ieie3366•4m ago
HN readers: claim to hate ai-generated text

Also HN readers: upvote the most obvious chatgpt slop to the frontpage

PaulHoule•16m ago
I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4".

I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.

If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

mcdonje•10m ago
I get that SQL text searches are miserable to write, but it would have flagged it properly in the example.

The text says, "...no leaks..." The case statement says, "...AND LOWER(claim_text) NOT LIKE '%no leak%...'"

It would've properly been marked as a "0".

I think nobody wants AI in Firefox, Mozilla

https://manualdousuario.net/en/mozilla-firefox-window-ai/
43•rpgbr•23m ago•11 comments

AGI fantasy is a blocker to actual engineering

https://www.tomwphillips.co.uk/2025/11/agi-fantasy-is-a-blocker-to-actual-engineering/
72•tomwphillips•1h ago•29 comments

Honda: 2 years of ml vs 1 month of prompting - heres what we learned

https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompting/
83•Ostatnigrosh•4d ago•34 comments

Show HN: Encore – Type-safe back end framework that generates infra from code

https://github.com/encoredev/encore
39•andout_•2h ago•29 comments

Operating Margins

https://fi-le.net/margin/
166•fi-le•4d ago•51 comments

EDE: Small and Fast Desktop Environment

https://edeproject.org/
14•bradley_taunt•1h ago•2 comments

Winamp for OS/X

https://github.com/mgreenwood1001/winamp
13•hyperbole•1h ago•16 comments

Nano Banana can be prompt engineered for nuanced AI image generation

https://minimaxir.com/2025/11/nano-banana-prompts/
777•minimaxir•20h ago•196 comments

Backblaze Drive Stats for Q3 2025

https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2025/
30•woliveirajr•1h ago•0 comments

Nvidia is gearing up to sell servers instead of just GPUs and components

https://www.tomshardware.com/tech-industry/artificial-intelligence/jp-morgan-says-nvidia-is-geari...
24•giuliomagnifico•1h ago•11 comments

RegreSQL: Regression Testing for PostgreSQL Queries

https://boringsql.com/posts/regresql-testing-queries/
103•radimm•7h ago•25 comments

Scientists Produce Powerhouse Pigment Behind Octopus Camouflage

https://today.ucsd.edu/story/scientists-produce-powerhouse-pigment-behind-octopus-camouflage
15•gmays•4d ago•1 comments

What Happened with the CIA and The Paris Review?

https://www.theparisreview.org/blog/2025/11/11/what-really-happened-with-the-cia-and-the-paris-re...
123•benbreen•14h ago•50 comments

Show HN: Pegma, the free and open-source version of the classic Peg solitaire

https://pegma.vercel.app
29•GlebShalimov•6h ago•38 comments

Disrupting the first reported AI-orchestrated cyber espionage campaign

https://www.anthropic.com/news/disrupting-AI-espionage
311•koakuma-chan•19h ago•241 comments

Launch HN: Tweeks (YC W25) – Browser extension to deshittify the web

https://www.tweeks.io/onboarding
291•jmadeano•22h ago•170 comments

A Common Semiconductor Just Became a Superconductor

https://www.sciencedaily.com/releases/2025/10/251030075105.htm
33•tsenturk•1w ago•16 comments

V8 Garbage Collector

https://wingolog.org/archives/2025/11/13/the-last-couple-years-in-v8s-garbage-collector
67•swah•4h ago•19 comments

Arrival Radar

https://entropicthoughts.com/arrival-radar
3•ibobev•2h ago•0 comments

How to Get a North Korea / Antarctica VPS

https://blog.lyc8503.net/en/post/asn-5-worldwide-servers/
156•uneven9434•12h ago•61 comments

650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

https://dataengineeringcentral.substack.com/p/650gb-of-data-delta-lake-on-s3-polars
212•tanelpoder•16h ago•87 comments

Magit manuals are available online again

https://github.com/magit/magit/issues/5472
4•vetronauta•2h ago•0 comments

OpenMANET Wi-Fi HaLow open-source project for Raspberry Pi–based MANET radios

https://openmanet.net/
127•hexmiles•17h ago•33 comments

Hooked on Sonics: Experimenting with Sound in 19th-Century Popular Science

https://publicdomainreview.org/essay/science-of-sound/
26•Hooke•8h ago•0 comments

Blender Lab

https://www.blender.org/news/introducing-blender-lab/
271•radeeyate•1d ago•47 comments

Think in math, write in code (2019)

https://www.jmeiners.com/think-in-math/
186•alabhyajindal•5d ago•70 comments

Show HN: An easy-to-use online curve fitting tool

https://byx2000.github.io/curve-fit/
17•byx•1w ago•6 comments

Why do we need dithering?

https://typefully.com/DanHollick/why-do-we-need-dithering-Ut7oD4k
110•ibobev•1w ago•98 comments

Piramidal (YC W24) Hiring: Front End Engineer

https://www.ycombinator.com/companies/piramidal/jobs/i9yNX5s-front-end-engineer-user-interface
1•dsacellarius•17h ago

Steam Machine

https://store.steampowered.com/sale/steammachine
2796•davikr•1d ago•1394 comments