Reasoning models are risky. Anyone else experiencing this?

4•lebonnnn•7mo ago

I'm building interviuu (a job application tool) and have been testing pretty much every LLM model out there for different parts of the product. One thing that's been driving me crazy: reasoning models seem particularly dangerous for business applications that need to go from A to B in a somewhat rigid way.

I wouldn't call it "deterministic output" because that's not really what LLMs do, but there are definitely use cases where you need a certain level of consistency and predictability, you know?

Here's what I keep running into with reasoning models:

During the reasoning process (and I know Anthropic has shown that what we read isn't the "real" reasoning happening), the LLM tends to ignore guardrails and specific instructions I've put in the prompt. The output becomes way more unpredictable than I need it to be.

Sure, I can define the format with JSON schemas (or objects) and that works fine. But the actual content? It's all over the place. Sometimes it follows my business rules perfectly, other times it just doesn't. And there's no clear pattern I can identify.

For example, I need the model to extract specific information from resumes and job posts, then match them according to pretty clear criteria. With regular models, I get consistent behavior most of the time. With reasoning models, it's like they get "creative" during their internal reasoning and decide my rules are more like suggestions.

I've tested almost all of them (from Gemini to DeepSeek) and honestly, none have convinced me for this type of structured business logic. They're incredible for complex problem-solving, but for "follow these specific steps and don't deviate" tasks? Not so much.

Anyone else dealing with this? Am I missing something in my prompting approach, or is this just the trade-off we make with reasoning models? I'm curious if others have found ways to make them more reliable for business applications.

What's been your experience with reasoning models in production?

Comments

techpineapple•7mo ago

Don't have an answer to your exact question, but waxing philosophical for a moment, it's interesting that if you were talking to a person they would use their experience to maybe redirect you if they thought your intentions were wrong - subtly or overtly. It's probably really hard, in a casual way to get all the biases right in an LLM. When should it listen to you directly, and when should it say, no your wrong? There are probably a ton of use cases where you want the "wrong" answer, almost anything innovative is going to go against the grain of what a system would think is "right" by definition. Be 77% confident in your answer is hard to tune and a bad UX.

PaulHoule•7mo ago

The difference is that in one case you care if the result is right and in another case you don't. If your "complex problem solving" was being used to "design a manufacturing process" or "diagnose a disease and come up with a treatment plan" or something that has consequences if it is wrong you're going to be unhappy.

On the other hand there is a lot of high social-status work that has no consequences like "write a college commencement address" and people will be impressed no matter what comes out.

Perhaps the likes of Sam Altman and Satya Nadella are so excited for AI because it can do their job but it can't fold towels or be the final assembly programmer for software that users actually use.

deepsiml•7mo ago

How simple can you break down the process?

So, I did a dream analysis agent setup, and had to break every piece down. One focused on symbols, one focused on memory retrievals, one focused on emotions.

It meant I needed far more agents, but it was a much better result I believe.

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK