Why Linguistic Context Outperforms Raw Data for LLM Decision-Making

https://www.prereason.com/evidence/research

2•KalskiTheDan•1h ago

Comments

KalskiTheDan•1h ago

I'm a solo dev; 6 months in the making. I built a financial context API that returns pre-analyzed market briefings for AI agents (on-chain, macro, regime classification), then ran 7 controlled experiments to find out if it actually helps or just adds noise.

I kept seeing the same pattern in AI agent demos. You hand an LLM a price feed, it gets {"price": 94200, "change_24h": -2.3}, and it burns half its context window figuring out basics. Is this up from last week? What percentile? How does hash rate correlate? The agent does all that work before it starts reasoning about what to do. So I started pre-computing the analysis server-side and returning ~400 token markdown briefings instead of raw JSON.

The experiment: 4-arm RCT. Treatment gets real-time briefings. Control gets price only. A third arm uses web search instead of briefings. Placebo gets the same briefings but time-shifted 5-7 months, presented as current. All arms run Claude, one trading decision per tick.

Latest run, 202 ticks over 6 months. BTC fell 34.7%.

  Treatment (briefings):   +7.83%  | max drawdown 5.95%
  Control (price only):    -8.14%  | max drawdown 15.95%
  Web search arm:          -1.55%  | max drawdown 12.63%
  Placebo (stale data):    -7.70%  | max drawdown 10.17%
  BTC buy-and-hold:       -34.70%

Treatment beat control by +15.97pp. Beat web search by +9.38pp. All 7 experiments positive, range +4.46pp to +15.97pp across two models (Opus 4.6, Sonnet 4.5).

The edge is almost entirely defensive. Treatment's return came from two short campaigns during crashes. In rallies and sideways markets, it matched or underperformed control. Long trades were coin flips.

What didn't work: the earliest run was the worst. Treatment finished last. Rich data with no guardrails caused the agent to flip-flop every tick. BUY, SELL, BUY across three consecutive ticks. $79K traded, zero net position change. A later run was aborted at tick 33 after the agent translated "macro bearish" into "go short" when the right move was cash. 1 of 24 total runs was negative. 5 were inconclusive.

Stale data was worse than no data. Placebo consistently underperformed plain control across runs. Well-structured wrong information is more dangerous than no information.

Things I'm still uncertain about: the edge is untested in a bull market (every window skews bearish), 202 ticks isn't statistically conclusive within a single run (more valued would be years of data/ticks), and the web search arm had contamination risk from future-dated search results.

Kreuzbery – Fast RAG Pipeline

Show HN: Reading Tree, a weighted outline for articles instead of a summary

Show HN: 3 out of 4 devs failed to catch dangerous AI-suggested commands

The story of FFmpeg (and how it ended up everywhere)

Developer Experience

Show HN: We caught our AI agents self-approving their own work, so we built this

A Fixation and Distance-Dependent Color Illusion

Agentis – multi-agent AI platform across 12 LLM providers, watch them in 3D

What breaks in AI agent commerce (300 sessions, WooCommerce)

The Sudden Death of a Man Who Told Chinese Kids How to Succeed

Ten Months with Copilot Coding Agent in Dotnet/Runtime

Auditing source code wasn't enough in the LiteLLM attack

Palantir's CEO says only two kinds of people will succeed in the AI era

Mymarks.net

Iran War Is Pushing Consumers to Break Up with Fossil Fuels

The Old Internet Is Still Here

What AI tools to use for iOS development

"Am I Actually Doing a Good Job?"

The Sparsity Nexus: Bypassing O(N²) Attention with Judy Arrays

The "Me" Decade and the Third Great Awakening

Hot things can freeze faster than cool ones. Now, this paradox has gone quantum

Cory Doctorow: Interoperability Can Save the Open Web

LeWorldModel: Stable E2E Joint-Embedding Predictive Architecture from Pixels

Don't Trust, Verify

Show HN: Codeseum – From Bare Metal to Pure Thought

Show HN: What Did My Agent Do? Compare logs to signed records

Ask HN: Do your coworkers review their own AI generated code?

TweetStyler – Style your X posts with 125 Unicode fonts (no extension needed)

Personal Identification Secure Comparison and Evaluation System

Time-lapse of continental drift over the last 750M years