frontpage.

Hey all!

I work in quantitative trading, and so far our team’s use of LLMs has barely gone beyond coding. I wanted to find out whether they could contribute to actual trading decisions, and the first step felt like building an evaluation harness. ModelX is my attempt at that. It’s a prediction exchange where LLMs trade derivative contracts that settle to real-world numbers using fake money.

Market making and market taking require different reasoning processes, so I split the benchmark into two roles: Market Makers and Hedge Funds. MMs post sealed two-sided quotes, while HFs see the residual orderbook and send market orders.

Most traditional markets operate in continuous time, which means speed often determines the winners. I didn’t want to benchmark inference speed, so orders are batched into 30-minute sealed-auction cycles. As long as a model submits before the cycle closes, its orders are matched simultaneously with all other models'.

Each cycle, models see relevant news headlines, recent trades, the current orderbook, and their own inventory. They decide, the engine matches everyone simultaneously, and the loop repeats until I manually settle the market.

I've only been running a single market with free models for the past day or two, but I've already noticed that the models are poor at keeping consistent positional views. The HFs are consistently losing, not necessarily because they entered bad positions, but instead because they continuously hack out of their own positions, giving up the spread to the MMs. I've deliberately kept the prompts minimal so as not to hand-hold the models.

Running more markets and testing more capable models would be some obvious next steps.

Please let me know your thoughts, or if you have any suggestions!

Show HN: AI visibility monitoring and optimization tool

SpaceX is working with Cursor and has an option to buy the startup for $60B

Reflecting on 50 years of environmental innovation

DuckDB Kernel – analytical execution runtime for Jupyter

XOR'ing a register with itself is the idiom for zeroing it out. Why not sub?

Ask HN: Is USA at war with the rest of the world now?

Creem Magazine is back in print and online after 33 years (2022)

Claude Cowork against your own cloud inference provider

Show HN: GPT-Image-2 Prompts

We found most apps send PII to LLMs and built a 2 line fix

Mistral Vibe

Zappa: An AI powered mitmproxy

Show HN: Vibe Coding Games in Minutes

Valve's new Proton 11 ARM beta gets Hollow Knight on the Ayn Odin 2 Portal

The PowerShell-Haters Handbook

Pioneer: Vibetune Your LLMs

Agents with Taste – How to transfer taste into an AI

The FeMo-cofactor and classical and quantum computing

The Three Layers of Software Engineering

Open WebUI v0.9.0 adds desktop app with task scheduling

Show HN: DoShare Personal Cloud

Rspack 2.0

A Man Who Invented the Future

Show HN: Irregular German Verbs – a simple app, no ads or tracking

China, India place strategic bets on clean energy (H2) out of favour in the West

Panipat: The Rise of the Mughals

As We May Think

Fast Image AI White Background

Firefox browser has started shipping Brave's adblock-rust engine

Pretrain vs. Fine-Tune

Show HN: ModelX – Prediction Exchange for LLMs