I work in quantitative trading, and so far our team’s use of LLMs has barely gone beyond coding. I wanted to find out whether they could contribute to actual trading decisions, and the first step felt like building an evaluation harness. ModelX is my attempt at that. It’s a prediction exchange where LLMs trade derivative contracts that settle to real-world numbers using fake money.
Market making and market taking require different reasoning processes, so I split the benchmark into two roles: Market Makers and Hedge Funds. MMs post sealed two-sided quotes, while HFs see the residual orderbook and send market orders.
Most traditional markets operate in continuous time, which means speed often determines the winners. I didn’t want to benchmark inference speed, so orders are batched into 30-minute sealed-auction cycles. As long as a model submits before the cycle closes, its orders are matched simultaneously with all other models'.
Each cycle, models see relevant news headlines, recent trades, the current orderbook, and their own inventory. They decide, the engine matches everyone simultaneously, and the loop repeats until I manually settle the market.
I've only been running a single market with free models for the past day or two, but I've already noticed that the models are poor at keeping consistent positional views. The HFs are consistently losing, not necessarily because they entered bad positions, but instead because they continuously hack out of their own positions, giving up the spread to the MMs. I've deliberately kept the prompts minimal so as not to hand-hold the models.
Running more markets and testing more capable models would be some obvious next steps.
Please let me know your thoughts, or if you have any suggestions!