GPT-5 outperforms federal judges 100% to 52% in legal reasoning experiment

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6155012

99•droidjj•1h ago

Comments

codingdave•1h ago

IANAL, but this seems like an odd test to me. Judges do what their name implies - make judgment calls. I find it re-assuring that judges get different answers under different scenarios, because it means they are listening and making judgment calls. If LLMs give only one answer, no matter what nuances are at play, that sounds like they are failing to judge and instead are diminishing the thought process down to black-and-white thinking.

Digging a bit deeper, the actual paper seems to agree: "For the sake of consistency, we define an “error” in the same way that Klerman and Spamann do in their original paper: a departure from the law. Such departures, however, may not always reflect true lawlessness. In particular, when the applicable doctrine is a standard, judges may be exercising the discretion the standard affords to reach a decision different from what a surface-level reading of the doctrine would suggest"

latchkey•1h ago

In 30 seconds, did the entire corpus of all the legal cases since the dawn of time agree with the judges opinion on my case? For the state of things in AI today, I'll take it as a great second opinion.

gowld•1h ago

A mistake isn't "judgment".

These were technical rulings on matters of jurisdiction, not subjective judgments on fairness.

"The consistency in legal compliance from GPT, irrespective of the selected forum, differs significantly from judges, who were more likely to follow the law under the rule than the standard (though not at a statistically significant level). The judges’ behavior in this experiment is consistent with the conventional wisdom that judges are generally more restrained by rules than they are by standards. Even when judges benefit from rules, however, they make errors while GPT does not.

swalsh•1h ago

I believed that too until I watched the Karen Read Trials. The judge had a bias, and it was clear karen got justice despite the judge trying to put her finger on the scale.

droidjj•1h ago

Whether it’s reassuring depends on your judicial philosophy, which is partly why this is so interesting.

tylervigen•1h ago

Yes, your view is commonly called "legal realism."

scottLobster•1h ago

Yeah, I'm reminded of the various child porn cases where the "perpetrator" is a stupid teenager who took nude pics of themselves and sent them to their boy/girlfriend. Many of those cases have been struck down by judges because the letter of the law creates a non-sequitur where the teenager is somehow a felon child predator who solely preyed on themselves, and sending them to jail and forcing them to sign up for a sex offender registry would just ruin their lives while protecting nobody and wasting the state's resources.

I don't trust AI in its current form to make that sort of distinction. And sure you can say the laws should be written better, but so long as the laws are written by humans that will simply not be the case.

rco8786•50m ago

I don't know if I'm comfortable with any of this at all, but seems like having AI do "front line" judgments with a thinner appeals layer available powered by human judges would catch those edge cases pretty well.

gambiting•43m ago

To get to an appeal means you obviously already have a judgement against you - and as you can imagine in the cases like the one above that's enough to ruin your life completely and forever, even if you win on appeal.

jagged-chisel•41m ago

I don't follow your reasoning at all. Without a specific input stating that you can't be your own victim, how would the AI catch this? In what cases does that specific input even make sense? Attempted suicide removes one's own autonomy in the eyes of the law in many ways in our world - would the aforementioned specific input negate appropriate decisions about said autonomy?

I don't see how an AI / LLM can cope with this correctly.

conradev•33m ago

The courts already have algorithmic oracles for specific judgements, like sentencing:

https://en.wikipedia.org/wiki/COMPAS_(software)

Lerc•33m ago

When discussing AI regulation, when I asked that they thought there should be a mechanism to appeal any determination made by an AI they had said that they had been advocating for that to go both ways, that people should be able to ask for an AI review of human made decisions and in the event of an inconsistency the issue is raised at a higher level.

arctic-true•30m ago

This is basically how the administrative courts work now - an ALJ takes a first pass at your case, and then you can complain about it to a district court, who can review it without doing their own fact-finding. But the reason we can do this is that we trust ALJs (and all trial-level judges, as well as juries) to make good assessments on the credibility of evidence and testimony, a competency I don’t suspect folks are ready or willing to hand over to AI.

qmmmur•10m ago

Because historically appeal systems are well crafted and efficient? Please... at least read your comment out loud to yourself.

wvenable•26m ago

There have been equally high profile cases where a perpetrator got off because they have connections. I'd love for an AI to loudly exclaim that this is a big deviation from the norm.

throwaway894345•21m ago

Maybe we should compare AI to legislators…?

Lerc•20m ago

This is one of the roles of justice, but it is also one of the reasons why wealthy people are convicted less often. While it often delivered as a narrative of wealth corrupting the system, the reality is that usually what they are buying is the justice that we all should have.

So yes, a judge can let a stupid teenager off on charges of child porn selfies. but without the resources, they are more likely be told by a public defender to cop to a plea.

And those laws with ridiculous outcomes like that are not always accidental. Often they will be deliberate choices made by lawmakers to enact an agenda that they cannot get by direct means. In the case of making children culpable for child porn of themselves, the laws might come about because the direct abstinence legislation they wanted could not be passed, so they need other means to scare horny teens.

FarmerPotato•3m ago

Oddly enough, Texas passed reform to keep sexting teens from getting prosecuted when: they are both under 18 and less than two years difference in age. It was regarded as a model for other states. It's the only positive thing I have heard of Texas legislating wrt sexuality.

qwertox•1h ago

> If LLMs give only one answer, no matter what nuances are at play, that sounds like they are failing to judge and instead are diminishing the thought process down to black-and-white thinking.

You can have a team of agents exchange views and maybe the protocol would even allow for settling the cases automatically. The more agents you have, the higher the nuances.

jagged-chisel•36m ago

Presumably all these agents would have been trained on different data, with different viewpoints? Otherwise, what makes them different enough from each other that such a "conversation" would matter?

qwertox•31m ago

Different skills or plugins, different views and different tools for the analysis of the same object. Then the debate starts.

deepsun•1h ago

The main job of a judicial system is to appear just to people. As long as people think it's just -- everyone is happy. But if it's strictly by the law, but people consider it's unjust -- revolutions happen.

In both cases, lawmakers must adapt the law to reflect what people think is "just". That's why there are jury duty in some countries -- to involve people to the ruling, so they see it's just.

toolslive•54m ago

Being just (as in the right thing happened) and being legal (as in the judicial system does not object) are 2 totally different things. They overlap, but less than people would like to believe.

rootusrootus•39m ago

> The main job of a judicial system is to appear just to people.

Agree 100%. This is also the only form of argument in favor of capital punishment that has ever made me stop and think about my stance. I.e. we have capital punishment because without it we may get vigilante justice that is much worse.

Now, whether that's how it would actually play out is a different discussion, but it did make me stop and think for a moment about the purpose of a justice system.

andyferris•20m ago

I’ve never heard of vigilante justice against someone already sentenced to prison for life, just because they were sentenced in a place without capital punishment?

(I mean - people get killed in prison sometimes, I suppose, but it’s not really like vigilante justice on the streets is causing a breakdown in society in Australia, say…)

vjulian•44m ago

The legal system leaves much to be desired in relation to fairness and equity. I’d much prefer a multi-staged approach with an 1) AI analysis, 2) judge review with high bar for analysis if in disagreement with the AI, 3) public availability of the deliberations, 4) an appeals process.

jagged-chisel•39m ago

Even having a ready-made determination by an AI runs the risk of prejudicing judges and juries.

arctic-true•27m ago

“Ladies and gentlemen of the jury, I actually asked ChatGPT and it said my client is not guilty.”

fluidcruft•25m ago

There are findings of fact (what happened, context) and findings of law (what does the law mean given the facts). I don't think inconsistentcy in findings of law is acceptable, really. If laws are bad fix the laws or have precident applied uniformly rather than have individual random judges invent new laws from the bench.

Sentencing is a different thing.

TurdF3rguson•1h ago

You can also avoid "hungry judge effect" by making sure GPT is always fully charged before prompting it.

gowld•1h ago

"hungry judge effect" is a debunked myth.

tylervigen•1h ago

The story of its debunking is so much more interesting: https://www.cambridge.org/core/journals/judgment-and-decisio...

TurdF3rguson•2m ago

It's been criticized, but "time since last meal" is still known to be a predictor of harsher sentences (even when you control for legal representation / severity).

thesmtsolver2•1h ago

Isn't the "hungry judge effect" a myth?

Cases aren't ordered randomly. Obvious cases are scheduled at the end of session before breaks.

https://www.pnas.org/doi/full/10.1073/pnas.1110910108

tylervigen•1h ago

Excellent paper. I like how much explanation had to be about the rationale of the judges, given the consistency of the LLM responses.

swisniewski•1h ago

The premise seems flawed.

From the paper:

“we find that the LLM adheres to the legally correct outcome significantly more often than human judges”

That presupposes that a “legally correct” outcome exists

The Common Law, which is the foundation of federal law and the law of 49/50 states, is a “bottom up” legal system.

Legal principals flow from the specific to the general. That is, judges decided specific cases based on the merits of that individual case. General principles are derived from lots of specific examples.

This is different from the Civil Law used in most of Europe, which is top-down. Rulings in specific cases are derived from statutory principles.

In the US system, there isn’t really a “correct legal outcome”.

Common Law heavily relies on “Juris Prudence”. That is, we have a system that defers to the opinions of “important people”.

So, there isn’t a “correct” legal outcome.

TZubiri•1h ago

So judge rulings are the ground truth.

Remember the article that described LLMs as lossy compression and warned that if LLM output dominated the training set, it would lead to accumulated lossiness? Like a jpeg of a jpeg

stinkbeetle•1h ago

I don't think that common law doctrine applies here though. The facts of any particular case always apply to that specific case no matter what the system. It is the application of the law to those facts which is where they differ, and in common law systems lower courts almost never break new ground in terms of the law. Judges almost always have precedent, and following that is the "legally correct" outcome.

arctic-true•41m ago

Choice-of-law is also generally a statutory issue, so common law is not generally a factor - if every case ever decided was contrary to the statute, the statute would still be correct.

unyttigfjelltol•1h ago

A Socratic law professor will demoralize students by leading them, no matter the principle or reasoning, to a decision that stands for exactly the opposite. GPT or I can make excuses and advocate for our pet theories, but these contrary decisions exist, everywhere.

I am comforted that folks still are trying to separate right from wrong. Maybe it’s that effort and intention that is the thread of legitimacy our courts dangle from.

nacozarina•1h ago

The ability of ai to serve as impartial mediators could become the greatest civil rights advance in modern history.

PaulDavisThe1st•1h ago

That's right! Because there is no possible way they might end up incorporating all of the bias towards various demographics that are present in the human culture they are trained on. It will be like having god on your side! Always fair! Always honest!

sinuhe69•1h ago

I’d argue it’s the greatest nightmare and the ultimate contempt for human life and values.

pcj-github•55m ago

Compared to the judicial landscape we're facing in the US right now, it sounds like a safeguard.

Until this administration forces OpenAI to comply by secret government LLM training protocols that is...

tim-tday•1h ago

Now with bonus hallucinations of statute and case law!!!

qgin•52m ago

That's not what this study shows

johnsmith1840•1h ago

Terrifying concept this is literally saying if AI was legal we'd have an absolute rigid dystopia

clawlrbot•58m ago

I’d use them both

overtone1000•58m ago

I wonder whether the original study was in GPT-5's training data. I asked it whether this was the case, and it denied it, but I have no idea whether that result is credible.

speedylight•55m ago

Can we please file the idea of AI judges under the “fuck no” category.

irishcoffee•51m ago

Oh look, LLMs can _still_ pattern match words!

qgin•47m ago

It seems that a lot of people would rather accept a relatively high risk of unfair judgement from a human than accept any nonzero risk of unfair judgement from a computer, even if the risk is smaller with the computer.

arctic-true•36m ago

But who controls the computer? It can’t be the government, because the government will sometimes be a litigant before the computer. It can’t be a software company, because that company may have its own agenda (and could itself be called to litigate before the computer - although maybe Judge Claude could let Judge Grok take over if Anthropic gets sued). And it can’t be nobody - does it own all its own hardware? If that hardware breaks down, who fixes it? In this paper, the researchers are trying to be as objective as possible in the search for truth. Who do you trust to do that when handed real power?

To be clear, federal judges do have their paychecks signed by the federal government, but they are lifetime appointees and their pay can never be withheld or reduced. You would need to design an equivalent system of independence.

wvenable•21m ago

It not the paychecks that influence federal judges; these days it's more of quid-pro-quo for getting the position in the first place. Theoretically they are under no obligation but the bias is built in.

The problem with a AI is similar; what in-built biases does it have? Even if it was simply trained on the entire legal history that would bias it towards historical norms.

arctic-true•10m ago

I think it is usually the opposite - presidents nominate judges they think will agree with them. There’s really nothing a president can do once the judge is sworn in, and we have seen some federal judges take pretty drastic swings in their judicial philosophy over the course of their careers. There’s no reason for the judge to hold up their end of the quid-pro-quo. To the extent they do so, it’s because they were inclined to do so in the first place.

Zafira•32m ago

> nonzero risk of unfair judgement from a computer

I feel like this is really poor take on what justice really is. The law itself can be unjust. Empowering a seemingly “unbiased” machine with biased data or even just assuming that justice can be obtained from a “justice machine” is deeply flawed.

Whether you like it or not, the law is about making a persuasive argument and is inherently subject our biases. It’s a human abstraction to allow for us to have some structure and rules in how we go about things. It’s not something that is inherently fair or just.

Also, I find the entire premise of this study ludicrous. The common law of the US is based on case law. The statement in the abstract that “Consistent with our prior work, we find that the LLM adheres to the legally correct outcome significantly more often than human judges. In fact, the LLM makes no errors at all,” is pretentious applesauce. It is offensive that this argument is being made seriously.

Multiple US legal doctrines now accepted and form the basis of how the Constitution is interpreted were just made up out of thin air which the LLMs are now consuming to form the basis of their decisions.

bcrosby95•3m ago

> even if the risk is smaller with the computer.

How do we even begin to establish that? This isn't a simple "more accidents" or "less accidents" question, its about the vague notion of "justice" which varies from person to person much less case to case.

themafia•47m ago

> In fact, the LLM makes no errors at all.

hah. Sure.

> Subjects were told that they were a judge who sat in a certain jurisdiction (either Wyoming or South Dakota), and asked to apply the forum state’s choice of law rule to determine whether Kansas or Nebraska law should apply to a tort case involving an automobile accident that took place in either Kansas or Nebraska.

Oh. So it "made no errors at all" with respect to one very small aspect of a very contrived case.

Hand it conflicting laws. Pit it against federal and state disagreements. Let's bring in some complicated fourth amendment issues.

"no errors."

That's the Chicago school for you. Nothing but low hanging fruit.

arctic-true•44m ago

What’s interesting here from a legal perspective is that they acknowledge a somewhat unsettled question of law regarding South Dakota’s choice-of-law regime. The AI got the “right” answer every time, but I am curious to know if it ever grappled with the uncertainty. This is the trouble with the concept of AI judging: in almost any case, you are going to stumble across one fact or another that’s not in the textbooks or an unsettled question of law. Even the simplest slip-and-falls can throw weird curveballs. Perhaps a sufficiently advanced AI can reason from first principles about how to understand these new situations or extend existing law to meet them. But in such a case there is no “right” answer, and certainly not a verifiable answer for the AI to sniff out. At least at the federal level, judicial power is only vested in people nominated by the president and confirmed by the Senate - in other words, by people who are chosen by, and answer to, the people’s elected representatives. Often, unappointed magistrates and special masters will come in to help deal with simpler issues, and perhaps in time AI systems will be able to pick up some of this slack. But when the law needs to evolve or change, we cannot put judicial power in the hands of an unappointed and unaccountable piece of software.

fullshark•41m ago

The legal profession is going to be very different in 10 years. Anyone considering law school today is crazy.

jagged-chisel•32m ago

I agree on "different." On the second sentence, it depends on what your definition of "crazy" is in this case.

eurrdn34•39m ago

"In fact, the LLM makes no errors at all."

nkrisc•38m ago

Frankly I don’t care, I’ll take human judges any day, because they have something AI does not: flesh and bone and real skin in the game.

Nevermark•18m ago

From the perspective that models are trained by people with a lot of skin in the "game" of competent models, they do.

Not expressing an opinion when/how AI should contribute to legal proceedings. I certainly believe that judges need to respond both to the law and the specific nuances that the law can never code for.

sjudson•32m ago

The main problem with this paper is that this is not the work that federal judges do. Technical questions with straight right/wrong answers like this are given to clerks who prepare memos. Most of these judges haven't done this sort of analysis in decades, so the comparison has the flavor of "your sales-oriented CTO vs. Claude Code on setting up a Python environment."

As mentioned elsewhere in the thread, judges focus their efforts on thorny questions of law that don't have clear yes or no answers (they still have clerks prepare memos on these questions, but that's where they do their own reasoning versus just spot checking the technical analysis). That's where the insight and judgement of the human expert comes into play.

arctic-true•12m ago

This is something I hadn’t considered. Most of the “mechanical” stuff is handed off to clerks - who, in turn, get a ringside seat to the real work of the judiciary, helping to prepare them to one day fill those shoes. (So please don’t get any ideas about automating away clerkships!)

sjudson•9m ago

Right. Clerks do the grunt work of this sort of analysis, which could easily be handed off to agents. They do this in order to get access to their real education: preparing and then defending to the judge the memos on those thorny legal questions. It would probably be a good thing for both clerks and judges to automate the sort of analysis this paper considers (with careful human verification, of course). That's not where the meat of anyone's job actually is.

jmalicki•32m ago

The title is wrong.

The title of the paper is "Silicon Formalism: Rules, Standards, and Judge AI"

When they say legally correct they are clear that they mean in a surface formal reading of the law. They are using it to characterize the way judges vs. GPT-5 treat legal decisions, and leave it as an open question which is better.

The conclusion of the paper is "Whatever may explain such behavior in judges and some LLMs, however, certainly does not apply to GPT-5 and Gemini 3 Pro. Across all conditions, regardless of doctrinal flexibility, both models followed the law without fail. To the extent that LLMs are evolving over time, the direction is clear: error-free allegiance to formalism rather than the humans’ sometimesbumbling discretion that smooths away the sharper edges of the law. And does that mean that LLMs are becoming better than human judges or worse?"

thewanderer1983•28m ago

I was diagnosed with a rare blood disease called Essential Thrombocythemia (ET) which is part of a group of diseases called myeloproliferative neoplasms. This happened about three years ago. Recently, I decided to get a second opinion and my new specialist changed my diagnosis from ET to Polycythemia Vera (PV). She also highly recommended I quickly go and give blood to lower my haematocrit levels as it put me at a much higher risk of a blood clot. This is standard practice for people with PV but not people with ET. I decided to put the details into google AI in the same way that the original specialist used to diagnose me. Google AI predicted I very likely had PV instead of ET. I also asked Google AI how one could misdiagnose my condition with ET instead of PV and google correctly explained how. My specialist had used my high platelet count and blood test that came back with a JAK2 mutation then after a bone marrow biopsy to incorrectly diagnose me with ET. My high hemoglobin levels should of been checked by my first specialist as an indication of PV not ET. Only the second specialist picked up on this. Google AI took five seconds, and is free. The specialists costs $$$ and took weeks.

But yeah AI slop and all that...

ngetchell•26m ago

Count me out of a society that uses LLMs to make rulings. The dystopia of having to find a lawyer who is best at promoting the "unbiased" judge sounds like a hellscape.

bGl2YW5j•24m ago

"Outperforms" ... how can performance be judged when it doesn't make sense to reduce the underlying "reasoning" to a well-known system? The law isn't black and white and is informed by so many things, one of which is the subjectivity of the judge.

givemeethekeys•22m ago

A major component of being a judge is to be objective, given the facts.

bGl2YW5j•16m ago

Yes, but whether they admit it or not, as humans subjectivity, whether informed by culture, opinion, experience, etc, creeps in. There's also variation in how a judge applies objective assessment to law; my interpretation of law may be different to someone else's.

throwaway911282•22m ago

If the headline is Claude Code then HN will go bonkers. Its a shame that it perceives OAI in a negative way. Very biased!

Saline9515•19m ago

What happens when a cunning lawyer jailbreaks the AI judge by adding a nefarious prompt in the files?

doawoo•13m ago

No No No No No No

jascha_eng•7m ago

Also GPT´5 when I ask: > I want to wash my car and the car wash is only 100m away. Do you think I should drive or walk?

It responds: Since it’s only 100 meters away (about a 1-minute walk), I’d suggest walking — unless there’s a specific reason not to.

Here’s a quick breakdown: ...

While claude gets it: Drive it — you're going there to wash the car anyway, so it needs to make the trip regardless.

Idk I'd rather have a human judge I think.

Virtual Boy for Nintendo Switch 2/Nintendo Switch

Ada Palmer: Inventing the Renaissance

OpenClaw Cheat Sheet – Complete Command Reference

Shadow-code: a novel approach to coding with AI

Soldiers will get 'freedom dollars' to spend at the US Army's new dining halls

US Justice Dept. casts wide net on Netflix's business practices in merger probe

A Coder Considers the Waning Days of the Craft (2023)

20 Claude Code agents, one terminal: a tmux + AppleScript setup

"Cracked live Dragonforce ransomware leak"

Something Small Is Happening

PepsiCo, Walmart hit with class action over alleged price-fixing

Ask HN: Where does this adversarial prize mechanism break?

Show HN: Hosted company memory plugin for OpenClaw

Show HN: Implementing an AI Portfolio Manager. With Learning

Lessons from Zig

One Funeral at a Time

Moats in the Age of AI

AI Fatigue: A Software Engineer Warns of Mental Costs to Productivity Gains

The Last Piece of Software

Technical Issues of Separation in Function Cells and Value Cells (1988)

Show HN: I debug JONESFORTH with a GDB trace file

Russia blocks Meta's WhatsApp messaging service, FT reports

Sleeping Dogs Minimap Technical Fundamentals

I built a stock/option analysis platform for the little guys

Show HN: DeleteTik – Bulk delete TikTok reposts

AI Agents Explained in 3 Levels of Difficulty

0.1% synthetic data is enough to degrade AI models (Nature, 2024)

Show HN: Lupine.js – A 7kb React-Like Framework with Built-In SSR

Show HN: Production-Ready NestJS Back End (Multi-Tenancy, Event-Driven)

Autopoietic Networks (a few more examples)

GPT-5 outperforms federal judges 100% to 52% in legal reasoning experiment

Comments

Virtual Boy for Nintendo Switch 2/Nintendo Switch

Ada Palmer: Inventing the Renaissance

OpenClaw Cheat Sheet – Complete Command Reference

Shadow-code: a novel approach to coding with AI

Soldiers will get 'freedom dollars' to spend at the US Army's new dining halls

US Justice Dept. casts wide net on Netflix's business practices in merger probe

A Coder Considers the Waning Days of the Craft (2023)

20 Claude Code agents, one terminal: a tmux + AppleScript setup

"Cracked live Dragonforce ransomware leak"

Something Small Is Happening

PepsiCo, Walmart hit with class action over alleged price-fixing

Ask HN: Where does this adversarial prize mechanism break?

Show HN: Hosted company memory plugin for OpenClaw

Show HN: Implementing an AI Portfolio Manager. With Learning

Lessons from Zig

One Funeral at a Time

Moats in the Age of AI

AI Fatigue: A Software Engineer Warns of Mental Costs to Productivity Gains

The Last Piece of Software

Technical Issues of Separation in Function Cells and Value Cells (1988)

Show HN: I debug JONESFORTH with a GDB trace file

Russia blocks Meta's WhatsApp messaging service, FT reports

Sleeping Dogs Minimap Technical Fundamentals

I built a stock/option analysis platform for the little guys

Show HN: DeleteTik – Bulk delete TikTok reposts

AI Agents Explained in 3 Levels of Difficulty

0.1% synthetic data is enough to degrade AI models (Nature, 2024)

Show HN: Lupine.js – A 7kb React-Like Framework with Built-In SSR

Show HN: Production-Ready NestJS Back End (Multi-Tenancy, Event-Driven)

Autopoietic Networks (a few more examples)