I have the feeling that LLMs are effectively running on dream logic, and everything we've done to make them reason properly is insufficient to bring them up to human level.
What they lack is multi turn long walk goal functions — which is being solved to some degree by agents.
This definitely matches my experience of talking to AI agents and chatbots. They can be extremely knowledgeable on arcane matters yet need to have obvious (to humans) assumptions pointed out to them, since they only have book smarts and not street smarts.
But people want an AI that is objective and right. HN is where people who know the distinction hang out, but it’s not what the layperson things they are getting when they use this miraculous super hyped tool that everybody is raving about?
YMMV.
-Michael Crichton
that doesn’t mean the future won’t herald a way of using what a transformer is good at - interfacing with humans - to translate to and interact with something that can be a lot more sound and objective.
Sure, he could have submitted a ill-considered 3800 line PR five years ago, but it would have taken him at least a week and there probably would have been opportunities to submit smaller chunks along the way or discuss the approach.
I think we’re going to see a lot of the systems we depend on fail a lot more often. You’d often see an ATM or flight staus screen have a BSOD - I think we’re going to see that kind of thing everywhere soon.
It seems to me that it's all a matter of company culture, as it has always been, not AI. Those that tolerate bad code will continue to tolerate it, at their peril.
I see current expectations that technical debt doesn't matter. The current tools embrace superficial understand. These tools to paper over the debt. There is no need for deeper understanding of the problem or solution. The tools take care of it behind the scenes.
Claude: No, but if you hum a few bars I can fake it!
Except "faking it" turns out to be good enough, especially if you can fake it at speed and get feedback as to whether it works. You can then just hillclimb your way to an acceptable solution.
The fact that it generates these "thinking" steps does not mean it is using them for reasoning. It's most useful effect is making it seem to a human that there is a reasoning process.
I've asked it to do why harder things so I thought it'd easily one-shot this but for some reason it absolutely ate it on this task. I tried to re-prompt it several times but it kept digging a hole for itself, adding more and more in-line javascript and backend code (and not even cleaning up the old code).
It's hard to appreciate how unintuitive the failure modes are. It can do things probably only a handful of specialists can do but it can also critical fail on what is a straightforward junior programming task.
treetalker•59m ago
The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.
LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.
I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.
FpUser•30m ago
Possible. It also suffers when majority simply can not afford proper representation
deaux•30m ago
Correct, and this of course extends past just laws, into the whole scope of rules and regulations described in human languages. It will by its nature imply things that aren't explicitly stated nor can be derived with certainty, just because they're very plausible. And those implications can be wrong.
Now I've had decent success with having LLMs then review these LLM-generated texts to flag such occurences where things aren't directly supported by the source material. But human review is still necessary.
The cases I've been dealing with are also based on relatively small sets of regulations compared the scope of the law involved with many legal cases. So I imagine that in the domain you're working on, much more needs flagging.
masfuerte•28m ago
roarcher•1m ago
The same goes for coding, in my experience. I have coworkers that use it to generate entire PRs. They can crank out two thousand lines of code with tests "proving" that it works, but may or may not actually be nonsense, in minutes. And then some poor bastard like me has to spend half a day reviewing it.
When code is written by a human that I know and trust, I can assume that they at least made reasonable, if not always correct, decisions. I can't assume that with AI, so I have to scrutinize every single line. And when it inevitably turns out that the AI has come up with some ass-backwards architecture, the burden is on me to understand it and explain why it's wrong and how to fix to the "developer" who hasn't bothered to even read his own PR.
I'm seriously considering suggesting that if you use AI to generate a PR at my company, the story points get credited to the reviewer.