If you can’t zero shot your way to success the LLM simply doesn’t have enough training for your problem and you need a human touch or slightly different trigger words. There have been times where I’ve gotten a solution with such a minimal prompt it practically feels like the LLM read my mind, that’s the vibe.
It affects people too. Something I learned halfway through a theoretical physics PhD in the 1990s was that a 50-page paper with a complex calculation almost certainly had a serious mistake in it that you'd find if you went over it line-by-line.
I thought I could counter that by building a set of unit tests and integration tests around the calculation and on one level that worked, but in the end my calculation never got published outside my thesis because our formulation of the problem turned a topological circle into a helix and we had no idea how to compute the associated topological factor.
Interesting, and I used to think that math and sciences were invented by humans to model the world in a manner to avoid errors due to chains of fuzzy thinking. Also, formal languages allowed large buildings to be constructued on strong foundations.
From your anecdote it appears that the calculations in the paper were numerical ? but I suppose a similar argument applies to symbolic calculations.
https://inspirehep.net/files/20b84db59eace6a7f90fc38516f530e...
using integration over phase space instead of position or momentum space. Most people think you need an orthogonal basis set to do quantum mechanical calculation but it turns that "resolution of unity is all you need", that is, if you integrate |x><x| over all x you get 1. If you believe resolution of unity applies in quantum gravity, then Hawking was wrong about black hole information. In my case we were hoping we could apply the trace formula and make similar derivations to systems with unusual coordinates, such as spin systems.
There are quite a few calculations in physics that involve perturbation theory, for instance, people used to try to calculate the motion of the moon by expanding out thousands of terms that look like (112345/552) sin(32 θ-75 ϕ) and still not getting terribly good results. It turns out classic perturbation theory is pathological around popular cases such as the harmonic oscillator (frequency doesn't vary with amplitude) and celestial mechanics (the frequency to go around the sun, to get closer or further from sun, or to go above or below the plane of the plane of the ecliptic are all the same.) In quantum mechanic these are not pathological, notably perturbation theory works great for an electron going around an atom which is basically the same problem as the Earth going around the Sun.
I have a lot of skepticism about things like
https://en.wikipedia.org/wiki/Anomalous_magnetic_dipole_mome...
in high energy physics because frequently they're comparing a difficult experiment to an expansion of thousands of Feynman diagrams and between computational errors and the fact that perturbation theory often doesn't converge very well I don't get excited when they don't agree.
----
Note that I used numerical calculations for "unit and integration testing", so if I derived an identity I could test that the identity was true for different inputs. As for formal systems, they only go so far. See
https://en.wikipedia.org/wiki/Principia_Mathematica#Consiste...
Sexual reproduction is context-clearing and starting over from ROM.
- Removing problematic tests altogether
- Making up libs
- Providing a stub and asking you to fill in the code
This is a perennial issue in chatbot-style apps, but I've never had it happen in Claude Code.
mikeocool•4h ago
I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result. If it doesn’t solve an issue on the first or second pass, it seems to rapidly start making things up, make totally unrelated changes claiming they’ll fix the issue, or trying the same thing over and over.
the__alchemist•4h ago
fcatalan•4h ago
qazxcvbnmlp•4h ago
1) switch to a more expensive llm and ask it to debug: add debugging statements, reason about what's going on, try small tasks, etc 2) find issue 3) ask it to summarize what was wrong and what to do differently next time 4) copy and paste that recommendation to a small text document 5) revert to the original state and ask the llm to make the change with the recommendation as context
rurp•4h ago
I've had the same experience as parent where LLMs are great for simple tasks but still fall down surprisingly quickly on anything complex and sometimes make simple problems complex. Just a few days ago I asked Claude how to do something with a library and rather than give me the simple answer it suggested I rewrite a large chunk of that library instead, in a way that I highly doubt was bug-free. Fortunately I figured there would be a much simpler answer but mistakes like that could easily slip through.
ziml77•2h ago
nico•3h ago
You might not even need to switch
A lot of times, just asking the model to debug an issue, instead of fixing it, helps to get the model unstuck (and also helps providing better context)
enraged_camel•4h ago
To get the best results, I make sure to give detailed specs of both the current situation (background context, what I've tried so far, etc.) and also what criteria the solution needs to satisfy. So long as I do that, there's a high chance that the answer is at least satisfying if not a perfect solution. If I don't, the AI takes a lot of liberties (such as switching to completely different approaches, or rewriting entire modules, etc.) to try to reach what it thinks is the solution.
prmph•4h ago
enraged_camel•2h ago
It's not often that I have to do this. As I mentioned in my post above, if I start the interaction with thorough instructions/specs, then the conversation concludes before the drift starts to happen.
skerit•4h ago
Is this with something like Aider or CLine?
I've been using Claude-Code (with a Max plan, so I don't have to worry about it wasting tokens), and I've had it successfully handle tasks that take over an hour. But getting there isn't super easy, that's true. The instructions/CLAUDE.md file need to be perfect.
nico•3h ago
What kind of tasks take over an hour?
aprilthird2021•1h ago
onlyrealcuzzo•4h ago
Sounds like a lot of employees I know.
Changing out the entire library is quite amusing, though.
Just imagine: I couldn't fix this build error, so I migrated our entire database from Postgres to MongoDB...
butterknife•4h ago
mathattack•4h ago
didgeoridoo•3h ago
PaulHoule•4h ago
Workaccount2•4h ago
They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.
Right now I work around it by regularly making summaries of instances, and then spinning up a new instance with fresh context and feed in the summary of the previous instance.
codeflo•3h ago
People have also been reporting that ChatGPT's new "memory" feature is poisoning their context. But context is also useful. I think AI companies will have to put a lot of engineering effort into keeping those LLMs on the happy path even with larger and larger contexts.
kossae•3h ago
kazinator•1h ago
HeWhoLurksLate•3h ago
rvnx•3h ago
vunderba•2h ago
nojs•2h ago
OtherShrezzing•2h ago
This is possible in tools like LM Studio when running LLMs locally. It's a choice by the implementer to grant this ability to end users. You pass the entire context to the model in each turn of the conversation, so there's no technical reason stopping this feature existing, besides maybe some cost benefits to the inference vendor from cache.
steveklabnik•2h ago
In Claude Code you can use /clear to clear context, or /compact <optional message> to compact it down, with the message guiding what stays and what goes. It's helpful.
libraryofbabel•2h ago
Claude has some amazing features like this that aren’t very well documented. Yesterday I just learned it writes sessions to disk and you can resume them where you left off with -continue or - resume if you accidentally close or something.
autobodie•13m ago
darepublic•9m ago
dingnuts•4m ago
peacebeard•4h ago
heyitsguay•3h ago
(Context: Working in applied AI R&D for 10 years, daily user of Claude for boilerplate coding stuff and as an HTML coding assistant)
Lots of "with some tweaks i got it to work" or "we're using an agent at my company", rarely details about what's working or why, or what these production-grade agents are doing.
alganet•3h ago
In programming, I already have a very good tool to follow specific steps: _the programming language_. It is designed to run algorithms. If I need to be specific, that's the tool to use. It does exactly what I ask it to do. When it fails, it's my fault.
Some humans require algorithmic-like instructions too. Like cooking a recipe. However, those instructions can be very vague and a lot of humans can still follow it.
LLMs stand on this weird place where we don't have a clue in which occasions we can be vague or not. Sometimes you can be vague, sometimes you can't. Sometimes high level steps are enough, sometimes you need fine-grained instructions. It's basically trial and error.
Can you really blame someone for not being specific enough in a system that only provides you with a text box that offers anthropomorphic conversation? I'd say no, you can't.
If you want to talk about how specific you need to prompt an LLM, there must be a well-defined treshold. The other option is "whatever you can expect from a human".
Most discussions seem to juggle between those two. LLMs are praised when they accept vague instructions, but the user is blamed when they fail. Very convenient.
peacebeard•25m ago
accrual•3h ago
For example, the other day I was converting models but was running out of disk space. The agent decided to change the quantization to save space when I'd prefer it ask "hey, I need some more disk space". I just paused it, cleared some space, then asked the agent to try the original command again.
vadansky•3h ago
When I came back all the tests were passing!
But as I ran it live a lot of cases were still failing.
Turns out the LLM hardcoded the test values as “if (‘test value’) return ‘correct value’;”!
bluefirebrand•3h ago
ffsm8•3h ago
https://github.com/auchenberg/volkswagen
EGreg•2h ago
vunderba•2h ago
matsemann•19m ago
mikeocool•1h ago
It then deleted the entire implementation and made the function raise a “not implemented” exception, updated the tests to expect that, and told me this was a solid base for the next developer to start working on.
Veen•52m ago
nico•3h ago
I’ve had a similar experience, where instead of trying to fix the error, it added a try/catch around it with a log message, just so execution could continue
mtalantikite•2h ago
I generally treat all my sessions with it as a pairing session, and like in any pairing session, sometimes we have to stop going down whatever failing path we're on, step all the way back to the beginning, and start again.
nojs•2h ago
At least that’s easy to catch. It’s often more insidious like “if len(custom_objects) > 10:” or “if object_name == ‘abc’” buried deep in the function, for the sole purpose of making one stubborn test pass.
akomtu•1h ago
dylan604•2h ago
does this mean that even AI gets stuck in dependency hell?
hhh•1h ago
reactordev•1h ago
EnPissant•1h ago
Wowfunhappy•57m ago
I absolutely have, for what it's worth. Particularly when the LLM has some sort of test to validate against, such as a test suite or simply fixing compilation errors until a project builds successfully. It will just keep chugging away until it gets it, often with good overall results in the end.
I'll add that until the AI succeeds, its errors can be excessively dumb, to the point where it can be frustrating to watch.
civilian•52m ago