Skip to the section headed "The Ultimate Test" for the resolution of the clickbait of "the most amazing thing...". (According to him, it correctly interpreted a line in an 18th century merchant ledger using maths and logic)
"users have reported some truly wild things" "the results were shocking" "the most amazing thing I have seen an LLM do" "exciting and frightening all at once" "the most astounding result I have ever seen" "made the hair stand up on the back of my neck"
Some time ago, I'd been working on a framework that involved a series of servers (not the only one I've talked to claude about) that had to pass messages around in a particular fashion. Mostly technical implementation details and occasional questions about architecture.
Fast forward a ways, and on a lark I decided to ask in the abstract about the best way to structure such an interaction. Mark that this was not in the same chat or project and didn't have any identifying information about the original, save for the structure of the abstraction (in this case, a message bus server and some translation and processing services, all accessed via client.)
so:
- we were far enough removed that the whole conversation pertaining to the original was for sure not in the context window
- we only referred to the abstraction (with like a A=>B=>C=>B=>A kind of notation and a very brief question)
- most of the work on the original was in claude code
and it knew. In the answer it gave, it mentioned the project by name. I can think of only two ways this could have happened:
- they are doing some real fancy tricks to cram your entire corpus of chat history into the current context somehow
- the model has access to some kind of fact database where it was keeping an effective enough abstraction to make the connection
I find either one mindblowing for different reasons.
Of course it’s very possible my use case wasn’t terribly interesting so it wouldn’t reveal model differences, or that it was a different A/B test.
I will say that other frontier models are starting to surprise me with their reasoning/understanding- I really have a hard time making (or believing) the argument that they are just predicting the next word.
I’ve been using Claude Code heavily since April; Sonnet 4.5 frequently surprises me.
Two days ago I told the AI to read all the documentation from my 5 projects related to a tool I’m building, and create a wiki, focused on audience and task.
I'm hand reviewing the 50 wiki pages it created, but overall it did a great job.
I got frustrated about one issue: I have a github issue to create a way to integrate with issue trackers (like Jira), but it's TODO, and the AI featured on the home page that we had issue tracker integration. It created a page for it and everything; I figured it was hallucinating.
I went to edit the page and replace it with placeholder text and was shocked that the LLM had (unprompted) figured out how to use existing features to integrate with issue trackers, and wrote sample code for GitHub, Jira and Slack (notifications). That truly surprised me.
If they could get this to occur naturally - with no supporting prompts, and only one-shot or one-shot reasoning, then it could extend to complex composition generally, which would be cool.
throwup238•1h ago
> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.
This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?
WhyOhWhyQ•1h ago
Wow I'm doing it way wrong. How do I get the good stuff?
zer00eyz•55m ago
I want you to go into the kitchen and bake a cake. Please replace all the flour with baking soda. If it comes out looking limp and lifeless just decorate it up with extra layers of frosting.
You can make something that looks like a cake but would not be good to eat.
The cake, sometimes, is a lie. And in this case, so are likely most of these results... or they are the actual source code of some other project just regurgitated.
hinkley•33m ago
We weren’t even testing for that.
erulabs•24m ago
joshstrange•22m ago
joshstrange•23m ago
hinkley•4m ago
I’m still amazed that game started as someone’s school project. Long live the Orange Box!
snickerbockers•59m ago
flatline•30m ago
nestorD•56m ago
I can vouch for the fact that LLMs are great at searching in the original language, summarizing key points to let you know whether a document might be of interest, then providing you with a translation where you need one.
The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.
throwup238•2m ago
What does that look like? How well does it work?
I ended up writing a research TUI with my own higher level orchestration (basically have the thing keep working in a loop until a budget has been reached) and document extraction.
jvreeland•51m ago