frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Better Models: Worse Tools

https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/
41•leemoore•2h ago

Comments

dofm•1h ago
As critical as I am about articles endlessly concerned with the weaknesses of closed-source cloud LLMs, this one is pretty great, and not just because it concerns interactions with Pi, which looks to me like it's going to end up a sort of quasi-reference implementation of an open source harness, and because it has so much useful technical detail.

But:

"Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology."

Only implicitly?

--

Many decades ago when I was working on research related to using MOOs as a learning environment, you would add "tool calls" into the stream of text that a MOO object might generate, so your rich client would e.g. show a picture, load a web page in a frame, move you on a map, trigger a change in an on-screen representation of an object.

Everyone who tried this in MUD/MUSH/MOO clients ran into more or less the same problems that LLM clients do: any attempt to shoehorn control sequences into in-band content was riddled with security risks, objects accidentally triggering the wrong interface etc.; you could never truly communicate out-of-band.

The more I read about how agentic harnesses work, the less embarrassed I feel about the code twenty-something-year-old me wrote in a MOO client.

mappu•1h ago
In my harness i implemented apply_patch just taking unified diffs for patch -p1. I was shocked to see how bad models are at generating them. I started logging diff failures to analyse -

- All models are terrible at generating line numbers for a proper diff, give up on them

- Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available

- Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc

fractorial•41m ago
In my harness, I implemented tool_edit as a subset of Rob Pike’s Sam editor syntax [0].

Only need ~650 tokens of system prompt for it to work. It’s pretty stellar.

[0] https://9p.io/sys/doc/sam/sam.html

cyanydeez•1h ago
building deterministic tools on non-determinism is hard enough; try adding another layer where your cloud provider decides to massage the context, realigns it's permitted output, arbitrarily downgrades context to cheaper models, or they hire an MBA who determines your plan value can be tied to a degraded model under a new shrinkfied.

It's amazing anyone watched the last 2 decades of tech's enshitification and wants to hook their wagon to this shitshow.

lukasco•1h ago
It sounds like harnesses might have to start to have model by model system prompts, though retrying works, I guess. It reminds me of the ancient times when browsers all read HTML and CSS differently, and differently on different devices. In that sense, this is nothing new. I was going to say, at least we don't have different device types, but then, the model still has to output the right variant of `grep` as well.
dofm•1h ago
The flip side of this is training models to better understand harness interaction, I suppose, which (if I understand it properly and I am in no way sure I do) appears to be what the Qwen AgentWorld model is doing?
the_mitsuhiko•57m ago
The problem with hyper targeting harnesses to models is that you end up locking yourself quite quickly into special behaviors of models, and you make your sessions non transferrable. That can be an acceptable trade-off and I know people who do that.
ares623•1h ago
Open source developer surprised and concerned by the trajectory their favorite proprietary software is taking.
wseqyrku•57m ago
> You can ask the model to produce valid JSON

Doesn't always work, for better performance you can kneel and start begging

_doctor_love•44m ago
This makes sense to me, much as I don't like it. IMHO the strategy taken by StrongDM's attractor coding agent seems like a path of least resistance. Directly target the LLM providers APIs and directly target their default tools.
sestep•42m ago
> In case you are curious about Fable: I intentionally did not test it because I was not sure if the classifiers they are running might downgrade me to Opus silently.

Is this still a thing? I thought Anthropic walked back the silent downgrades so now all the different domains downgrade non-silently.

resonious•32m ago
Claude Code downgrades loudly but I'm not sure what happens over API or with other harnesses, OpenRouter, etc.
socketcluster•22m ago
When building agent integration for my serverless backend https://saasufy.com/, I decided to not use MCP but to put curl commands inside skill markdown files instead: https://github.com/Saasufy/skills

The curl command is extremely popular so models seem to be really good at using it.

Also I like that curl uses a bash syntax and my platform requires JSON payloads; it makes the separation clear to the agent. I find it to be very reliable.

Command and Conquer Generals natively ported to macOS, iPhone, iPad using Fable

https://github.com/ammaarreshi/Generals-Mac-iOS-iPad/tree/main
237•asronline•3h ago•100 comments

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

https://github.com/openai/codex/issues/30364
43•maille•50m ago•4 comments

Leaking YouTube creators' private videos

https://javoriuski.com/post/youtube
419•javxfps•5h ago•211 comments

Google Books (or similar) all book scans – $200k bounty (2025)

https://software.annas-archive.gl/AnnaArchivist/annas-archive/-/work_items/234
261•Cider9986•5h ago•142 comments

Better Models: Worse Tools

https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/
44•leemoore•2h ago•13 comments

Potential session/cache leakage between workspace instances or consumer accounts

https://github.com/anthropics/claude-code/issues/74066
260•chatmasta•8h ago•120 comments

Verizon is About to Break our Watches

https://www.jefftk.com/p/verizon-is-about-to-break-our-watches
109•jefftk•4h ago•48 comments

Explanation of everything you can see in htop/top on Linux (2019)

https://peteris.rocks/blog/htop/
358•theanonymousone•10h ago•46 comments

Zig: All Package Management Functionality Moved from Compiler to Build System

https://ziglang.org/devlog/2026/#2026-06-30
96•tosh•6h ago•19 comments

Drone Physics

https://iahmed.me/post/drone-physics/
60•wrxd•4d ago•15 comments

Can you build a recognizable World Map in under 500 bytes?

https://www.experimentlog.com/blog/building-a-world-map-with-only-500-bytes
4•iweczek•3d ago•3 comments

Windows CE Dreamcast Community Edition (wince-dc)

https://github.com/maximqaxd/wince-dc
79•msephton•7h ago•15 comments

It's not me, it's the compiler

https://parsa.wtf/cast/
32•SVI•3d ago•7 comments

Astrophysicists Puzzle over Webb’s New Universe

https://www.quantamagazine.org/astrophysicists-puzzle-over-webbs-new-universe-20260702/
179•jnord•13h ago•115 comments

Meta data center water discharges suspended for contaminating water supply

https://www.tomshardware.com/tech-industry/data-centers/cheyenne-suspends-data-center-fill-and-fl...
204•sensanaty•5h ago•68 comments

The Vespa at 80

https://www.cbc.ca/news/world/vespa-italy-postwar-design-9.7252641
130•cf100clunk•3d ago•125 comments

Curveball

https://mightyburger.net/projects/curveball/
39•toilet•6h ago•9 comments

Neural Render Proxies for Interactive and Differentiable Lighting

https://studios.disneyresearch.com/2026/07/01/neural-render-proxies-for-interactive-and-different...
40•tobr•3d ago•3 comments

Protocol Prying: Vulnerability Research in AirDrop and Quick Share

https://arxiv.org/abs/2606.26967
5•logickkk1•2h ago•0 comments

EndBASIC 0.14: Are we multimedia yet?

https://www.endbasic.dev/2026/07/endbasic-0.14.html
21•jmmv•5h ago•2 comments

BareMetal RAM Dumper – Bare-metal x86 tool for Cold Boot Attack experiments

https://github.com/pIat0n/BareMetal-RAM-Dumper
43•liffik•5h ago•28 comments

Designing DB partitions you don't have to babysit

https://explainanalyze.com/p/designing-partitioning-you-dont-have-to-babysit/
50•rtolkachev•3d ago•7 comments

Postgres data stored in Parquet on S3: LTAP architecture explained

https://www.databricks.com/blog/lakebase-ltap-rethinking-database-storage
156•andrenotgiant•3d ago•51 comments

Breaking the Bird Barrier: Scientist Decodes Zebra Finch Language

https://www.freepressjournal.in/education/breaking-the-bird-barrier-scientist-decodes-zebra-finch...
76•yyyk•4d ago•23 comments

Fable created novel 4D splat format

https://adamraudonis.github.io/splats4D/
67•adamraudonis•6h ago•14 comments

Finland's last analogue landline phones go silent after 150 years

https://www.euronews.com/next/2026/06/30/finlands-last-analogue-landline-phones-go-silent-after-1...
80•ohjeez•5h ago•20 comments

The .join() that should be a bug

https://kronotop.com/blog/the-join-that-should-be-a-bug/
13•mastabadtomm•4d ago•1 comments

The bottleneck might be the air in the room

https://blog.mikebowler.ca/2026/07/03/co2-and-decision-making/
729•gslin•16h ago•415 comments

Mir Books – Books from the Soviet Era

https://mirtitles.org
163•clmul•4d ago•78 comments

Night Witches – all-female Soviet aviator regiment WW2

https://en.wikipedia.org/wiki/Night_Witches
83•gverrilla•4d ago•26 comments