frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

https://github.com/dirac-run/dirac
74•GodelNumbering•1h ago
Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.

Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things

1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever

2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)

3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.

I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.

HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...

It is astounding how much the harness matters, based on this and other experiments I have done.

Comments

aetherspawn•1h ago
Sorry I couldn’t really figure out if this was a harness, a fine tuned model, or both. Can we use Qwen with this for example? Is the performance expected to be better in that case?
GodelNumbering•1h ago
The model was the default gemini-3-flash-preview.

Harness was https://www.npmjs.com/package/dirac-cli

Since Dirac is Cline's heavily modified fork, it supports all models Cline supported, including Qwen and all popular open/closed models

As a matter of fact, I am trying to run terminal bench 2.0 using some OSS models at the moment but the slow inference speeds are causing tasks to timeout

GodelNumbering•1h ago
Interesting things Dirac does:

1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)

2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads

3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)

4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate

5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next

deskamess•57m ago
I always wondered why AST's were not more of a part in both editing and scoping of changes/parsing code. I thought I read an article where they said 'grep' was just as effective. It kinda made sense for the case they were talking about.
GodelNumbering•49m ago
Grep is effective for the most part, except for situations like when you have huge codebases and the thing you're looking for is used in too many places both as symbol and non-symbol.

Another annoying thing about plain grep is, LLMs often end up pulling in bundled packages when using grep where 1 line is large enough to ruin the context window

embedding-shape•37m ago
> Grep is effective for the most part

It's very effective in well-written and well-designed code bases where concepts tend to be relatively well formed to not be named the same as everything else, so grepping for symbols give you good search results.

Projects where the god-object or core concepts are generic names like "Tree", "Node" or other things that are used everywhere, tends to be short of impossible to search with grep and friends.

tripleee•44m ago
"Hey everyone, you know that tech that so many of you mentioned has made your work miserable and you're worried might put you out of a job? I think I made it even better! And I didn't even get paid for it! Hah!"

Anyone working on this is anti-developer.

nthypes•1h ago
No CLI? Only VSCode extension?
GodelNumbering•1h ago
Cli too (you can't run tbench without cli as it runs in an isolated docker env) `npm install -g dirac-cli`
nthypes•1h ago
Can't OpenCode reach the same just developing this as a feature or plug-in? Like anchored edit?
mdasen•4m ago
Sure. Dirac is just a fork of the Cline harness and obviously OpenCode could take the same techniques and implement them. I don't know how difficult it would be to implement them in OpenCode, but given that Dirac and OpenCode are both open source, a future version of OpenCode could always be a re-branded Dirac (I'm sure there are ways to implement Dirac's techniques without having to completely replace OpenCode's underlying code base, but this illustrates that at the extreme, they could clearly just take Dirac in its entirety to get the same results).
martinald•1h ago
Very interesting! I've often thought static analysis could really help agents (I wrote this last summer: https://martinalderson.com/posts/claude-code-static-analysis...), but despite being hyped for LSPs in Claude Code it turned out to be very underwhelming (for many of the reasons that they can be annoying in a "real" IDE, ie static analysis starts firing mid edit and complaining and cached analysis getting stuck).

Curious to know if this has been an issue with your AST approach on larger projects?

The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).

I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.

GodelNumbering•57m ago
Wrt LSP, it uses the default LSP mechanism of the ide provider.

For AST, it uses tree-sitter WASMs (ships them with the package), and maintains queries (https://github.com/dirac-run/dirac/tree/master/src/services/...)

To keep performance fast, it stores the symbols DB (using sqlite) in the workspace's directory and incrementally updates it based on timestamps. Then it uses this DB to resolve symbol queries

martinald•28m ago
Yes I understand, but do you not have issues that it drifts out of date and confuses the agents (especially on longer running tasks)?

Like even "full" Visual Studio and Resharper have issues with this. Eg, you start editing file x, 'intellisense' runs, says there are loads of errors... because you haven't finished editing yet.

Mashimo•58m ago
Interesting. Would love a comparison to pi.dev (Not Ohmypi)

How does this perform in day to day coding tasks, outside of benchmarks?

GodelNumbering•53m ago
https://github.com/dirac-run/dirac#-evals

README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.

Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences

howdareme•28m ago
Going to assume you didnt capture the data but could you add time taken to completion for each if you have it?
bryanhogan•44m ago
If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?
GodelNumbering•32m ago
Yes, plan+act mode is one thing I loved about Cline!
blueTiger33•27m ago
Stared it. will try it later. one question though, to make it simpler for me, in what tasks does this model shine, how do you improve the score? I already use some skills to cut down CC costs, like caveman, rtk cli and a few others. just want to understand
GodelNumbering•8m ago
I did limited testing using Sonnet on CC vs Sonnet on Dirac. I could not confirm the costs however
snqb•26m ago
how well does it do on frontier models like Opus 4.6?
GodelNumbering•7m ago
I have only done functionality testing, no benchmark testing on Opus (decided to pay my rent instead)
redrove•16m ago
I keep trying to use dirac-cli with codex and it won't work: Error: Codex API error: Codex API request failed: 400.

Any ideas?

GodelNumbering•9m ago
Assuming you logged in with OAuth, I am guessing you are trying to use gpt-5.5?

In my tests, it worked using gpt-5.4 for me and I assumed gpt-5.5 is not available to me because I am on the free plan

Do you have the subscription that allows 5.5? If so, I can look into what changed in API. Sorry I rarely use openAI so it is a bit of an untrodden path

adyavanapalli•14m ago
I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!

Using LLMs to find Python C-extension bugs

https://lwn.net/SubscriberLink/1067234/801a0f084f7f0493/
1•lumpa•3m ago•0 comments

X's translate feature injects Zionist propaganda

https://twitter.com/Hezbolsonaro/status/2048227481736593589
1•sosomoxie•3m ago•0 comments

A Timeline to China Blocking Meta's $2B Manus Acquisition (Built Using Manus)

https://metamanus-rsbcnkpx.manus.space/
1•mattcollins•4m ago•0 comments

Pharmacovigilance

https://en.wikipedia.org/wiki/Pharmacovigilance
1•_Microft•5m ago•0 comments

Codedb: Code intelligence server for AI agents

https://github.com/justrach/codedb
1•doppp•5m ago•0 comments

Zork-bench: An LLM reasoning eval based on text adventure games

https://www.lowimpactfruit.com/p/zork-bench-an-llm-reasoning-eval
1•nicholasjbs•6m ago•0 comments

2026 Hugo Award Finalists

https://blog.zarfhome.com/2026/04/2026-hugo-finalists
1•speckx•9m ago•0 comments

Rvidia-exporter – Prometheus metrics exporter for Nvidia GPUs

https://github.com/neo-airouter/rvidia-exporter
1•sacrelege•9m ago•1 comments

Live coverage: SpaceX to launch final ViaSat-3 satellite on Falcon Heavy rocket

https://spaceflightnow.com/2026/04/27/live-coverage-spacex-to-launch-final-viasat-3-satellite-on-...
1•bookmtn•9m ago•0 comments

Review: The Greatest Knight, by Thomas Asbridge

https://www.thepsmiths.com/p/review-the-greatest-knight-by-thomas
1•jger15•9m ago•0 comments

Postgres's lateral joins allow for quite the good eDSL

https://bensimms.moe/postgres-lateral-makes-quite-a-good-dsl/
1•nitros•9m ago•0 comments

Bookshop.org founder on how small retailers are taking on Amazon

https://www.fastcompany.com/91529634/independent-bookstore-day-bookshop-org-founder-on-how-small-...
1•helterskelter•9m ago•0 comments

SQLite-memory-MCP – SQLite-backed working memory for Claude, Codex, and humans

https://github.com/RMANOV/sqlite-memory-mcp
1•ruslanMANOV•10m ago•0 comments

Novai – AI-native L1 blockchain, 65K lines of Rust, built from scratch

https://github.com/0x-devc/NOVAI-node
1•0xdevc•13m ago•0 comments

Is GraphQL dead? (GraphQL Conf 2025 talk, YouTube) [video]

https://www.youtube.com/watch?v=3GWZ9yiskFk
1•rbalicki•14m ago•1 comments

Help me to find cure for my brother (thalassemia)

https://github.com/nakafaai/nakafa.lab
2•nabilfatih•14m ago•0 comments

Google DeepMind Paper Argues LLMs Will Never Be Conscious

https://www.404media.co/google-deepmind-paper-argues-llms-will-never-be-conscious/
2•cdrnsf•15m ago•2 comments

Monero is simpler, Zcash is more flexible

https://blog.alcazarsec.com/posts/monero-is-simpler-zcash-is-more-flexible
2•alcazar•15m ago•0 comments

Critical infrastructure giant Itron says it was hacked

https://techcrunch.com/2026/04/27/critical-infrastructure-giant-itron-says-it-was-hacked/
2•Brajeshwar•15m ago•0 comments

Scientists map how Down syndrome reshapes brain development before birth

https://www.uclahealth.org/news/release/scientists-map-how-down-syndrome-reshapes-brain-development
1•gmays•16m ago•0 comments

The Ethics of Hollywood Deepfakes

https://www.unite.ai/on-the-ethics-of-hollywood-deepfakes/
1•50kIters•16m ago•1 comments

The "just build it with Claude" paradox

3•ethantheswe•18m ago•1 comments

Environmentalists in Western Balkans call for renewables over US gas projects

https://www.reuters.com/sustainability/boards-policy-regulation/environmentalists-western-balkans...
2•mooreds•18m ago•0 comments

Switzerland opens Swiyu electronic ID bug bounty program to public

https://www.biometricupdate.com/202604/switzerland-opens-swiyu-bug-bounty-program-to-public
3•nar001•18m ago•0 comments

Many Opioid Victims Will Be Shut Out of Purdue's $7.4B Bankruptcy Settlement

https://www.propublica.org/article/purdue-settlement-leaves-opioid-victims-behind
4•speckx•19m ago•0 comments

Using Google's Gemma 4 E4B Local AI Model to Reverse Engineer a Simple Crackme

https://github.com/markoglasgow/gemma_crackme_tutorial
1•nekitamo•19m ago•0 comments

Buy vs. Build, Train vs. Use

https://blog.incrementalforgetting.tech/p/buy-vs-build-train-vs-use
1•mooreds•19m ago•0 comments

Do_not_track

https://donottrack.sh/
2•ThrowAway797264•20m ago•0 comments

Announcement: SitePulse Is Live

https://sitepulse.services/
1•makedonialainen•20m ago•0 comments

Research note: Fine-tuning experiments on CoT controllability

https://metr.org/blog/2026-04-01-fine-tuning-cot-controllability/
1•mooreds•21m ago•0 comments