frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: jj-benchmark – Evaluating AI agents on Jujutsu version control

https://tabbyml.github.io/jj-benchmark/
4•wsxiaoys•1h ago
Hi HN, Meng from TabbyML here.

We decided to build this simply because we find Jujutsu (jj) really interesting, and many folks on our team have started trying it out recently. Since it introduces a very different workflow compared to traditional Git, we thought it would be a fun challenge to see how well current AI coding agents can actually use it.

To build this, we created a semi-automated pipeline. We used AI to research the official Jujutsu documentation and websites, which then helped us bootstrap a dataset of 63 distinct evaluation tasks. Each task includes instructions, bootstrap scripts, and tests. We then ran the evaluations using the Harbor framework and our Pochi agent.

Some interesting insights from our initial leaderboard:

Claude 4.6 Sonnet is the clear winner: It achieved a 92% success rate (passing 58/63 tasks), beating out Opus and OpenAI's top models. It seems exceptionally good at parsing the novel CLI rules of jj. The Speed vs. Accuracy Trade-off: While GPT-5.4 sits at #5 with an 81% success rate, it is incredibly fast, averaging just 77.6s per task. In contrast, Gemini-3.1-pro achieved 84% but took over 3x as long (267.6s average). Open Weights / Regional Models are competitive: Models like Kimi-k2.5 (79%) put up a very respectable fight on a relatively niche tool. The benchmark isn't completely solved yet, but the fact that top models can successfully navigate a relatively new version control system by reasoning through the tasks is pretty exciting.

If there are specific jj edge cases you think we should add to the dataset, feel free to open up a PR!

Killing SaaS. Anatomy of a murder. How I replaced Wisprflow.ai with vibe coding

https://gpt3experiments.substack.com/p/killing-saas-the-anatomy-of-a-murder
1•nutanc•21s ago•0 comments

Ask HN: AI evaluation for an EV charger without additional installation?

1•chrisgd•51s ago•0 comments

Bubble Sorted Amen Break

https://parametricavocado.itch.io/amen-sorting
1•eieio•2m ago•1 comments

Show HN: A test harness that blocks unsafe AI actions before execution

1•celestinestudio•2m ago•0 comments

Grammarly Is Facing a Class Action Lawsuit over Its AI 'Expert Review' Feature

https://www.wired.com/story/grammarly-is-facing-a-class-action-lawsuit-over-its-ai-expert-review-...
1•laurex•3m ago•0 comments

If a web server runs websites then a corporation server? (2025)

https://interconnected.org/home/2025/03/13/homeostasis
2•alcazar•4m ago•0 comments

Linux Page Faults, MMAP, and userfaultfd for fast sandbox boot times

https://www.shayon.dev/post/2026/65/linux-page-faults-mmap-and-userfaultfd/
1•shayonj•5m ago•0 comments

Show HN: Cloud to Desktop in the Fastest Way

https://nativedesktop.com/
1•lasgawe•5m ago•0 comments

Software Maturity Wall

https://www.apolloacademy.com/software-maturity-wall/
1•akyuu•5m ago•0 comments

Fast and free coding agent written with Go

https://github.com/cheikh2shift/godex
1•cheikhshift•6m ago•0 comments

Show HN: PipeStep – Step-through debugger for GitHub Actions workflows

https://github.com/Photobombastic/pipestep
3•photobombastic•7m ago•0 comments

Apple's MacBook Neo makes repairs easier and cheaper than other MacBooks

https://arstechnica.com/gadgets/2026/03/more-modular-design-makes-macbook-neo-easier-to-fix-than-...
4•GeekyBear•9m ago•0 comments

An agentic workflow, March 2026 edition

https://twolongos.com/3/12/an-agentic-workflow-march-2026-edition/
2•suzzer99•9m ago•1 comments

Is your vet owned by private equity?

https://privateequityvet.org/vet-list/
2•hampelm•9m ago•0 comments

Show HN: LogClaw – Open-source AI SRE that auto-creates tickets from logs

https://logclaw.ai
3•Robelkidin•9m ago•0 comments

WikiCity – Where every building is a Wikipedia article

https://wikicity.app/
2•leononame•10m ago•1 comments

Harness Engineering

https://openai.com/index/harness-engineering/
4•jlas•10m ago•0 comments

A Day in the Life of an Enshittificator [video]

https://www.youtube.com/watch?v=T4Upf_B9RLQ
2•KindAndFriendly•11m ago•0 comments

Show HN: Understudy – Teach a desktop agent by demonstrating a task once

https://github.com/understudy-ai/understudy
3•bayes-song•11m ago•0 comments

Inboxscan – find every subscription hiding in your email (runs locally)

https://github.com/LakshmiSravyaVedantham/inboxscan
2•sravyavedantham•11m ago•1 comments

Ask HN: In 2026, how do you share a list of URLs to the public (or friends)?

2•wenbin•14m ago•1 comments

Work_mem: It's a Trap

https://mydbanotebook.org/posts/work_mem-its-a-trap/
2•giulianopz•15m ago•0 comments

Show HN: Fixing Agent / LLM Context Decay in VS Code with Git Worktrees

https://www.appsoftware.com/blog/fixing-agent-llm-context-decay-in-vs-code-with-git-worktrees
4•gbro3n•16m ago•0 comments

Design Tip: Enforcing Constraints Leads to Simpler, More Powerful Systems

https://www.rodriguez.today/articles/emergent-event-driven-workflows
1•birdculture•17m ago•0 comments

Show HN: I lost billable hours forgetting timers. I turned my calendar into a DB

https://www.timescanner.io/
2•sergentrif•17m ago•2 comments

Anthropic's Claude AI can respond with charts, diagrams, and other visuals now

https://www.theverge.com/ai-artificial-intelligence/893625/anthropic-claude-ai-charts-diagrams
1•newusertoday•17m ago•0 comments

Show HN: Verge Browser a self-hosted isolated browser sandbox for AI agents

https://github.com/zzzgydi/verge-browser
2•zzzgydi•18m ago•0 comments

Ask HN: How are you using personal AI assistants with local coding agents?

2•everfly•18m ago•0 comments

The Thinking Field

https://www.robpanico.com/articles/display/?entry_short=the-thinking-field
2•retrocog•19m ago•1 comments

Claude Bought Me a Car

https://www.nahtnam.com/blog/claude-bought-me-a-car
3•nahtnam•21m ago•2 comments