fp.

Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.

challengerVIE•1h ago

To me using agents daily, the long term vision with maintainability in mind really makes the difference between us humans and agents, I like the idea. However evaluating long term maintainability over an average of just 500 loc changes does not sound like long term maintainability being measured here

KronisLV•48m ago

> The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository.

This seems like a really cool thing to benchmark! Technically it'd be possible to take GitHub repos that the AI orgs probably already have, cross-reference the code against the issues and regressions, and train/validate on that.

The dataset would need to be way bigger to get close to the likes of SWE-bench: https://www.swebench.com/original.html

"Vibe coded stuff gets hard to maintain and will end up buggy." Yeah, so make models that deal with that better, optimize for maintainability and consistency.

Cool to see Claude doing decently though!

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

Cloud VM benchmarks 2026

Show HN: Curiosity – DIY 6" Newtonian Reflector Telescope

Warn about PyPy being unmaintained

Rijksmuseum researchers discover new painting by Rembrandt van Rijn

From RGB to L*a*b* color space (2024)

How to run Qwen 3.5 locally

CasNum

Notes on Writing WASM

MonoGame: A .NET framework for making cross-platform games

A decade of Docker containers

Emacs internals: Deconstructing Lisp_Object in C (Part 2)

Dumping Lego NXT firmware off of an existing brick (2025)

Yoghurt delivery women combatting loneliness in Japan

I'm Not Consulting an LLM

Show HN: A weird thing that detects your pulse from the browser video

Autoresearch: Agents researching on single-GPU nanochat training automatically

Ask HN: Why there are no actual studies that show AI is more productive?

Best performance of a C++ singleton

Digital Iris [video]

The surprising whimsy of the Time Zone Database

In 1985 Maxell built a bunch of life-size robots for its bad floppy ad

Ten years of deploying to production

Sem – Semantic version control. Entity-level diffs on top of Git

FLASH radiotherapy's bold approach to cancer treatment

macOS code injection for fun and no profit (2024)

Files are the interface humans and agents interact with

Lisp-style C++ template meta programming

SigNoz (YC W21) is hiring for engineering, growth and product roles

To the Polypropylene Makers

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

Comments

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

Cloud VM benchmarks 2026

Show HN: Curiosity – DIY 6" Newtonian Reflector Telescope

Warn about PyPy being unmaintained

Rijksmuseum researchers discover new painting by Rembrandt van Rijn

From RGB to L*a*b* color space (2024)

How to run Qwen 3.5 locally

CasNum

Notes on Writing WASM

MonoGame: A .NET framework for making cross-platform games

A decade of Docker containers

Emacs internals: Deconstructing Lisp_Object in C (Part 2)

Dumping Lego NXT firmware off of an existing brick (2025)

Yoghurt delivery women combatting loneliness in Japan

I'm Not Consulting an LLM

Show HN: A weird thing that detects your pulse from the browser video

Autoresearch: Agents researching on single-GPU nanochat training automatically

Ask HN: Why there are no actual studies that show AI is more productive?

Best performance of a C++ singleton

Digital Iris [video]

The surprising whimsy of the Time Zone Database

In 1985 Maxell built a bunch of life-size robots for its bad floppy ad

Ten years of deploying to production

Sem – Semantic version control. Entity-level diffs on top of Git

FLASH radiotherapy's bold approach to cancer treatment

macOS code injection for fun and no profit (2024)

Files are the interface humans and agents interact with

Lisp-style C++ template meta programming

SigNoz (YC W21) is hiring for engineering, growth and product roles

To the Polypropylene Makers

From RGB to Lab* color space (2024)

From RGB to Lab* color space (2024)