frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

https://github.com/joelparkerhenderson/queueing-theory
1•jph•10m ago•0 comments

Show HN: Hibana – choreography-first protocol safety for Rust

https://hibanaworks.dev/
3•o8vm•12m ago•0 comments

Haniri: A live autonomous world where AI agents survive or collapse

https://www.haniri.com
1•donangrey•13m ago•1 comments

GPT-5.3-Codex System Card [pdf]

https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc697c/GPT-5-3-Codex-System-Card-02.pdf
1•tosh•26m ago•0 comments

Atlas: Manage your database schema as code

https://github.com/ariga/atlas
1•quectophoton•29m ago•0 comments

Geist Pixel

https://vercel.com/blog/introducing-geist-pixel
1•helloplanets•32m ago•0 comments

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•39m ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•41m ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•42m ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•43m ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
1•basilikum•45m ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•46m ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•51m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•52m ago•1 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•52m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•53m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•55m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•58m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•1h ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•1h ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•1h ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•1h ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•1h ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
2•lifeisstillgood•1h ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•1h ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•1h ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•1h ago•1 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•1h ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•1h ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•1h ago•0 comments
Open in hackernews

Terminal-Bench: a benchmark for AI agents in terminal environments

https://www.tbench.ai/
17•mikemerrill•8mo ago

Comments

mikemerrill•8mo ago
We (an open community of AI researchers at Stanford, Anthropic, UW, and more) just released Terminal-Bench, a new open-source framework for evaluating how well AI agents perform in terminal environments. Given how much we all use the terminal and how many new AI terminal assistants are emerging, we wanted to create a rigorous way to test their capabilities.

What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with: - Chaining multiple terminal commands together - Reasoning over long command outputs - Acting independently within sensible limits - Executing tasks safely

What's in Terminal-Bench: - Docker-containerized environments for consistent testing - Hand-crafted tasks covering data science, networking, security, and more - Human-verified solutions and test cases - Support for different integration methods

Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!

Check out our website: https://tbench.ai Join our Discord: https://discord.gg/6xWPKhGDbA

What terminal tasks do you wish AI agents could handle better?

kristopolous•8mo ago
Their terminus approach is petty similar to my "dui mode" here: https://github.com/day50-dev/llmehelp ... it's still not great except for basic investigations. I think a hybrid approach would be better.

There's certainly some litellm hacking that will improve things. I'm absolutely convinced of that. The proxy is pretty hard to use though. I keep making glacial progress on it.

mikemerrill•8mo ago
Agreed that it's not the best way to use agents right now (they still need supervision) but I think in the coming year(s) we'll reach a point where they'll be good enough to run on their own (see Codex).

If you're interested in working on this, we love to see new contributors in the Discord https://discord.gg/6xWPKhGDbA