frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Postgres Message Queue (PGMQ)

https://github.com/pgmq/pgmq
1•Lwrless•1m ago•0 comments

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
1•cui•4m ago•1 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
1•geox•5m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
2•EA-3167•6m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
4•fliellerjulian•8m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
2•DustinEchoes•10m ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•10m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
1•RickJWagner•12m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•12m ago•0 comments

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

https://www.nytimes.com/2026/02/07/technology/washington-post-will-lewis.html
4•jbegley•13m ago•0 comments

DevXT – Building the Future with AI That Acts

https://devxt.com
2•superpecmuscles•14m ago•4 comments

A Minimal OpenClaw Built with the OpenCode SDK

https://github.com/CefBoud/MonClaw
1•cefboud•14m ago•0 comments

The silent death of Good Code

https://amit.prasad.me/blog/rip-good-code
3•amitprasad•14m ago•0 comments

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

https://www.vo2maxpro.com/blog/internal-negotiation-heart-rate
1•GoodluckH•16m ago•0 comments

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

https://github.com/AveryClapp/glance
2•AveryClapp•17m ago•0 comments

Busy for the Next Fifty to Sixty Bud

https://pestlemortar.substack.com/p/busy-for-the-next-fifty-to-sixty-had-all-my-money-in-bitcoin-...
1•mithradiumn•18m ago•0 comments

Imperative

https://pestlemortar.substack.com/p/imperative
1•mithradiumn•19m ago•0 comments

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

https://github.com/XxCotHGxX/Instruction_Entropy
1•XxCotHGxX•22m ago•1 comments

I went back to Linux and it was a mistake

https://www.theverge.com/report/875077/linux-was-a-mistake
3•timpera•24m ago•1 comments

Octrafic – open-source AI-assisted API testing from the CLI

https://github.com/Octrafic/octrafic-cli
1•mbadyl•25m ago•1 comments

US Accuses China of Secret Nuclear Testing

https://www.reuters.com/world/china/trump-has-been-clear-wanting-new-nuclear-arms-control-treaty-...
2•jandrewrogers•26m ago•1 comments

Peacock. A New Programming Language

2•hashhooshy•30m ago•1 comments

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

https://www.washingtonpost.com/lifestyle/2026/02/07/postcard-death-teacher-glickman/
3•bookofjoe•32m ago•1 comments

What to know about the software selloff

https://www.morningstar.com/markets/what-know-about-software-stock-selloff
2•RickJWagner•35m ago•0 comments

Show HN: Syntux – generative UI for websites, not agents

https://www.getsyntux.com/
3•Goose78•36m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/ab75cef97954
2•birdculture•36m ago•0 comments

AI overlay that reads anything on your screen (invisible to screen capture)

https://lowlighter.app/
1•andylytic•38m ago•1 comments

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

https://seafloor.bot/
1•k0mplex•38m ago•0 comments

Tesla turbine-inspired structure generates electricity using compressed air

https://techxplore.com/news/2026-01-tesla-turbine-generates-electricity-compressed.html
2•PaulHoule•39m ago•0 comments

State Department deleting 17 years of tweets (2009-2025); preservation needed

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
5•sleazylice•39m ago•2 comments
Open in hackernews

AI Coding: A Sober Review

https://www.ubicloud.com/blog/ai-coding-a-sober-review
18•furkansahin•4mo ago

Comments

CuriouslyC•4mo ago
A vibe article on vibe coding.
softwaredoug•4mo ago
This space is filled with personal anecdotes and studies from providers. It's hard to get objective perspectives from independent labs.
troupo•4mo ago
It's hard to go beyond anecdotes because it's impossible to measure outcomes objectively.
CuriouslyC•4mo ago
Is it? Tests turn green seems pretty objective, as does time/tokens to test green, code delta size, patch performance, etc. Not sure why people have such a hard time with agent evals.

Just remember to keep a holdout test set for validation.

troupo•4mo ago
> Is it?

Yes. You are "testing" a non-deterministic black box, and we usually know nothing about the code base, the prompts, the tasks etc.

Which is further complicated by whatever wrapper you're using (cursor/amp/windsurf/opencode/whatever).

Which is further complicated by the "oops we nerfed the model, but it was bug trust us".

> Tests turn green seems pretty objective, as does time/tokens to test green, code delta size, patch performance, etc. Not sure why people have such a hard time with agent evals.

What is the distribution between running the same test on the same model with the same prompt, also with distribution by time?

I've already had several instances when the same model with the same prompt on the same code would produce completely different results.

CuriouslyC•4mo ago
You can construct or curate code bases (parametric construction is cheaper and gives you 100% knowledge).

You are testing a series of traces from starting prompt -> agent stops or creates a PR. Your signal is %pass + time to green + code metrics as I said.

You can control for the model and drift by doing bootstraps on individual repo evals to get a distribution, any model nerf will show using statistical tests.

Capturing a distribution is the whole point. I run my agent evals 20x on a given problem for this exact reason. This way you can tune prompts and not only do you get your average improvement in pass/time to green, but you can see the shape of the distribution and optionally tune for things like maximum error magnitude that point statistics won't show you.

If you want to talk about how to eval in more depth, share your specific case and I'll help you set it up.

troupo•4mo ago
You have either too much time, or too much money, or both to curate code bases, to run 20x agent evals on those curated databases and spend time micro-optimising your agents... for those curated codebases. The moment you step outside of those curated codebases and run the agents against non-curated codebases?

Well, no one knows. They may or may not work because the actual codebase may be similar to, or may be completely different from the curated one.

And how do I know that it may not work? Well, let's turn to our friends at Anthropic: https://www.anthropic.com/engineering/a-postmortem-of-three-...

--- start quote ---

When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution. We use "top-p sampling" to avoid nonsensical outputs—only considering words whose cumulative probability reaches a threshold (typically 0.99 or 0.999). On TPUs, our models run across multiple chips, with probability calculations happening in different locations. To sort these probabilities, we need to coordinate data between chips, which is complex

--- end quote ---

So it's a probabilistic next word (which is quite likely to be different for a non-curated codebase), and there's top sampling, and then the complex sorting of probabilities, and on top of that are all the changes and bugs and limits and input/output transforms that Anthropic introduces.

> share your specific case and I'll help you set it up.

I have several side projects in Elixir. And at work we're developing a product that runs across 14 different (but similar, but different enough) platforms using the company's proprietary services.

It's especially funny to see the claims of "oh just one more fine-tuning, bro, and everything will be gazillion times better" when I have already used and found issues with every "diligently researched" "guaranteed eval'ed" hype tool under the sun. This is just one of the results: https://x.com/dmitriid/status/1967306828418818217

Yours are unlikely to be any different.

shikharbhardwaj•4mo ago
Hi! Author of the blog post here.

I completely agree, getting an objective measure for the developer experience from these various tools is not easy. On one hand, you have a series of benchmarks from LLM providers. While reflecting some degree of fitness to specific tasks, they often fail to translate to real-world usage. On the other hand, you have the tool providers with different features and product claims, and user anecdotes for very different use-cases.

The attempt with this post was to summarize my experience across some of these tools and highlight some specific features which worked better for me vs others. Given how quickly things are changing in this space, the primary conclusion is that using a tool day-to-day, discovering its strengths and deficiencies and working to eliminate the ones with high hit-rate is best at this point.

ozgune•4mo ago
(Disclaimer: Ozgun from Ubicloud)

I agree with you. I feel the challenge is that using AI coding tools is still an art, and not a science. That's why we see many qualitative studies that sometimes conflict with each other.

In this case, we found the following interesting. That's why we nudged Shikhar to blog about his experience and put a disclaimer at the top.

* Our codebase is in Ruby and follows a design pattern uncommon industry * We don't have a horse in this game * I haven't seen an evaluation that evaluates coding tools in (a) coding, (b) testing, and (c) debugging dimension

ExxKA•4mo ago
I am none the wiser. How do I get my 5 minutes back?
GardenLetter27•4mo ago
This reads like an advert for Continue.dev
willahmad•4mo ago
Here's my experience with these tools:

Good: I can prototype things very quickly thanks to these tools

Bad: After couple of vibe coding iterations, I don't have a mental model of the project.

Good: When I open my past projects where I have very good mental models, I can come up with a nice prompt and build anything quickly again.

Bad: After couple of iterations I become lazy, and eventually my mental models break.

There's definitely a use for these tools. But be careful, job of engineers are not only coding but also training their memory to build solutions and bridge real world problem with software solution. If you lose this skill of thinking, you will be obsolete quickly

accrual•4mo ago
This matches my experience as well. When I'm working on a codebase that I started and know well, it feels like magic to chat with an AI and watch patches appear on the screen to accept/deny. I only accept about 50% of the AI patches before tweaks because it's my project and I care about keeping on the track I laid out.

When I'm vibe coding something from scratch I don't have the mental model, I don't always review everything closely, and eventually it becomes an "AI project" that I'm just making requests against to hopefully achieve my goal.

softwaredoug•4mo ago
And when you lose your mental model it’s harder to prompt the LLM for good code.