frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

CLI for Common Playwright Actions

https://github.com/microsoft/playwright-cli
1•saikatsg•58s ago•0 comments

Would you use an e-commerce platform that shares transaction fees with users?

https://moondala.one/
1•HamoodBahzar•2m ago•1 comments

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

https://github.com/ykdojo/safeclaw
2•ykdojo•5m ago•0 comments

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

https://huggingface.co/blog/huggingface/one-year-since-the-deepseek-moment-blog-3
3•gmays•6m ago•0 comments

The Evolution of the Interface

https://www.asktog.com/columns/038MacUITrends.html
2•dhruv3006•7m ago•0 comments

Azure: Virtual network routing appliance overview

https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-routing-appliance-overview
2•mariuz•7m ago•0 comments

Seedance2 – multi-shot AI video generation

https://www.genstory.app/story-template/seedance2-ai-story-generator
2•RyanMu•11m ago•1 comments

Πfs – The Data-Free Filesystem

https://github.com/philipl/pifs
2•ravenical•14m ago•0 comments

Go-busybox: A sandboxable port of busybox for AI agents

https://github.com/rcarmo/go-busybox
3•rcarmo•15m ago•0 comments

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf
2•gmays•16m ago•0 comments

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

https://www.bloomberg.com/news/newsletters/2026-02-03/musk-s-xai-merger-poses-bigger-threat-to-op...
2•andsoitis•16m ago•0 comments

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

https://www.youtube.com/watch?v=UNorxwlZlFk
2•lysace•17m ago•0 comments

Zen Tools

http://postmake.io/zen-list
2•Malfunction92•19m ago•0 comments

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

https://hailey.at/posts/3mear2n7v3k2r
2•carnevalem•20m ago•0 comments

The purpose of Continuous Integration is to fail

https://blog.nix-ci.com/post/2026-02-05_the-purpose-of-ci-is-to-fail
1•zdw•22m ago•0 comments

Apfelstrudel: Live coding music environment with AI agent chat

https://github.com/rcarmo/apfelstrudel
2•rcarmo•23m ago•0 comments

What Is Stoicism?

https://stoacentral.com/guides/what-is-stoicism
3•0xmattf•23m ago•0 comments

What happens when a neighborhood is built around a farm

https://grist.org/cities/what-happens-when-a-neighborhood-is-built-around-a-farm/
1•Brajeshwar•24m ago•0 comments

Every major galaxy is speeding away from the Milky Way, except one

https://www.livescience.com/space/cosmology/every-major-galaxy-is-speeding-away-from-the-milky-wa...
2•Brajeshwar•24m ago•0 comments

Extreme Inequality Presages the Revolt Against It

https://www.noemamag.com/extreme-inequality-presages-the-revolt-against-it/
2•Brajeshwar•24m ago•0 comments

There's no such thing as "tech" (Ten years later)

1•dtjb•25m ago•0 comments

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

https://medium.com/@aglaforge/what-really-killed-flash-player-a-six-year-campaign-of-deliberate-p...
1•jbegley•25m ago•0 comments

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

1•buildingwdavid•27m ago•0 comments

Show HN: Knowledge-Bank

https://github.com/gabrywu-public/knowledge-bank
1•gabrywu•32m ago•0 comments

Show HN: The Codeverse Hub Linux

https://github.com/TheCodeVerseHub/CodeVerseLinuxDistro
3•sinisterMage•33m ago•2 comments

Take a trip to Japan's Dododo Land, the most irritating place on Earth

https://soranews24.com/2026/02/07/take-a-trip-to-japans-dododo-land-the-most-irritating-place-on-...
2•zdw•33m ago•0 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
48•bookofjoe•34m ago•19 comments

BookTalk: A Reading Companion That Captures Your Voice

https://github.com/bramses/BookTalk
1•_bramses•35m ago•0 comments

Is AI "good" yet? – tracking HN's sentiment on AI coding

https://www.is-ai-good-yet.com/#home
3•ilyaizen•35m ago•1 comments

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

https://github.com/BETAER-08/amdb
1•try_betaer•36m ago•0 comments
Open in hackernews

Show HN: Pi Co-pilot – Evaluation of AI apps made easy

https://withpi.ai/
34•achintms•8mo ago
Hey HN — 2 months ago we shared our first product with the HN community (https://news.ycombinator.com/item?id=43362535). Despite receiving lots of traffic from HN, we didn’t see any traction or retention. One of our major takeaways was that our product was too complicated. So we’ve spent the last 2 months iterating towards a much more focused product that tries to do just one thing really well. Today, we’d like to share our second launch with HN.

Our original idea was to help software engineers build high-quality LLM applications by integrating their domain knowledge into a scoring system, which could then drive everything from prompt tuning to fine-tuning, RL, and data filtering. But what we quickly learned (with the help of HN – thank you!) is that most people aren’t optimizing as their first, second, or even third step — they’re just trying to ship something reasonable using system prompts and off-the-shelf models.

In looking to build a product that’s useful to a wider audience, we found one piece of the original product that most people _did_ notice and want: the ability to check that the outputs of their AI apps look good. Whether you’re tweaking a prompt, switching models, or just testing a feature, you still need a way to catch regressions and evaluate your changes. Beyond basic correctness, developers also wanted to measure more subtle qualities — like whether a response feels friendly.

So we rebuilt the product around this single use case: helping developers define and apply subjective, nuanced evals to their LLM outputs. We call it Pi Co-pilot.

You can start with any/all of the below:

- a few good/bad examples

- a system prompt, or app description

- an old eval prompt you wrote

The co-pilot helps you turn that into a scoring spec — a set of ~10–20 concrete questions that probe the output against dimensions of quality you care about (e.g. “is it verbose?”, “does it have a professional tone?”, etc). For each question, it selects either:

- a fast encoder-based model (trained for scoring) – Pi scorer. See our original post [1] for more details on why this is a good fit for scoring compared to the “LLM as a judge” pattern.

- or generates Python functions when that makes more sense (word count, regex etc.)

You iterate over examples, tweak questions, adjust scoring behavior, and quickly reach a spec that reflects your actual taste — not some generic benchmark or off-the-shelf metrics. Then you can plug the scoring system into your own workflow: Python, TypeScript, Promptfoo, Langfuse, Spreadsheets, whatever. We provide easy integrations with these systems.

We took inspiration from tools like v0 and Bolt: natural language on the left, structured artifacts on the right. That pattern felt intuitive — explore conversationally, and let the underlying system crystallize it into things you can inspect and use (scoring spec, examples and code). Here is a loom demo of this: https://www.loom.com/share/82c2e7b511854a818e8a1f4eabb1a8c2

We’d appreciate feedback from the community on whether this second iteration of our product feels more useful. We are offering $10 of free credits (about 25M input tokens), so you can try out the Pi co-pilot for your use-cases. No sign-in required to start exploring: https://withpi.ai

Overall stack: Co-pilot next.js and Vercel on GCP. Models: 4o on Azure, fine tuned Llama & ModernBert on GCP. Training: Runpod and SFCompute.

– Achint (co-founder, Pi Labs)

Comments

jmoore15•8mo ago
Hey HN: I'm John, an engineer from Pi Labs. Despite lurking on HN daily for over 10 years, I've never posted, and now feels like a good time to change that :)

I joined Pi 3 months ago after a decade at Google. It was partly the HN community that inspired me to make the switch to a smaller company where I could have more direct impact. Working at a start-up has been quite an adjustment: while the work is extremely rewarding and fun, the pre- product/market fit phase is challenging in ways I've never experienced before in my career.

That's why I asked the team to post here, and am excited to show off this launch to see whether it meets a need that developers have (or learn why if not!)

Just for fun, I spent the last 5 minutes making an evaluation system for Show HN posts, so you can look at a real example if you'd rather not make your own [1]. If you sign in, you can fork and modify it, but you can also go directly to the homepage to try your own hand at it without any sign-in.

[1] https://withpi.ai/project/Xxyhrg2UR8kZHeNmbdV3

jmoore15•8mo ago
There was a thoughtful reply here with some feedback on the scores. It seems to have been deleted while I was writing a reply. In the interest of substantive discussion, I'm posting my reply below, since I think it's still valuable information.

--

Yes, I noticed there were a few things off about this example (e.g., certain questions feel inapplicable for certain examples, or certain scores feel too optimistic for the bad example), and I intentionally left them there so as not to window dress things too much.

To add to some of your observations, I'll note that: 1. Automatic question generation runs before I've given any examples in this particular chat. This can be a positive (in that you can get started without even providing an example of your own data), but it also means we sometimes add questions that don't make sense for the data you actually have. The co-pilot is meant to be iterative for that reason. (As an example, towards the end of the chat, I do ask it to remove some questions that don't feel applicable).

2. The model still has to output a score for all questions, even if they don't apply to a particular input. We're working on a new system that will understand which questions actually apply, and can turn certain questions off if they're impossible to answer given the inputs provided.

3. We do get feedback from users that the scores feel off sometimes. In some cases, they're too high; in others, they're too low. We're working on an interface for calibrating the scores to your own preferences, e.g. with a small amount of thumbs up/thumbs down data. There's a tradeoff here, though, because we're also trying to make the evaluation process a lot easier than today's "prompt an LLM as a judge" paradigm where writing a prompt with a rubric can take a substantial amount of time, and any kind of calibration adds friction for users.

Overall, we release new models every other week and track ourselves on internal benchmarks to see improvements to both question generation and question scoring. If you play around with the system more and find other ways it's not working as you'd expect or like, please feel free to email me your examples and we'd be happy to prioritize looking into them.

eterm•8mo ago
I don't normally leave feedback on these posts, so I apologise if I get the tone wrong and this is overly harsh.

1. If you're going to record a demo, invest $100 into a real microphone. The sound quality of the loom demo is really off-putting. It might also be over-compression, but it gives me a headache to listen to this kind of sound.

2. The demo has left me more confused. Rather than going step by step, you take the Blue-peter approach of "Here's one I made earlier" and suddenly you're switching tab to something different. Show me the product in action.

I guess I'm not in the market for this, but it feels like UI-heavy for something that's evaluating agents / infrastructure-as-code. I'd have thought if I was going to not just automat something, but also automate the evaluation of that automation, then I'd want a pipeline / process for that, and not actually scan down the criteria trying to work out which blog-posts are which and how the scores relate.

achint_withpi•8mo ago
Thanks for your feedback on the video. Great point about going step by step, instead of switching mid-stream to a pre-built session :). We have another even simpler version which goes slower, step by step which would have been better for this post? The challenge has been balancing between showcasing the wide feature set with duration.

We have a spreadsheet integration (which I might post as a comment) for the usecase you mentioned. The scorer is quite light weight so easy to integrate it in your existing pipelines instead of building yet another pipeline/framework. The co-pilot is specifically for triangulating the right set of metrics (that are subjective based on your taste), which does require looking at examples a few at a time and make a judgement call. But I agree that once you are done with that you want to quickly transition off of this to either code or other frameworks like sheets, promptfoo etc.

accidc•8mo ago
I think naming things co-pilot is getting excessive. Its just diluting your potential brand with MSFT and the term itself seems to be meaningless now.

While MSFT may not own co-pilot, they definitely control the mindshare.

See also: https://news.ycombinator.com/item?id=42781316

dockercompost•8mo ago
Plus there's already Inflection's Pi, adding to the confusion
bound008•8mo ago
> From the team that brought you the magic of Google Search

This feels really disingenuous. Larry and Sergei are the team that brought us Google Search.

"Magic of" is trying to hide the fact that you didn't bring Google Search to fruition. The last 5 years of Google Search do not feel magical at all.

Instead, claim credit for something that you did do with Google Search.

From looking at your LinkedIn: CTO > Joined Google via acquisition of ITA Matrix and worked on schema.org amongst other very impressive things. Before that, founding team of Bing @ MSFT. CEO > Worked on search at Google from 2018 - 2024 ( 6 years )

These are impressive credentials-- so find a better way to showcase them.

donfotto•8mo ago
I like that you go beyond just prompt engineering and "LLM as a judge" and use finetuned (?) ModernBert and Llama models.

In your previous post you mentioned that you "score 20+ dimensions". Are these generic dimensions for all use cases / users, or do you finetune individually for each user?