frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Auto LLM Ranker – Describe a task in English and get ranked models

https://github.com/gauravvij/llm-evaluator
3•gauravvij137•2h ago
I got tired of picking LLMs based on vibes and leaderboards that don't reflect real workloads, so I built this.

You describe a task in plain English. The tool generates a test suite for that specific task, discovers candidate models via OpenRouter, benchmarks them in parallel, and uses a Judge LLM to score every response across 5 dimensions: accuracy, hallucination, grounding, tool-calling, and clarity.

Output is a ranked top 3 with average latency per model and a task-specific system prompt optimized for the winner.

A few things I learned while building it:

- Score and latency rarely correlate. The best model for accuracy on coding tasks was almost never the fastest. This tradeoff is completely task-dependent and impossible to see from benchmarks that don't reflect your workload. - The Judge LLM approach is surprisingly consistent but introduces positional and familiarity bias. Using one model to score others isn't perfect, but it's far more reproducible than manual eval. Open to ideas on how to reduce judge bias without blowing up the cost. - Model discovery matters more than I expected. The top performers on generic benchmarks often weren't the top performers on narrow tasks.

Stack: Python, OpenRouter for model access, MIT licensed.

https://github.com/gauravvij/llm-evaluator

Happy to answer questions on the design decisions.

Show HN: Jotio – Temporary notes that archive themselves

https://apps.apple.com/us/app/jotio-idea-notes/id6757312138
1•heikowitte•36s ago•0 comments

Polsia claims a $3M ARR and streams all user chats via public URL

https://sitebloom.ch/writing/polsia-claims-3m-arr-and-exposes-user-chats/
1•nickgreg•1m ago•0 comments

Minimal NixOS systemd-nspawn containers

https://bou.ke/blog/nixos-containers/
1•bouk•1m ago•0 comments

No, your six-year-old cannot paint this

https://beller-it.de/blog/20260309-abstract-art-and-llms.html
1•mads_quist•2m ago•1 comments

AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution

https://arxiv.org/abs/2603.01145
1•granoIacowboy•2m ago•1 comments

Show HN: AriaType – Privacy-first voice keyboard with AI polish (Beta, macOS)

https://ariatype.com/en/
1•zachari•3m ago•0 comments

Removing recursion via explicit callstack simulation

https://jnkr.tech/blog/removing-recursion
1•todsacerdoti•4m ago•0 comments

AMD Expands Ryzen AI Embedded P100 Family with 8 to 12 Core Parts – ServeTheHome

https://www.servethehome.com/amd-expands-ryzen-ai-embedded-p100-family-with-8-to-12-core-parts/
2•rbanffy•4m ago•0 comments

I infiltrated phishing panels targeting European banks

https://inti.io/p/how-i-infiltrated-phishing-panels
1•kalev•4m ago•0 comments

Nvidia 595 Linux Driver Running Well in Early Benchmarks

https://www.phoronix.com/review/nvidia-595-linux
1•rbanffy•5m ago•0 comments

I packed an Interactive 3D Encyclopedia into 1MB

1•Aureon_de_Veyra•5m ago•0 comments

3D printer that can mine Bitcoin uses excess heat for temperature control

https://www.tomshardware.com/tech-industry/cryptomining/3d-printer-that-can-mine-bitcoin-uses-exc...
1•rbanffy•5m ago•0 comments

ChatGPT driving rise in reports of 'satanic' organised and ritual abuse

https://www.theguardian.com/technology/2026/mar/08/chatgpt-driving-rise-in-reports-of-satanic-org...
1•YeGoblynQueenne•6m ago•0 comments

Show HN: A step debugger for AI agents

https://github.com/SEsquieu/HiveOS-Trace
1•sesquieu•6m ago•0 comments

SaaS Changelogs, 5 Months: What I Found About How B2B Companies Ship

https://spylert.com/blog/saas-changelog-analysis-how-companies-ship/
1•cnu•6m ago•0 comments

Show HN: Skilo – Share agent skills with a link, no repo required

https://skilo.xyz
2•plawlost•8m ago•0 comments

A New Spy Radio Signal Has Appeared. It's Broadcasting in Farsi

https://theiceman.substack.com/p/a-new-spy-radio-signal-has-appeared
3•Gustomaximus•8m ago•1 comments

Revealed: UK's multibillion AI drive is built on 'phantom investments'

https://www.theguardian.com/technology/2026/mar/09/revealed-uks-multibillion-ai-drive-is-built-on...
2•tablets•8m ago•0 comments

We ran 21 MCP database tasks on Claude Sonnet 4.6

https://insforge.dev/blog/mcpmark-benchmark-results-v2
2•Arindam1729•8m ago•0 comments

Show HN: Argus – Self-hosted Ethereum security monitor

https://github.com/tokamak-network/Argus
1•cd4761•9m ago•1 comments

Show HN: Screen-watching AI needs a kill switch

https://github.com/deusXmachina-dev/memorylane
1•fidorka•9m ago•1 comments

Corpus Christi careens toward water catastrophe

https://www.texastribune.org/2026/03/08/texas-corpus-christi-water-crisis/
2•speckx•10m ago•0 comments

India offered sanctuary to Iranian ship three days before US sank it

https://www.bbc.com/news/articles/c2e4yxj0pd3o
8•tartoran•10m ago•0 comments

Do AI-enabled companies need fewer people?

https://seldo.com/posts/do-ai-enabled-companies-need-fewer-people/
1•arberavdullahu•10m ago•0 comments

Show HN: AI workflows for SoC analysts (phishing analysis, log triage)

https://soc-workflows-ai-cyb-tstb.bolt.host
1•gauravkundu•11m ago•0 comments

How to Host Your Own Email Server

https://blog.miguelgrinberg.com/post/how-to-host-your-own-email-server
2•ibobev•13m ago•0 comments

Post-Quantum Cryptography Beyond TLS: Remain Quantum Safe

https://www.akamai.com/blog/security/post-quantum-cryptography-beyond-tls
2•todsacerdoti•14m ago•0 comments

My Experiment with GitHub Sponsors

https://chuniversiteit.nl/personal/donations-on-github
1•ibobev•14m ago•0 comments

Do AI coding agents improve velocity and quality?

https://chuniversiteit.nl/papers/impact-of-ai-coding-agents
2•ibobev•14m ago•0 comments

The first AI agent worm is months away, if that

https://dustycloud.org/blog/the-first-ai-agent-worm-is-months-away-if-that/
2•birdculture•14m ago•0 comments