frontpage.

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.

What it does:

Parallel Execution: Send a single prompt to OpenAI, Anthropic, Mistral, and Gemini simultaneously. Compare the outputs, latency, and exact token usage side-by-side.

Batch Evaluations: Upload a CSV dataset to run bulk tests across multiple models at once.

Manual Diagnostics: Grade outputs manually (Pass/Fail) and assign diagnostic tags (e.g., Hallucination, Format Error) to build a human-verified performance leaderboard.

Local-first: API keys encrypted with your OS keyring; history stored in a local SQLite DB; no telemetry.

I’m looking for technical feedback. What do you think current LLM testing/evaluation tools get most wrong?

French physicist and media star loses doctorate after plagiarism investigation

Paul Krugman has the perfect metaphor for the career of Elon Musk

Estonia to Grant AI Bots Digital IDs to Control Access

Show HN: Stop your AI agents from approving their own work

Why There Won't Be a Singleton AI God (Physics and Evolution)

Kaspersky discovered malware targeting Steam users through Wallpaper Engine

Local Qwen isn't a worse Opus, it's a different tool

Real Artists Still Ship

Anthropic Employees Accuse Trump Administration of Targeting Them

Send Bulk and Transactional Emails for Free

Orbital Data Centers Have a Silicon Problem Nobody Is Pricing

Climbing the Generative AI Mountain: A "hitchhiker's guide" for product managers

SHA-1 Was Shattered

Cosmicgpt – A GPT-in-space simulator to research SpaceX AI satellite viability

Show HN: StumbleUpon Is Back (Kinda)

Towards Conversational AI for Disease Management

Governance Is the Missing Half of AI Efficiency

Claude Code sessions erase after 30 days by default

Volkswagen started blocking GrapheneOS users

The Demise of Real Neighborhoods Is a Story of Finance

The Evolution of Unix

The Mind of Anthropic CEO Dario Amodei [Extended Interview] [video]

Too many newsletters, not enough time? Listen

Language Courses in the Public Domain

Call for proposals, designing new kinds of research organisations

Show HN: Tyto – find where audio breaks your voice-agent calls

Built Uber aggregator that tracks top AI researchers and leaders

Show HN: FusionHarness – An Open-source Mixture-of-Agents compound-model server

Who Is America's Homer?

Cursor built a fleet of security agents to solve a familiar frustration

Show HN: AptSelect – A local LLM client for parallel testing and evaluation