frontpage.

Hey HN,

I built yardstiq because I got tired of the copy-paste workflow for comparing LLM responses when developing apps. Every time I wanted to see how Claude vs GPT vs Gemini handled the same prompt, I'd open three tabs, paste the same thing, and try to eyeball the differences. It's 2026 and we have 40+ models worth considering — that doesn't scale.

yardstiq is a CLI tool that sends one prompt to multiple models simultaneously and streams the responses side-by-side in your terminal. It also tracks performance metrics (time to first token, tokens/sec, cost) and optionally runs an AI judge to score the outputs.

``` npx yardstiq "Explain quicksort in 3 sentences" -m claude-sonnet -m gpt-4o ```

What it does:

- Streams responses from multiple models in parallel, rendered in columns - Shows TTFT, throughput (tok/s), token counts, and cost per request - AI judge mode: have a model evaluate and score the responses - Export to JSON, Markdown, or self-contained HTML reports - Run YAML-defined benchmark suites across models with aggregate scoring - Works with Ollama for local model comparisons (zero API cost) - Supports 40+ models via direct provider keys or Vercel AI Gateway

I built this mostly for my own workflow — picking models for different tasks, testing prompt variations, and running quick benchmarks without setting up a whole evaluation framework. It's not trying to replace serious eval platforms, just make the "which model is better for X?" question answerable in 10 seconds.

MIT licensed, written in TypeScript: https://github.com/stanleycyang/yardstiq

Happy to answer questions about the architecture or benchmarking approach.

Intel's make-or-break 18A process node debuts for data center with 288-core Xeon

Silent Backwards Compatibility Breaking Changes in PyTorch

Hacked traffic cameras & US Intel: How plot to kill Iran's leader came together

Claude Code escapes its own denylist and sandbox

I Built a Spy Satellite Simulator in a Browser. Here's What I Learned

LotusQ Cross platform voice dictation with free local Whisper(Mac/Windows/Linux)

The gap between ICP documents and buyer understanding in B2B sales

Academics Need to Wake Up on AI

Qwen Tech Lead Steps Down

Fire the CEO, Introducing the AxO's

Mpv Is the MVP of Video and Image Viewing

Deprecate confusing APIs like "os.path.commonprefix()"

Ask HN: Using AI at work is stupidity, or a good tool if used properly?

How HN: DocAPI – HTTP 402 as designed: agents register, pay USDC, run forever

Why exe.dev VMs are persistent

Gram 1.0 Released

OpenAI releases GPT-5.3 Instant update to make ChatGPT less 'cringe'

Beatport and Beatsource to Unite into One Premium DJ Platform

Identity Formation and the Politics of Belonging: Bengali Migrants in Kerala [pdf]

Ask HN: What are your go to sources for relatively unbiased global news?

Show HN: Voquill, an open source and cross-platform alternative to wisprflow

The unfortunate need for an "age verification" API for legal compliance

OpenclawwOpenClaw Partners with VirusTotal for Skill Security

Blocking a brain receptor may calm blood pressure signals

Show HN: Mozilla.ai introduces Clawbolt, an AI Assistant for the trades

Claude and Pentagon whole fight timeline

New tool for designing software architecture diagrams and presentations

Section 230 is the best protection we have from Trump's censorship

Cofounder search: An internet-native way to do ML and bio research

The Making of the Atomic Bomb book predicted the AI crisis before it happened

Show HN: Yardstiq – Compare LLM outputs side-by-side in your terminal