frontpage.

I evaluated 12 models (6 cloud, 6 local) across 5 tasks at shot counts 0, 1, 2, 4, and 8, with 3 trials each. 60 model-task pairs, 27k+ evaluations. Three patterns stood out:

1. Few-shot can cause collapse: Gemini 3 Flash scored 93% at zero-shot on route optimization, then crashed to 30% at 8-shot. Same model family (Gemma 3 27B, local) stayed stable at 90%.

2. Most models benefit from few-shot: On classification, all models scored 0-20% at zero-shot. At 8-shot, scores spread from 27% to 80%. Zero-shot benchmarks would have led to the wrong model choice.

3. Task mismatch ≠ collapse: Reasoning-specialized models scored low on summarization regardless of shot count. They're not "collapsing" — they're just not suited for the task.

A 27B local model (Gemma 3) matched Claude Haiku's adaptation efficiency (AUC 0.814 vs 0.815). The 12-model results are included as default demo data — explore the patterns without API keys.

Article: https://dev.to/shuntarookuma/i-tested-12-llms-with-few-shot-...

GitHub (MIT): https://github.com/ShuntaroOkuma/adapt-gauge-core

Apple adds new partners to its American Manufacturing Program

Show HN: A tiny macOS app that stops your Dock from jumping between monitors

Friction Is Everything

NL Kubernetes Maintainers Read Mean Comments (2024) [video]

Meshery 1.0 debuts, offering Infrastructure as Design for cloud native infra

Why we're building a different type of web index

Language mapping with electrical stimulation during awake neurosurgery

Vibe physics: The AI grad student

Stripe Projects: Provision and manage services from the CLI

Show HN: Claude skill that evaluates B2B vendors by talking to their AI agents

Canadian Court elevates thumbs-up emoji to signature status (2023)

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents

Architecture patterns for integrating LLM agents into enterprise knowledge work

Remilia Is 60 Years Behind

Apple pushes Maps ads in free training-wheels business bundle

TR-49

The Systemic Barbell Effect: Death of the Median

Beyond Context Graphs

What's in a Codebase?

ROOM: Self-Perpetuating Coding Agent Harness

Open Tooling – Open-source, agent-first CRM with MCP tools and memory

Show HN: No MCP, ONLY CLI. Turn any OpenAPI spec into a native CLI binary

Show HN: Abom – Actions Bill of Materials for GitHub Actions Supply Chains

Show HN: NerdFlair, a Claude Code QoL Plugin

I Can't See Apple's Vision

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL [pdf]

My minute-by-minute response to the LiteLLM malware attack

S.E.A.M. – A local-first browser tool to test seamless audio loops

An indoor air scrubber for removing ammonia from air within poultry houses

Fixed a bug in ARK – my AI agent stopped hallucinating

Show HN: Tested 12 LLMs with few-shot examples