frontpage.

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

https://github.com/cactus-compute/needle

4•HenryNdubuaku•59m ago

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

Paperclip The human control plane for AI labor

After Deaths, Lawsuits Against A.I. Companies Test a New Strategy

The Origins of "Hello, World" [video]

AI isn't paying off in the way companies think

U.S. inflation jumps to 3.8% YoY (7.2% MoM, annualized)

AI in the rare disease news desert

CC-Ledger: Claude Code Cost Tracker (Per-Session and Per-PR)

Parent sues Palo Alto Unified after son is accused of using AI on essay

Carmack on starting a video game company today

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL

Trump says US FDA Commissioner Makary is out

Did Ancient Civilizations Have Organized Crime?

The Main Path to Creative AI

Redraw: 2d Primitives for Web and Native

Modeling the US-Europe Paradox

Building a Local AI Workspace Inside VS Code

In-Kernel Broadcast Optimization: Co-Designing Kernels for RecSys Inference

ChatGPT adoption broadened in early 2026

Company behind GLiNER model released open source model for running LLM guardrail

Dependencies Are Someone Else's Attack Surface

AI Is Starting to Build Better AI (Recursive self-improvement)

AI overlay that stays invisible to screen recorders

Are LLM Useful for Solo Founders

US budget watchdog estimates Golden Dome will cost $1.2T

What Is RTSP Streaming and Why It Is Still Relevant in 2026

Show HN: Mealplannr – turn YouTube chef videos into weekly meal plans

GitLab Outage

New Project

Show HN: Awesome Stars- render github awesome list with live star/fork badges

A code (reformatting) conundrum in Python, and heuristics