frontpage.

Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval

https://github.com/yoloshii/ClawMem

3•yoloshii•2h ago

So I've been building ClawMem, an open-source context engine that gives AI coding agents persistent memory across sessions. It works with Claude Code (hooks + MCP) and OpenClaw (ContextEngine plugin + REST API), and both can share the same SQLite vault, so your CLI agent and your voice/chat agent build on the same memory without syncing anything.

The retrieval architecture is a Frankenstein, which is pretty much always my process. I pulled the best parts from recent projects and research and stitched them together: [QMD](https://github.com/tobi/qmd) for the multi-signal retrieval pipeline (BM25 + vector + RRF + query expansion + cross-encoder reranking), [SAME](https://github.com/sgx-labs/statelessagent) for composite scoring with content-type half-lives and co-activation reinforcement, [MAGMA](https://arxiv.org/abs/2501.13956) for intent classification with multi-graph traversal (semantic, temporal, and causal beam search), [A-MEM](https://arxiv.org/abs/2510.02178) for self-evolving memory notes, and [Engram](https://github.com/Gentleman-Programming/engram) for deduplication patterns and temporal navigation. None of these were designed to work together. Making them coherent was most of the work.

On the inference side, QMD's original stack uses a 300MB embedding model, a 1.1GB query expansion LLM, and a 600MB reranker. These run via llama-server on a GPU or in-process through node-llama-cpp (Metal, Vulkan, or CPU). But the more interesting path is the SOTA upgrade: ZeroEntropy's distillation-paired zembed-1 + zerank-2. These are currently the top-ranked embedding and reranking models on MTEB, and they're designed to work together. The reranker was distilled from the same teacher as the embedder, so they share a semantic space. You need ~12GB VRAM to run both, but retrieval quality is noticeably better than the default stack. There's also a cloud embedding option if you're tight on vram or prefer to offload embedding to a cloud model.

For Claude Code specifically, it hooks into lifecycle events. Context-surfacing fires on every prompt to inject relevant memory, decision-extractor and handoff-generator capture session state, and a feedback loop reinforces notes that actually get referenced. That handles about 90% of retrieval automatically. The other 10% is 28 MCP tools for explicit queries. For OpenClaw, it registers as a ContextEngine plugin with the same hook-to-lifecycle mapping, plus 5 REST API tools for the agent to call directly.

It runs on Bun with a single SQLite vault (WAL mode, FTS5 + vec0). Everything is on-device; no cloud dependency unless you opt into cloud embedding. The whole system is self-contained.

This is a polished WIP, not a finished product. I'm a solo dev. The codebase is around 19K lines and the main store module is a 4K-line god object that probably needs splitting. And of course, the system is only as good as what you index. A vault with three memory files gives deservedly thin results. One with your project docs, research notes, and decision records gives something actually useful.

Two questions I'd genuinely like input on: (1) Has anyone else tried running SOTA embedding + reranking models locally for agent memory, and is the quality difference worth the VRAM? (2) For those running multiple agent interfaces (CLI + voice/chat), how are you handling shared memory today?

Show HN: Termcraft – terminal-first 2D sandbox survival in Rust

Show HN: Atomic – Self-hosted, semantically-connected personal knowledge base

Show HN: Joonote – A note-taking app on your lock screen and notification panel

Show HN: I built a pricing tool for home bakers that reads recipe photos

Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval

Show HN: AI SDLC Scaffold, repo template for AI-assisted software development

Show HN: We built a terminal-only Bluesky / AT Proto client written in Fortran

Show HN: An event loop for asyncio written in Rust

Show HN: Simple Terminal Voice Recorder

Show HN: The Two by Two Truth Diagram

Show HN: Sonar – A tiny CLI to see and kill whatever's running on localhost

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI

Show HN: GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

Show HN: vLLM Studio – A macOS app for using vLLM models

Show HN: Red Grid Link – peer-to-peer team tracking over Bluetooth, no servers

Show HN: I ran a language model on a PS2

Show HN: Zen-Hunt – A bare-metal forensic scanner in Rust (SIMD, 7GB/s on NVMe)

Show HN: A KEXP native macOS app

Show HN: Vessel Browser – An open-source browser built for AI agents, not humans

Show HN: Three new Kitten TTS models – smallest less than 25MB

Show HN: Baltic shadow fleet tracker – live AIS, cable proximity alerts

Show HN: EchoLive – Read-it-later app that reads to you with 600 AI voices

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

Show HN: Batear – I built a $15 edge-only acoustic drone warning system

Show HN: I saw Norton Commander on X and nostalgia made me build it for the web

Show HN: Deterministic security solution for AI agents – OpenClaw and 2 more

Show HN: FPGA soft-core of the Saab Viggen's 1963 airborne computer

Show HN: RSS reader that scores articles 0–10 with LLM before you open them

Show HN: I ran Qwen3.5 35B on my iPhone at 5.6 tok/SEC

Show HN: I made an email app inspired by Arc browser

Show HN: Termcraft – terminal-first 2D sandbox survival in Rust

Show HN: Atomic – Self-hosted, semantically-connected personal knowledge base

Show HN: Joonote – A note-taking app on your lock screen and notification panel

Show HN: I built a pricing tool for home bakers that reads recipe photos

Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval

Show HN: AI SDLC Scaffold, repo template for AI-assisted software development

Show HN: We built a terminal-only Bluesky / AT Proto client written in Fortran

Show HN: An event loop for asyncio written in Rust

Show HN: Simple Terminal Voice Recorder

Show HN: The Two by Two Truth Diagram

Show HN: Sonar – A tiny CLI to see and kill whatever's running on localhost

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI

Show HN: GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

Show HN: vLLM Studio – A macOS app for using vLLM models

Show HN: Red Grid Link – peer-to-peer team tracking over Bluetooth, no servers

Show HN: I ran a language model on a PS2

Show HN: Zen-Hunt – A bare-metal forensic scanner in Rust (SIMD, 7GB/s on NVMe)

Show HN: A KEXP native macOS app

Show HN: Vessel Browser – An open-source browser built for AI agents, not humans

Show HN: Three new Kitten TTS models – smallest less than 25MB

Show HN: Baltic shadow fleet tracker – live AIS, cable proximity alerts

Show HN: EchoLive – Read-it-later app that reads to you with 600 AI voices

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

Show HN: Batear – I built a $15 edge-only acoustic drone warning system

Show HN: I saw Norton Commander on X and nostalgia made me build it for the web

Show HN: Deterministic security solution for AI agents – OpenClaw and 2 more

Show HN: FPGA soft-core of the Saab Viggen's 1963 airborne computer

Show HN: RSS reader that scores articles 0–10 with LLM before you open them

Show HN: I ran Qwen3.5 35B on my iPhone at 5.6 tok/SEC

Show HN: I made an email app inspired by Arc browser

Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval