frontpage.

Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval

https://github.com/yoloshii/ClawMem

3•yoloshii•1h ago

So I've been building ClawMem, an open-source context engine that gives AI coding agents persistent memory across sessions. It works with Claude Code (hooks + MCP) and OpenClaw (ContextEngine plugin + REST API), and both can share the same SQLite vault, so your CLI agent and your voice/chat agent build on the same memory without syncing anything.

The retrieval architecture is a Frankenstein, which is pretty much always my process. I pulled the best parts from recent projects and research and stitched them together: [QMD](https://github.com/tobi/qmd) for the multi-signal retrieval pipeline (BM25 + vector + RRF + query expansion + cross-encoder reranking), [SAME](https://github.com/sgx-labs/statelessagent) for composite scoring with content-type half-lives and co-activation reinforcement, [MAGMA](https://arxiv.org/abs/2501.13956) for intent classification with multi-graph traversal (semantic, temporal, and causal beam search), [A-MEM](https://arxiv.org/abs/2510.02178) for self-evolving memory notes, and [Engram](https://github.com/Gentleman-Programming/engram) for deduplication patterns and temporal navigation. None of these were designed to work together. Making them coherent was most of the work.

On the inference side, QMD's original stack uses a 300MB embedding model, a 1.1GB query expansion LLM, and a 600MB reranker. These run via llama-server on a GPU or in-process through node-llama-cpp (Metal, Vulkan, or CPU). But the more interesting path is the SOTA upgrade: ZeroEntropy's distillation-paired zembed-1 + zerank-2. These are currently the top-ranked embedding and reranking models on MTEB, and they're designed to work together. The reranker was distilled from the same teacher as the embedder, so they share a semantic space. You need ~12GB VRAM to run both, but retrieval quality is noticeably better than the default stack. There's also a cloud embedding option if you're tight on vram or prefer to offload embedding to a cloud model.

For Claude Code specifically, it hooks into lifecycle events. Context-surfacing fires on every prompt to inject relevant memory, decision-extractor and handoff-generator capture session state, and a feedback loop reinforces notes that actually get referenced. That handles about 90% of retrieval automatically. The other 10% is 28 MCP tools for explicit queries. For OpenClaw, it registers as a ContextEngine plugin with the same hook-to-lifecycle mapping, plus 5 REST API tools for the agent to call directly.

It runs on Bun with a single SQLite vault (WAL mode, FTS5 + vec0). Everything is on-device; no cloud dependency unless you opt into cloud embedding. The whole system is self-contained.

This is a polished WIP, not a finished product. I'm a solo dev. The codebase is around 19K lines and the main store module is a 4K-line god object that probably needs splitting. And of course, the system is only as good as what you index. A vault with three memory files gives deservedly thin results. One with your project docs, research notes, and decision records gives something actually useful.

Two questions I'd genuinely like input on: (1) Has anyone else tried running SOTA embedding + reranking models locally for agent memory, and is the quality difference worth the VRAM? (2) For those running multiple agent interfaces (CLI + voice/chat), how are you handling shared memory today?

Kill Chain

What Happens with Open Source in the Age of AI?

Show HN: AgentVerse – Open social network for AI agents (Mar 2026)

Hungary's Foreign Minister Briefed Russia on EU Meetings in Real Time

Sam Altman Sister's Abuse Claims Against Him Dismissed for Now

Deploy model whose predictions most resemble the ensemble mean

Show HN: How I built a resume editor using AI with zero web dev experience

When AI Writes the Software, Who Verifies It?

Show HN: CI-debugger – Debug GitHub Actions locally with breakpoints

Refraktd – crowdsourced news bias ratings by article or outlet

Screen Recorder – Free browser-based screen recording with zoom, blur, and cuts

Package Manager Mirroring – Andrew Nesbitt

The HTML Review 05

Flow Matching and Diffusion Models – 2026 Version

Dark Reader – Browser extension that generates dark mode for web pages

I built an AI teammate that takes Jira tickets and turns them into PRs

Exponential lower bound for fan-in-2 circuits computing Hamiltonian Cycle

Show HN: Command_line – a fast, terminal-style Hacker News app

A tool to “swallow” outputs from Claude/Codex/Cursor and reuse them

My Willing Complicity in "Human Rights Abuse"

German Mathematician Gerd Faltings Wins Abel Prize for Number Theory Work

Chest Fridge

Argentina was one of the richest countries at the beginning of the 20th century

JavaScript Is Enough

Lightweight Compression in DuckDB

JPMorgan deploys tech to monitor junior bankers' working hours

Why craft-lovers are losing their craft

OS X Stats Nano – Ultra-light macOS menu bar monitor (180 KB)

Musk found liable to Twitter shareholders in fraud lawsuit over $44B takeover

Big HVAC is shaking: Beat the heat with a wet towel and spite for $240B [video]