The frustration that started it: every time I use a coding agent (Cursor, OpenCode, Aider, Claude Code, etc.), it eventually loses context — forgets the SSH address, re-asks for the DB password, tries to redeploy to localhost when the server is remote. The "proper" answer is "set up 10 specialized agents with short context windows." I'm too lazy for that.
The conventional architecture is the actual problem. Every turn re-sends the full conversation, the model recomputes attention from scratch, and cost compounds with conversation length. Long-running agents are economically infeasible by design.
What I built: NLS captures the model's own computed K/V states (and recurrent states for hybrid models like Qwen3.5-MoE) after each turn, persists them to disk, and re-injects them into the cache on the next turn — at the right positions, with proper alignment. The model behaves as if it had the full conversation in context, but the conversation is never re-sent.
Validated across three settings, in increasing order of stringency:
(1) Standard conversational recall: 5/5 on a 5-fact production test. Baseline check.
(2) LongMemEval (published cross-session benchmark, ~19K sessions). On the 18-question "fully answerable" subset:
Condition Qwen 3.5 Qwen 3.6
Memories provided as TEXT in the prompt 8/18 9/18
Same memories delivered as KV-state via NLS 8/18 9/18
Text and KV produce identical scores. Both fail the same 9-10 questions for the same reasons (multi-hop temporal reasoning that exceeds model capacity). When the architecture's inputs are equivalent, the outputs are equivalent.
(3) Real agentic loop with OpenCode (TUI coding agent, used NLS as its OpenAI-compatible backend). It scaffolded a multi-phase coding project ("ICF Coaching Evaluation Tool"). Then in a separate session, after a full TUI restart with no chat history, I asked "what's the project about?" — it returned a rich, specific description naming the project, the stack, and the architectural decisions. 124 user-typed tokens delivered 18,751 tokens of stored prior-session context. 99.3% prompt-token savings on the recall path. 4/4 recall across the test scenarios.Honest caveats: - The plugin source is proprietary (patent pending). The repo has docs, benchmarks, journey — not the implementation. - Single-GPU validation. Multi-GPU not tested yet. - Solo, no team yet. - Provisional patent only — non-provisional and PCT in the next 12 months.
What I want from this thread: tell me where you'd stress-test it. What workload breaks it? Anyone here from an inference provider — does this overlap with what your stack already does, or is this a new place?
Demo (conversational): https://punkrecords.live Demo (agentic, OpenAI-compatible): https://api.punkrecords.live/v1