frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

We need better ways to evaluate how AI memory systems perform

https://www.cognee.ai/blog/deep-dives/ai-memory-evals-0825
1•vasa_•2h ago

Comments

vasa_•2h ago
The usual benchmarks for language models—Exact Match, F1, and even multi-hop QA datasets—weren’t designed to measure what matters most about persistent AI memory: connecting concepts across time, documents, and contexts.

We just completed our most extensive internal evaluation of cognee to date, using HotPotQA as a baseline. While the results showed strong gains, they also reinforced a growing realization: we need better ways to evaluate how AI memory systems actually perform.

We ran Cognee through 45 evaluation cycles on 24 questions from HotPotQA, using ChatGPT 4o for the analysis. Each part of the evaluation process is affected by the inherent variance in GPT’s output: cognification, answer generation, and answer evaluation. We especially noticed significant variance across different metrics on small runs, which is why we chose the repeated, end-to-end approach.

We compared results using the same questions and setup with:

Mem0 Lightrag Graphiti

While they are standard in QA, EM and F1 scores reward surface-level overlap and miss the core value proposition of AI memory systems. For example, a syntactically perfect answer can be factually wrong, and a fuzzy-but-correct response can be penalized for missing the reference phrasing.

LLMs are inconsistent, that is another issue.

Even HotPotQA assumes all relevant information sits neatly in two paragraphs. That’s not how memory works. Real-world AI memory systems need to link information across documents, conversations, and knowledge domains that traditional QA benchmarks just can’t capture.

Consider the difference:

Traditional QA:

“What year was the company that acquired X founded?”

Memory Challenge:

“How do the concerns raised in last month’s security review relate to the authentication changes discussed in the architecture meeting three weeks ago?”

Only one of these tests long-term knowledge, reasoning across sources, and organizational memory—care to guess which one?

We are working on a new dataset and benchmarks to measure memory, and would love feedback!

Must democracy "deliver the goods" to beat autocracy?

https://democracyorbust.bearblog.dev/must-democracy-deliver-the-goods/
1•tornadofart•4m ago•0 comments

Open source lowcode builder – REI3 looks awesome for business needs

https://github.com/r3-team/r3
1•khargara•6m ago•0 comments

Simon Willison on the Talking Postgres podcast: AI for data engineers"

https://talkingpostgres.com/episodes/ai-for-data-engineers-with-simon-willison
1•clairegiordano•9m ago•0 comments

Trump Wants UCLA to Pay $1B to Restore Its Research Funding

https://www.nytimes.com/2025/08/08/us/trump-ucla-research-funding-deal.html
4•mitchbob•15m ago•2 comments

ChatGPT Is Still a Bullshit Machine

https://gizmodo.com/chatgpt-is-still-a-bullshit-machine-2000640488
3•01-_-•16m ago•0 comments

A Guide Dog for the Face-Blind

https://asimov.blog/a-guide-dog-for-the-face-blind/
2•arto•17m ago•0 comments

"Magic" Cleaning Sponges Found to Release Trillions of Microplastic Fibers

https://scitechdaily.com/magic-cleaning-sponges-found-to-release-trillions-of-microplastic-fibers/
2•01-_-•17m ago•0 comments

Study finds flavor bans cut youth vaping but slow decline in cigarette smoking

https://medicalxpress.com/news/2025-07-flavor-youth-vaping-decline-cigarette.html
5•PaulHoule•21m ago•1 comments

Slack Threads are utter dog shit, so I made a quote reply extension with gpt5

https://github.com/pashpashpash/slack-reply
1•KeysToHeaven•22m ago•1 comments

Ask HN: Have we reached the acceptance phase of generative AI?

1•wsgeorge•22m ago•0 comments

Is Gen X Nostalgia Just Trauma-Bonding?

https://www.wsj.com/lifestyle/is-gen-x-nostalgia-just-trauma-bonding-09081b42
1•petethomas•22m ago•0 comments

Buttercup is now open-source

https://blog.trailofbits.com/2025/08/08/buttercup-is-now-open-source/
2•wglb•23m ago•0 comments

DHH: The Framework Desktop is a beast

https://world.hey.com/dhh/the-framework-desktop-is-a-beast-636fb4ff
1•lemonberry•25m ago•0 comments

Why good finance gets ignored

https://averyseriousnewsletter.substack.com/p/why-good-finance-gets-ignored
1•tosh•25m ago•0 comments

Thinking Is Becoming a Luxury Good

https://www.nytimes.com/2025/07/28/opinion/smartphones-literacy-inequality-democracy.html
3•twalichiewicz•29m ago•0 comments

Putin Tells U.S. He'll Halt War in Exchange for Eastern Ukraine

https://www.wsj.com/world/putin-russia-ukraine-ceasefire-proposal-0021453b
9•Jimmc414•29m ago•0 comments

Chinese biz using AI to hit US politicians, influencers with propaganda

https://www.theregister.com/2025/08/08/golaxy_ai_influence/
1•rntn•30m ago•0 comments

OpenAI will bring back ChatGPT-4o for plus users

https://old.reddit.com/r/ChatGPT/comments/1mkae1l/gpt5_ama_with_openais_sam_altman_and_some_of_the/n7nelhh/
1•Fraaaank•30m ago•1 comments

A Deeper Dive into Apache Iceberg V3

https://opensource.googleblog.com/2025/08/whats-new-in-iceberg-v3.html
1•xnx•31m ago•0 comments

Glass bottles found to contain more microplastics than plastic bottles

https://phys.org/news/2025-06-glass-bottles-microplastics-plastic.html
2•shpat•32m ago•1 comments

Leann – Claude Code–compatible semantic search with 97% smaller vector index

https://github.com/yichuan-w/LEANN
3•yichuan•33m ago•0 comments

Python backoff repository was archived

https://github.com/litl/backoff
2•hleszek•36m ago•1 comments

Are you in a mid-career to senior job? Don't fear AI

https://theconversation.com/are-you-in-a-mid-career-to-senior-job-dont-fear-ai-you-could-have-this-important-advantage-262347
3•almost-exactly•37m ago•0 comments

Jim Lovell Has Died

https://www.bbc.com/news/articles/cl7y8zq5xpno
4•brudgers•39m ago•1 comments

New signs found of giant gas planet in 'Earth's neighbourhood'

https://www.bbc.com/news/articles/cx2xezw3dkpo
3•bookofjoe•40m ago•0 comments

A Vulkan on Metal Mesa 3D Graphics Driver

https://www.lunarg.com/a-vulkan-on-metal-mesa-3d-graphics-driver/
1•coffeeaddict1•41m ago•0 comments

Apollo 13 astronaut Jim Lovell dies

https://news.sky.com/story/apollo-13-astronaut-jim-lovell-dies-13408665
4•austinallegro•42m ago•0 comments

Show HN: Text Cleanse – Free Online Text Cleaner and Case Converter

https://textcleanse.com/index.html
2•sowadgg•42m ago•0 comments

First Ethernet-Based AI Memory Fabric System to Increase LLM Efficiency – News

https://www.allaboutcircuits.com/news/first-ethernet-based-ai-memory-fabric-system-to-increase-llm-efficiency/
1•rbanffy•46m ago•0 comments

Show HN: GPT OSS: How to run and fine-tune

https://docs.unsloth.ai/basics/gpt-oss-how-to-run-and-fine-tune
1•danielhanchen•47m ago•0 comments