frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Includetheprompt.com

https://includetheprompt.com
1•sunaurus•4m ago•0 comments

Making Sense of Memory in AI Agents

https://www.leoniemonigatti.com/blog/memory-in-ai-agents.html
1•sebg•5m ago•0 comments

Google insider profited $1M in a single day betting on the Google search markets

https://twitter.com/JeongHaeju/status/1996462116094480464
1•mirzap•6m ago•0 comments

NextJS Security Vulnerability

https://nextjs.org/blog/CVE-2025-66478
1•connor11528•7m ago•0 comments

Wikipedia seeks more AI licensing deals similar to Google tie-up, Wales says

https://www.reuters.com/business/media-telecom/wikipedia-seeks-more-ai-licensing-deals-similar-go...
1•thm•12m ago•0 comments

AI Image Generation – Kirkify.live

https://kirkify.live
2•Nancy1230•13m ago•0 comments

Best way to collect LinkedIn post URLs by keyword and comment later

2•stephanemillet•14m ago•0 comments

Amazon Nova

https://aws.amazon.com/nova/
2•jonbaer•16m ago•0 comments

LanguageTool requires premium subscription for browser extension

https://languagetool.org/webextension/premium-announcement
1•unixfox•17m ago•0 comments

Building optimistic UI in Rails (and learn custom elements)

https://railsdesigner.com/custom-elements/
1•amalinovic•18m ago•0 comments

Crashing an AI Promo Event: What to Ask Before Buying into an AI Agent Platform

https://ossa-ma.github.io/blog/crashing-ai-promo
1•ossa-ma•19m ago•0 comments

All About Diffraction Gratings

https://www.edmundoptics.com/knowledge-center/application-notes/optics/all-about-diffraction-grat...
1•thunderbong•19m ago•0 comments

An Interview with Atlassian CEO Mike Cannon-Brookes About Atlassian and AI

https://stratechery.com/2025/an-interview-with-atlassian-ceo-mike-cannon-brookes-about-atlassian-...
1•feross•19m ago•0 comments

Training LLMs for Honesty via Confessions [pdf]

https://cdn.openai.com/pdf/6216f8bc-187b-4bbb-8932-ba7c40c5553d/confessions_paper.pdf
1•goplayoutside•20m ago•0 comments

How confessions can keep language models honest

https://openai.com/index/how-confessions-can-keep-language-models-honest/
2•goplayoutside•22m ago•0 comments

Google's Year in Search: 2025

https://trends.withgoogle.com/year-in-search/2025/
1•ravenical•23m ago•0 comments

OpenAI to Acquire Neptune

https://openai.com/index/openai-to-acquire-neptune/
3•resiros•26m ago•0 comments

PGlite – Embeddable Postgres

https://pglite.dev/
29•dsego•29m ago•1 comments

Haskell Weekly – Issue 501

https://haskellweekly.news/issue/501.html
1•unripe_syntax•30m ago•0 comments

WordPress Playground: 2025 Year in Review

https://make.wordpress.org/playground/2025/12/03/wordpress-playground-2025-year-in-review/
2•program•32m ago•0 comments

Flock cameras are also computers – and perfectly hackable

https://neuburger.substack.com/p/flock-camera-vulnerability-its-worse
2•ThomasNeu•34m ago•0 comments

Porn company fined £1M over inadequate age checks (UK)

https://www.bbc.co.uk/news/articles/c93nll07z3go
9•ndsipa_pomu•34m ago•5 comments

How to Think Like a World-Class Marketer – Rory Sutherland

https://fs.blog/knowledge-project-podcast/rory-sutherland-2/
1•feross•35m ago•0 comments

We created API-Bench to test how well LLMs execute against APIs

https://superglue.ai/benchmark_v2
2•adinagoerres•36m ago•1 comments

Khwand AI – personalized AI tutor (launch)

https://khwand.webflow.io
1•FahadHafeezOff•37m ago•1 comments

The Eternal Canvas – 10yr observation and 2yr full-time documentation (85 docs)

https://publish.obsidian.md/thecanvas
1•DVoidCreationz•37m ago•1 comments

NRC Completes Safety Review of TerraPower Natrium [pdf]

https://www.nrc.gov/sites/default/files/cdn/doc-collection-news/2025/25-063.pdf
3•mpweiher•40m ago•0 comments

Production Ready Terraform with Testing, Validation and CI/CD

https://fatihkoc.net/posts/production-ready-terraform/
1•fatihkocnet•40m ago•0 comments

LED Streetlights Are Disrupting Ecosystems – A Systems Failure

2•emmasuntech•40m ago•0 comments

Tony Tetro

https://en.wikipedia.org/wiki/Tony_Tetro
1•herol3oy•42m ago•0 comments
Open in hackernews

We created API-Bench to test how well LLMs execute against APIs

https://superglue.ai/benchmark_v2
2•adinagoerres•36m ago

Comments

adinagoerres•36m ago
How well can agents work with APIs they’ve never seen before? We tested 41 APIs across 8 different LLMs to find out.

API execution is great for benchmarking, because it tests core qualities and limitations of LLMs:_the depth of the data they were trained on, their stateless architecture, context dependency, and reasoning.

Today we're releasing v2 of API-Bench:_a benchmark that tests how well LLMs can execute against APIs. Here are the results: https://superglue.ai/benchmark_v2

Tl;dr:_LLMs fail at integrations because they lack ground truth, lack state, lack debugging ability, and lack access to real system context - everything API integrations fundamentally require.

Here’s what we found:

1. LLMs are only as good as the data they’re trained on:_when docs change, APIs evolve, or systems are niche/long-tail, they use outdated patterns, guess missing pieces and hallucinate endpoints and parameters.

2. LLMs are stateless, but integrations are stateful:_auth handshakes, pagination, retries, multi-step flows all need memory but LLMs can’t persist intermediate values or reason across steps.

3. LLMs produce code that “looks right” but fails at runtime: LLMs cannot isolate the failing step and understand real error messages, so they can’t change what’s broken or retry with new hypotheses.

4. LLMs can’t reliably interpret imperfect API design:_humans can infer the intended function, LLMs will hallucinate what looks reasonable.

We open sourced the benchmark so you can test your own APIs or contribute new ones: https://github.com/superglue-ai/superglue/tree/main/eval/llm...

Curious to hear your experience, and of course always happy to share more learnings.