frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (2010)

https://research.google/pubs/dapper-a-large-scale-distributed-systems-tracing-infrastructure/
1•tosh•1m ago•0 comments

Amazon Sponsors AI Energy Summit Featuring Climate Deniers

https://www.desmog.com/2025/12/18/amazon-sponsors-ai-energy-summit-featuring-climate-deniers/
1•robtherobber•1m ago•0 comments

Show HN: Surprise Guardian – privacy for couples on shared laptops

https://chromewebstore.google.com/detail/surprise-guardian-privacy/fmefpopmejbhbkafilpboiiielffibbd
1•NabilChiheb•4m ago•0 comments

Emergency UX Audit: When Body Failure Meets the System's Infinite Loop

https://suggger.substack.com/p/emergency-ux-audit-when-body-failure
1•Suggger•5m ago•0 comments

Library Liberation-Competitive Performance Through Compiler-Composed Nanokernels

https://arxiv.org/abs/2511.13764
1•matt_d•5m ago•0 comments

The iOS Weekly Brief – Issue #42

https://vladkhambir.substack.com/p/the-ios-weekly-brief-issue-42
2•khambir•6m ago•0 comments

Claude Code Daily Degradation Tracker

https://marginlab.ai/trackers/claude-code/
3•qwesr123•6m ago•0 comments

Official Earthbound 64 Cancellation Interview (2013)

https://yomuka.wordpress.com/2013/08/18/earthbound-64-cancellation-interview-itoi-miyamoto-iwata/
1•realslimjd•7m ago•0 comments

Modeling uncertainty: A blueprint for the next 24 years of iconographic research

https://resonism.substack.com/p/uncertainty-is-invaluable
1•jkoester•9m ago•0 comments

To Keep Water Liquid, the Red Planet Needed to Freeze

https://www.universetoday.com/articles/to-keep-water-liquid-the-red-planet-needed-to-freeze
1•rbanffy•9m ago•0 comments

America's new dietary guidelines ignore decades of scientific research

https://www.technologyreview.com/2026/01/08/1130905/americas-diet-guidelines-ignore-scientific-re...
3•rbanffy•9m ago•1 comments

Apple-TSMC: The Partnership That Built Modern Semiconductors

https://newsletter.semianalysis.com/p/apple-tsmc-the-partnership-that-built
1•rbanffy•10m ago•0 comments

Finland's electricity consumption hits all-time high (15.6 GW)

https://yle.fi/a/74-20203123
1•iljah•11m ago•1 comments

Recommended RSS Readers

https://www.coryd.dev/posts/2025/recommended-rss-readers
1•cdrnsf•12m ago•0 comments

Show HN: PromptStash – Save and Reuse AI Prompts Across ChatGPT, Claude, Gemini

https://chromewebstore.google.com/detail/promptstash-ai-prompt-man/ocgkponbnolpgobllplcamfobolbjbcj
1•ktg0215•13m ago•1 comments

Cisco switches hit by reboot loops due to DNS client bug

https://www.bleepingcomputer.com/news/security/cisco-switches-hit-by-reboot-loops-due-to-dns-clie...
3•TechTechTech•13m ago•0 comments

How Dangerous Is It to Work for ICE?

https://www.motherjones.com/politics/2025/10/ice-deaths-assaults-administration-masks-covid19-sho...
2•mooreds•14m ago•0 comments

Show HN: Magrittr-like pipe syntax for Python

https://github.com/smacke/pipescript
1•smacke•14m ago•0 comments

Ask HN: "Too many people in HN work in Google or Apple–that itself is immoral."

3•bookofjoe•14m ago•0 comments

A Review of CrowdStrike Acquiring SGNL

https://radar.thecyberhut.com/p/a-review-of-crowdstrike-acquiring
1•mooreds•14m ago•0 comments

"They Saw a Protest": Cognitive Illiberalism and the Speech-Conduct Distinction [pdf]

https://www.stanfordlawreview.org/wp-content/uploads/sites/3/2012/05/Kahan-64-Stan-L-Rev-851.pdf
1•pcaharrier•14m ago•0 comments

Joel David Hamkins declares AI Models useless for solving math. Here's why

https://economictimes.indiatimes.com/news/new-updates/basically-zero-garbage-renowned-mathematici...
1•madihaa•14m ago•1 comments

Debugging CSS Values

https://docs.google.com/document/u/0/d/1zyKdPREtKT8OU4WtlHV_Wxet3SvyUtAXrTdFLPmYmdU/mobilebasic
1•erhuve•15m ago•0 comments

Anti-government protests in Tehran and other Iranian cities, videos show

https://www.bbc.com/news/articles/cg7y0579lp8o
1•mooreds•16m ago•0 comments

AI Agents Are Revolutionizing Open Source Software

https://oneuptime.com/blog/post/2026-01-09-how-ai-helps-open-source-succeed/view
2•ndhandala•17m ago•0 comments

Awesome: Logical Programming Language

https://github.com/matan-h/Awesome-lang
2•matan-h•17m ago•0 comments

What Happens When Governments Can't Tax Productivity Anymore?

https://m4ttl4w.substack.com/p/the-second-bounce-of-the-ball-part
2•mattyboomboom•18m ago•1 comments

The Dead Salmons of AI Interpretability

https://arxiv.org/abs/2512.18792
2•Anon84•18m ago•0 comments

Illinois man charged with hacking Snapchat accounts to steal nude photos

https://www.bleepingcomputer.com/news/security/illinois-man-charged-with-hacking-snapchat-account...
2•fleahunter•19m ago•0 comments

How do you manage quality when AI write code faster than humans can review it?

2•lostsoul8282•20m ago•1 comments
Open in hackernews

Task-free intelligence testing of LLMs

https://www.marble.onl/posts/tapping/index.html
64•amarble•18h ago

Comments

vitaelabitur•17h ago
Aren't LLMs just super-powerful pattern matchers? And guessing "taps" a pattern recognition task? I am struggling to understand how your experiment relates to intelligence in any way.

Also, commercial LLMs generally have system instructions baked on top of the core models, which intrinsically prompt them to look for purpose even in random user prompts.

crooked-v•16h ago
There's definitely more than "just" pattern matching in there - for example, current SOTA models 'plan ahead' to simultaneously process both rough outlines of an answer and specific subject details to then combine internally for the final result (https://www.anthropic.com/research/tracing-thoughts-language...).
wood_spirit•15h ago
Eh that is still encompassed by the term “pattern matching” in this context. Sure it’s complicated, but it’s still just a glorified spell checker.
globnomulous•11h ago
I'm an LLM naysayer, and even I have no trouble seeing, or accepting, that they're much more than glorified spell checkers.
nomel•7h ago
And we're just glorified oxidation. At some point the concept of "emergent systems" comes into play.
lubujackson•13h ago
LLMs are pattern matchers, but every model is given specific instructions and response designs that influence what to do given unclear prompts. This is hugely valuable to understand since you may ask an LLM an invalid question and it is important to know if it is likely to guess at your intent, reject the prompt or respond randomly.

Understanding how LLMs fail differently is becoming more valuable than knowing that they all got 100% on some reasoning test with perfect context.

sdenton4•16h ago
I like the high level idea! (how do we test intelligence in a non functional way?)

I'm effect, the different response types are measuring how the models respond to a context-free novel environment. I imagine humans would also respond on a variety of ways to this test, none of which are necessarily incorrect from the perspective of intelligence testing .

Many tests of human behavior (eg, n behavioral economics) create some pretense context to avoid boarding the response that is actually being measured. For example, we may invite a participant to a study of color preference, but actually measure how fast they complete the task when the scientist has/hasn't bathed in a week (or whatever).

Likewise, for llm intelligence testing, you could create pretext tasks and context, and perhaps measure what the model considered along the way, instead of the actual task outcome.

nestorD•15h ago
On alternative ways to measure LLM intelligence, we had good success with this: https://arxiv.org/abs/2509.23510

In short: start with a dataset of question and answer pairs, where each question has been answered by two different LLMs. Ask the model you want to evaluate to choose the better answer for each pair. Then measure how consistently it selects winners. Does it reliably favor some models over the questions, or does it behave close to randomly? This consistency is a strong proxy for the model’s intelligence.

It is not subject to dataset leaks, lets you measure intelligence in many fields where you might not have golden answers, and converges pretty fast making it really cheap to measure.

esafak•10h ago
Doesn't that presume that one model dominates the other?
vintermann•1h ago
Interesting, but couldn't a model "cheat" in this task by being very good at telling model outputs apart? How far do you get with a classifier simply trained to distinguish models by their output?

It seems to me many models - maybe by design - have a recognizable style which would be much easier to detect than evaluating the factual quality of answers.

CuriouslyC•14h ago
Game playing is the next frontier. Model economically valuable tasks as games and have the agents play/compete. Alphabench and Vendingbench show the potential of this approach.
ossa-ma•11h ago
A decade of reinforcement and agentic learning was spent playing games (Google Deepmind AlphaGo, AlphaStar, OpenAI Five), including against each other. So what makes it a new frontier?
CuriouslyC•10h ago
Its application to LLMs to push capabilities. We're going to tap out expert feedback, and objective/competitive arenas are going to be the only way to progress at a reasonable speed.

The difference is going to be instead of starting from pre-existing games and hoping that "generalizes" to intelligence, this time people are going to build gamified simulators of economically valuable stuff. This is feasible now because we can use LLMs to help generate these games much faster than we would have been able to previously.

optimalsolver•11h ago
Typo:

"The behvior summary"

8note•9h ago
whats the assistant prompt being used for these? i dont think ive ever gotten these joking responses back to anything
rdos•3h ago
This is very interesting. Especially the last part where it shows gpt-5.2 and gpt-oss and their very similar and unique outcome of being 90%+ Serious.

I tested this locally and got the same result with gpt-oss 120b. But only on the default 'medium' reasoning effort. When I used 'low' I kept getting more playful responses with emojis and when I used 'high' I kept getting more guessing responses.

I had a lot of fun with this and it provided me with more insight than I would have thought.

vintermann•1h ago
These aren't task free. They're just implicit task, "figure out what you're expected to do". These sort of riddle tasks are 100% dependent on who does the expecting.

This is not a new idea. Traditional IQ tests pivoted to them (they weren't originally like that), and no doubt they have great "discriminative power", because having the ability to figure out what's expected of you and not getting intimidated by cryptic and obtuse tasks put before you, are certainly extremely valuable skills in e.g business and politics.

But I always respected real tasks more. A question on a math test is honest; if it doesn't precisely define what's expected of you, the taskmaster has done a bad job, not you. It still can be extremely demanding.

An implicit task, by comparison, smells more of riddles, gnosticism. Do you know the way? Do you know the genre? (Once you know the genre of implicit tasks typical to IQ tests, you can easily increase your performance by a lot).

For that matter, this idea isn't new to machine learning either. Francois Chollet did it already, and he was IMO just as wrongheaded in thinking implicit tasks are somehow more indicative of "true intelligence" than explicit ones.