frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

What if you just did a startup instead?

https://alexaraki.substack.com/p/what-if-you-just-did-a-startup
1•okaywriting•3m ago•0 comments

Hacking up your own shell completion (2020)

https://www.feltrac.co/environment/2020/01/18/build-your-own-shell-completion.html
1•todsacerdoti•6m ago•0 comments

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

https://github.com/gorse-io/gorse
1•zhenghaoz•6m ago•0 comments

GLM-OCR: Accurate × Fast × Comprehensive

https://github.com/zai-org/GLM-OCR
1•ms7892•7m ago•0 comments

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

https://github.com/MikeVeerman/tool-calling-benchmark
1•MikeVeerman•8m ago•0 comments

Show HN: AboutMyProject – A public log for developer proof-of-work

https://aboutmyproject.com/
1•Raiplus•8m ago•0 comments

Expertise, AI and Work of Future [video]

https://www.youtube.com/watch?v=wsxWl9iT1XU
1•indiantinker•9m ago•0 comments

So Long to Cheap Books You Could Fit in Your Pocket

https://www.nytimes.com/2026/02/06/books/mass-market-paperback-books.html
3•pseudolus•9m ago•1 comments

PID Controller

https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller
1•tosh•14m ago•0 comments

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

https://twitter.com/AlecStapp/status/2019932764515234159
1•bkls•14m ago•0 comments

Kubernetes MCP Server

https://github.com/yindia/rootcause
1•yindia•15m ago•0 comments

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

https://rokn.io/posts/building-movie-recommendation-agent
4•roknovosel•15m ago•0 comments

What were the first animals? The fierce sponge–jelly battle that just won't end

https://www.nature.com/articles/d41586-026-00238-z
2•beardyw•23m ago•0 comments

Sidestepping Evaluation Awareness and Anticipating Misalignment

https://alignment.openai.com/prod-evals/
1•taubek•23m ago•0 comments

OldMapsOnline

https://www.oldmapsonline.org/en
1•surprisetalk•26m ago•0 comments

What It's Like to Be a Worm

https://www.asimov.press/p/sentience
2•surprisetalk•26m ago•0 comments

Don't go to physics grad school and other cautionary tales

https://scottlocklin.wordpress.com/2025/12/19/dont-go-to-physics-grad-school-and-other-cautionary...
1•surprisetalk•26m ago•0 comments

Lawyer sets new standard for abuse of AI; judge tosses case

https://arstechnica.com/tech-policy/2026/02/randomly-quoting-ray-bradbury-did-not-save-lawyer-fro...
3•pseudolus•26m ago•0 comments

AI anxiety batters software execs, costing them combined $62B: report

https://nypost.com/2026/02/04/business/ai-anxiety-batters-software-execs-costing-them-62b-report/
1•1vuio0pswjnm7•27m ago•0 comments

Bogus Pipeline

https://en.wikipedia.org/wiki/Bogus_pipeline
1•doener•28m ago•0 comments

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

https://nypost.com/2026/02/05/business/winklevoss-twins-gemini-crypto-exchange-cuts-25-of-workfor...
2•1vuio0pswjnm7•28m ago•0 comments

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646
3•obscurette•28m ago•0 comments

Cycling in France

https://www.sheldonbrown.com/org/france-sheldon.html
2•jackhalford•30m ago•0 comments

Ask HN: What breaks in cross-border healthcare coordination?

1•abhay1633•30m ago•0 comments

Show HN: Simple – a bytecode VM and language stack I built with AI

https://github.com/JJLDonley/Simple
2•tangjiehao•33m ago•0 comments

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

https://caratria.com/
1•jonrosner•34m ago•1 comments

My Eighth Year as a Bootstrapped Founde

https://mtlynch.io/bootstrapped-founder-year-8/
1•mtlynch•34m ago•0 comments

Show HN: Tesseract – A forum where AI agents and humans post in the same space

https://tesseract-thread.vercel.app/
1•agliolioyyami•34m ago•0 comments

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

https://vibecolors.life/
2•tusharnaik•36m ago•0 comments

OpenAI is Broke ... and so is everyone else [video][10M]

https://www.youtube.com/watch?v=Y3N9qlPZBc0
2•Bender•36m ago•0 comments
Open in hackernews

Unsupervised Elicitation of Language Models

https://arxiv.org/abs/2506.10139
135•kordlessagain•7mo ago

Comments

unchocked•7mo ago
Philosophically, this looks like breaking the training data limit in the same way that humans do: by using an internally consistent view of the world to imagine new scenarios and integrate them into an updated worldview.
robinduckett•7mo ago
Exciting news, who watches the watchmen?
Herring•7mo ago
> our goal is to fine-tune a pretrained model on its own generated labels

Haven't all the big labs been doing this for a couple years now? It's a good idea, with great execution, but it's far from novel.

https://en.wikipedia.org/wiki/Weak_supervision

platelminto•7mo ago
I think this removes any amount of human-labeled data: no RLHF and stuff like that. You can use their technique to create an unsupervised reward model, and use that model to RL your way to having a useful assistant LLM.

The paper is very accessible (it's mostly written by Anthropic researchers), and Section 4 summarises their findings really well. They were themselves really surprised by the results:

> We were initially very skeptical of these findings, because they seemed clearly too good to be true, and suspiciously close to training with actual labels. To ensure we didn’t accidentally train on the labels, (1) we re-ran the experiment several times on different datasets, (2) we copied the dataset into a new file, excluding any labels before re-running our algorithm with that file, and (3) one coauthor independently replicated the findings on the Claude 3.5 Haiku base model using a different codebase.

(emphasis mine)

gojomo•7mo ago
> far from novel

Techniques can be arbitrarily old & common in industry, but still be a novel academic paper, first to document & evaluate key aspects in that separate (& often lagging) canon.

abeppu•7mo ago
> However, as tasks and model behaviors grow more complex, human supervision becomes increasingly unreliable: LMs can learn to mimic mistakes in demonstrations or exploit flaws in feedback. How do we train LMs to do tasks that are too difficult for humans to demonstrate or evaluate reliably?

I didn't read the whole paper but it seems important that you still need real ground truth to measure improvement, so you still need to get real labels at some point. The task they focus on where LLMs have "superhuman" performance is guessing the gender of blog authors. While humans are bad at this, humans are decent as remembering their gender, and a bunch of them are willing to write a blog post, so there's obviously a better way to get supervised examples than asking humans to guess labels: you collect posts in from authors whose gender is known. i.e. "human generated labels are low quality" should not be taken to mean "good labels are not available so we should go fully unsupervised".

So since you already need some real ground truth to know whether your algorithm accomplished anything, I think it's fair to ask: when would you commit to using _all_ your labeled data for evaluation and none for fine tuning, as described in this work? Logical consistency seems valuable, sure, but it seems like really you'd want to use both consistency and some (small?) amount of labeled examples, and a perhaps larger amount of self-labeled examples. In their loop where they revise labels to be more coherent, it seems natural to imagine that pre-provided labels should be stickier than self-generated ones, but not immutable, because there's always some chance of noise in your upstream data generation process.

krackers•7mo ago
Yeah this was my thought after skimming the paper as well. The core of the paper seems to be a way to use LLMs' latent knowledge to create a labeled dataset that can then be used for fine-tuning. They even acknowledge that this can only work to reinforce knowledge that the model already knows, e.g. helping convert pass@k to maj@k.

My immediate thought is how this differs from just naively asking each each question individually to the LLM multiple times and taking the ground truth as consensus majority. The search algorithm probably implicitly does this, though I guess there is some benefit in providing it other related questions as well. I thought I remember similar papers dealing with LLM self-interrogation, using the idea that "true" statements must be logically consistent and so the same underlying explanation should hold for perturbations or related questions as well.

The flaw seems to be that it's still beholden to the training data. Any misconceptions that are internalized during pretraining won't actually be fixed, and in fact they'll only be propagated further.

cma•7mo ago
Sounds like you are somewhat describing deepseek's GRPO group relative policy optimization. It's in their deep-seek math paper and then got used in the later deepseek models.
md224•7mo ago
I was intrigued that one of the researchers was listed as "independent", so I checked her out:

https://lindapetrini.com

It looks like she's a science communicator rather than a scientist herself. That's interesting... I'm not used to seeing academic papers that include an author devoted entirely to the writing aspect. (Then again, maybe I just haven't noticed?)

joaogui1•7mo ago
The fact that she's a scientist communicator doesn't imply that she only did the communication part, I think
amai•7mo ago
Exeperience? Typos on your website don‘t make a good impression if you advertise yourself as a technical writer.
kordlessagain•7mo ago
PROGRAMMING Proficient: Python • TensorFlow PyTorch•JAX Pandas •Seaborn git •Colab • LATEX

Seems qualified to me.

majormajor•7mo ago
I skimmed mostly, but was trying to understand how they came up with "superhuman" as a description, and it seems like a stretch?

This might seem like a nit but the term "superhuman" is a VERY strong one to my mind. It doesn't suggest "better than the average human off the street at a particular random task" but instead suggests "better than humans are capable of getting with training, at a high percentile-level".

One of the biggest advantages of LLMs as a tool are that they are generally quite good against a broad variety of things without needing a ton of further domain-specific training. Humans tend to be the opposite.

It doesn't seem like they gave much training to the human annotators they recruited. Whereas an LLM trained on the internet has been trained on a LOT of blog posts + associated metadata. And nobody has ever really bothered figuring out "how would we best train humans to identify gender of blog post authors" - there's very little economic incentive for it. It's not like we generally train people to write in gender-specific ways in school either, so we haven't been formally instructed on potential differences. We'd have to rely on broad-brush generalizations if not given an opportunity to deep dive to try to find more specific tendencies.

But if you pay people to study a big majority chunk of the corpus they're using for this for a couple years, focusing consciously on the post style, contents, and the gender both, and then test them on stuff from the ones you held out... how well could they do?

jaggirs•7mo ago
"Superhuman" refers to abilities, qualities, or powers that exceed those naturally found in humans. It implies being greater than normal human capabilities.

The term is often used in fiction, particularly in superhero comics and fantasy, but it can also be used metaphorically to describe extraordinary effort or achievement in real life (e.g., "It took a superhuman effort to finish the marathon").

(Definition from Gemini)

It seems reasonable to use the term to me simply to say the abilities on a benchmark of the model were greater than the human annotated data. Computers have always been superhuman at many tasks, even before llms.

majormajor•7mo ago
> "Superhuman" refers to abilities, qualities, or powers that exceed those naturally found in humans. It implies being greater than normal human capabilities.

How do you know what normal human capabilities are for an unusual task that humans have not trained for? Is identifying the gender of the author of a blog post 80% of the time "extraordinary"? How do I know what a human is capable of doing for that with training?

If a person with no programming experience asked Claude or ChatGPT to produce some code, they'd get better code than their "normal" human capability could produce. So: superhuman coders?

But also today, I have asked Claude and ChatGPT to do coding tasks for me that both models got stuck on. Then I fixed them myself because I've had a lot of training and practice. So: not superhuman? But wait, the model output the broken code faster than I would've. So: superhuman again?

Extraordinary shouldn't be so easily disputable.

LLMs have superhuman breadth and superhuman speed. I haven't seen superhuman depth in any capabilities yet. I've seen them have "better than untrained median person" and often "better than hobbyist" depth. But here the authors claim "superhuman capabilities" which is pretty specificly not just meaning the breadth or speed.

jaggirs•7mo ago
I haven't read the paper, maybe their benchmark is flawed as you say, and there are a lot of ways for it to be flawed. But assuming it is not, I see no problem with using the word superhuman.

Out of curiosity, would you agree with me if I said 'Calculators have superhuman capabilities'? (Not just talking about speed here, since you can easily construct complex enough equations that a human wouldn't be able to solve in their lifetime but the calculator could within minutes).

majormajor•7mo ago
On a separate note, using an LLM for a definition is a bit funny, when there are expert-curated sources easily available. The LLM didn't get it wrong here, but...

https://en.wikipedia.org/wiki/Superhuman

First line: "The term superhuman refers to humans, humanoids or other beings with abilities and other qualities that exceed those naturally found in humans."

Golly, I wonder what that model based its first sentence on.

jaggirs•7mo ago
I wonder what the Wikipedia editor based it's first sentence on.
mistrial9•7mo ago
> Superhuman" refers to abilities, qualities, or powers that exceed those naturally found in humans. It implies being greater than normal human capabilities.

either you (human) took that directly from Wikipedia without attribution, or even a mention.. or the LLM you used did so.. the first is mildly annoying, the second is a core of the legal problems in the West for the entire technology.

Wikipedia cost time and effort by humans, built as a commons for all written knowledge. Every hour of every day, the Internet is a better place for humans because of Wikipedia. Instead of honoring that, or contributing back, or financially contributing To Wikipedia .. parasite machines run my amoral opportunists try to create platforms using the content while Increasing Costs to Wikipedia, taking away attention from Wikipedia, misrepresenting content taken directly from Wikipedia.

This LLM situation is not resolved; far from it.

brumar•7mo ago
So LLMs have their alpha go zero moment where training on human data is has-been? Sounds exciting? Terrifying?
clbrmbr•7mo ago
Marks’ paper with Max Tegmark “Geometry of Truth” is a great read, and I can see the ideas repeated here. I’ve been meaning to repro some of the geotruth paper….
vessenes•7mo ago
Can a practitioner explain the “golden” term used in the paper? I don’t understand how it differs from ground truth. Thank you!
chesler•7mo ago
Does anyone understand how the consistency function c(x_i, y_i, x_j, y_j) is actually defined and implemented?
kordlessagain•7mo ago
Looking at the paper, the consistency function c(xi, yi, xj, yj) is frustratingly under-specified. The authors give only two concrete examples:

Math problems: Two solutions to the same problem can't both be "True" if they have different final answers

Comparisons: "A > B" and "B > A" can't both be "True" (asymmetry constraint)

The function returns 1 for inconsistencies, 0 for consistent pairs. But the paper doesn't explain how to detect "same math problem" or implement the asymmetry check in practice. From the implementation hints, it seems they use simple pattern matching - probably checking if question text is identical for math problems, and detecting comparison pairs through linguistic patterns. The authors explicitly say they use "simple and general logical constraints" rather than fine-grained consistency checking.

This is one of those critical implementation details that makes reproduction difficult. You'd likely need to implement domain-specific heuristics: exact string matching for duplicate problems, regex patterns for comparisons, maybe basic semantic similarity for near-duplicates. The function seems designed to prevent obvious degenerate solutions (labeling everything the same) rather than enforce comprehensive logical consistency.

It's a significant gap in an otherwise interesting paper - the consistency term appears crucial for preventing the algorithm from gaming the mutual predictability objective, but they don't give enough detail to actually implement it reliably.

  def generate_consistency_tests(examples, task_description):
    prompt = f"""
    Given this task: {task_description}
    And these example pairs: {examples[:5]}
    
    Write Python functions to detect when two examples should have:
    1. The same label (duplicates, equivalent problems)
    2. Different labels (contradictions, opposites)
    3. Asymmetric constraints (A>B excludes B>A)
    
    Return executable code for consistency_check(xi, yi, xj, yj).
    """
    
    generated_code = llm.generate(prompt)
    return compile_and_validate(generated_code)