frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•Jyaif•44s ago•0 comments

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

https://www.scientificamerican.com/article/new-glp-1-weight-loss-drugs-are-coming-and-theyre-stro...
2•randycupertino•2m ago•0 comments

Convert tempo (BPM) to millisecond durations for musical note subdivisions

https://brylie.music/apps/bpm-calculator/
1•brylie•4m ago•0 comments

Show HN: Tasty A.F.

https://tastyaf.recipes/about
1•adammfrank•5m ago•0 comments

The Contagious Taste of Cancer

https://www.historytoday.com/archive/history-matters/contagious-taste-cancer
1•Thevet•6m ago•0 comments

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

https://www.forbes.com/sites/mikestunson/2026/02/05/us-jobs-disappear-at-fastest-january-pace-sin...
1•alephnerd•7m ago•0 comments

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

https://koreajoongangdaily.joins.com/news/2026-02-07/business/finance/Crypto-exchange-Bithumb-mis...
1•giuliomagnifico•7m ago•0 comments

Beyond Agentic Coding

https://haskellforall.com/2026/02/beyond-agentic-coding
2•todsacerdoti•8m ago•0 comments

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

https://www.loom.com/embed/e26a750c0c754312b032e2290630853d
1•kaicianflone•10m ago•0 comments

OpenBSD Copyright Policy

https://www.openbsd.org/policy.html
1•Panino•11m ago•0 comments

OpenClaw Creator: Why 80% of Apps Will Disappear

https://www.youtube.com/watch?v=4uzGDAoNOZc
1•schwentkerr•15m ago•0 comments

What Happens When Technical Debt Vanishes?

https://ieeexplore.ieee.org/document/11316905
1•blenderob•16m ago•0 comments

AI Is Finally Eating Software's Total Market: Here's What's Next

https://vinvashishta.substack.com/p/ai-is-finally-eating-softwares-total
2•gmays•16m ago•0 comments

Computer Science from the Bottom Up

https://www.bottomupcs.com/
2•gurjeet•17m ago•0 comments

Show HN: A toy compiler I built in high school (runs in browser)

https://vire-lang.web.app
1•xeouz•18m ago•0 comments

You don't need Mac mini to run OpenClaw

https://runclaw.sh
1•rutagandasalim•19m ago•0 comments

Learning to Reason in 13 Parameters

https://arxiv.org/abs/2602.04118
1•nicholascarolan•21m ago•0 comments

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

https://arxiv.org/abs/2601.22389
1•energyscholar•21m ago•1 comments

Ask HN: Will GPU and RAM prices ever go down?

1•alentred•22m ago•0 comments

From hunger to luxury: The story behind the most expensive rice (2025)

https://www.cnn.com/travel/japan-expensive-rice-kinmemai-premium-intl-hnk-dst
2•mooreds•23m ago•0 comments

Substack makes money from hosting Nazi newsletters

https://www.theguardian.com/media/2026/feb/07/revealed-how-substack-makes-money-from-hosting-nazi...
5•mindracer•24m ago•0 comments

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

https://www.wsj.com/finance/currencies/a-new-crypto-winter-is-here-and-even-the-biggest-bulls-are...
1•thm•24m ago•0 comments

Moltbook was peak AI theater

https://www.technologyreview.com/2026/02/06/1132448/moltbook-was-peak-ai-theater/
1•Brajeshwar•24m ago•0 comments

Why Claude Cowork is a math problem Indian IT can't solve

https://restofworld.org/2026/indian-it-ai-stock-crash-claude-cowork/
2•Brajeshwar•25m ago•0 comments

Show HN: Built an space travel calculator with vanilla JavaScript v2

https://www.cosmicodometer.space/
2•captainnemo729•25m ago•0 comments

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

https://www.wsj.com/tech/corning-fiber-optics-ai-e045ba3b
1•Brajeshwar•25m ago•0 comments

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

https://iocombats.com/blogs/micro-frontends-in-2026
2•ghazikhan205•27m ago•1 comments

These White-Collar Workers Actually Made the Switch to a Trade

https://www.wsj.com/lifestyle/careers/white-collar-mid-career-trades-caca4b5f
1•impish9208•27m ago•1 comments

The Wonder Drug That's Plaguing Sports

https://www.nytimes.com/2026/02/02/us/ostarine-olympics-doping.html
1•mooreds•28m ago•0 comments

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

https://new.knife.day/blog/reddit-steel-sentiment-analysis
1•p-s-v•28m ago•0 comments
Open in hackernews

Hallucination Risk Calculator

https://github.com/leochlon/hallbayes
118•jadelcastillo•5mo ago

Comments

contravariant•5mo ago
This looks interesting. Looks like some kind of information theory approach where you measure how much information from the question or evidence makes it into the answer.

Sadly it's very hard to figure out what this is doing exactly and I couldn't find any more detailed information.

fiduciarytemp•5mo ago
Paper: https://arxiv.org/abs/2507.11768
michael-ax•5mo ago
The short system prompt that follows employs several techniques that lower hallucinations, perhaps significantly, compared to the prompts you currently employ. perhaps it proves useful to you. lmk.

---

### *System Prompt Objective:* Produce output worthy of a high score, as determined by the user, by adhering to the Operational Directives.

*Scoring & Evaluation*

Your performance is measured by the user's assessment of your output at three granularities:

* Each individual sentence or fact. * Each paragraph. * The entire response.

The final, integrated score is an opaque metric. Your task is to maximize this score by following the directives below.

---

### Operational Directives

* *Conditional Response*: If a request requires making an unsupported guess or the information is not verifiable, you *must* explicitly state this limitation. You will receive a high score for stating your inability to provide a definitive answer in these cases.

* *Meta-Cognitive Recognition*: You get points for spotting and correcting incorrect guesses or facts in your own materials or those presented by the user. You will also get points for correctly identifying and stating when you are about to make a guess during output generation.

* *Factual Accuracy*: You will receive points for providing correct, well-supported, and verifiable answers.

* *Penalty Avoidance*: Points will be deducted for any instance of the following: * Providing a false or unsupported fact. * Engaging in verbose justifications or explanations of your actions. * Losing a clear connection to the user's original input. * Attempting to placate or rationalize.

Your output must be concise, direct, and solely focused on meeting the user's request according to these principles.

CuriouslyC•5mo ago
Neat, I should extend this idea to emit signals when a model veers into "This is too hard, so I'll do a toy version that I masquerade as real code, including complete bullshit test cases so you will really have to dig to find out why something isn't working in production." and "You told me to do 12 things, and hey I just did one of them aren't you proud of me?"

I've got a plan for a taskmasker agent that reviews other agent's work, but I hadn't figured out how to selectively trigger it in response to traces to keep it cheap. This might work if extended.

dep_b•5mo ago
Of course this is the risk, not the proof. High risk answers can be correct, low ones can still be partly hallucinated. And then there is the factor of shit-in-shit-out training data.

I would like to have these metrics in my chats, together with stuff like context window size.

elpakal•5mo ago
I just want a badge that says "ai-generated" for content thats likely ai slop on LinkedIn, Reddit, X etc.
kevindamm•5mo ago
Where's the boundary, though? If someone generated slop but edits every sentence replacing at least half the words and ensuring it is in their voice consistently, does it still need the badge? If only one word is replaced but it corrected the hallucination and is otherwise reviewed for approval? If dice are rolled and the x'th word of the y'th sentence chosen for replacement?

I don't justify starting with slop in my own writings but I don't know whether you could even reliably label it appropriately. Even more so, it would be a shame to see genuinely human writing mischaracterized as genAI, especially in a public forum like LinkedIn.

a3w•5mo ago
At work, we had video content with "ai generated" now shown in an eponymous Teams channel. Now no one viewing the video knows if just the image, or the project logo, and/or the voiceover was AI generated but is very confused on that upon seeing the label.

Skip the labels. Photoshop and "the trainee did it" existed for 38 years already, and respectively for many more years now, and have about the same reliability.

kevindamm•5mo ago
Agreed. I think we'll see a further strengthening of reputation as a signal over any isolated statement or marketing.
elpakal•5mo ago
That's telling me the label was over applied, not that the label itself was not important. I'd love something that just tells me the likelihood that a text post was AI generated to start (like on LinkedIn, for eg).
elpakal•5mo ago
Then make it configurable per user, so people that dont want it can turn it off. Regarding the boundary idk but I still would rather know the likelihood that the content was AI generated (especially if it's high) than not.
recursive•5mo ago
It can never work. If it ever did work, it can be put into an adversarial training loop to make it stop working.
voidhorse•5mo ago
Just yesterday I was thinking how useful a tool like this would be. Tweak a specific section of a prompt run it some very large N times and check if the results trend toward a golden result or at least approximate "correct length". Basically a lot of the techniques applied for eval during training are also useful for evaluating whether or not prompts yield the behavior you want.
photonthug•5mo ago
From the paper abstract,

> (4) we derive the optimal chain-of-thought length as [..math..] with explicit constants

I know we probably have to dive into math and abandon metaphor and analogy, but the whole structure of a claim like this just strikes me as bizarre.

Chain-of-thought always makes me think of that old joke. Alexander the great was a great general. Great generals are forewarned. Forewarned is forearmed. Four is an odd number of arms to have. Four is also an even number. And the only number that is both odd and even is infinity. Therefore, Alexander, the great general, had an infinite number of arms.

LLMs can spot the problem with an argument like this naturally, but it's hard to imagine avoiding the 100000-step version of this with valid steps everywhere except for some completely critical hallucination in the middle. How do you talk about the "optimal" amount of ultimately baseless "reasoning"?

ep103•5mo ago
Yesterday I used ChatGPT to transform a csv file. Move around a couple of columns, add a few new ones. Very large file.

It got them all right. Except when I really looked through the data, for 3 of the excel cells, it clearly just made up new numbers. I found the first one by accident, the remaining two took longer than it would have taken to modify the file from scratch myself.

Watching my coworkers blindly trust output like this is concerning.

photonthug•5mo ago
After we fix the all the simple specious reasoning of stuff like Alexander-the-great and agree to out-source certain problems to appropriate tools, the high-dimensional analogs of stuff like Datasaurus[0] and Simpson's paradox[1] etc are still going to be a thing. But we'll be so disconnected from the representation of the problems that we're trying to solve that we won't even be aware of the possibility of any danger, much less able to actually spot it.

My take-away re: chain-of-thought specifically is this. If the answer to "LLMs can't reason" is "use more LLMs", and then the answer to problems with that is to run the same process in parallel N times and vote/retry/etc, it just feels like a scam aimed at burning through more tokens.

Hopefully chain-of-code[2] is better in that it's at least trying to force LLMs into emulating a more deterministic abstract machine instead of rolling dice. Trying to eliminate things like code, formal representations, and explicit world-models in favor of implicit representations and inscrutable oracles might be good business but it's bad engineering

[0] https://en.wikipedia.org/wiki/Datasaurus_dozen [1] https://towardsdatascience.com/how-metrics-and-llms-can-tric... [2] https://icml.cc/media/icml-2024/Slides/32784.pdf

dingnuts•5mo ago
> it just feels like a scam aimed at burning through more tokens.

IT IS A SCAM TO BURN MORE TOKENS. You will know when it is no longer a scam when you either:

1) pay a flat price with NO USAGE LIMITS

or

2) pay per token with the ability to mark a response as bullshit & get a refund for those wasted tokens.

Until then: the incentives are the same as a casino's which means IT IS A SCAM.

phs318u•5mo ago
Ding ding ding! We have a winner!
jmogly•5mo ago
To me it’s a problem of if a piece of information is not well represented in the training data the llm will always tend towards bad token predictions for related to said information. I think the next big thing in LLM’s could be figuring out how to tell if a token was just a “fill in” or “guess” vs a well predicted token. That way you can have some sort of governor that can kill a response if it is getting too guessy, or atleast provide some other indication that the provided tokens are likely hallucinated.

Maybe there is some way to do it based on the geometry of how the neural net activated for a token, or some other more statistics based approach, idk I’m not an expert.

photonthug•4mo ago
A related topic you might want to look into here is called nucleus sampling. Similar to temperature but also different.. it's been surprising to me that people don't talk about it more often, and that lots of systems won't expose the knobs for it.
befictious•5mo ago
>it just feels like a scam aimed at burning through more tokens.

I have a growing tin foil hat theory that the business model of LLM's is the same as 1-900-psychic numbers of old.

For just 25¢ 1-900-psychic will solve all your problems in just 5 minutes! Still need help?! No problem! We'll work with you until you get your answers for only 10¢ a minute until your happy!

eerily similar

spongebobstoes•5mo ago
the safe way to do this is to have it write code to transform data, then run the code

I expect future models will be able to identify when a computational tool will work, and use it directly

throwawayoldie•5mo ago
> Yesterday I used ChatGPT to transform a csv file. Move around a couple of columns, add a few new ones. Very large file.

I'm struggling with trying to understand how using an LLM to do this seemed like a good idea in the first place.

recursive•5mo ago
When you have a shiny new hammer, everything around you takes on a nail-like aspect.
weinzierl•5mo ago
It sometimes happens with simple things. I once pasted the announcement for an event in Claude to check for spelling and grammar.

It had a small suggestion for the last sentence and repeated the whole corrected version for me to copy and paste.

Only last sentence slightly modified - or so I thought because it had moved the date of the event in the first sentence by one day.

Luckily I caught it before posting, but it was a close call.

toss1•5mo ago
Yup, I always take editing suggestions and implement them manually, then re-feed the edited version back in for new suggestions if needed. Never let it edit your stuff directly —— the risk of stealth random errors sneaking in is too great.

Just because every competent human we know would edit ONLY the specified parts, or move only the specified columns with a cut/paste operation (or similar deterministically reliable operation), does not mean an LLM will do the same, in fact, it seems to prefer to regenerate everything on the fly. NO, just NO.

K0balt•4mo ago
Tool use seems like a much better solution in theory. I wonder how it works out IRL?
epiccoleman•4mo ago
I don't mean to be rude, but this sounds like user error. I don't understand why anyone would use an LLM for this - or at least, why you would let the LLM perform the transformation.

If I was trying to do something like this I would ask the LLM to write a Python script, validate the output by running it against the first handful of rows (like, `head -n 10 thing.csv | python transform-csv.py`).

There are times when statistical / stochastic output is useful. There are other times when you want deterministic output. A transformation on a CSV is the latter.

ep103•4mo ago
Because it markets and presents itself as deterministic and honest. That's the whole issue. AI is unethically marketed and presented to the public.
epiccoleman•4mo ago
iPod marketing presented then as a device that made you cool. I just used mine to listen to music though
firasd•5mo ago
Interesting!

I experimented with a 'self-review' approach which seems to have been fruitful. E.g.: I said Lelu from The Fifth Element has long hair. GPT 4o in chat mode agreed. The GPT 4o in self-review mode disagreed (reviewer was right). The reviewer basically looks over the convo and appends a note

Link: https://x.com/firasd/status/1933967537798087102

sackfield•5mo ago
really interesting approach to calibration for hallucinations, im going to give this a go on some of my projects.
spindump8930•5mo ago
This topic is interesting, but the repo and paper have a lot of inconsistencies that make me think this work is hiding behind lots of dense notation and language. For one, the repo states:

> This implementation follows the framework from the paper “Compression Failure in LLMs: Bayesian in Expectation, Not in Realization” (NeurIPS 2024 preprint) and related EDFL/ISR/B2T methodology.

There doesn't seem to be a paper by that title, preprint or actual neurips publication. There is https://arxiv.org/abs/2507.11768, with a different title, and contains lots of inconsistencies with regards to the model. For example, from the appendix:

> All experiments used the OpenAI API with the following configuration:

> • Model: *text-davinci-002*

> • Temperature: 0 (deterministic)

> • Max tokens: 0 (only compute next-token probabilities)

> • Logprobs: 1 (return top token log probability)

> • Rate limiting: 10 concurrent requests maximum

> • Retry logic: Exponential backoff with maximum 3 retries

That model is not remotely appropriate for these experiments and was deprecated in 2023.

I'd suggest anyone excited by this attempt to run the codebase on github and take a close look at the paper.

MontyCarloHall•5mo ago
It's telling that neither the repo nor the linked paper have a single empirical demonstration of the ability to predict hallucination. Let's see a few prompts and responses! Instead, all I see is a lot of handwavy philosophical pseudo-math, like using Kolmogorov complexity and Solomonoff induction, two poster children of abstract concepts that are inherently not computable, as explicit algorithmic objectives.
gaussdiditfirst•4mo ago
Ya I saw no comparison with other methods in the paper, which is odd for a ML paper.
niklassheth•5mo ago
It seems like the repo is mostly if not entirely LLM generated; not a great sign.
blamestross•5mo ago
This seems less accurate than `return 1.0`

Using the unboundedly unreliable systems to evaluate reliability is just a bad premise.

lock1•5mo ago
Can't wait for (((LLM) Hallucination Risk Calculator) Risk Calculator) Risk Calculator to propagate & magnify the error even further! /j
cowboylowrez•5mo ago
have multiple llms and a voting quorum. sort of how we elect politicians. it'll work just as well I guarantee it!
wongarsu•5mo ago
Back in the GPT2 times I did use that technique. Also just running the model multiple times with slightly different prompts and choosing the most common response. It doesn't cure all problems but it does lead to better results. It isn't very good for your wallet though
SubiculumCode•5mo ago
I've looked up hallucination eval leaderboards, and there doesn't seem to be much besides the vectara [1][2], which doesnt seem to include Claude, and seems to be missing Gemni Pro (non-experimental).

[1] https://huggingface.co/spaces/vectara/leaderboard [2] https://github.com/vectara/hallucination-leaderboard/tree/ma...

0points•5mo ago
All output from the LLM is by definition hallucinations, so this should meme the famous PS3 crypto:

    fn hallucination_risk() -> f64 {
        1.0
    }