frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Skribidi – Nimble bidirectional text stack for UIs

https://github.com/memononen/Skribidi
1•todsacerdoti•3m ago•0 comments

Lisp in your Excel sheet via lambda

https://spreadsheet.institute/lisp/
1•macmac•3m ago•0 comments

Kindle Store eInk manga bad formatting examples

https://github.com/ciromattia/kcc/wiki/Kindle-Store-bad-formatting
1•seam_carver•4m ago•0 comments

JSDev.Space - Modern JavaScript Development Hub

https://github.com/anliberant/jsdev-astro
1•javatuts•7m ago•0 comments

A closer look inside AI Mode

https://blog.google/products/search/ai-mode-development/
1•kristianp•9m ago•0 comments

Vibe Coding: Fad, Future or Folly?

https://thenewstack.io/vibe-coding-fad-future-or-folly/
1•mooreds•10m ago•0 comments

Show HN: MCP-Cloud – One-click hosting for MCP servers (50 templates)

https://mcp-cloud.ai/
2•vstoi_can•11m ago•1 comments

Freedom Is a Function of Wallet Balance

https://github.com/dmf-archive/dmf-archive.github.io
1•NetRunnerSu•11m ago•0 comments

Ask HN: Best way to make the most of WWDC?

1•jspann•12m ago•0 comments

SpaceX will begin decommissioning Dragon spacecraft

https://twitter.com/elonmusk/status/1930718684819112251
8•chirau•12m ago•0 comments

Stablecoin issuer Circle soars 168% in IPO debut, above expected range

https://www.cnbc.com/2025/06/05/stablecoin-issuer-circle-soars-in-nyse-debut-after-pricing-ipo-above-expected-range.html
1•donsupreme•13m ago•0 comments

Brazil's dWallet program will let citizens cash in on their data

https://restofworld.org/2025/brazil-dwallet-user-data-pilot/
1•ohjeez•13m ago•0 comments

Ask HN: How do traditional software systems fit into an agentic future?

1•calflegal•14m ago•0 comments

Defending Adverbs Exuberantly If Conditionally

https://countercraft.substack.com/p/defending-adverbs-exuberantly-if
1•benbreen•15m ago•0 comments

The benefits and dangers of anthropomorphic conversational agents

https://www.pnas.org/doi/10.1073/pnas.2415898122
1•hackernj•17m ago•0 comments

Microsoft stock just hit a record high as it cashes in on the AI boom

https://qz.com/microsoft-intraday-stock-high-ai-june-2025-1851783771
2•mikece•17m ago•0 comments

SpaceX to build its own advanced chip packaging factory in Texas

https://www.tomshardware.com/tech-industry/manufacturing/elon-musks-spacex-to-build-its-own-advanced-chip-packaging-factory-in-texas-700mm-x-700mm-substrate-size-purported-to-be-the-largest-in-the-industry
2•ironyman•21m ago•0 comments

Projects on Claude now support 10x more content

https://twitter.com/AnthropicAI/status/1930671235647594690
2•handfuloflight•24m ago•2 comments

Elon Musk has publicly broken with Trump over budget reconciliation bill

https://www.theregister.com/2025/06/05/trump_musk_face_off_budget_bill/
9•hnthrowaway0315•30m ago•0 comments

Angelcore: Building an Artificial Angel – Recursive Symbolic AI and Bio Memory

https://github.com/Mattbusel/ANGELCORE
1•Shmungus•32m ago•9 comments

East River Source Control

https://ersc.io
3•eterps•34m ago•1 comments

Musk says Trump is named in Epstein files

https://thehill.com/policy/technology/5335453-elon-musk-donald-trump-jeffrey-epstein-files/
30•wslh•38m ago•10 comments

Tell me how do you vibe-code, or vibe-marketing

1•xantin•39m ago•0 comments

A Brief History of (Alleged) Military Laser Weapon Kills

https://www.laserwars.net/p/military-laser-weapon-kills-history
1•davidcarraway•42m ago•1 comments

docker-kasm: on-demand, disposable Docker containers accessible via browser

https://github.com/linuxserver/docker-kasm
2•indigodaddy•42m ago•0 comments

Chatterbox

https://github.com/resemble-ai/chatterbox
2•handfuloflight•44m ago•0 comments

Dutch Government Collapses over Migration Dispute

https://www.nytimes.com/2025/06/03/world/europe/geert-wilders-netherlands-coalition.html
3•sharpshadow•45m ago•0 comments

A Philosophical Inquiry into AI-Generated and Human-Generated Code

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5265751
1•camilochs•47m ago•0 comments

Faced with drought, fertilizer helps grasslands grow strong, global study shows

https://phys.org/news/2025-05-drought-fertilizer-grasslands-strong-global.html
1•PaulHoule•48m ago•0 comments

Tesla shares tank 16% as Musk-Trump spat escalates

https://www.cnbc.com/2025/06/05/tesla-shares-musk-trump.html
22•ivape•49m ago•1 comments
Open in hackernews

AI LLMs can't count lines in a file

19•sha-69•1d ago
Was starting to mess around with the latest LLM models and found that they're not great at counting lines in files.

I gave Gemini 2.5 flash a python script and asked it to tell me what was at line 27 and it consistently got it wrong. I tried repeatedly to prompt it the right way, but had no luck.

https://g.co/gemini/share/0276a6c7ef20

Is this something that LLM bots are still not good at? I thought they had gotten past the "strawberry" counting problems.

Here's the raw file: https://pastebin.com/FBxhZi6G

Comments

sha-69•1d ago
I gave this file to ChatGPT and Grok and had similar issues there too
tuga2099•1d ago
Been there too
fasthands9•1d ago
The thing is I imagine the LLM would be able to write a code that counted the lines and outputted what is on line 27. It seems inevitable (in a way that scares me) that a good model in the near future would know enough to write that file and execute on its own.

My understanding is early LLMs were bad at math (for similar reasons) but then got better once the model was hooked up to a calculator behind the scenes.

paulddraper•1d ago
Claude 4 added Code Execution.

E.g. ask it to find the 100th prime, it will write a Python script and then run that.

avalys•1d ago
How good do you think a human brain is at doing this if you simply provided the contents of the file as a string of characters (i.e. not in a text editor with line breaks rendered, etc.)?
digianarchist•1d ago
I think computers are good at counting lines delimited by newline characters.
t-3•1d ago
Why are you comparing LLM's to a human brain? Software should integrate software when solving problems. It's completely reasonable to expect an LLM given a "count lines" problem to just pipe the text through wc -l.
mhh__•1d ago
which they will do I'd imagine after being told they have access to a shell
selcuka•1d ago
Most LLMs have access to such tools. Well, maybe not a Unix shell, but something similar. This is from GPT 4.5's system prompt [1]:

    python

    When you send a message containing Python code to python, it
    will be executed in a stateful Jupyter notebook environment.
    python will respond with the output of the execution or time
    out after 60.0 seconds. The drive at '/mnt/data' can be used
    to save and persist user files. Internet access for this
    session is disabled. Do not make external web requests or API
    calls as they will fail.

[1] https://github.com/0xeb/TheBigPromptLibrary/blob/main/System...
tylersmith•1d ago
An LLM itself can't use wc. Coding agents like Claude Code or Cursor will call out to command line tools for this kind of problem when the LLM detects it.
selcuka•1d ago
Well, maybe not wc directly, but they have access to sandboxed Python environments. It must be trivial for an LLM to write the Python code that calculates this.

I don't understand why Gemini insists that it can count the lines itself, instead of falling back to its Python tool [1].

[1] https://github.com/elder-plinius/CL4R1T4S/blob/main/GOOGLE/G...

xanth•1d ago
Why don't we put the file characteristics into the token set that represents the file: file name, extension, size, lines, last edited, character encoding, and so on
scarface_74•1d ago
ChatGPT o4-mini got it right

https://chatgpt.com/share/683f9f73-42d8-8010-9cbc-27ad396a55...

ChatGPT 4o (the product not the LLM) got it right with a little additional prompting

https://chatgpt.com/share/683f9fd4-e61c-8010-99be-81d25264ba...

thisisauserid•1d ago
Gemini Pro 2.5 says 207.

But use Flash so you can get a wrong answer sooner?

Kranar•1d ago
This is true, in general LLMs are not great at counting because they don't see individual characters they see tokens.

Imagine you spoke perfect English, but you learned how to write English using Mandarin characters, basically using the closest sounding Mandarin characters to write in English. Then someone asks you how many letter o's are in the sentence "Hello how are you?". Well you don't read using English charaters, you read using Mandarin characters so you read it as "哈咯,好阿优?" because using Mandarin letters that's the closest sounding way to spell "Hello how are you?"

So now if someone asks you how many letter o's are in "哈咯,好阿优?", you don't really know... you are familiar conceptually that the letter o exists, you know that if you spelled the sentence in English it would contain the letter o, and you can maybe make an educated guess about how many letter o's there are, but you can't actually count out how many letter o's there are because you've never seen actual English letters before.

The same thing goes for an LLM, they don't see characters, they only see tokens. They are aware that characters do exist, and they can reason about their existence, but they can't see them so they can't really count them out either.

krackers•1d ago
I see this claim repeated over and over, and while it seems plausible, this should be an easily testable hypothesis right? You don't even need a _large_ model for this because the hypothesis you are testing is whether transformer models [possibly with chain of thought] can count to some "reasonable" limit (maybe it can be modeled in TCS sense as something to do with circuit complexity) and you can easily train on synthetic strings. Is there any paper that shows proof/disproof that transformer networks using single-character tokenization successfully count?
Kranar•1d ago
Forget single character tokens, you can just go on OpenAI's own tokenizer website [1] and construct tokens and ask ChatGPT to count how many tokens there are in a given string. For example hello is a single token and if I ask ChatGPT to count how many times "hello" appears in "hellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohello" or variations thereof it gets it right.

Be careful that you structure your query so that all of the "hello" are in their own token, because you could inadvertently ask it where the first or last hello gets chunked into the text just before or just after.

[1] https://platform.openai.com/tokenizer

krackers•1d ago
Neat finding, does it generalize to larger samples? Someone should randomly generate a few thousand such strings, feed it to 4o or o3, and get some accuracy results. Then compare the accuracy in cases of counting individual letters in random strings.

I find there's a lot of low-hanging fruit and claims about LLMs that are easily testable, but for which no benchmarks exist. E.g. the common claim about LLMs being "unable" to multiply isn't fully accurate, someone did a proper benchmark and found that there's a gradual decline in accuracy as digit length increases past 10 digits by 10 digits. I can't find the specific paper, but I also remember there was a way of training a model on increasingly hard problems at the "frontier" (GRPO-esque?) that fixed this issue, giving very high accuracy up to 20 digits by 20 digits.

Kranar•1d ago
Oh that's fair. I am not actually an LLM expert so I could have some misunderstanding about this. I remember hearing this explanation given for why previous ChatGPT models failed to answer "How many "r"s are in strawberry?", but perhaps this was an over simplification.
krackers•1d ago
Right that's the explanation I've heard too (and I think Karpathy even said it so it's not some fringe theory). I wasn't dismissing the hypothesis but asking out of genuine curiosity, since this feels like something that can easily be tested on "small" large language models. There's lots of little experiments like this can be done with small-ish models trained on purely synthetic data (the stuff about digit multiplication was done on GPT-2 scale model IIRC). Can models learn to count? Can they learn to add? Can they learn to copy text verbatim accurately? Can they learn to recognize regular grammars, or even context-free grammars (this one has already been done, and the answer is yes). And if the answer to one of these turns out to be no, then we'd better find out sooner rather than later, since it means we probably need to rethink the architecture a bit.

I know there's a lot of theoretical CS work on deriving upper-bounds on these models from a circuit-complexity point of view, but as architectures are revised all the time it's hard to tell how much is still relevant. Nothing beats having a concrete, working example of a model that correctly parses CFGs as rebuttal to the claim that models just repeat their training data.

dmd•1d ago
I tried it in gpt-4.1, o4-mini, and claude-4-sonnet, and all got the right answer.
apothegm•1d ago
LLMs don’t reason or count. They predict and output next tokens. “Reasoning” models mostly just have another layer of validating actual output against predictions. Newer models, if provided with programming tools (as they are in the ChatGPT interface), will predict the tokens that make up short scripts and then call those scripts to achieve numeric results for things like counting lines or letters.
zihotki•21h ago
LLMs are turing complete ( https://arxiv.org/abs/2411.01992 ). Or what is your definition of 'count'?
yencabulator•2h ago
That's not what the paper says. It says it possible to construct weights that when run through inference will perform Turing machine -equivalent operations on prompts that are specifically made for that purpose.

That does not mean weights derived from a pile of books will do such a thing.

throwdbaaway•1d ago
This prompt works fine with Qwen2.5-Coder-32B-Instruct-Q4_K_M:

    Add a line number prefix to each line, stopping at line 27. What's on line 27 of this program?
simonw•1d ago
Yeah, they're still bad at counting.

Tools like Claude Code work around this by feeding code into the LLMs with explicit line numbers - demo of that here: https://static.simonwillison.net/static/2025/log-2025-06-02-... - expand out some of the "tool result" panels until you see it, more notes on where I got that trace from here: https://simonwillison.net/2025/Jun/2/claude-trace/

jazzyjackson•1d ago
They're very bad at geometry. GPT4 (first ed.) tried convincing me a line intersects a sphere 3 times. Completely clueless at comparing volumes of various polyhedra.
selcuka•1d ago
Both GPT-4o and o4-mini got it right for me. They both wrote and executed a small Python program:

    # Let's read the file and get line 27
    file_path = '/mnt/data/code.py'
    line_27 = None
    try:
        with open(file_path, 'r') as f:
            lines = f.readlines()
            if len(lines) >= 27:
                line_27 = lines[26].rstrip('\n')
            else:
                line_27 = None
    except FileNotFoundError:
        line_27 = None
insin•1d ago
Why would you expect them to be able to given how they work? [1]

Someone where I work was trying to get an LLM to evaluate responses to an internal multiple-choice quiz (A, B or C), putting people into different buckets based on a combination of the total number of correct responses and having answered specific questions correctly. They spent a week "prompt engineering" it back and forth, with subtle changes to their instructions on how the scoring should work, with no appreciable effect on accuracy or consistency.

That's another scenario where I felt someone was asking for something with no mechanical sympathy for how it was supposed to happen. Maybe a "thinking" model (why do "AI" companies always abuse terms like this? (rhetorical)) would have been able to get enough stuff into the context for it to be able to get closer to a better outcome, but I took their prompt asked it to write code instead, and got it translated into some overly-commented but simple-enough code which would do the job perfectly every time, including a comment that the instructions they'd provided had a gap where people answering with a certain combination of answers wouldn't fall into any bucket.

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

bigbuppo•1d ago
See, your mistake was thinking they would be useful today. You shouldn't concern yourself with that, but rather how fast they are getting slightly more useful. And with your one time investment of just $12,000,000,000 we can add that to our roadmap for future planning consideration.
__fst__•1d ago
I also noticed that they struggle reversing strings. Ask it to "generate a list of the 30 biggest countries together with their name in reverse". Most of the results will be correct but you'll likely find some weird spelling mistakes.

It's not something they can regurgitate from previously seen text. Models like Claude with background code execution might get around that.

scarface_74•1d ago
https://chatgpt.com/share/683fd42a-c850-8010-9f0e-6c634f1b01...
__fst__•1d ago
yup, exactly what I meant, e.g

5 Brazil liziarB

scarface_74•1d ago
Telling it to use Python with 4o

https://chatgpt.com/share/6840a944-3bac-8010-9694-2a8b0a9c35...

Even o4-mini-high got it wrong though (Indonesia)

https://chatgpt.com/share/6840a9aa-1260-8010-ba3f-bd99fff721...