frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Tree Borrows

https://plf.inf.ethz.ch/research/pldi25-tree-borrows.html
108•zdw•1h ago•9 comments

Why LLMs Can't Write Q/Kdb+: Writing Code Right-to-Left

https://medium.com/@gabiteodoru/why-llms-cant-write-q-kdb-writing-code-right-to-left-ea6df68af443
106•gabiteodoru•1d ago•59 comments

A fast 3D collision detection algorithm

https://cairno.substack.com/p/improvements-to-the-separating-axis
54•OlympicMarmoto•2h ago•2 comments

Ruby 3.4 frozen string literals: What Rails developers need to know

https://www.prateekcodes.dev/ruby-34-frozen-string-literals-rails-upgrade-guide/
137•thomas_witt•3d ago•62 comments

Is the doc bot docs, or not?

https://www.robinsloan.com/lab/what-are-we-even-doing-here/
139•tobr•8h ago•70 comments

Helm local code execution via a malicious chart

https://github.com/helm/helm/security/advisories/GHSA-557j-xg8c-q2mm
135•irke882•10h ago•65 comments

Most RESTful APIs aren't really RESTful

https://florian-kraemer.net//software-architecture/2025/07/07/Most-RESTful-APIs-are-not-really-RESTful.html
183•BerislavLopac•9h ago•284 comments

X Chief Says She Is Leaving the Social Media Platform

https://www.nytimes.com/2025/07/09/technology/linda-yaccarino-x-steps-down.html
128•donohoe•1h ago•122 comments

US Court nullifies FTC requirement for click-to-cancel

https://arstechnica.com/tech-policy/2025/07/us-court-cancels-ftc-rule-that-would-have-made-canceling-subscriptions-easier/
352•gausswho•17h ago•329 comments

Bootstrapping a side project into a profitable seven-figure business

https://projectionlab.com/blog/we-reached-1m-arr-with-zero-funding
668•jonkuipers•1d ago•168 comments

Galiliean-invariant cosmological hydrodynamical simulations on a moving mesh

https://wwwmpa.mpa-garching.mpg.de/~volker/arepo/
5•gone35•2d ago•1 comments

Phrase origin: Why do we "call" functions?

https://quuxplusone.github.io/blog/2025/04/04/etymology-of-call/
158•todsacerdoti•12h ago•107 comments

7-Zip for Windows can now use more than 64 CPU threads for compression

https://www.7-zip.org/history.txt
171•doener•2d ago•119 comments

IKEA ditches Zigbee for Thread going all in on Matter smart homes

https://www.theverge.com/smart-home/701697/ikea-matter-thread-new-products-new-smart-home-strategy
233•thunderbong•6h ago•125 comments

RapidRAW: A non-destructive and GPU-accelerated RAW image editor

https://github.com/CyberTimon/RapidRAW
209•l8rlump•13h ago•91 comments

Breaking Git with a carriage return and cloning RCE

https://dgl.cx/2025/07/git-clone-submodule-cve-2025-48384
349•dgl•22h ago•140 comments

Astro is a return to the fundamentals of the web

https://websmith.studio/blog/astro-is-a-developers-dream/
202•pumbaa•6h ago•172 comments

Using MPC for Anonymous and Private DNA Analysis

https://vishakh.blog/2025/07/08/using-mpc-for-anonymous-and-private-dna-analysis/
18•vishakh82•4h ago•7 comments

A Emoji Reverse Polish Notation Calculator Written in COBOL

https://github.com/ghuntley/cobol-emoji-rpn-calculator
9•ghuntley•3d ago•0 comments

Florida is letting companies make it harder for highly paid workers to swap jobs

https://www.businessinsider.com/florida-made-it-harder-highly-paid-workers-to-swap-jobs-2025-7
75•pseudolus•1h ago•79 comments

Where can I see Hokusai's Great Wave today?

https://greatwavetoday.com/
106•colinprince•12h ago•81 comments

ESIM Security

https://security-explorations.com/esim-security.html
87•todsacerdoti•7h ago•41 comments

Frame of preference A history of Mac settings, 1984–2004

https://aresluna.org/frame-of-preference/
157•K7PJP•16h ago•21 comments

I'm Building LLM for Satellite Data EarthGPT.app

https://www.earthgpt.app/
87•sabman•2d ago•11 comments

Supabase MCP can leak your entire SQL database

https://www.generalanalysis.com/blog/supabase-mcp-blog
784•rexpository•22h ago•421 comments

Smollm3: Smol, multilingual, long-context reasoner LLM

https://huggingface.co/blog/smollm3
346•kashifr•1d ago•70 comments

Nvidia Becomes First Company to Reach $4T Market Cap

https://www.cnbc.com/2025/07/09/nvidia-4-trillion.html
20•mfiguiere•2h ago•6 comments

Hugging Face just launched a $299 robot that could disrupt the robotics industry

https://venturebeat.com/ai/hugging-face-just-launched-a-299-robot-that-could-disrupt-the-entire-robotics-industry/
101•fdaudens•2h ago•88 comments

I Ported SAP to a 1976 CPU. It Wasn't That Slow

https://github.com/oisee/zvdb-z80/blob/master/ZVDB-Z80-ABAP.md
64•weinzierl•2d ago•40 comments

Serving a half billion requests per day with Rust and CGI

https://jacob.gold/posts/serving-half-billion-requests-with-rust-cgi/
95•feep•1d ago•75 comments
Open in hackernews

Is the doc bot docs, or not?

https://www.robinsloan.com/lab/what-are-we-even-doing-here/
139•tobr•8h ago

Comments

BossingAround•7h ago
It's probably docs... If it can hallucinate an answer, it's docs with probably the most infuriating UX one can imagine.

I remember being taught that no docs is better (i.e. less frustrating to the user) than bad/incorrect docs.

pmg101•6h ago
"Documentation - or, as I like to call it, lies."

After a certain number of years you learn that source code comments so often fall out of synch with the code itself that they're more of a liability than an asset.

taneq•5h ago
“There’s lies, damn lies, and datasheets.”

Although, “All datasheets are wrong. Some datasheets are useful.”

walthamstow•5h ago
At my last place the docs were in the repo with the code, and if you didn't update the docs in the same PR as the code it wouldn't get approved.

My current place? It's in Confluence, miles away from code and with no review mechanism.

domk•7h ago
Working with Shopify is an example of something where a good mental model of how it works under the hood is often required. This type of mistake, not realising that the tag is added by an app after an order is created and won't be available when sending the confirmation email, is an easy one to make, both for a human or an LLM just reading the docs. This is where AI that just reads the available docs is going to struggle, and won't replace actual experience with the platform.
bravesoul2•7h ago
Need a real CC to test. Right there makes me lose respect for shopify if true. Even stripe let's you test :)
Bewelge•6h ago
Not sure if I'm missing something but the way I'd always test orders is generate some 100% discount. You don't need any payment info then. I only ever needed a CC if I wanted to actually test something relating to payment. And on test stores you can mock a CC
bravesoul2•4h ago
That's a good way too for most cases. Unless you need there to be an amount
PeterStuer•7h ago
Confused. I just tried it in the Shopify Assistant and got:

There is no built-in Liquid property to directly detect Shopify Collective fulfillment in email notifications.

You can use the Admin GraphQL API to programmatically detect fulfillment source.

In Liquid, you must rely on tags, metafields, or custom properties that you set up yourself to mark Collective items.

If you want to automate this, consider tagging products or orders associated with Shopify Collective, or using an app to set a metafield, and then check for that in your Liquid templates.

What you can do in Liquid (email notifications):

If Shopify exposes a tag, property, or metafield on the order or line item that marks it as a Shopify Collective item, you could check for that in Liquid. For example, if you tag orders or products with "Collective", you could use:

  {% if order.tags contains "Collective" %}
    <!-- Show Collective-specific content -->
  {% endif %}
or for line items:

  {% for line_item in line_items %}
    {% if line_item.product.tags contains "Collective" %}
      <!-- Show something for Collective items -->
    {% endif %}
  {% endfor %}
In the author's 'wrong' vs 'seems to work' answer, the only difference is the tag on the line items vs, the order. The flow (template? as he refers to it as 'some other cryptic Shopify process' ) he uses in his tests does seem to add the 'Shopify Collective' tag to the line items, and potentially also to the order if the whole order is Shopify Collective fullfilled, but without further info we can only guess his setup.

While using AI can always lead to non-perfect results, I feel the evidence presented here does not support the conclusion.

P.S. Given the reference to 'cryptic Shopify processes', I wonder how far the author would get with 'just the docs'.

redhale•6h ago
I think you're making the author's point, though. If two users ask the bot the same question and get different answers, is the bot valuable? A dice roll that might be (or is even _probably_) correct is not what I want when going directly to the official docs.
PeterStuer•4h ago
Not sure the author is giving the full account though, as his answer snippet was probably just a part of the same answer I got, framed and interpreted differently (the AI's are never this terse as to just whip out a few lines of code).

Besides, it is not even incorrect in the way he states it is. It is fully dependent on how he added the tags in his flow, as the complete answer correctly stated. He speculates on some timing issue in some 'cryptic Shopify process' adding the tag at a later stage, but this is clearly wrong as his "working answer" (which is also in the Assistant reply) does rely on the tag having been added at the same point in the process.

My pure and exaggerated on purpose speculation: He just blindly copied some flow template, then from the (same as I got?) Assistant's answer copy/pasted the first Liquid code box, tested on one order and found it not doing what he wanted, this suited his confirmation bias regarding AI, later tried pasting the second Liquid code box (or the same answer you will get from Gemini through Google Search) and found 'it worked' on his one test order, still blamed the Assistant for being 'wrong'.

hennell•4h ago
So because you got a good response the conclusion is invalid? How does the user know if they got a good response or a bad one? Due to the parameters passed most LLMs are functionally non-deterministic, rarely giving the same answer twice even with the same question.

I just asked chatgpt "whats the best database structure for a users table where you have users and admins?" in two different browser sessions. One gave me sql with varchars and a role column using:

    role VARCHAR(20) NOT NULL CHECK (role IN ('user', 'admin')),
the other session used text columns and defined an enum to use first:

    CREATE TYPE user_role AS ENUM ('user', 'admin', 'superadmin');
    //other sql snipped
    role user_role NOT NULL DEFAULT 'user',
An Ai Assistant should be better tuned but often isn't. That variance to me makes it feel wildly unhelpful for 'documentation' as two people end up with quite different solutions.
PeterStuer•4h ago
So by extrapolation all of the IT books of the past were "wildly unhelpful" as no two of them presented the exact same solution to a problem, even all those pretending to be 'best practice'?

Your question is vague (technical reference, not meant derogatory). In which DBMS? By what metric of 'best'? For which size of database? Does it need to support internationalization? Will the roles be updated or extended in the future etc.

You could argue an AI Assistant would need to ask you this clarification if the question is vague rather than make a guess. But in extremis this is in practice not workable. If every minute factor needs to be answered by the user before getting a result, only the very experts would get to the stage of getting an answer if ever.

This is not just an AI problem, but a problem (human) business and technical analysts face every day in their work. When do you switch to proposing a solution rather than asking further details? It is BTW also why all those BPM or RPA platforms that promise to eliminate 'programming' and let the business analyst 'draw' a solution often fail miserably. They either have too narrow defaults or keep needing to be fed detail long past the BA's comfort zone.

dworks•3h ago
its non-deterministic. it gives different answers each time you ask, potentially, and small differences in your prompt yields different completions. it doesnt actually understand your prompt, you know.
deepdarkforest•6h ago
I mean that's the dirty secret of any RAG chatbot. The concept of "grounding" is arbitrary. It doesn't matter if you use embeddings, or use a tool that uses your usual search and gets the top items, like most web search tools or google's. Is still relies on the model to not hallucinate given this info, which is very hard since too much info -> model gets confused, but too little info -> model assumes the info might not be there so useless. The fine balance depends on the user's query, and all approaches like score cutoff for embeddings etc just don't generalize.

This is the same exact problem in coding assistants when they hallucinate functions or cannot find the needed dependencies etc.

There are better and more complex approaches that use multiple agents to summarize different smaller queries and then iteratively buildup etc, internally we and a lot of companies have them, but for external customer queries, way too expensive. You can't spend 30 cents on every query

Bewelge•6h ago
To be fair, for me at least, that weird chat bot only appears on https://help.shopify.com/ while the technical documentation is on shopify.dev/.

Everytime I land on help.shopify.com I get the feeling it's one of those "Doc pages for sales people". Like it's meant to show "We have great documentation and you can do all these things" but never actually explains how to do anything.

I tried that bot a couple of months ago and it was utterly useless:

question: When using discountRedeemCodeBulkAdd there's a limit to add 100 codes to a discount. Is this a limit on the API or on the discount? So can I add 100 codes to the same discount multiple times?

answer: I wasn't able to find any results for that. Can you tell me a little bit more about what you're looking for?

Telling it more did not help. To me that seemed like the bot didn't even have access to the technical documentation. Finding it hard to believe that any search engine can miss a word like discountRedeemCodeBulkAdd if it actually is in the dataset: https://shopify.dev/docs/api/admin-graphql/latest/mutations/...

So it's a bit like asking sales people technical questions.

edit: Okay, I should have tried that before commenting. They seem to have updated it. When I ask the same question now it answers correctly (weirdly in German) :

Die Begrenzung von 100 Codes bei der Verwendung von discountRedeemCodeBulkAdd bezieht sich auf die Anzahl der Codes, die Sie in einem einzelnen API-Aufruf hinzufügen können, nicht auf die Gesamtanzahl der Codes, die einem Rabatt zugeordnet werden können. Ein Rabattcode kann bis zu 20.000.000 eindeutige Rabattcodes enthalten. Daher können Sie mehrfach jeweils 100 Codes zum selben Rabatt hinzufügen, bis Sie das Limit von 20.000.000 Codes erreicht haben. Beachten Sie, dass Drittanbieter-Apps oder benutzerdefinierte Lösungen dieses Limit nicht umgehen oder erhöhen können.

~= It's a limit on the API endpoint, you can add up to 20M to a single discount.

delusional•6h ago
> So it's a bit like asking sales people technical questions.

Maybe that's the best anthropomorphic analogy of LLMs. Like good sales people completely disconnected from reality, but finely tuned to give you just the answer you want.

WJW•5h ago
Well no, the problem was that the bot didn't give them the answer they wanted. It's more like "finely tuned to waffle around pretending to be knowledgeable, but lacking technical substance".

Kind of like a bad salesperson, the best salespeople I've had the pleasure of knowing were not afraid to learn the technical background of their products.

barrell•5h ago
The best anthropomorphic analogy for LLMs is no anthropomorphic analogy :)
debugnik•5h ago
> weirdly in German

I keep seeing bots wrongly prompted with both the browser language and the text "reply in the user's language". So I write to a bot in English and I get a Spanish answer.

dworks•3h ago
to be fair?
anentropic•5h ago
I would guess these narrow docs bots probably perform worse than ChatGPT et al in 'search' mode
emil_sorensen•5h ago
Docs bots like these are deceptively hard to get right in production. Retrieval is super sensitive to how you chunk/parse documentation and how you end up structuring documentation in the first place (see frontpage post from a few weeks ago: https://news.ycombinator.com/item?id=44311217).

You want grounded RAG systems like Shopify's here to rely strongly on the underlying documents, but also still sprinkle a bit of the magic of the latent LLM knowledge too. The only way to get that balance right is evals. Lots of them. It gets even harder when you are dealing with GraphQL schema like Shopify has since most models struggle with that syntax moreso than REST APIs.

FYI I'm biased: Founder of kapa.ai here (we build docs AI assistants for +200 companies incl. Sentry, Grafana, Docker, the largest Apache projects etc).

skrebbel•5h ago
Why RAG at all?

We concatenated all our docs and tutorials into a text file, piped it all into the AI right along with the question, and the answers are pretty great. Cost was, last I checked, roughly 50c per question. Probably scales linearly with how much docs you have. This feels expensive but compared to a human writing an answer it's peanuts. Plus (assuming the customer can choose to use the AI or a human), it's great customer experience because the answer is there that much faster.

I feel like this is a no-brainer. Tbh with the context windows we have these days, I don't completely understand why RAG is a thing anymore for support tools.

Rygian•5h ago
What you describe sounds like poor man's RAG. Or lazy man's. You're just doing the augmentation at each prompt.
cluckindan•5h ago
With RAG the cost per question would be low single-digit pennies.
IceDane•4h ago
Because llms still suck at actually using all that context at once. And surely you can see yourself that your solution doesn't scale. It's great that it works for your specific case but I'm sure you can come up with a scenario where it's just not feasible.
cube2222•4h ago
This works as long as your docs are below the max context size (and even then, as you approach larger context sizes, quality degrades).

Re cost though, you can usually reduce the cost significantly with context caching here.

However, in general, I’ve been positively surprised with how effective Claude Code is at grep’ing through huge codebases.

Thus, I think just putting a Claude Code-like agent in a loop, with a grep tool on your docs, and a system prompt that contains just a brief overview of your product and brief summaries of all the docs pages, would likely be my go to.

bee_rider•1h ago
Oh man, maybe this would cause people to write docs that are easy to grep through. Let’s start up that feedback loop immediately, please.
emil_sorensen•3h ago
Accuracy drops hard with context length still. Especially in more technical domains. Plus latency and cost.
llm_nerd•2h ago
What you described is RAG. Inefficient RAG, but still RAG.

And it's inefficient in two ways-

-you're using extra tokens for every query, which adds up.

-you're making the LLM less precise by overloading it with potentially irrelevant extra info making it harder for it to needle in a haystack the specific relevant answer.

Filtering (e.g. embedding similarity & BM25) and re-ranking/pruning what you provide to RAG is an optimization. It optimizes the tokens, the processing time, and optimizes the answer in an ideal world. Most LLMs are far more effective if your RAG is limited to what is relevant to the question.

chrismorgan•5h ago
Why do you say “deceptively hard” instead of “fundamentally impossible”? You can increase the probability it’ll give good answers, but you can never guarantee it. It’s then a question of what degree of wrongness is acceptable, and how you signal that. In this specific case, what it said sounds to me (as a Shopify non-user) entirely reasonable, it’s just wrong in a subtle but rather crucial way, which is also mildly tricky to test.
whatsgonewrongg•4h ago
A human answering every question is also not guaranteed to give good answers; anyone that has communicated with customer service knows that. So calling it impossible may be correct, but not useful.

(We tend to have far fewer evals for such humans though.)

girvo•4h ago
A human will tell you “I am not sure, and will have to ask engineering and get back to you in a few days”. None of these LLMs do that yet, they’re biased towards giving some answer, any answer.
whatsgonewrongg•3h ago
You’re right that some humans will, and most LLMs won’t. But humans can be just as confidently wrong. And we incentivize them to make decisions quickly, in a way that costs the company less money.
dcre•3h ago
This is not really true. If you give a decent model docs in the prompt and tell them to answer based on the docs and say “I don’t know” if the answer isn’t there, they do it (most of the time).
SecretDreams•3h ago
> most of the time

This is doing some heavy lifting

QuadmasterXLII•49m ago
I have never seen this in the wild. Have you?
dingnuts•42m ago
I need a big ol' citation for this claim, bud, because it's an extraordinary one. LLMs have no concept of truth or theory of mind so any time one tells you "I don't know" all it tells you is that the source document had similar questions with the answer "I don't know" already in the training data.

If the training data is full of certain statements you'll get certain sounding statements coming out of the model, too, even for things that are only similar, and for answers that are total bullshit

simonw•9m ago
Do you use LLMs often?

I get "I don't know" answers from Claude and ChatGPT all the time, especially now that they have thrown "reasoning" into the mix.

Saying that LLMs can't say "I don't know" feels like a 2023-2024 era complaint to me.

unshavedyak•2h ago
I agree with you, but man i can't help but feel humans are the same depending on the company. My wife was recently fighting with several layers of comcast support over cap changes they've recently made. Seemingly it's a data issue since it's something new that theoretically hasn't propagated through their entire support chain yet, but she encountered a half dozen confidently incorrect people which lacked the information/training to know that they're wrong. It was a very frustrating couple hours.

Generally i don't trust most low paid (at no fault of their own) customer service centers anymore than i do random LLMs. Historically their advice for most things is either very biased, incredibly wrong, or often both.

axus•2h ago
Won't that be cool, when LLM-based AIs ask you for help instead of the other way around
intended•3h ago
This is to move the goal posts /raise a different issue. We can engage with the new point, but this is to concede that Docs bots are not docs bots.
bee_rider•1h ago
Documentation is the thing we created because humans are forgetful and misunderstand things. If the doc bot is to be held to a standard more like some random discord channel or community forum, it should be called something without “doc” in the name (which, fwiw, might just be a name the author of the post came up with, I dunno what Shopify calls it).
PeterStuer•3h ago
Indeed. Dabbling in 'RAG' (which for better or worse has become a tag for anything context retrieval) for more complex documentation and more intricate questions, you will very quickly realize that you really need to go far beyond simple 'chunking', and end up with a subsystem that constructs more than one very intricate knowledge graphs for supporting different kinds of questions the users might ask. For example: a simple question such as "What exactly is an 'Essential Entity'? is better handled by Knowledge Representation A as opposed to "Can you provide a gap and risk analysis on my 2025 draft compliance statement (uploaded) in light of the current GDPR, NIS-2 and the AI Act?"

(My domain is regulatory compliance, so maybe this goes beyond pure documentation but I'm guessing pushed far enough the same complexities arise)

dingnuts•39m ago
This is sort of hilarious; to use an LLM as a good search interface first build.. a search engine.

I guess this is why Kagi Quick Answer has consistently been one of the best AI tools I use. The search is good, so their agent is getting the best context for the summaries. Makes sense.

schnable•5h ago
Reminds me of when I asked Gemini how to do some stuff in Google Docs App Script, and it just hallucinated the capability and code to make it work. Turns out what I wanted to do isn't supported at all.

I feel like we aren't properly using AI in products yet.

hnlmorg•4h ago
I’ve found LLMs (or at least everyone I’ve tried this on) will always assume the customer is correct and thus even if they’re flat out wrong, the LLM will make up some bullshit to confirm the costumer is still correct.

It’s great when you’re looking to do creative stuff. But terrible when you’re looking to confirm the correctness of an approach or asking for support on something that you weren’t even aware of its nonexistence.

dworks•3h ago
that's because its "answers" are actually "completions". cant escape that fact - LLMs will always "hallucinate".
aDyslecticCrow•4h ago
I asked about a nieche json library for c. It apparently wasn't in the training data so it just invented how it feels like a json library would work.

Ive also had alot of issues with cmake that it just invents syntax and functions. Every new question has to be made in a new chat context to clear the context poisoning.

Its the things that lack good docs i want to ask about. But that's where its most likley to fail.

dingnuts•30m ago
I think users should get a refund on the tokens when this happens
braebo•3h ago
Yet Google raised my workspace subscription cost by 25% last night because our current agreement is suddenly unworthy of all the new “ai value” they’ve added… value I didn’t even know existed until I started paying for it. I don’t even want to know what isis supposed to be referencing… I just want to dump it asap.
dsmmcken•2h ago
The tool we use for our docs AI answers lets you mine that data for feature requests. It generates a report of what it didn't have answers for and summarizes them as potential feature gaps. (Or at least what it is aware it didn't have answers for).

People seem more willing to ask an AI about certain things then be judged by asking the same question of a human, so in that regard it does seem to surface slightly different feature requests then we hear when talking to customers directly.

We use inkeep.com (not affiliated, just a customer).

rapind•1h ago
> We use inkeep.com (not affiliated, just a customer).

And what do you pay? It's crazy that none of these AI CSRs have public pricing. There should just be monthly subscription tiers, which include some number of queries, and a cost per query beyond that.

xyst•2h ago
> I feel like we aren't properly using AI in products yet.

Very similar sentiment at the height of the crypto/digital currency mania

simonw•4h ago
This is a great example of the kind of question I'd love to be able to ask these documentation bots but that I don't trust them to be able to get right (yet):

> What’s the syntax, in Liquid, to detect whether an order in an email notification contains items that will be fulfilled through Shopify Collective?

I suspect the best possible implementation of a documentation bot with respect to questions like this one would be an "agent" style bot that has the ability to spin up its own environment and actually test the code it's offering in the answer before confidently stating that it works.

That's really hard to do - Robin in this case could only test the result by placing and then refunding an order! - but the effort involved in providing a simulated environment for the bot to try things out in might make the difference in terms of producing more reliable results.

dworks•3h ago
get a second agent to validate the return from the first agent. but it might get it wrong because reasons, so you need a third agent just to make sure. and then a fourth. and so on. this is obviously not a working direction.
simonw•3h ago
That's why you give them the ability to actually execute the code in a sandbox. Then it's not AI checking AI, you're mixing something deterministic into the loop.
dworks•3h ago
the return may still not reflect the sandbox reality.
kmoser•2h ago
That may certainly increase the agent's ability to get it right, but there will always be cases where the code it generates mimics the correct response, i.e. produces the output asked for, without actually working as intended, as LLMs tend to want to please as much as be correct.
simonw•1h ago
Not much harm done. The end user sees the response and either spots that it's broken or finds out it's broken when they try to run it.

They take a screenshot and make fun of the rubbish bot on social media.

If that happens rarely it's still a worthwhile improvement over today. If it happens frequently then the documentation bot is junk and should be retired.

dworks•59m ago
youre hand wavibng all the other million use cases where returning false information isnt OK.
dworks•3h ago
We're going to see increasingly more of these, and it's going to cause a big scandal at one point, that pops the current AI bubble. It's really obvious that you can't use non-deterministic systems this way but companies are hellbent on doing it anyway. This is why I won't take a role to implement "AI" in an existing product.
crystal_revenge•1m ago
I don’t understand why people seem to be attacking the “non-determinism” of LLMs. First, I think most people are confusing “probabilistic” with “non-deterministic” which have very distinct meanings in CS/ML. Non-deterministic typically entails following multiple paths at once. Consider regex matching with NFAs or even the particular view of a list as a monad. The only case where LLMs are “non-deterministic” is when using sampling algorithms like beam search where multiple paths are considered simultaneously. But most LLM usage being discussed doesn’t involve beam search.

But even if one assumes people mean “probabilistic”, that’s also an odd critique give how probabilistic software has pretty much eaten the world. Most of my career has been building reliable product using probabilistic models.

Finally, there’s nothing inherently probabilistic or non-deterministic about LLM generation, these are properties of the sampler applied. I did quite a lot of LLM benchmarking in recent years and almost always used greedy sampling both for performance (doing things like GSM8K strong benefits from choosing the maximum likely path) and reproducibility. You can absolutely set up LLM tools that have perfectly reproducible results. LLMs have many issues but their probabilistic nature is not one of them.

trjordan•2h ago
The core argument here is: LLM docbots are wrong sometimes. Docs are not. That's not acceptable.

But that's not true! Docs are sometimes wrong, and even more so if you could errors of omission. From a users perspective, dense / poorly structured docs are wrong, because they lead users to think the docs don't have the answer. If they're confusing enough, they may even mislead users.

There's always an error rate. DocBots are almost certainly wrong more frequently, but they're also almost certainly much much faster than reading the docs. Given that the standard recommendation is to test your code before jamming it in production, that seems like a reasonable tradeoff.

YMMV!

(One level down: the feedback loop for getting docbots corrected is _far_ worse. You can complain to support that the docs are wrong, and most orgs will at least try to fix it. We, as an industry, are not fully confident in how to fix a wrong LLM response reliably in the same way.)

mananaysiempre•2h ago
Docs are reliably fixable, so with enough effort they will converge to correctness. Doc bots are not and will not.
ngriffiths•2h ago
The doc bot goes in the same category as asking a human who has read the docs. In order of helpfulness you could get:

- "Oh yeah just write this," except the person is not an expert and it's either wrong or not idiomatic

- An answer that is reliably correct enough of the time

- An answer in the form "read this page" or quotes the docs

The last one is so much better because it directly solves the problem, which is fundamentally a search problem. And it places the responsibility for accuracy where it belongs (on the written docs).

bee_rider•1h ago
I think the name, doc-bot, is just bad (actually I don’t know what Shopify even calls their thing, so maybe the confusion is on the part of the author of the post, and not some misleading thing from Shopify). A bot like that could fulfill the role of the community forum, which certainly isn’t nothing! But of course it isn’t the documentation.
schaum•46m ago
There is also https://gurubase.io/ Which is sometimes used as a kind of talk with the documentation, it claims to validate the response somehow
shlomo_z•19m ago
> so I did my customary dance of order-refund, order-refund, order-refund. My credit card is going to get locked one of these days.

I don't know the first thing about Shopify, but perhaps you can create a free "test" item so you don't actually need to make a credit card transaction.