It may not be objective, but at least it's consistent, and it reflects something about the default human position.
For example, there are no good ways of measuring the amount of technical debt in a codebase. It's such a fuzzy question that only subjective measures work. But what if we show the AI one file at a time, ask "Rate, 1-10, the comprehensibility, complexity, and malleability of this code," and then average across the codebase. Then we get measure of tech debt, which we can compare over time to measure if it's rising or falling. The AI makes subjective measurements consistent.
This essay gives such a cool new idea, while only scratching the surface.
No it doesn't. Nothing that comes out of an LLM reflects anything except the corpus it was trained on and the sampling method used. That definitionally true, since those are the very things it is a product of.
You get NO subjective or objective insight from asking the AI about "technical debt" you only get an opaque statistical metric that you can't explain.
With the proposed way of measuring code quality, it’s also unclear how comparable the resulting numbers would be between different projects. If one project has more essential complexity than another project, it’s bound to yield a worse score, even if the code quality is on par.
Cycolmatic complexity is a terrible metric to obsesses over yet in a project I was on it was undeniably true that the newer code written by more experienced Devs was both subjectively nicer and also had lower cycolmatic complexity than the older code worked on by a bunch of juniors (some of the juniors had then become some of the experienced Devs who wrote the newer code)
Yes. But it means that it doesn’t let you assess code quality, only (at best) changes in code quality. And it’s difficult as soon as you add or remove functionality, because then it isn’t strictly speaking the same project anymore, as you may have increased or decreased the essential complexity. What you can assess is whether a pure refactor improves or worsens a project’s amenibility to AI coding.
I'd take a wager.
I didn't, just an AI tool in general.
The first time I tried without the deeper output, it "solved" it by writing a load of code that failed in loads of other ways, and ended up not even being related to the actual issue.
Like you can be certain it'll give you some nice looking metrics and measurements - but how do you know if they're accurate?
I'm not necessarily convinced that the current generation of LLMs are overly amazing at this, but they definitely are very good at measuring inefficiency of tooling and problematic APIs. That's not all the issues, but it can at least be useful to evaluate some classes of problems.
The issue is that it can happily go down the completely wrong path and report exactly the same as though it's solved the problem.
Just outright "if test-is-running { return success; }" level stuff.
Not kidding. 3 or 4 times in the past week.
Thinking of cancelling my subscription, but I also find it kind of... entertaining?
So… it did.
It made the tests pass.
“Job done boss!”
You have to be super careful and review everything because if you don't you can find your code littered with this strange mix of seeming brilliance which makes you complacent... and total Junior SWE behaviour or just outright negligence.
That, or recently, it's just started declaring victory and claiming to have fixed things, even when the test continues to fail. Totally trying to gaslight me.
I swear I wasn't seeing this kind of thing two weeks ago, which makes me wonder if Anthropic has been turning some dials...
My team refers to this as a "VW Bugfix".
It feels like it’s become grabbier and less able to stay in its lane: ask for a narrow thing, and next thing you know it’s running hog wild across the codebase shoehorning in half-cocked major architectural changes you never asked for. [Ed.: wow, how’s that for mixing metaphors?]
Then it smugly announces success, even when it runs the tests and sees them fail. “Let me test our fix” / [tests fail] / [accurately summarizes the way the tests are failing] / “Great! The change is working now!”
After leaving a trail of mess all over.
Wat?
Someone is changing some weights and measures over at Anthropic and it's not appreciated.
What's ironic about this is that the very things that TFA points out are needed for success (test coverage, debuggability, a way to run locally etc) are exactly the things that typical LLMs themselves lack.
I know what seems natural to me but that's because I'm extremely familiar with the internal workings of the project. LLM's seem to be very good at coming with names that are just descriptive enough but not too long, and most importantly follow "general conventions" from similar projects that I may not be aware of. I can't count the number of times an LLM has given me a name for a function that I've thought, oh of course, that's a much clearer name that what I was using. And I thought I was already pretty good at naming things...
ToucanLoucan•6h ago
> In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong
I recently implemented Microsoft's MSAL authentication on iOS which includes as you might expect a function that retrieves the authenticated accounts. Oh sorry, I said function, but there's two actually: one that retrieves one account, and one that retrieves multiple accounts, which is odd but harmless enough right?
Wrong, because whoever designed this had an absolutely galaxy brained moment and decided if you try and retrieve one account when multiple accounts are signed in, instead of, oh I dunno, just returning an error message, or perhaps returning the most recently used account, no no no, what we should do in that case is throw an exception and crash the fucking app.
I just. Why. Why would you design anything this way!? I can't fathom any situation you would use the one-account function in when the multi-account one does the exact same fucking thing, notably WITHOUT the potential to cause a CRASH, and just returns a set of one, and further, why then if you were REALLY INTENT ON making available one that only returned one, it wouldn't itself just call the other function and return Accounts.first.
</ rant>
layer8•5h ago
dewey•5h ago
johnmaguire•5h ago
layer8•5h ago
Most functions can fail, and any user-facing app has to be prepared for it so that it behaves gracefully towards the user. In that sense I agree that the error reporting mechanism doesn’t matter. It’s unclear though what the difference was for the GP.
ToucanLoucan•5h ago
More importantly: why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind. I wouldn't call our use of the framework an edge case by any means, it opens a web form in which one puts authentication details, passes through the flow, and then we are given authentication tokens and the user data we need. It's not unheard of for more than one account to be returned (especially on our test devices which have many) and I get the one-account function not being suitable for handling that, my question is... why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?
TOGoS•4h ago
It is if the caller is expecting there to be exactly one account.
This is why I generally like to return a set of things from any function that might possibly return zero or more than one things. Fewer special cases that way.
But if the API of the function is to return one, then you either give one at random, which is probably not right, or throw an exception. And with the latter, the person programming the caller will be nudged towards using the other API, which is probably what they should have done anyway, and then, as you say, the returns-one-account function should probably just not exist at all.
lazide•4h ago
Then later on, it was figured out that multiple accounts per credential set (?!?) needed to be supported, but the original clients still needed to be supported.
And either no one could afree on a sane convention if this happened (like returning the first from the list), or someone was told to ‘just do it’.
So they made the new call, migrated themselves, and put in a uncaught exception in the old place (can’t put any other type there without breaking the API) and blam - ticket closed.
Not that I’ve ever seen that happen before, of course.
Oh, and since the multi-account functionality is obviously new and probably quite rare at first, it could be years before anyone tracks down whoever is responsible, if ever.
layer8•4h ago
lazide•4h ago
But there is a way that closes your ticket fast and will compile!
layer8•4h ago
ToucanLoucan•4h ago
Yes there is! Just get rid of it. It's useless. The re-implementation from using one to the other was barely a few moments of work, and even if you want to say "well that's a breaking change" I mean, yeah? Then break it. I would be far less annoyed if a function was just removed and Xcode went "hey this is pointed at nothing, gotta sort that" rather than letting it run in a way that turns the use of authentication functionality into a landmine.
lazide•1h ago
You might be bound to support these calls for many, many years.
kfajdsl•4h ago
Seems like you should have a generic error handler that will at a minimum catch unexpected, unhandled exceptions with a 'Something went wrong' toast or similar?
zahlman•3h ago
Not if you handle the exception properly.
> why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind.
Because you explicitly asked for "the" account, and your request is based on a false premise.
>why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?
Because other users of the library explicitly want that to be an error condition, and would rather not write the logic for it themselves.
Performance could factor into it, too, depending on implementation details that obviously I know nothing about.
Or for legacy reasons as described in https://news.ycombinator.com/item?id=44321644 .
Jabrov•3h ago
raincole•2h ago
> throw an exception and crash the fucking app
Yes, if your app crashes when a third-party API throws an exception, it's a "skill issue" of you. This comment is an example why sometimes blaming the user's skill issue is valid.
jiggawatts•1h ago
Server side APIs and especially authentication APIs tend towards the “fail fast” approach. When APIs are accidentally mis-used this is treated either as a compiler error or a deliberate crash to let the developer know. Silent failures are verboten for entire categories of circumstances.
There’s a gradient of: silent success, silent failure, error codes you can ignore, exceptions you can’t, runtime panic, and compilation error.
That you can’t even tell the qualitative difference between the last half of that list is why I’m thinking you’re primarily a JavaScript programmer where only the first two in the list exist for the most part.