We Can Just Measure Things

https://lucumr.pocoo.org/2025/6/17/measuring/

52•tosh•2d ago

Comments

ToucanLoucan•6h ago

Still RTFA but this made me rage:

> In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong

I recently implemented Microsoft's MSAL authentication on iOS which includes as you might expect a function that retrieves the authenticated accounts. Oh sorry, I said function, but there's two actually: one that retrieves one account, and one that retrieves multiple accounts, which is odd but harmless enough right?

Wrong, because whoever designed this had an absolutely galaxy brained moment and decided if you try and retrieve one account when multiple accounts are signed in, instead of, oh I dunno, just returning an error message, or perhaps returning the most recently used account, no no no, what we should do in that case is throw an exception and crash the fucking app.

I just. Why. Why would you design anything this way!? I can't fathom any situation you would use the one-account function in when the multi-account one does the exact same fucking thing, notably WITHOUT the potential to cause a CRASH, and just returns a set of one, and further, why then if you were REALLY INTENT ON making available one that only returned one, it wouldn't itself just call the other function and return Accounts.first.

</ rant>

layer8•5h ago

How is an exception different from “returning an error message”?

dewey•5h ago

Seems like the main differentiator is that one crashed and one doesn’t. Unrelated to error message or exception.

johnmaguire•5h ago

I'm not sure I understand how both occurred at once. Typically an uncaught exception will result in a crash, but this would generally be considered an error at the call site (i.e. failing to handle error conditions.)

layer8•5h ago

I understood “crashing” as them not catching the exception.

Most functions can fail, and any user-facing app has to be prepared for it so that it behaves gracefully towards the user. In that sense I agree that the error reporting mechanism doesn’t matter. It’s unclear though what the difference was for the GP.

ToucanLoucan•5h ago

For one: terminating execution

More importantly: why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind. I wouldn't call our use of the framework an edge case by any means, it opens a web form in which one puts authentication details, passes through the flow, and then we are given authentication tokens and the user data we need. It's not unheard of for more than one account to be returned (especially on our test devices which have many) and I get the one-account function not being suitable for handling that, my question is... why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?

TOGoS•4h ago

> why is having more than one account an "exception" at all? That's not an error or fail condition

It is if the caller is expecting there to be exactly one account.

This is why I generally like to return a set of things from any function that might possibly return zero or more than one things. Fewer special cases that way.

But if the API of the function is to return one, then you either give one at random, which is probably not right, or throw an exception. And with the latter, the person programming the caller will be nudged towards using the other API, which is probably what they should have done anyway, and then, as you say, the returns-one-account function should probably just not exist at all.

lazide•4h ago

Chances are, the initial function was written when the underlying auth backend only supported a single account (structurally), and most clients were using that method.

Then later on, it was figured out that multiple accounts per credential set (?!?) needed to be supported, but the original clients still needed to be supported.

And either no one could afree on a sane convention if this happened (like returning the first from the list), or someone was told to ‘just do it’.

So they made the new call, migrated themselves, and put in a uncaught exception in the old place (can’t put any other type there without breaking the API) and blam - ticket closed.

Not that I’ve ever seen that happen before, of course.

Oh, and since the multi-account functionality is obviously new and probably quite rare at first, it could be years before anyone tracks down whoever is responsible, if ever.

layer8•4h ago

There’s no good way to solve this, though. Returning an arbitrary account can have unpredictable consequences as well if it isn’t the expected one. It’s a compatibility break either way.

lazide•4h ago

Exactly, which is probably why a better ‘back compatibility’ change couldn’t be agreed on.

But there is a way that closes your ticket fast and will compile!

layer8•4h ago

Sure, but not introducing the ability to be logged into multiple accounts isn’t the best choice as well. Arguably, throwing an exception upon multiple logins for the old API is the lesser evil overall.

ToucanLoucan•4h ago

> There’s no good way to solve this, though.

Yes there is! Just get rid of it. It's useless. The re-implementation from using one to the other was barely a few moments of work, and even if you want to say "well that's a breaking change" I mean, yeah? Then break it. I would be far less annoyed if a function was just removed and Xcode went "hey this is pointed at nothing, gotta sort that" rather than letting it run in a way that turns the use of authentication functionality into a landmine.

lazide•1h ago

I take it you’ve never had to support a widely used publicly available API?

You might be bound to support these calls for many, many years.

kfajdsl•4h ago

> For one: terminating execution

Seems like you should have a generic error handler that will at a minimum catch unexpected, unhandled exceptions with a 'Something went wrong' toast or similar?

zahlman•3h ago

> For one: terminating execution

Not if you handle the exception properly.

> why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind.

Because you explicitly asked for "the" account, and your request is based on a false premise.

>why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?

Because other users of the library explicitly want that to be an error condition, and would rather not write the logic for it themselves.

Performance could factor into it, too, depending on implementation details that obviously I know nothing about.

Or for legacy reasons as described in https://news.ycombinator.com/item?id=44321644 .

Jabrov•3h ago

"crash the app" sounds like the app's problem (ie. not handling exceptions properly) as opposed to the design of the API. It doesn't seem that unreasonable to throw an exception if unexpected conditions are hit? Also, more likely than not, there is probably an explicit reason that an exception is thrown here instead of something else.

raincole•2h ago

> nowadays it's a “skill issue”

> throw an exception and crash the fucking app

Yes, if your app crashes when a third-party API throws an exception, it's a "skill issue" of you. This comment is an example why sometimes blaming the user's skill issue is valid.

jiggawatts•1h ago

At the risk of being an amateur psychologist, your approach feels like that of a front end developer used to a forgiving programming model with the equivalent of the old BASIC programming language statement ON EFROR RESUME NEXT.

Server side APIs and especially authentication APIs tend towards the “fail fast” approach. When APIs are accidentally mis-used this is treated either as a compiler error or a deliberate crash to let the developer know. Silent failures are verboten for entire categories of circumstances.

There’s a gradient of: silent success, silent failure, error codes you can ignore, exceptions you can’t, runtime panic, and compilation error.

That you can’t even tell the qualitative difference between the last half of that list is why I’m thinking you’re primarily a JavaScript programmer where only the first two in the list exist for the most part.

lostdog•6h ago

A lot of the "science" we do is experimenting on bunches of humans, giving them surveys, and treating the result as objective. How many places can we do much better by surveying a specific AI?

It may not be objective, but at least it's consistent, and it reflects something about the default human position.

For example, there are no good ways of measuring the amount of technical debt in a codebase. It's such a fuzzy question that only subjective measures work. But what if we show the AI one file at a time, ask "Rate, 1-10, the comprehensibility, complexity, and malleability of this code," and then average across the codebase. Then we get measure of tech debt, which we can compare over time to measure if it's rising or falling. The AI makes subjective measurements consistent.

This essay gives such a cool new idea, while only scratching the surface.

delusional•6h ago

> it reflects something about the default human position

No it doesn't. Nothing that comes out of an LLM reflects anything except the corpus it was trained on and the sampling method used. That definitionally true, since those are the very things it is a product of.

You get NO subjective or objective insight from asking the AI about "technical debt" you only get an opaque statistical metric that you can't explain.

BriggyDwiggs42•6h ago

If you knew that the model never changed it might be very helpful, but most of the big providers constantly mess with their models.

cwillu•5h ago

Even if you used a local copy of a model, it would still just be a semi-quantitative version of “everyone knows ‹thing-you-don't-have-a-grounded-argument-for›”

layer8•5h ago

Their performance also varies depending on load (concurrent users).

BriggyDwiggs42•4h ago

Dear god does it really? That’s very funny.

layer8•5h ago

We can just measure things, but then there’s Goodhart's law.

With the proposed way of measuring code quality, it’s also unclear how comparable the resulting numbers would be between different projects. If one project has more essential complexity than another project, it’s bound to yield a worse score, even if the code quality is on par.

Marazan•5h ago

I would argue you can't compare between projects due to the reasons you state. But you can try and improve the metrics within a single project.

Cycolmatic complexity is a terrible metric to obsesses over yet in a project I was on it was undeniably true that the newer code written by more experienced Devs was both subjectively nicer and also had lower cycolmatic complexity than the older code worked on by a bunch of juniors (some of the juniors had then become some of the experienced Devs who wrote the newer code)

layer8•4h ago

> But you can try and improve the metrics within a single project.

Yes. But it means that it doesn’t let you assess code quality, only (at best) changes in code quality. And it’s difficult as soon as you add or remove functionality, because then it isn’t strictly speaking the same project anymore, as you may have increased or decreased the essential complexity. What you can assess is whether a pure refactor improves or worsens a project’s amenibility to AI coding.

elktown•3h ago

I think this is advertisement for an upcoming product. Sure, join the AI gold rush, but at least be transparent about it.

falcor84•3h ago

Even if he does have some aspiration to make money by operationalizing this (which I didn't sense that he does), what Armin describes there is something that's almost trivial to implement a basic version of yourself in under an hour.

elktown•2h ago

> which I didn't sense that he does

I'd take a wager.

the_mitsuhiko•2h ago

If your wager is that I will build an AI code quality measuring tool then you will lose it. I'm not advertising anything here, I'm just playing with things.

elktown•1h ago

> code quality measuring tool

I didn't, just an AI tool in general.

GardenLetter27•3h ago

I'm really skeptical of using current LLMs for judging codebases like this. Just today I got Gemini to solve a tricky bug, but it only worked after providing it more debug output after solving some of it myself.

The first time I tried without the deeper output, it "solved" it by writing a load of code that failed in loads of other ways, and ended up not even being related to the actual issue.

Like you can be certain it'll give you some nice looking metrics and measurements - but how do you know if they're accurate?

the_mitsuhiko•3h ago

> I'm really skeptical of using current LLMs for judging codebases like this.

I'm not necessarily convinced that the current generation of LLMs are overly amazing at this, but they definitely are very good at measuring inefficiency of tooling and problematic APIs. That's not all the issues, but it can at least be useful to evaluate some classes of problems.

falcor84•3h ago

What do you mean that it "ended up not even being related to the actual issue"? If you give it a failing test suite to turn green and it does, then either its solution is indeed related to the issue, or your tests are incomplete; so you improve the tests and try again, right? Or am I missing something?

GardenLetter27•2h ago

It made the other tests fail, I wasn't using it in agent mode, just trying to debug the issue.

The issue is that it can happily go down the completely wrong path and report exactly the same as though it's solved the problem.

cmrdporcupine•2h ago

I explain this in sibling-node comment but I've caught Claude multiple times in the last week just inserting special-case kludges to make things "pass", without actually successfully fixing the underlying problem that the test was checking for.

Just outright "if test-is-running { return success; }" level stuff.

Not kidding. 3 or 4 times in the past week.

Thinking of cancelling my subscription, but I also find it kind of... entertaining?

jiggawatts•1h ago

I just realised that this is probably a side-effect of a faulty training regime. I’ve heard several industry heads say that programming is “easy” to generate synthetic data for and is also amenable to training methods that teach the AI to increase the pass rate of unit tests.

So… it did.

It made the tests pass.

“Job done boss!”

cmrdporcupine•2h ago

I have mixed results but one of the more disturbing things I've found Claude doing is that when confronted with a failing test case, and not being able to solve a tricky problem.. just writing a kludge into the code that identifies that here's a test running, and makes it pass. But only for that case. Basically, totally cheating.

You have to be super careful and review everything because if you don't you can find your code littered with this strange mix of seeming brilliance which makes you complacent... and total Junior SWE behaviour or just outright negligence.

That, or recently, it's just started declaring victory and claiming to have fixed things, even when the test continues to fail. Totally trying to gaslight me.

I swear I wasn't seeing this kind of thing two weeks ago, which makes me wonder if Anthropic has been turning some dials...

quesera•2h ago

> identifies that here's a test running, and makes it pass. But only for that case

My team refers to this as a "VW Bugfix".

alwa•1h ago

I also feel like I’ve seen a lot more of these over the past week or two, whereas I don’t remember noticing it at all before then.

It feels like it’s become grabbier and less able to stay in its lane: ask for a narrow thing, and next thing you know it’s running hog wild across the codebase shoehorning in half-cocked major architectural changes you never asked for. [Ed.: wow, how’s that for mixing metaphors?]

Then it smugly announces success, even when it runs the tests and sees them fail. “Let me test our fix” / [tests fail] / [accurately summarizes the way the tests are failing] / “Great! The change is working now!”

cmrdporcupine•1h ago

Yes, or I've seen lately "a few unrelated tests are failing [actually same test as before] but the core problem is solved."

After leaving a trail of mess all over.

Wat?

Someone is changing some weights and measures over at Anthropic and it's not appreciated.

yujzgzc•3h ago

Another, related benefit of LLMs in this situation is that we can observe their hallucinations and use them for design. I've come up with a couple situations where I saw Copilot hallucinate a method, and I agreed that that method should've been there. It helps confirm whether the naming of things makes sense too.

What's ironic about this is that the very things that TFA points out are needed for success (test coverage, debuggability, a way to run locally etc) are exactly the things that typical LLMs themselves lack.

crazygringo•2h ago

I've found LLM's to be extremely helpful in naming and general function/API design, where there a lot of different ways to express combinations of parameters.

I know what seems natural to me but that's because I'm extremely familiar with the internal workings of the project. LLM's seem to be very good at coming with names that are just descriptive enough but not too long, and most importantly follow "general conventions" from similar projects that I may not be aware of. I can't count the number of times an LLM has given me a name for a function that I've thought, oh of course, that's a much clearer name that what I was using. And I thought I was already pretty good at naming things...

Compiling LLMs into a MegaKernel: A path to low-latency inference

Literate programming tool for any language

Curved-Crease Sculpture

Homegrown Closures for Uxn

Andrej Karpathy: Software in the era of AI [video]

How OpenElections uses LLMs

Show HN: EnrichMCP – A Python ORM for Agents

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

Show HN: RM2000 Tape Recorder, an audio sampler for macOS

Extracting memorized pieces of books from open-weight language models

Guess I'm a Rationalist Now

Star Quakes and Monster Shock Waves

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

Testing a Robust Netcode with Godot

Show HN: Unregistry – “docker push” directly to servers without a registry

What would a Kubernetes 2.0 look like

DNA floating in the air tracks wildlife, viruses, even drugs

Flowspace (YC S17) Is Hiring Software Engineers

Posit floating point numbers: thin triangles and other tricks (2019)

Visual History of the Latin Alphabet

Why do we need DNSSEC?

Munich from a Hamburger's perspective

String Interpolation in C++ Using Glaze Stencil/Mustache

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

Getting Started Strudel

Elliptic Curves as Art

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

Juneteenth in Photos

Geochronology supports LGM age for human tracks at White Sands, New Mexico

June 2025 C2PA News

We Can Just Measure Things

Comments

Compiling LLMs into a MegaKernel: A path to low-latency inference

Literate programming tool for any language

Curved-Crease Sculpture

Homegrown Closures for Uxn

Andrej Karpathy: Software in the era of AI [video]

How OpenElections uses LLMs

Show HN: EnrichMCP – A Python ORM for Agents

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

Show HN: RM2000 Tape Recorder, an audio sampler for macOS

Extracting memorized pieces of books from open-weight language models

Guess I'm a Rationalist Now

Star Quakes and Monster Shock Waves

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

Testing a Robust Netcode with Godot

Show HN: Unregistry – “docker push” directly to servers without a registry

What would a Kubernetes 2.0 look like

DNA floating in the air tracks wildlife, viruses, even drugs

Flowspace (YC S17) Is Hiring Software Engineers

Posit floating point numbers: thin triangles and other tricks (2019)

Visual History of the Latin Alphabet

Why do we need DNSSEC?

Munich from a Hamburger's perspective

String Interpolation in C++ Using Glaze Stencil/Mustache

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

Getting Started Strudel

Elliptic Curves as Art

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

Juneteenth in Photos

Geochronology supports LGM age for human tracks at White Sands, New Mexico

June 2025 C2PA News