Test, don't just verify

https://alperenkeles.com/posts/test-dont-verify/

200•alpaylan•1mo ago

Comments

esafak•1mo ago

Alperen,

Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.

Good luck with your degree!

P.S. Some links in your Research page are placeholders or broken.

alpaylan•1mo ago

I'll add some links for the original VGD paper and related articles, that should help in short term. Thank you! I'll look into writing something on VGD itself in the next few weeks.

getregistered•1mo ago

> AI-assisted programming pushes the limits of programming from what you can implement to what you can specify and what you can verify.

This really resonates. We can write code a lot faster than we can safely deploy it at the moment.

marcosdumay•1mo ago

> We can write code a lot faster than we can safely deploy it at the moment.

We always could. That has been true since the days we programmed computers by plugging jumper wires into a panel.

lelanthran•1mo ago

> We always could. That has been true since the days we programmed computers by plugging jumper wires into a panel.

That's news to me, and I'm an ancient greybeard in development.

If you have a team of 1x f/time developer and 1x f/time tester, the tester would be spending about half their day doing nothing.

Right now, a single developer with Claude code can very easily overwhelm even a couple of testers with new code to test.

scubbo•1mo ago

> If you have a team of 1x f/time developer and 1x f/time tester,

Y'all have dedicated testers!? In 14 years of development, across FAANG and startup, this has never been true for me. The closest I've come is a brief period when a group of ~7 teams were able to call on the services of two testers. As you can imagine, with that ratio, the testers were not spending much time doing nothing.

lelanthran•1mo ago

I've had it at least 4 times; it's a byproduct of working in a highly regulated industry that requires the software (or goods sold) to meet a specific certification (military/munitions/EMV/etc).

In the FAANG and startup world that I worked in, there was no QA department, so I assume that FAANGs and startups don't have a dedicated and autonomous/independent QA department.

That's not the point I was making, though. The point is that we could never emit code faster than it was to deploy. Deployment (including QA) was always 2x to 4x as fast. Sometimes as much as 10x as fast.

=================

EDIT: Of course, I've been working for about twice the number of years as you, and back in those days it was pretty common for large companies to have dedicated QA. Even Microsoft had those :-)

marcosdumay•1mo ago

> the tester would be spending about half their day doing nothing

That's because the developer would be spending 2/3 of their day fixing the problems the tester already found.

And the time spent writing new code has always been a rounding error from 0.

Jensson•1mo ago

> Right now, a single developer with Claude code can very easily overwhelm even a couple of testers with new code to test.

Because there are endless errors and problems that never gets fixed with AI coding. The reason testers ran out of things to test before was that developers tested themselves before sending it over, if you take a bunch of cowboy coders coding thousands of lines a day with no testing whatsoever before throwing it over to the testers you would say you don't have enough testers even if you had thousands.

lelanthran•1mo ago

> Because there are endless errors and problems that never gets fixed with AI coding.

But, that's my point :-)

> The reason testers ran out of things to test before was that developers tested themselves before sending it over, if you take a bunch of cowboy coders coding thousands of lines a day with no testing whatsoever before throwing it over to the testers you would say you don't have enough testers even if you had thousands.

Right. But even if the devs are doing unit-tests (which is all devs are supposed to do), they can still overwhelm a QA department.

We could never do this before. GGP claimed that we always could. I am disagreeing with that specific claim - "We could always overwhelm the testers".

acedTrex•1mo ago

> We can write code a lot faster than we can safely deploy it at the moment.

This has always been the case?

badgersnake•1mo ago

> AI is making formal verification go mainstream.

This nonsense again. No. No it isn’t.

I’m sure the people selling it wish it was, but that doesn’t make it true.

AnimalMuppet•1mo ago

LLM-style AI isn't great for formal verification, not so far as I understand. And the recent advances in AI didn't do much for the kind of AI that is useful for formal verification.

otterley•1mo ago

You don’t use AI to perform formal verification. You give the agent access to verification tools whose output can then be fed back to the model.

It’s the same design as giving LLMs the current time, since they can’t tell time themselves, either.

baq•1mo ago

You haven't been paying attention.

The fact that we're reading about it here today and have read about it in the past weeks is one piece of evidence. Another is that we hadn't been reading about it in the past months before November. Opus 4.5 and GPT 5.2 have crossed an usefulness frontier.

Anecdotally, I've been having some success (guiding LLMs) writing Alloy models in the past month and ensuring conformance with code. Making these would've been unjustifiable from ROI perspective fairy tales just this summer. The landscape has changed qualitatively.

Izkata•1mo ago

> The fact that we're reading about it here today and have read about it in the past weeks is one piece of evidence. Another is that we hadn't been reading about it in the past months before November.

Except I do remember reading about it on here many times in past years.

Ericson2314•1mo ago

Formal verification is going mainstream as watercooler weakend project fodder. As someone that has been well-versed in functional programming and depedent types for over a decade, this is a vast improvement.

The hobby project to day job methodology pipeline is real.

andai•1mo ago

Related discussion from last week:

AI will make formal verification go mainstream

https://news.ycombinator.com/item?id=46294574

zipy124•1mo ago

I think this misses a lot of reasons why learning verification is important. For instance learning the concept of invariants and their types such as loop invariants. They make reasoning about code in general easier, even if you never formally do any verification, it makes it easier to write tests or asserts(). A substantial amount of bugs are due to the program having a different state to that assumed by the programmer, and there are other tools that help with this. For example a statically typed language is a type of verification since it verifies a variable has a specific type and thus operations that can be performed on it, and limits the valid input and output range of any function. Languages like Rust are also verification in terms of memory correctness, and are also extremely useful tools.

CuriouslyC•1mo ago

Formal verification is a nice idea but it's a big hill to climb from where we're at. Most people can't even get agents to robustly E2E QA code, which is a much smaller hill to climb for (probably) larger benefits. I'm sure this area will improve over time though, since it is an eventual unlock for fully autonomous engineering.

__MatrixMan__•1mo ago

I think for most complex systems, robust E2E QA is a waste of money. A small handful of E2E smoke tests and thoughtful application of smaller tests is usually enough. Though to be fair, agent aren't good at that either.

smccabe0•1mo ago

I think the section on AI from Zero to QED (a proofs in Lean/lang guide) gives a sober path forward from the perspective of market-makers and trading:

"Imagine market infrastructure where agents must prove, before executing, that their actions satisfy regulatory constraints, risk limits, fairness properties, and eventually machine-checkable proofs of Pareto efficiency of market mechanisms. This is a big, hairy, ambitious goal. Not “we reviewed the code” but “the system verified the proof.” The agent that cannot demonstrate compliance cannot act."

https://sdiehl.github.io/zero-to-qed/20_artificial_intellige...

TeodorDyakov•1mo ago

I dream of a future where before any software is released we can predict 100 years into the future what effect it will have on every living thing and not release it if unhappiness delta for some living thing falls below a certain threshold.

layer8•1mo ago

That would require making the behavior of the living things part of the formal specification, which isn’t really possible.

ecocentrik•1mo ago

Doesn't this run into the same bottleneck as developing AI first languages? AI need tons of training material for how to write good formal verification code or code in new AI first languages that doesn't exist. The only solution is large scale synthetic generation which is hard to do if humans, on some level, can't verify that the synthetic data is any good.

omgJustTest•1mo ago

my user should get upvotes for this :)

omgJustTest•1mo ago

Lol worth the -3 pts

andrewmutz•1mo ago

I agree completely with the author that AI assisted coding pushes the bottleneck to verification of the code.

But you don't really need complete formal verification to get these benefits. TDD gets you a lot of them as well. Perhaps your verification is less certain, but it's much easier to get high automated test coverage than it is to get a formally verifiable codebase.

I think AI assisted coding is going to cause a resurgence of interest in XP (https://en.wikipedia.org/wiki/Extreme_programming) since AI is a great fit for two big parts of XP. AI makes it easy to write well-tested code. The "pairing" method of writing code is also a great model for interacting with an AI assistant (much better than the vibe-coding model).

9rx•1mo ago

Trouble is that TDD, and formal proofs to much the same extent, assume a model of "double entry accounting". Meaning that you write both the test/proof and the implementation, and then make sure they agree. Like in accounting, the assumption is that the probability of you making the same mistake twice is fairly low, giving high confidence to accuracy when they agree. When there is a discrepancy, then you can then unpack if the problem is in the test/proof or the implementation. The fallible human can easily screw either.

But if you only fill out one side of the ledger, so to speak, an LLM will happily invent something that ensures that it is balanced, even where your side of the entry is completely wrong. So while this type of development is an improvement over blindly trusting an arbitrary prompt without any checks and balances, it doesn't really get us to truly verifying the code to the same degree we were able to achieve before. This remains an unsolved problem.

altmanaltman•1mo ago

I don't fully understand what you mean by accounting expects the probability of making the same mistake twice is fairly low? Double-entry bookkeeping can only tell you if the books are balanced or not. We absolutely cannot assume that the books reflect reality just because they're balanced. You don't need to mess up twice to mess up the books in terms of truthness.

Also tests and code are independent while you always affect both sides in double-entry always. Audits exist for a reason.

9rx•1mo ago

What about the concept of "high confidence" is not understandable?

altmanaltman•1mo ago

You can use a little less snark and "high confidence" is pretty easy to understand but your metaphor makes no sense. Balanced books =/= accurate books and it is not at all a sign that the bookkeeping is accurate. The entries are also not independent like code and tests.

9rx•1mo ago

> Balanced books =/= accurate books

Naturally. Hence "high confidence" and not "full confidence". But let's not travel too far into the weeds here. Getting us back on track, what about the concept of "high confidence" is not understandable?

layer8•1mo ago

With double-entry bookkeeping, the only way an error can slip through is if you make the same error on both sides, or else they wouldn’t be balanced. A similar thing is true for testing: If you make both an error in your test and in your implementation, they can cancel out and appear to be error-free.

I don’t quite agree with that reasoning, however, because a test that fails to test the property it should test for is a very different kind of error than having an error in the implementation of that property. You don’t have to make the “same” error on both sides for an error to remain unnoticed. Compared to bookkeeping, a single random error in either the tests or the implementation is more likely to remain unnoticed.

altmanaltman•1mo ago

> With double-entry bookkeeping, the only way an error can slip through is if you make the same error on both sides, or else they wouldn’t be balanced. A similar thing is true for testing: If you make both an error in your test and in your implementation, they can cancel out and appear to be error-free.

Yeah but it's very different from tests vrs code though, right? Every entry has two sides at least and you do it together, they are not independent like test and code.

You can easily make a mistake if you write a wrong entry and it will still balance. Balanced books =/= accurate books is my point. And there is no difference between "code" and "tests" in double entry, it's all just "code".

So it seems like the person who made the metaphor doesn't really know how double-entry works or took maybe one accounting class.

zahlman•1mo ago

> Yeah but it's very different from tests vrs code though, right? Every entry has two sides at least and you do it together, they are not independent like test and code.

The point of the current thread is that the use of AI coding agents threatens to disrupt that. For example, they could observe a true positive test failure and opt to modify the test to ensure a pass instead.

andrewmutz•1mo ago

That sounds right in theory, but in practice my code is far, far higher quality when I do TDD than when I don't. This applies whether or not I'm using an Ai coding assistant

MetaWhirledPeas•1mo ago

I don't think GP disagrees. They are (I think) stating that AI-assisted TDD is not as reliable as human TDD, because AI will invent a pointless test just to achieve a passing outcome.

antupis•1mo ago

Mainstream has been slowly adapting xp like last 20 years, devops first and now pair programming with agents and tdd.

senderista•1mo ago

There's a pretty fundamental difference: TDD is about avoiding up-front design, while formal specification literally is up-front design.

baq•1mo ago

We won't be formally verifying millions of LOC anytime soon, don't get your hopes that high up.

...but we will be modelling those 5-10kLOC modules across multiple services doing critical business logic or distributed transactions. This has been unthinkable a couple months ago and today is a read-only-Friday experiment away (try it with a frontier model and you'll be surprised).

tgtweak•1mo ago

I think more salient here (at term certainly) is setting up adversarial agents for testing/verification - that has been a big win for me in multi-agent workflows - when claude first released "computer use" that was a very big step in closing this loop and avoiding the manual babysitting involved in larger projects. PSA that it's not a silver bullet as the "analyzer" can still get tripped up and falsely declare something as broken (or functional), but it greatly reduces the "Hey I've done the task" when the task is not done or the output is broken.

esafak•1mo ago

So it is not reliable enough to run automatically?

throw-12-16•1mo ago

people cant even bother to write code and you expect them to test it?

anon-3988•1mo ago

Before we start writing Lean. Perhaps we can start with something "dumber" like Rust or any typed program. If you want to write something correct, or you care about correctness, you should not be using dynamic languages. The most useful and used type of test is type checking.

Type errors, especially once you have designed your types to be correct by construction, is extremely, extremely useful for LLMs. Once you have the foundation correct, they just have to wiggle through that narrow gap until it figures out something that fits.

But from what I understood and read so far, I am not convinced of OP's "formal verification". A simple litmus test is to take any of your recent day job task and try to describe a formal specification of it. Is it even doable? Reasonable? Is it even there? For me the most useful kind of verification is the verification of the lower level tools i.e. data structures, language, compilers etc

For example, the type signature of Vec::operator[usize] in Rust returns T. This cannot be true because it cannot guarantee to return a T given ANY usize. To me, panic is the most laziest and worst ways to put in a specification. It means that every single line of Rust code is now able to enter this termination state.

packetlost•1mo ago

I think it's possible to write correct systems with dynamic languages, just not the ones we commonly use like Python and JavaScript. I find Clojure, for example to be one example of a dynamic language that is pretty easy to manage and I attribute that to the immutable nature and data-centric ethos. I'm sure there are other dynamic languages that would work as well.

Now, I wouldn't necessarily use Clojure on a huge multi-organization codebase (maybe it's fine, this is outside of my experience with it), but it can be the right tool for some jobs.

brabel•1mo ago

Common Lisp as well. I can’t explain why, but type errors are just not something I struggle with in Common Lisp! But it is in JS and Python for sure. Maybe someone knows why it feels different?

packetlost•1mo ago

I haven't done much with CL so I can only speculate, but I think stricter FP principles in general work to minimize the downsides of dynamic typing. CL, to my understanding, isn't the most "pure" when it comes to FP, but does a good job at giving the programmer a lot of power to constrain and explore systems.

chamomeal•1mo ago

I think it’s cause there’s less imperative code and side effects to track data transformations through.

Like any random JS/php app is probably a huge pile of loops and if statements. To track what happens to the data, you need to run the whole program in your head. “And now it adds that property to the object in the outer scope, and now that object gets sorted, now it hits the database… ok…”. Whereas in clojure most functions are either a single atomic transformation to a set of data, or batch of side effects. You still have to run it through your head, but you can do it more piece-by-piece instead of having to understand a 1,000 method with class states being auto loaded and mutated all over the place. Also you have a REPL to try stuff out as you go.

Dont get me wrong, I LOVE static types. Statically typed clojure would be the best fckin language ever. But there is definitely a wide gulf between a dynamic language like JS, and one like clojure!

brabel•1mo ago

> Statically typed clojure

Well, if you also like Common Lisp, there's Coalton, which is Common Lisp with a Haskell-like type system: https://coalton-lang.github.io/

zahlman•1mo ago

> Like any random JS/php app is probably a huge pile of loops and if statements. To track what happens to the data, you need to run the whole program in your head. “And now it adds that property to the object in the outer scope, and now that object gets sorted, now it hits the database… ok…”. Whereas in clojure most functions are either a single atomic transformation to a set of data, or batch of side effects. You still have to run it through your head, but you can do it more piece-by-piece instead of having to understand a 1,000 method with class states being auto loaded and mutated all over the place. Also you have a REPL to try stuff out as you go.

Nothing really forces you to write imperative code in a large fraction of cases, and typically the state-change operations can be quite localized within the code. And of course JavaScript and Python both also have REPLs.

chamomeal•1mo ago

But nothing forces you to write functional code either. I’ve seen a whooole lot of php and JS, and most of it has been pretty terrible lol. Of course you can write terrible code in any language, but I think the ease of starting with JS/php combined with the lack of built-in opinions makes it easy to build huge piles of spaghetti.

Though these days fresh typescript codebases are usually pretty decent. I love typescript and it’s really nice to work with a well-typed, modern project with proper schema validation and such. Def miss that in clojure.

Also I wouldn’t really compare JS or pythons REPL to clojure’s. Python’s is useful, but I pretty much live inside the clojure repl

phaedrus•1mo ago

I once attended a talk by someone who is or was big in the node.js world. He opened with the premise, "a static type check is just a stand-in for a unit test."

I wanted to throw a shoe at him. A static type check doesn't stand in for "a" unit test; static typing stands in for an unbounded number of unit tests.

Put another way, this common misconception by users of languages like Javascript and Python that unit testing is just as good as type checking (plus more flexible) is a confusion between the "exists" and "for all" logical operators.

kccqzy•1mo ago

Plus, it is simply more enjoyable to design the types in your program than to write unit tests. The fun factor comes from operating on a higher level of abstraction and engages more of your brain’s puzzle-solving mode than just writing unit tests. Making yourself think about “for all x” rather than a concrete x forces your brain to consider deeply the properties of x being used.

zahlman•1mo ago

> it is simply more enjoyable to design the types in your program than to write unit tests.

I have tried both and I have no idea what you're talking about.

> Making yourself think about “for all x” rather than a concrete x forces your brain to consider deeply the properties of x being used.

The entire point of dynamic typing is that you can think about interfaces rather than concrete types, which entails deep consideration of the properties of the object (semantics of the provided interface).

array_key_first•1mo ago

That's not the entire point of dynamic typing, because all the interface stuff comes from statically typed languages. Some* dynamic languages borrowed it, but most use "implicit" interfaces - where the interface is whatever kind of works, I guess.

zahlman•1mo ago

> because all the interface stuff comes from statically typed languages.

No, it doesn't. It comes from theory that came after the languages.

> Some* dynamic languages borrowed it, but most use "implicit" interfaces

An implicit interface is an interface, and is exactly the sort of thing I'm talking about in GP. The point is that you think about the object in terms of its capabilities, rather than some proven-up-front categorization that it fits into. What it does, not what it is.

maleldil•1mo ago

You can achieve this with structural subtyping, such as Go interfaces and Python protocols. Whether that is desirable is a different question.

lesuorac•1mo ago

If you care about the type of a parameter you can just add an assertion in the method /s

seg_lol•1mo ago

Hundreds of unit tests replace a type.

Start using properties and it is in the thousands.

Most code should be typed. Python is great for prototypes, but once the prototype gels, you need types.

baq•1mo ago

Good that Python supports types then

lelanthran•1mo ago

> Good that Python supports types then

"Optional typing" is not the same as "Static typing".

Great, my program will crash, because I forgot to opt-in to typing :-/

paulddraper•1mo ago

C has void pointers.

lelanthran•1mo ago

> C has void pointers.

And? Void pointers are not the default type :-/

With Python I have to do extra work to get type errors.

With C I have to do extra work to hide the type errors.

I am battling to understand the point you are making.

baq•1mo ago

Hard to argue with an argument which isn’t internally coherent.

array_key_first•1mo ago

He's probably conflating static and strong typing.

C is statically typed, but weakly typed - you need to throw away types to do a bunch of run of the mill things. Python is dynamically typed, but strongly typed, where it will just fail if typed don't resolve.

C# and C++ are both statically typed and strongly typed, although C# more than C++ in practice.

paulddraper•1mo ago

I believe that is in fact the exact return type of malloc.

You have to do extra work to type it.

I’m not aware of any other statically typed language that does that.

Hopefully that helps you in your battle.

lelanthran•1mo ago

> I believe that is in fact the exact return type of malloc.

> You have to do extra work to type it.

Nope. Casting the return from `malloc` is a code-smell in C. You have to do extra work if you use `malloc` in C++.

> Hopefully that helps you in your battle.

Not sure what you mean by this - weren't you the one who dragged C into a conversation about Python?

SR2Z•1mo ago

Poorly, though, and with lots of edge cases and foot guns to make you miserable once your code calls out to numpy or gets some JSON.

Groxx•1mo ago

Python has formalized magic comments.

You can run a third party linter on those comments, but you must hope that they're correct. There are usually some checks for that, but they're only reliable in trivial cases.

This is not static typing any more than "you can use emscripten to transpile JavaScript to C" means that JavaScript is a low level language with native assembly support. It's a huge step forward from "no system at all" and I'm thrilled it exists, but it's hardly the same thing.

kccqzy•1mo ago

They are originally magic comments, unformalized. The feature is originally known as annotations and you can basically put anything there.

    def f(a: "something like an int", b: "another int-like thing"): pass

    def g(a: sys.stdin, b: sys.stderr): pass

tasuki•1mo ago

Tell me more please: how does one use types in Python? Unfortunately I write Python professionally these days (it is the language that has all the libraries) and hate it with a passion.

baq•1mo ago

mypy is the simplest/oldest option, look into pyright and ty, too.

install uv and then

    uvx mypy

should do the rest.

seg_lol•1mo ago

You will never get good if you continue to hate it. You don't have to like it, but hating it creates a mental block.

andrewl-hn•1mo ago

It's actually remarkable how with the success of TypeScript so many other dynamic languages switched to gradual typing.

Erlang and Clojure were the early ones, TypeScript followed, and now Python, Ruby, and even Perl have ways to specify types and type check your programs.

lelanthran•1mo ago

I've always hated Python. Could never enjoy it at all. Pretty much the same poor DX as PHP, Javascript, Ruby, etc.

Finally set up neovim with pyright; use types on every single fucking thing, and now I love Python[1].

Can't wait to see TC39 become a reality (learned about it just this past week on HN, actually). Maybe I'll enjoy Javascript too.

--------------------

[1] Within reason; the packaging experience is still extremely poor!

tasuki•1mo ago

I've never seen types used correctly in Python. A `tuple` is not even a type! It's a type constructor!

lelanthran•1mo ago

Most of the types are "not even types". It's why you can pass a type as a value (not a string) to a function.

If the type was a type, you'd not be able to use it as a value.

zahlman•1mo ago

The distinction you are trying to make is nonsensical in Python's object model. Types are inherently callable, and calling them constructs (i.e. instantiates) the type (normally; this can be overridden, by design). There is also no type->kind->category hierarchy; `type` itself is an object, which is its own type.

When you're at a level of theory where terms like "type constructor" are natural, it's unreasonable to expect any of it to be applicable to Python. This is why the Haskell people speak of dynamically-typed languages in the Python mold as "untyped" regardless of their attitude towards implicit casts.

And I love it, and have been using it for decades, and write beautiful things where the annotations hardly ever seem worth the effort — perhaps for documentation, but not for a static checker. Then I look at other, newer Pythonistas trying to figure out how to write complex generic type expressions (and sacrificing backwards compatibility as they keep up with the churn of Python figuring out how to offer useful annotation syntax) and deal with covariance vs contravariance etc. and I just smile.

baq•1mo ago

I’ve been doing Python and Typescript professionally, Python for almost two decades, Typescript for last 5 years and I can very confidently say that it doesn’t matter.

Besides, you see to be confusing Python run-time with Python typecheck-time, theoretically unfortunate, but again practically irrelevant distinction. (Unfortunate since Python typecheck is basically a different language than Python execution; irrelevant, because the right subsets of both align well.)

lenkite•1mo ago

Didn't get the bit about TC39 - it is just a group. It has nothing to do with types in JS.

lelanthran•1mo ago

Sorry, I should have been clearer about which proposal I was talking about:

https://github.com/tc39/proposal-type-annotations

shermantanktop•1mo ago

A unit test is a functional assertion. A type is a semantic construct that can provide that, but it provides a lot more.

As a trivial example, if I create a type alias from “string” to “foobarId,” I now (assuming a compliant language) can prevent code that consumes foobarIds from accidentally consuming a string.

pmontra•1mo ago

Good luck using static typing to model many real world unit tests for the programming languages people use most. I start with an easy example: those records should be sorted by date of birth. We can move on to more complicated scenarios.

Jaxan•1mo ago

The comment didn’t claim that types are a stand in for tests either! IMO, they are orthogonal.

zahlman•1mo ago

The comment explicitly set out to refute the idea "...that unit testing is just as good as type checking" by describing the former as simply inferior.

Veserv•1mo ago

No. They refuted the claim that "a static type check is just a stand-in for a unit test". That is a claim that you can just remove your type checks and replace them with unit tests at no loss. The comment stated that removing a type check just so you can replace it with a unit test is inferior. The prior state was already pre-supposed to have a type check/type checkable condition that you could replace.

That is the literal converse of the claim in the response to that comment arguing that the comment stated that all unit tests can be replaced with type checks. Those are not at all the same claim.

To make it even more clear the comment said: I saw a talk that said Type Check -> Unit Test. I said that is silly.

Response said: Unit Test -> Type Check is not reasonable. So clearly your claim that Type Check -> Unit Test is silly is wrong.

paulddraper•1mo ago

No one claims that types are a stand in for all unit tests.

They stand in for the banal unit tests.

jodrellblank•1mo ago

> "records should be sorted by date of birth."

What's wrong with C#'s:

    System.Collections.Generic.SortedList<DoBDateTime, PersonRecord>

?

tasuki•1mo ago

> I wanted to throw a shoe at him.

You should have!

imiric•1mo ago

Agreed.

Besides, it's not like types don't matter in dynamically typed languages. The (competent) programmer still needs to keep types in their head while programming. "Can this function work with a float, or must I pass an int?" "This function expects an iterable, but what happens if I pass a string?" Etc.

I started my career with JavaScript and Python, but over the years I've come to the conclusion that a language that hides types from programmers and does implicit conversion magic in the background does not deliver a better DX. It might make the language more approachable initially, and the idea of faster prototyping might be appealing, but it very quickly leads to maintenance problems and bugs. Before type hinting tools for Python became popular, I worked on many projects where `TypeError` was the #1 exception in Sentry by a large margin.

Gradual and optional typing is better than nothing, but IME if the language doesn't require it, most programmers are lazy and will do the bare minimum to properly add type declarations. Especially with things like TypeScript, which makes many declarations difficult to read, write, and understand.

I think that type inference is a solid middle ground. Types are still statically declared, but the compiler is smart enough to not bother the developer when the type is obvious.

zahlman•1mo ago

> Before type hinting tools for Python became popular, I worked on many projects where `TypeError` was the #1 exception in Sentry by a large margin.

My experience is radically different. `ValueError` is far more common in my un-annotated Python, and the most common cause of `TypeError` anyway is the wrong order or number of arguments after a refactoring.

imiric•1mo ago

Hhmm I could be misremembering if it was `ValueError` or `TypeError`. This was a few years ago. I know that typing issues were always the most frequent in any Python project I have worked on.

How is your experience different?

zahlman•1mo ago

> A static type check doesn't stand in for "a" unit test; static typing stands in for an unbounded number of unit tests.

You have conflated "a static type check" with "static typing". Unit tests stand in, in the same way, for an unbounded number of states of real-world input. They're simply being subjected to a trial verification system rather than a proof system. It turns out that writing proofs is not very many people's idea of a good time, even in the programming world. And the concept of "type" that's normally grokked is anemic anyway.

> Put another way...

Rhetoric like this is unconvincing and frankly insulting. You pass off your taste and opinion as fact, while failing to understand opposed arguments.

andrewl-hn•1mo ago

> "a static type check is just a stand-in for a unit test."

This is not an original argument. Rich Hickey made a similar argument in his "Simple made easy" talk in 2011, though his focus was on a fact that every bug that easiest in a software system has passed unnoticed through both a type checker and a test suit. And even before that similar ideas of test suits being a suitable replacement for a type checker have percolated through Python and Ruby communities, too.

I distinctly remember that the "tests makes static type checks unnecessary" was in fact so prevalent in JavaScript community that TypeScript had really hard time getting adoption in its first 3-4 years, and only the introduction of VSCode in 2015 and subsequent growth of its marketshare over Atom and SublimeText got more people exposed to TypeScript and the benefits of a type checker. Overall it took almost 10 years for Typescript to become the "default" language for web projects.

loglog•1mo ago

The author is in the comfortable position of working on a system that does have a formal specification and a formally verified reference implementation. The post is not about how they wish things would work, but how their existing system (Cedar) works.

Regarding your point on Rust, the vast majority of software has nowhere near the amount of static guarantees provided by Rust. If you need more, use static memory allocation, that's what people do for safety critical systems. By the way, it seems that Rust aborts on OOM errors, not panics: https://github.com/rust-lang/rust/issues/43596

gaogao•1mo ago

Yeah, Rust has been pretty good for formal verification so far. Hoare spec contracts I think are the way forward, especially since they fairly naturally flow from unittests. I've been using Hax to pretty good effect so far. I'm generally suspect that advances in Lean proof solving by dedicated models are that useful for program verification, compared to generalist models, though it could help lower costs a good bit.

didip•1mo ago

I agree. And learning a typed language is significantly easier now that AI can explain everything. The types also help AI to write a correct code. A very positive feedback loop.

Ericson2314•1mo ago

> To me, panic is the most laziest and worst ways to put in a specification.

This why the "existing programs don't have specs!" Hand-ringing is entirely premature. Just about every code base today has error modes the authors think won't happen.

All you have to do is start proving they won't happen. And if you do this, you will begin a long journey that ends up with a formal spec for, at least, a good part of your program.

Proving the panics are dead code is a Socratic method, between you and the proof assistant / type checker, for figuring out what your program is and what you want it to be :).

anon-3988•1mo ago

Orrrr just use an API that returns Option<T>, then bubble it up or manually panic if that's what you want?

pron•1mo ago

> Perhaps we can start with something "dumber" like Rust or any typed program. If you want to write something correct, or you care about correctness, you should not be using dynamic languages. The most useful and used type of test is type checking.

Lean or TLA+ are to Rust/Java/Haskell's type systems what algebraic topology and non-linear PDEs are to "one potato, two potatoes". The level of "correctness" achievable with such simple type systems is so negligible in comparison to the things you can express and prove in rich formal mathematics languages that they barely leave an impression (they do make some grunt work easier, but if we're talking about a world where a machine can do the more complicated things, a little more grunt work doesn't matter).

anon-3988•1mo ago

I mean..sure, but I just want the first 80%. We don't have that. Instead, we are building kernels and infrastructure using bash scripts that who knows does what. We need a tool that is solid and rigid that LLMs can use to go through all of that.

It should be something that is familiar (so imperative style like C), easier to read (perhaps with type inference) and have strong modern type system (just give me sum type is enough for gods sake). Perhaps Python with (real) types.

pron•1mo ago

But if LLMs get to the point they're smart enough to deal with the tricker aspects of programming, what makes you think they need help with the easier parts? Conversely, if they're not smart enough to deal with the trickier parts, why would a little help move the needle much? Despite trying, research has not been able to find a significant general[1] effect of language design on correctness or productivity for human programmers (at least among more-or-less high level languages; I'm not talking Java vs Assembly). We all have our preferences, and we tend to think they're universal, but it's precisely because of this bias that empirical study is needed, and it's not been conclusive.

If there's no big impact on humans, why assume there would be one for LLMs? I'm not saying that LLMs think like humans, but the default hypothesis should be that something doesn't make a big difference if there's no example in which it does. In other words, if something does not have a known effect, we shouldn't assume that it will in this case (I mean, it could, but we'll need to first find good empirical evidence for that).

[1]: Research did find some differences between TypeScript and JavaScript specifically, but that result hasn't generalised.

i_am_a_peasant•1mo ago

a rust to js transpiler would be pretty sweet. idk if someone has turned rust into a frontend language yet like typescript.

MetaWhirledPeas•1mo ago

I lack the level of education and eloquence of the author, but I have my own notion that I think agrees with them: Specification is difficult and slow, and bugs do not care whether they are part of the official specification or not.

Some software needs formal verification, but all software needs testing.

On another subject...

> Tests are great at finding bugs ... but they cannot prove the absence of bugs.

I wish more people understood this.

enceladus76•1mo ago

Unless people therefore decide that testing unnecessary... Which has happened a lot in academia. One of the reasons testing is not being taught that well on some universities...

visarga•1mo ago

At the end of the day, you either validate every line of code manually, or you have the agent write tests. Automate your review.

vzaliva•1mo ago

There are many arguable points in this blog post, but I want to highlight just one: the need for formal specification. It is indeed a big issue. However, one must distinguish between a full specification, which is sufficient to prove functional correctness, and a specification of certain security or safety properties, which only allows us to verify those properties. For example, we can easily specify the property that "the program shall never read uninitialised memory" and prove it. That wouldn't guarantee that the program is functionally correct, but it would at least rule out a whole class of potential errors.

xp84•1mo ago

For the verification experts: (and forgive me because I have almost zero of the math understanding of this stuff)

> This makes formal verification a prime target for AI-assisted programming. Given that we have a formal specification, we can just let the machine wander around for hours, days, even weeks.

Is this sentiment completely discounting that there can be many possible ways to write program that satisfies certain requirements that all have correct outputs? Won’t many of these be terrible in terms of performance, time complexity, etc? I know that in the most trivial case, AI doesn’t jump straight to O(n)^3 solutions or anything, but also there’s no guarantee it won’t have bugs that degrade performance as long as they don’t interfere with technical correctness.

Also, are we also pretending that having Claude spin for “even weeks” is free?

hun3•1mo ago

Verified software should satisfy the liveness property; otherwise, an infinite loop that never returns would pass as "correct."

Verifying realtime software goes even further and enforces an upper bound on the maximum number of ticks it takes to complete the algorithm in all cases.

gaogao•1mo ago

Yup, I've already spent like $20k using Claude to verify things, so like there's probably some room for cost cutting.

hacker_homie•1mo ago

Test, just don’t verify.

How I learned to deploy faster.

nileshtrivedi•1mo ago

> proof assistants, traditionally, don't use our classic two's complement integers packed into words in our memory, they use Peano numbers

Why can't we just prove theorems about the standard two's complement integers, instead of Nat?

AlotOfReading•1mo ago

You can. Naturals are the default because the kinds of people who write and use proof assistants think usually in terms of natural numbers first rather than the 2-adic numbers mod 2^N we use in programming, or even the plain 2-adics used for algorithmics. It's like mathematicians and 1-based indexing. It violates programming conventions, but the people using it find it easier.

aidenn0•1mo ago

See also Regehr's example[1] where a formally verified C compiler generates incorrect output because of an inconsistent value in <limits.h> (TL;DR: The compiler can pick whether "char" is signed or unsigned. Compcert picked one, but the linux system header used the other for CHAR_MIN and CHAR_MAX).

1: https://blog.regehr.org/archives/482 there were many issues here, not just with compcert

MarkMarine•1mo ago

This is an aside because I agree with the author’s core point, but spelling, grammatical errors, and typos actually imply something authored by a human now. This sentence:

“It affects point number 1 because AI-assisted programming is a very natural fit fot specification-driven development.”

made me smile. Reading something hand made that hadn’t been through the filters and presses of modern internet writing.

blauditore•1mo ago

To my surprise, I recently found a typo in AI-generated code. It's rare though

zahlman•1mo ago

There are typos all over the training data, and people offering RLHF feedback can overlook them.

hebelehubele•1mo ago

He's not a native English speaker, for starters. But I like that he didn't fill that gap with an LLM.

jesse__•1mo ago

> It is also painfully slow, the computational complexity of a + b, an operation so fast in CPU that it's literally an instant, is O(a + b), addition is linear in time to the added values instead of a constant operation.

To me, this reads as an insurmountably high hurdle for the application domain. We're talking about trying to verify systems which are produced very quickly by AIs. If the verification step is glacially slow (which, by any measure, a million cycles to add two integers is), I don't see how this could be considered a tractable solution.

Ericson2314•1mo ago

This blog post is out of its depth

- Lean will optimize peano arithmetic with binary bignums underneath the hood

- Property based checking and proof search already exist on a continuum, because counterexamples are a valid (dis)proof technique. This should surprise no writer of tactics.

- the lack of formal specs for existing software should become less a problem for greenfield software after these techniques go mainstream. People will be incentivized to actually figure out what they want, and successfully doing so vastly improves project management.

Finally, and most importantly, people thinking that there is a "big specification" and then "big implementation" are totally missing the mark. Remember tools like lean are just More Types. When we program with types, do we have a single big type and a single untyped term, paired together? Absolutely not.

As always, the key to productive software development is more and more libraries. Fancier types will allow writing more interesting libraries that tackle the "reusable core" of many tasks.

For example, do you want to write a "polymorphic web app" that can be instantiated with a arbitrary SQL Schema? Ideas like that become describable.

jesse__•1mo ago

> the key to productive software development is more and more libraries

You had me until this statement. The idea that "more and more libraries" is going to solve the (rather large) quality problems we have in the software industry is .. misguided.

see:

https://www.folklore.org/Negative_2000_Lines_Of_Code.html

https://caseymuratori.com/blog_0031

teaearlgraycold•1mo ago

Don’t use a library unless you really need it. Someone recently recommended I add Zod to a project where I am only validating two different JSON objects in the entire project. I like Zod, but I already wrote the functions to progressively prove out the type in vanilla JS.

Less is more, including other people’s libraries.

jesse__•1mo ago

Agreed.

swid•1mo ago

Go Proverb:

A little copying is better than a big dependency.

teaearlgraycold•1mo ago

100% agree. This actually makes AI-aided development a big improvement (as long as you’re careful). You can have an LLM write you a little function, or extract the correct one from a big library, and inline it into your module.

Ericson2314•1mo ago

I'm talking great libraries in great languages. Like how the kmettverse revolutionized writing Haskell. Libraries that make you completely reconsider what it is you're trying to do.

Most people use shit libraries in shit languages. NPM slopfests have no bearing on what I'm talking about.

jesse__•1mo ago

> great languages like [...] Haskell

We have very different opinions on what makes a great language

Ericson2314•1mo ago

Then I'm surprised you agreed with any of what I wrote!

jesse__•1mo ago

I mean, I agree with the notion that a great type system is part of a great language, but the type system is only a small part of the overall whole.

Haskell has a great type system, but it fails miserably at many trivial tasks I care about :

1) sum 1 million integers == stackoverflow (big lul)

2) massive memory cliffs for using the wrong type [1], which is not uncommon

3) poor metaprogramming facilities

4) difficult documentation

[1] https://stackoverflow.com/questions/76470000/is-there-a-reas...

dalmo3•1mo ago

> People will be incentivized to actually figure out what they want

That's the AGI I want to see.

xiphias2•1mo ago

This article is underselling of how much was achieved in proof formalizing for math in the last few years and how close it is to being solved.

If we disregard programming and just look at formalizing math (Christian Szegedy has been doing it for a long time now), the length of proofs that are being formalized are exponentially growing and there's a good chance that in 2026 close to 100% of human written big/important proofs will be translated to and verified by Lean.

Just as an example for programming / modelling cache lines and cycle counts: we have quite good models for lots of architectures (even quite good reverse engineered model for NVIDIA GPUs in some papers). The problem is that calculating exact numbers for cache reads / writes is boring with lots of constants in them, and whenever we change the model a little bit the calculations have to be remade.

It's a lot of boring constraints to solve, and the main bottleneck for me when I was trying to do it by hand was that I couldn't just trust the output of LLMs.

GuB-42•1mo ago

"Beware of bugs in the above code; I have only proved it correct, not tried it." - Donald Knuth

Not that relevant in context as the code in question is used to conclude a formal proof, not the other way around. Buy hey, it is a common quote when talking about proving software and someone has to do it...

Context: https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf

ozim•1mo ago

I smell vaporware. Formal verification is easy on easy stuff like simple functions - complex functions it might be impossible. Then you most likely will get bunch of snake oil salesmen promising that you can verify full system…

NlightNFotis•1mo ago

What is a complex function to you, and why do you think it’s impossible to verify properties of them?

pron•1mo ago

I find this discourse about AI and formal verification of software very confusing. It's like someone saying, let's assume I can somehow get a crane that would lift that vintage car and place it in my 15th floor apartment living room, but what will I do with my suitcases?

All the problems mentioned in the article are serious. They're also easier than the problem of getting an AI to automatically prove at least hundreds of correctness properties on programs that are hundreds of thousand, if not millions of lines long. Bringing higher mathematics into the discussion is also unhelpful. Proofs of interesting mathematical theorems require ingenuity and creativity that isn't needed in proving software correct, but they also require orders of magnitude fewer lemmas and inference steps. We're talking 100-1000 lines of proof per line of program code.

I don't know when AI will be able to do all that, but I see no reason to believe that a computer that can do that wouldn't also be able to reconcile the formal statements of correctness properties with informal requirements, and even match the requirements themselves to market needs.

pron•1mo ago

What I'm trying to say is that a machine that can reliably write a complex, large piece of software and prove its correctness - somethng that, BTW, humans are not currently capable of doing - is also likely a machine that can do that from the prompt: Write a piece of software that will be popular among women aged 35-65, let alone "write a spreadsheet that's as powerful as excel but easier to use". Of course, once that happens, the market value of any such software will drop to zero, because anyone could give such a prompt. In fact, there would be no need for software as we know it because the AI could just do what the software is supposed to do (although perhaps it would choose to create an executable as an implementation detail).

What I see is people spending a lot of time imagining how we would work with an AI that could solve some huge problems and at the same time fail to solve easier problems. I don't understand the point of the exercise.

odie5533•1mo ago

I wonder if Design by Contract or schema-first design might take off as a way of structuring AI output and allowing it to rapidly iterate toward goals. I'm starting to try these methods out for myself with AI to see where they lead. Looking into https://deal.readthedocs.io/

senderista•1mo ago

I think the article still sells formal specification a bit short in a few areas:

- A formal spec can be used to derive randomly generated tests attempting to find counterexamples to spec invariants (which the article hinted at but didn't describe explicitly)

- A formal spec can be used as input to a model checker, which will try to find counterexamples to spec invariants via bounded exploration of the model's state space

- A formal spec can be used to find spec invariant violations by analyzing traces from production systems

What all these examples have in common is that they attempt to falsify, not verify, that the spec accurately describes the desired design properties or the actual implementation. That tends to be much more feasible than an actual proof.

drejt•1mo ago

Really a great read

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw Respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw Respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Test, don't just verify

Comments