Type-constrained code generation with language models

257•tough•1y ago

Comments

jiggawatts•1y ago

This was an obvious next step. Most current products can only restrict the token prediction to valid JSON or a specific JSON schema at best. There's no reason that this should be the only grammar available for constrained output mode.

The real challenge will be to make this detect and switch languages automatically. For example, a snippet of code could include a LaTeX formula in a comment and SQL in a string literal. There are many more examples, such as regex inside a shell script, and so on.

The obvious next step after that is back-tracking. It's possible to emit a token that is valid, but then allows no further completions that are valid. In other words, the model can paint itself into a corner. To my knowledge, no current online LLM service uses any kind of backtracking, they run in append ("forwards") mode only.

helltone•1y ago

Backtracking idea is interesting, could maybe diffusion help? At some point it turns into sat solving.

grafmax•1y ago

Sat solving I guess because types encode proofs?

foota•1y ago

I believe Microsoft introduced a framework that did this sort of backtracking that you're suggesting. I'm not sure how much traction it got.

tough•1y ago

SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking

https://arxiv.org/abs/2504.00532

IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking

https://arxiv.org/abs/2410.07295

ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation

https://arxiv.org/abs/2411.07112v1

pizza•1y ago

Another one: SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking https://arxiv.org/abs/2306.05426

There was also an hn thread: https://news.ycombinator.com/item?id=36425375

nielstron•1y ago

re detecting and switching language: you could run several constraint systems in parallel and switch as soon as one of them rejects the input and another accepts it

re backtracking: a core part of this paper is ensuring a prefix property. that is there is always a legitimate completion and the model can not "corner" itself!

research needs to be done for what kind of languages and language features this prefix property can be ensured

ArcaneMoose•1y ago

I think TypeScript is uniquely positioned to be the optimal language for LLMs. Tons of training data (benefiting from all the JS examples as well) plus the structure of types for LLMs to follow and tools to enforce.

yoyohello13•1y ago

God help us…

marviel•1y ago

what do you dislike about it?

treyd•1y ago

TypeScript is arguably one of the weaker statically typed languages, with how it allows `any` to quietly violate the type checked assumptions. It makes it harder to do a lot of the basic typing mistakes in JS, but it doesn't prevent them by any means, especially if you have to interface with (typeless) JS code.

So for these reasons alone I would be against using TS as a lingua franca for LLM codegen (as is GP I assume). As another commenter mentioned, LLMs have a tendency to throw their hands^Hlogits up when presented with complex TS type errors and just resort to using `any` to get it to compile (and probably hiding bugs).

And that doesn't even touch the issues with the JS/TS ecosystem and runtimes more broadly.

agos•1y ago

tsc can be configured to avoid implicit use of any ("noImplicitAny": true) and ESLint can be set up to avoid explicit use of any. Typeless JS code is also a thing of the past.

But the devil is in the details - some libraries are typed quite crappily, some have unnecessary complex types, and the code that the LLMs was trained on is probably not the best in the world

tpm•1y ago

Can be configured, but then you get to work at a real codebase halfheartedly half-converted from javascript with half the files beginning with ts-ignore.

However crappy your Java codebase is going to be, it will still use types. And as just today Gemini hallucinated an API call that never existed (in a widely available and used library even), it's just better to have the ability to check that right away.

agos•1y ago

the codebases I've worked at in the last ten years are not as half arsed as that - and of course from my point of view are "real" enough.

If a codebase is so unkempt the issue is not Typescript - and forgive for writing such a platitude, but you can write awful code in Java, too.

tpm•1y ago

Yes but it will be typed awful code and the typing provides a grounding of sorts. However awful code in untyped/dynamically typed langs can be unspeakably bad. I have many years of Perl experience...

OutOfHere•1y ago

There are languages that constrain types a lot more tightly than TypeScript, e.g. Kotlin, Rust, and Haskell. The more constrained the types, the more correct the program could be.

mindwok•1y ago

Yep, and Rust famously goes beyond this by modelling memory ownership at compile time.

In fact, the more behaviour we can model at compile time the better when it comes to LLMs - there's some cool ideas here like transpiling Rust into languages for formal verification. See https://github.com/formal-land/coq-of-rust as an example.

Formal verification was one of those things that was previously so annoying to do that it rarely made it past academic use cases or extremely important libraries, but I think LLMs take the tedium out of it. Perhaps formal verification will have a "test driven development" type of moment in the sun thanks to this.

koakuma-chan•1y ago

Can LLMs properly code in Rust yet? There is way more TypeScript code out there compared to Rust, and I doubt structured output can alleviate this.

steveklabnik•1y ago

They can, yes.

vessenes•1y ago

my experience - yes! but. It's more of an edit - compile - fix loop than a write (seemingly) correct code on the first try. This might be a feature.

There is a little occasional difficulty on syntax with rust, but there are often the same sort of logic errors / getting lost an llm would have on another codebase -- the compiler helps catch many of these.

NitpickLawyer•1y ago

> This might be a feature.

I think so as well. The rust errors are some of the most "helpful" and easy to understand (once you grok the core concepts around rust) and it seems that the loop of - generate (maybe constrained) - check - fix benefits from this. In my testing it is better than python (long ass traces that you have to manually trim for the LLM).

OutOfHere•1y ago

Are you saying that the non-looping generators are grossly insufficient for Rust? This matters because the non-looping generators have a fixed monthly subscription cost, whereas the looping ones could cost per call.

vessenes•1y ago

I’m not sure what you mean here. All of my coding is done in a loop; something writes, something compiles, something fixes, once compiling, something tests and writes more, repeat until you need bed.

If you mean “can GitHub copilot author long syntactically, type, and memory-safe- correct rust code in one shot?” Then the answer is “not right now”

oivey•1y ago

The program won’t be “more” correct. What would that even mean? Writing correct programs might be easier (or not) with more “constrained” (ill defined) typing.

kreetx•1y ago

With Haskell, you can be more precise in expressing what you want.

oivey•1y ago

That wouldn’t make a program written in it “more” correct.

OutOfHere•1y ago

It is more correct in a statistical sense over many programs.

Think back to Javascript and untyped Python (without type annotations). It is a lot easier to have bugs in these languages without types. Types help eliminate classes of bugs.

kreetx•1y ago

We might be coming from different backgrounds and hence not immediately understand each other, but, at least in Haskell if the types are precise in the sense that you can't construct invalid values, then using such types will make your programs more correct.

For example, if you want a program from a number to HTML, then if the HTML type is a syntax tree type of all valid HTML rather than wrapper around string, then filtering LLM output by that type will make it more correct than a string wrapper kind of type (as with the latter, any program generated by the LLM which returns a string and wraps it into HTML will do).

The actual use cases might not go as extreme as the above, but the idea is that the "tighter" your type is, the better it is on pruning LLM outputs from invalid programs.

oivey•1y ago

Suppose you write a program that computes the sum of two input numbers. Suppose the two inputs are 1 and “a”. If it returns “1a” or 12 or whatever, the program is incorrect. Its correctness does not hinge on type safety. A untyped program could detect that one of the inputs is unexpected and correctly raise an error. Typing may make it easier to detect this error (or not). Fundamentally, adding type information does not make the above program “more” correct. It’s either correct, or it’s not.

You can write a sorting algorithm in assembly, and it can be correct. Rewriting in Haskell won’t make it “more” correct.

There’s an undercurrent of people espousing strictly types languages (not accusing you) who believe that somehow programs written in them are better. They’re not. They either serve their purpose, or they don’t. Strict typing is a tool. Sometimes it’s enabling. Sometimes (example: horrible polymorphism in most strictly typed languages like C++/Java/copy cats) it’s a hinderance. Strictly typed languages aren’t strictly better than non-strictly typed ones.

IsTom•1y ago

I wonder if at some point LLM would "give up" when given a difficult to satisfy types and insert nonterminating code / bottoms instead.

pram•1y ago

LLMs work well with any static analysis tool. I frequently instruct Claude to use stuff like “go vet” and “deadcode” when it goes on a tear and writes a bunch of broken trash and declares mission accomplished.

koakuma-chan•1y ago

> LLMs work well with any static analysis tool.

tsc error messages are so bad that every time my LLM sees one of those "SomeType is not assignable to SomeLongAssTypeDontEvenTryToUnderstandWhatsGoingOnHere<<<<>>>>>>>>>>>>>>>>>>>>" it just gives up and casts to any. goes for python too.

floydnoel•1y ago

ha, that's always been my biggest gripe with ts

AaronAPU•1y ago

I can’t be the only one who hopes this was a joke.

AnthonBerg•1y ago

I believe that the rutabaga is the perfect material to make sausages out of as it has proven as excellent swine fodder with widespread adoption!

(Please forgive me the extreme disrespect put forth in the above statement! It is not the intention to show disrespect; I… am quite the rutabaga enjoyer in all respects, you know? I certainly include myself within the absurdity and it is with love.)

babyent•1y ago

It’s better sure but as a power TS user it still sucks at generating better code, and consistently fucks up with generics (or doesn’t use them) or simple types sometimes.

primitivesuave•1y ago

Completely agree. Even with the basic LLMs in the $20/month Cursor plan, I can work 10x faster on TypeScript codebases than I could otherwise, while for Python that multiple feels more like 2-3x. The autocompletions are especially impressive when there is a well-organized type system.

Also in response to adjacent commenters - many mission-critical TS codebases will disable the use of an explicit "any" with eslint - https://typescript-eslint.io/rules/no-explicit-any/.

johnmw•1y ago

Those who agree might be interested in "Introducing TypeChat" by Anders Hejlsberg + others (2023) [1]

[1]: https://microsoft.github.io/TypeChat/blog/introducing-typech...

dcsan•1y ago

Wish this project had more traction. Typechat with type checking could generate lots of synthetic data for model training too

miki123211•1y ago

And unlike many other languages, Typescript types are extremely expressive.

For example, you can write a function that takes an object received from an API that uses snake_cased keys, and returns that same object, but with camelCased keys instead. This is not some "special case" in the Typescript compiler, the ability to do this emerges naturally from Typescript's features. I don't know any other language that can do this.

Most people don't know enough TS to use tese things effectively, but I think one could train an LLM to be very good at them. The combination of LLMs placing such advanced constraints on themselves, and then generating code based on those constraints, seems extremely powerful.

rfoo•1y ago

> Tons of training data (benefiting from all the JS examples as well)

More != better.

threeseed•1y ago

Scala would be the best given that its type system is formally modelled:

https://infoscience.epfl.ch/entities/publication/6c6bb09d-a4...

homebrewer•1y ago

Hejlsberg mentioned the ability to quickly provide accurate type information to LLMs as one of the reasons for rewriting tsc into Go:

https://youtu.be/10qowKUW82U?t=3186

tough•1y ago

But isn't TypeScript already a typed language to begin with?

habitue•1y ago

This is about the speed with which the compiler can advise an LLM that a particular thing checks or doesn't check. Typescript is much slower than Go

tough•1y ago

okay so basically the faster compiling means a tigher feedback loop for the LLM to -know- if the code compiles or not etc? interesting

is go faster than rust?

yoyohello13•1y ago

Go’s compiler is WAY faster than Rust’s. As far as speed of the actual program, Rust will generally be faster.

notnullorvoid•1y ago

Go or Rust compiler speeds won't have any effect here. The program in this context is the TypeScript compiler.

koakuma-chan•1y ago

cargo check is WAY faster than go build

Thaxll•1y ago

Working with both I can say that this is a big no, go mod is as fast if not faster, usually Go dep are much faster because Go does not import as much dependencies as Rust.

koakuma-chan•1y ago

In Rust you only need to compile your dependencies once. After that it's just your app because dependencies don't change.

binary132•1y ago

that is also the case in Go…?

koakuma-chan•1y ago

Sure, but the point is: don't be scared of dependencies in Rust.

binary132•1y ago

Well the context was a comparison of pros and cons and you started with “in Rust” so perhaps you can see why it sounded like you were presenting it as a pro.

nulld3v•1y ago

This may be true, but in my experience Rust is still slower to compile because monomorphization must be done for deps every time you compile, even for deps that are already compiled. And monomorphization ends up taking a long time because it is done on every type/function that uses generics, and Rust code tends to use generics very liberally.

notnullorvoid•1y ago

> is go faster than rust?

Depends on how you write the Go or Rust code. The most optimal Rust re-write of the TypeScript compiler would very likely be faster than the most optimal version in Go. However they didn't want to do a re-write, they wanted to port the existing compiler codebase written in TS. Go like TS (ultimately the JS runtime) also has GC which makes a 1-to-1 port much easier.

PartiallyTyped•1y ago

No. Ignore the other comments.

The reason for this decision is that they wanted a near 1:1 port of the typescript code to go, keeping design and structure almost identical.

You can’t do that in rust as easily because of all the cyclical references and indirection involved.

A rust port would be a rewrite. This is merely a migration.

raincole•1y ago

> is go faster than rust

No.

They rewrote in go because go is similar enough to typescript, while being faster than typescript.

Source: https://github.com/microsoft/typescript-go/discussions/411

int_19h•1y ago

Go has a very simple type system that is easy to typecheck on every token.

TypeScript has a type system that is complex enough, you can literally implement wasm inside it (and then use that to run e.g. Doom: https://socket.dev/blog/typescript-types-running-doom)

notnullorvoid•1y ago

The general idea seems very promising, I had been hoping someone would do something like this since seeing JSON schema structured outputs for LLMs.

Need to dig in a bit more on the implementation, but I was surprised that the paper didn't mention hooking into existing language service/server. There's more than types that an LLM could leverage from existing language tooling. Auto imports is a good example, it is handy for the human developer to keep a linear writing flow, something a LLM needs even more.

nielstron•1y ago

the problem with LSPs is that they don't guarantee generating a type annotation that we can use for constraints, i.e. we can not ensure the prefix property using LSPs. so we had to roll our own :)

Pulling in more features to help the system is definitely worth looking into!

tough•1y ago

The code can be found here: https://github.com/eth-sri/type-constrained-code-generation

compacct27•1y ago

Honestly it's already working great in Cursor. Even adapting one type structure to another is quickly handled.

slt2021•1y ago

we really need LLM trained on AST, instead of token, is there any research on this?

tough•1y ago

ASTrust: Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations

https://arxiv.org/abs/2407.08983

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

https://arxiv.org/abs/2401.03003

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

https://arxiv.org/abs/2405.02355

nielstron•1y ago

The downside is that you need to properly preprocess code, have less non-code Training Data, and can not adapt easily to new programming languages

koakuma-chan•1y ago

The vibe code society would benefit way more if libraries hosted their docs in a way that's easy to copy and paste into an LLM.

tough•1y ago

many docs now include llms.txt https://llmstxt.org/

koakuma-chan•1y ago

I saw that but it doesn't work for me. I use Gemini 2.5 Pro Preview right now, and it cannot fetch content from links. What I am looking for is a large text file with public API class, function, etc. signatures, plain text documentation and code examples.

tough•1y ago

https://ai-sdk.dev/llms.txt

koakuma-chan•1y ago

Depends on the library I guess, I spent 12~ hours today vibe coding with LiveKit and their /llms.txt is https://docs.livekit.io/llms.txt

davidz•1y ago

thanks for the feedback. let us see how we can organize this better for compat with diff LLMs.

tough•1y ago

what i do if no good llms txt is to download the whole docs from gh or website and keep the MD files available to my agent via a small mcp server

vessenes•1y ago

Karpathy nerdbaited me on this last week! I'm almost done with aidocs and aidd.

The aidocs server: keeps track of generated llm-friendly docs for any github repo.

The aidocs daemon (aidd) is resident, and can watch a repo, find imports in a number of languages, request the docs from aidocs, serve them up in mcp, and/or put them into a directory in your repo. Planning on generating docs for a codebase and incremental docs creation later.

I could use a couple beta testers -- lmk if you're interested. macos for now, although the daemon is written in go and should be portable.

koakuma-chan•1y ago

Hey, I'm definitely interested, but I'm not on MacOS. Please let me know though if there is a GitHub repo I can star or perhaps even send a PR (e.g. no reason for it to only work on MacOS, right?).

vessenes•1y ago

What OS do you want? Linux is an obvious next target and “should be” “easy”. :) I’d be happy to have a Linux tester. Windows would be last I think.

koakuma-chan•1y ago

I use WSL so yeah, primarily Linux, but I would also be willing to help with Windows if necessary.

muglug•1y ago

Really cool results!

That this research comes out of universities, and not large AI labs, makes me think those labs believe that larger models are still the way to go.

aibrother•1y ago

+1 this seems like healthy development

nielstron•1y ago

thank you!

bmc7505•1y ago

The correct way to do this is with finite model theory but we're not there yet.

_jayhack_•1y ago

Also worth checking out MultiLSPy, effectively a python wrapper around multiple LSPs: https://github.com/microsoft/multilspy

Used in multiple similar publications, including "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), which uses static analysis beyond the type system to filter out e.g. invalid variable names, invalid control flow etc.

nielstron•1y ago

Yes this work is super cool too! Note that LSPs can not guarantee resolving the necessary types that we use to ensure the prefix property, which we leverage to avoid backtracking and generation loops.

nikolayasdf123•1y ago

nice. the speed of AI development is accelerating so fast

cpfiffer•1y ago

We (.txt, the outlines people) had a brief thread about this paper on twitter if you're interested: https://x.com/dottxtai/status/1922322194379551128

hongbo_zhang•1y ago

We published a similar paper for MoonBit: Explore the Design of an AI-Friendly Programming Language https://conf.researchr.org/details/icse-2024/llm4code-2024-p...

int19h•1y ago

Been using Devin for a few months now, for Typescript and Python.

I've never seen it check-in uncompilable code, but watching the Devin console I can see it building and using the code to ensure commits are not complete garbage. When it has checked-in compilable and almost right but slightly wrong code, automatically running lint and tests (it doesn't always run them before checking in) from ci triggers it to push a fix on its own.

Feedback loops are nice, but they can be expensive, and time consuming (oh look at me complain that it takes Devin a whopping 15 minutes to complete a task) so I can definitely see the value in type constraints.

android521•1y ago

is Devin worth the money? Would it be a big jump in productivity migrating from cursor to Devin?

int19h•1y ago

it has been worth it for me, ymmv of course.

also they have a pay-as-you-go tier now as well.

I pay the full $500 though. This month I'm going to blow past the base allowance and tap into 'gift credits'

speaking of which if anyone wants a referral code (gift creds for me, and for you) hmu

tough•1y ago

how to hit you up tho

int19h•12mo ago

not sure if this is kosher with HN rules, but anyway... https://app.devin.ai/invite/DWfktbZQoevKNlNj

informal007•1y ago

Would it better if we move the feedback loops into RL-stage of LLM training?

Are there some related works?

nielstron•1y ago

we were thinking about doing exactly this, the closest current work is probably the amazing "Learning Formal Mathematics from Intrinsic Motivation" by Poesia et al (they use constraints too increase the likelihood of generating correct theorems/proofs during RL)

https://arxiv.org/abs/2407.00695

informal007•1y ago

Yes, RL works well in fields where answer can be verified in different degree. That's why AlphaGo success, it also should work in code generation and math.

imtringued•1y ago

Your reward function can simply be the distance between the constrained output and the unconstrained output, that way you won't even need synthetic data, just a dataset of prompts to RL against.

informal007•1y ago

How to get "unconstrained output" and evaluate the distance between them?

Evaluation method which can decide distance between two sentences is hard to find, best option is closed-source LLM API even it's not the most ideal option. As a result, we also must use current LLM to improve our models.

weiwenhao•1y ago

Does llm really understand the code at this stage?

Corey_•1y ago

I completely agree that TypeScript is ideal for LLMs. The type system and the extensive training data make it the best choice. But as someone who's been working with TypeScript for a while, I still see LLMs struggling with complex generics or even simple types. It’s better than before, but still far from perfect.

Also, TypeScript error messages can be a pain. When LLMs encounter something like "SomeType is not assignable," instead of handling it properly, they often just cast it to any. This happens way too often.

energy123•1y ago

This is what I'd consider doing if I was a small AI lab. Don't try to build a frontier LLM that beats all benchmarks. Try to make the world's best LLM at one programming language. Create your RL pipeline that puts all your resources into making the LLM the best at that language. Even better if there's a dearth of human-created training data on Github, since all your competitors will be bad at it.

Google somewhat did this with javascript in their latest Gemini-2.5 Pro release. But what about doing it for a smaller language? Google isn't going to do that, but there is still a lot of demand.

Drakim•1y ago

It makes sense to specialize it on one programming language to dedicate all of the LLM's intellectual space to that one domain, but on the flip side I wonder how much the LLM's sharpness and reasoning capabilities is increased by having more data to train on even if it's the wrong programming language.

As a developer I certainly think my programming skills in a specific language was improved by knowing other languages so I can contrast and compare.

tough•1y ago

You could just have specialized fine-tunes for esxh programling la guage that are only called when writing code, a more general bigger model could pass the plan/pseudo code to it

eigenspace•1y ago

I'm not saying this is a bad idea, but it does sound like a rather risky prospect. You're basically proposing a bet against the ability of LLMs to generalize across programming languages, and to embed concepts at a deeper level than the syntax.

Many people do think this, but I'm not sure many of them are running AI labs.

tough•1y ago

it feels to me most of the real usage of AI is in coding right now, so a small lab that decided to go all in into just code-gen would have at least the differentiator of a narrower field to beat the bigger incumbents doing it all?

I dunno tho.

Big AI labs also have their own agendas and would rather keep scaling and growing than serving a rather smaller real market ?

Once you're into real usage territory, you can't no longer use make up numbers to justify future growth.

eigenspace•1y ago

Again though, my point was just that it's not actually clear that you can do better than these big models by taking a narrower focus. I'm saying that that the things these big LLMs are learning about other languages probably do have utility when applied even to quite niche languages.

If you take some niche language and build an LLM from scratch that's hyperspecialized on that language, will that LLM actually outperform some big LLM that's trained on all the programming resources out there, and all the blogs, forum conversations, stack overflow posts on all those languages, and then learns to generalize that information and apply it to your niche language?

One of the things that LLMs seem to excel at is taking information from one context, transforming it and applying it to another context.

tough•1y ago

So how I envision this would be like a dual system, you let the frontier bigger LLM come up with the overall function signature, structure, and reasoning/planning around the specific code, but then have it ask the hyperspecialized fine-tuned model which can only output valid code, to create it.

You get then best of both worlds at the expense of a double-round trip or x2, which for something like coding seems fine, people are OK paying 200 for ChatGPT Plus

This also would solve the context window sizes problem of them getting full and the model starting to generate non-sense, if you have the bigger model using the bigger context window to orchestrate and organize the task calling smaller specialized sub-modules, that seems like it should yield better final code outptus than just one big ass LLM

but we'r'e moving the goalposts from 1 model to multi-agentic system i guess so nvermind

and i agree it seems all the big corps are betting for bigger more data for now

harperlee•1y ago

From my experience around less-used languages (with clojure on one hand and code aster's python on the other), LLMs may be able to generalize syntax but availability of APIs, functions, etc. is something that you can't solve by generalizing. Or more precisely, you can generalize but that means hallucinating non existing tools.

unshavedyak•1y ago

Would non-generalizing solve this issue for libraries though? Ie a lot of models produce reasonable code for me, but i almost always care about usage of libraries. That's where they get the wrong version, or hallucinate, or etc for me.

JohnMakin•1y ago

General purpose LLM's fail really hard at this in domains like terraform. There may be drastic differences in syntax and semantics between the massive matrix of terraform version + provider version(s) and they've shown to be absolutely terrible at navigating that, even if you specify versions specifically. Even worse, and probably what exacerbates it, this version matrix changes at a much faster pace than most programming languages typically introduce large changes.

_w1tm•1y ago

> There may be drastic differences in syntax and semantics between the massive matrix of terraform version + provider version(s) and they've shown to be absolutely terrible at navigating that, even if you specify versions specifically.

To be fair, humans have trouble with that as well.

JohnMakin•1y ago

That is always the counter but that is not the promise of these tools to be as bad as humans are. Humans can also work their way through it. The context ruts these tools get into with terraform specifically they can never dig out of, it's pretty much worthless for it, at least as many times as I've tried. You will waste far more time trying to figure out where it messed up on something dead simple than if you just looked over a colleague's shoulder, who probably wouldn't run into the same kinds of basic mistakes like making up fields.

robrenaud•1y ago

Meta synthetically generated lots of PHP from Python for Llama 3 for training purposes. Meta writes a crazy amount of PHP internally. Translation tends to be way easier than unconstrained generation for LLMs. But if you can translate and filter a large amount of code, you can learn to generate. If you also translate and run the unittests, you get another layer of error checking.

https://arxiv.org/abs/2407.21783

See figure 8.

nurettin•1y ago

Using the language itself isn't the challenge for LLMs, they do that with a very high success rate. I haven't seen an LLM make syntax errors for several months. Calling the right functions with correct parameters is the challenge your hypothetical AI lab will have to solve (or half ass it and show great benchmark results).

kreetx•1y ago

They should extend this to Haskell and make use of the Curry-Howard isomorphism: define the program you want by a type signature and have the LLM find the implementation.

chriswarbo•1y ago

The classic approach to this is Djinn https://hackage.haskell.org/package/djinn It's not very good with concrete types (Text, Int, MyCustomThing, etc.) but is good for polymorphic functions whose parts only fit together in a few ways.

More recent work is better at using concrete types, and choosing functions from Hackage, like Hoogle+ https://github.com/TyGuS/hoogle_plus and Hectare https://dl.acm.org/doi/10.1145/3547622

There's also "inductive programming" (producing a function from input/output examples), with Haskell implementations like Magic Haskeller http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...

seeknotfind•1y ago

This is anticipated from work on constrained output from LLMs, and it's good to see it being developed. One nitpick though, this paper mentions the complexities of implementing type checking for program prefixes in languages that are not context free. It's true this is extremely difficult for languages which are context sensitive, especially because types may be defined after they are used. However, it does not mention that it is impossible to implement such a program for Turing complete languages such as C++. I would never miss such an opportunity to criticize C++ and highlight the need for better language design. I love you C++.

nielstron•1y ago

noted. we'll make sure to critizise turing complete type systems more thoroughly next time :))

_alternator_•1y ago

Until a couple weeks ago, I considered this a promising approach. What changed? Agents, and Claude code in particular.

My prior experience was that LLMs were not much better than reading the docs, and certainly you wouldn’t get far vibe-coding in Rust. But Claude code behaves like I would, writing code that does t compile (typical LLM behavior), then reading the errors, correcting the code, and iterating until it compiles.

Its first attempt at a graph based scheduler in Rust took about $3 and 10 minutes to work correctly. It was ~500 loc, so definitely faster than what I can write in rust. (To be fair I spent a decent amount of time drafting a description of what I wanted in a markdown file to get Claude started).

gdiamos•1y ago

If you have two methods that both improve accuracy, why not stack them?

_alternator_•1y ago

This reasoning is at odds with recent history. A lot of effort was put into making machines that could only output grammatically correct sentences. Yet LLMs output grammatically correct sentences when needed, and a whole lot more as context demands, without ever hardcoding it. It would be foolish to restrict LLMs to only grammatical sentences at this point, but it might be a good tool to run on generated docs because sometimes words are misspelled or sentences are confusing.

Similarly, we have a tool that makes sure the type and syntax are correct, namely, a compiler. Building an LLM that can only output syntactically correct code is one approach of stacking them, but in fact it will significantly worsen the ability of the LLM to reason and construct code. The winning choice seems to be to emulate the workflow of humans: code, compile, read errors, repeat.

gitroom•1y ago

[flagged]

LostBenjamin•1y ago

As an author of this paper, I am very excited see the great discussion here!

Several people mentioned the generation - compilation - fixing loop. Just want to remind you that our approach works for not only the generation step but also the fixing step. This is because fixing is essentially asking LLMs to generate a new version of the code. The paper actually has a "repair" experiment to demonstrate this and our approach achieves significant gain in this experiment, i.e., 37% relative improvement in functional correctness.

tough•1y ago

Thank you for your research really impressive work!

yewW0tm8•1y ago

37% gain relative to what? What percent of generated functions were incorrect?

LostBenjamin•1y ago

compared to vanilla LLM decoding.

thesz•1y ago

  > To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation.

This should not work with type inference even at the level of C++ "auto x = " - "auto" does not constrain "x" at all and what is right of equal sign is not constrained either..

In Haskell, the gap is even wider. A long "where" clause may have dependencies constraining things in different direction.

But, what important I see here is the continuation of reinvention of Cyc, from different starting point. ;)

Definitely, "every big LLM has in support code an ad-hoc bug ridden inefficient implementation of half of Cyc." Cyc was written in Lisp, most of LLM support code is C/C++, thus, it is just a corrolary of Greenspun's Tenth Rule.

drumnerd•1y ago

This was obviously coming, and it should be tuned to Haskell and Agda

An OpenAI model has disproved a central conjecture in discrete geometry

GitHub confirms breach of 3,800 repos via malicious VSCode extension

Haskell Foundation 2026 Update

Show HN: I reverse engineered Apple's video wallpapers

New features in GCC 16: Improved error messages and SARIF output

The Letter S, by Donald Knuth (1980) [pdf]

DOS Zone

Flipper One Tech Specs

Vivaldi 8.0: our biggest design overhaul

Anthropic is expanding to Colossus2. Will use GB200

Archaeologists find Egyptian mummy buried with the 'Iliad'

How fast is N tokens per second really?

Simulating Infinity in Conway's Game of Life with Modern C++

Show HN: I made a tactical map-based WWII submarine simulator (public beta)

Saying goodbye to asm.js

OpenAI to confidentially file for IPO as soon as Friday

Intuit to lay off over 3k employees to refocus on AI

Reviving old scanners with an in-browser Linux VM bridged to WebUSB over USB/IP

Qian Xuesen: The missile genius America lost and China gained (2025)

Why is Inkwell stuck in review

Google’s AI is being manipulated. The search giant is quietly fighting back

A Markdown-based test suite

All the bugs they found

Recreate famous water profiles using supermarket bottled water

Learnings from 100K lines of Rust with AI (2025)

SpaceX S-1

Sharla Boehm, the programmer whose code underpins the Internet

Incident Report: May 19, 2026 – GCP Account Suspension

SBCL: the ultimate assembly code breadboard (2014)

Show HN: CPU-only transcription for YouTube, TikTok, X, Instagram videos

An OpenAI model has disproved a central conjecture in discrete geometry

GitHub confirms breach of 3,800 repos via malicious VSCode extension

Haskell Foundation 2026 Update

Show HN: I reverse engineered Apple's video wallpapers

New features in GCC 16: Improved error messages and SARIF output

The Letter S, by Donald Knuth (1980) [pdf]

DOS Zone

Flipper One Tech Specs

Vivaldi 8.0: our biggest design overhaul

Anthropic is expanding to Colossus2. Will use GB200

Archaeologists find Egyptian mummy buried with the 'Iliad'

How fast is N tokens per second really?

Simulating Infinity in Conway's Game of Life with Modern C++

Show HN: I made a tactical map-based WWII submarine simulator (public beta)

Saying goodbye to asm.js

OpenAI to confidentially file for IPO as soon as Friday

Intuit to lay off over 3k employees to refocus on AI

Reviving old scanners with an in-browser Linux VM bridged to WebUSB over USB/IP

Qian Xuesen: The missile genius America lost and China gained (2025)

Why is Inkwell stuck in review

Google’s AI is being manipulated. The search giant is quietly fighting back

A Markdown-based test suite

All the bugs they found

Recreate famous water profiles using supermarket bottled water

Learnings from 100K lines of Rust with AI (2025)

SpaceX S-1

Sharla Boehm, the programmer whose code underpins the Internet

Incident Report: May 19, 2026 – GCP Account Suspension

SBCL: the ultimate assembly code breadboard (2014)

Show HN: CPU-only transcription for YouTube, TikTok, X, Instagram videos

Type-constrained code generation with language models

Comments