Differential testing of compilers via random generation of correct programs goes back a ways before this paper. For C, it's made more difficult by the need to avoid undefined behaviors in the generated code (which C is loaded with). So much of the work here was how to generate sufficiently diverse programs that still have defined behavior.
A follow on paper (from 2012) described creduce, the failing test input reducer than can be used with csmith. It's also useful for other purposes, and for languages other that C. Reducing the inputs requires preserving the condition that they are valid C without undefined behaviors.
https://users.cs.utah.edu/~regehr/papers/pldi12-preprint.pdf
Back before this was realized compilers were a buggy mess and people just lived with it. Now, they're still buggy, but less so. Random testing with billions of inputs is practical now and tends to rapidly "mine out" significant chunks of potential bug space. One could view this as an example of the Bitter Lesson, where dumb search coupled with massive amounts of computing power solves a problem that had seemed to require more knowledge.
Anyone making a production quality compiler these days needs to be using high volume random testing.
How is this done these days, do people just set up CI workflows for this or have dedicated machines running tests 24/7?
qualeed•7mo ago