The AGI Final Frontier: The CLJ-AGI Benchmark

https://raspasov.posthaven.com/the-agi-final-frontier-the-clj-agi-benchmark

19•raspasov•6mo ago

Comments

malux85•6mo ago

Perhaps this is a really great AGI test - not in the sense that the AGI can complete the given task correctly, but if the AGI can interpret incredibly hand-wavy requirements with “do XXX (as much as possible)” and implement these: A,B,C etc

delegate•6mo ago

Doesn't Clojure already support all of those features ?

Eg.

> transducer-first design, laziness either eliminated or opt-in

You can write your code using transducers or opt-in for laziness in Clojure now. So it's a matter of choice of tools, rather than a feature of the language.

> protocols everywhere as much as practically possible (performance)

Again, it's a choice made by the programmer, the language already allows you to have protocols everywhere. It's also how Clojure is implemented under the hood.

-> first-class data structures/types are also CRDT data types, where practical (correctness and performance)

Most of the programs I worked on, did not require CRDT. I'm inclined to choose a library for this.

> first-class maps, vectors, arrays, sets, counters, and more

Isn't this the case already ? If Clojure's native data structures are not enough, there's the ocean of Java options..

Which leads to a very interesting question:

How should the 'real' AGI respond to your request ?

raspasov•6mo ago

> first-class maps, vectors, arrays, sets, counters, and more

That's my mistake; this line was intended to be a sub-bullet point of the previous line regarding CRDTs.

> the language already allows you to have protocols everywhere

The core data structures, for example, are not based on protocols; they are implemented in pure Java. One reason is that the 1.0 version of the language lacked protocols. All that being said, it remains an open question what the full implications of the protocol-first idea are.

> You can write your code using transducers or opt in for laziness in Clojure now. So it's a matter of choice of tools, rather than a feature of the language.

You 100% can. Unfortunately, many people don't. The first thing people learn is (map inc [1 2 3]), which produces a lazy sequence. Clojure would never change this behavior, as the authors value backward compatibility almost above everything else, and rightly so. A transducer-first approach would be a world where (map inc [1 2 3]) produces the vector [2 3 4] by default, for example.

This was mentioned by Rich Hickey himself in his "A History of Clojure" paper:

https://clojure.org/about/history https://dl.acm.org/doi/pdf/10.1145/3386321

(from paper) > "Clojure is an exercise in tool building and nothing more. I do wish I had thought of some things in a different order, especially transducers. I also wish I had thought of protocols sooner, so that more of Clojure’s abstractions could have been built atop them rather than Java interfaces."

mdemare•6mo ago

More AGI Final Frontiers:

"Reimplement Sid Meier's Alpha Centauri", but with modern graphics, smart AIs that role-play their personalities, all bugs fixed, a much better endgame, AI-generated unexpected events, and a dev console where you can mod the game via natural language instructions."

"Reimplement all linux command line utilities in Rust, make their names, arguments and options consistent, and fork all software and scripts on the internet to use the new versions."

raspasov•6mo ago

"Reimplement Linux in Rust" would be a good one!

glimshe•6mo ago

Let's say we had a ChatGPT-2000 capable of all of this. How would digital life look like? What people would do with their computers?

Lerc•6mo ago

Even if we were not past a hard takeoff point where AIs could decide for themselves what to work on, the things that would be created in all areas would be incredible.

Consider every time you played a game and thought it would be better if it had x,y, or z. Or you wished an application had this one simple nrw feature.

All those things would be possible to make. A lot of people will discover why their idea was a bad idea. Some will discover their idea was great, some will erroneously think their bad idea is great.

We will be inundated with the creation of those good and bad ideas. Some people will have ideas on how to manage that flood of new creations, and create tools to help out, some of those tools will be good and some of them will be bad, there will be a period of churn where finding the good and ignoring the bad is difficult, a badly made curator might make bad ideas linger.

That's just in the domain of games and applications. If AI could manage that level of complexity, you can ask it to develop and test just about any software idea you have.

I barely go a day without thinking of something that I could spend months of development time on.

Some idle thoughts that such a model could develop and test.

Can you make a transformer that instead of linear space V modifiers it instead used geodesics? Is it better? Would it better support scalable V values?

Can you train a model to identify which layer is the likely next layer purely based upon the input given to that layer? If it only occasionally gets it wrong does the model perform better if you give the input to the layer that the predictor thought was the next layer. Can you induce looping/skipping layers this way?

If you train a model with the layers in a round robin ordering on every input, do the layers regress to a mean generic layer form, or do they develop into a general information improver that works purely by the context of the input.

What if you did every layer on a round robin twice, so that every layer was guaranteed to be followed by any of the other layers at least once?

Given you can quadruple the parameters of a model without changing it's behavour using the Wn + Randomn, Wn - Randomn trick, can you distill a model To .25 size then quadruple to make a model to retain the size but takes further learning better, broadening parameter use.

Can any of these ideas be combined with the ones above?

Imagine instead of having these idle ideas, you could direct an AI to implement them and report back to you the results.

Even if 99.99% of the ideas are failures, there could be massive advances from the fraction that remains.

SequoiaHope•6mo ago

That’s still just code! How about “design a metal 3D printing machine which can be built for $2000 and can make titanium, steel, aluminum, and copper parts with 100 micron precision, then design a simple factory for that machine. Write the manufacturing programs for all of the CNC machines, and work instructions for every step of the process. Order the material and hire qualified individuals to operate the machines. Identify funding opportunities and raise funds.”

I could go on. One of the challenges here is that many things like this cannot be designed by simply thinking, unless you have extremely super human performance, because complex subassemblies have to be built and prototypes and debugged. And right now there’s no good datasets for machine design, PCB design, machine tool programming, hiring, VC fund raising, negotiating building leases, etc.

We will never have real AGI unless it can learn how to improve without extensive datasets.

kloud•6mo ago

My "pelican test" for coding LLMs now is creating a proof of concept building UIs (creating a hello world app) using Jetpack Compose in Clojure. Since Compose is implemented as Kotlin compiler extensions and does not provide Java APIs, it cannot be used from Clojure using interop.

I outlined a plan to let it analyze Compose code and suggest it can reverse engineer bytecode of Kotlin demo app first and emit bytecode from Clojure or implement in Clojure directly based on the analysis. Claude Code with Sonnet 4 was confident implementing directly and failed spectacularly.

Then as a follow-up I tried to let it compile Kotlin demo app and then tried to bundle those classes using clojure tooling to at least make sure it gets the dependencies right as the first step to start from. It resorted to cheating by shelling out to graddlew from clojure :) I am going to wait for next round of SOTA models to burn some tokens again.

Grimblewald•6mo ago

mine is seeing if they can implement brown et al (2007) image stitching algorithm. It's old, plenty of code examples exist in training data, the math at this stage is quite well developed, but funnily enough, no decent real open source examples of this exist, especially anything that gets close to Microsoft research's demo tool, the image composite editor (ICE). Even if you heavily constrain the requirements, i.e. planar motion only, only using multi band blending and gain correction, not a single model currently manages to pull this off. Few even have something working at the start. Many other things they excel at, even look downright competent, but in all those cases it simply turns out decent open source examples of the implementation exist on git-hub, usually a touch better than the LLM version. I have yet to see a LLM produce good code for something even moderately complex that I couldn't then find a copy of online.

upghost•6mo ago

This is a good one. Forget AGI, I'd settle for an LLM that when doing Clojure doesn't spew hot trash. Balancing parens on tab complete would be a nice start. Or writing sensible ClojureScript that isn't reskinned JavaScript with parens would be pretty stellar.

raspasov•6mo ago

Haha, the higher-end LLMs are not absolutely terrible. In my experience, LLMs in their current form are better at explaining code than creating it. Not perfect by any stretch in either task.

Balancing parens is still a challenge.

Lerc•6mo ago

The notion of when a language is created is open to interpretation.

It is not stated whether you want such a language described, specified, or implemented.

raspasov•6mo ago

I think "created" is generally considered to be implemented :).

I also discuss performance, so I think implementation is definitely strongly implied.

kelseyfrog•6mo ago

I get it now. Benchmarks, in the end, are prompts for AI researchers.

If you want a problem solved, translate it into an AGI benchmark.

With enough patience, it becomes something AI researchers report on, optimize for, and ultimately saturate. Months later, the solution arrives; all you had to do was wait. AI researchers are an informal, lossy form of distributed computation - they mass-produce solutions and tools that, almost inevitably, solve the messy problem you started with.

rs186•6mo ago

What comes to mind is whether AGI can gracefully solve the Go error handling problem, once and for all.

https://go.googlesource.com/proposal/+/master/design/go2draf...

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: I spent 4 years building a UI design tool with only the features I use

Dark Alley Mathematics

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: If you lose your memory, how to regain access to your computer?

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

PC Floppy Copy Protection: Vault Prolok

An Update on Heroku

Show HN: ARM64 Android Dev Kit

Why I Joined OpenAI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Female Asian Elephant Calf Born at the Smithsonian National Zoo

How to effectively write quality code with AI

Delimited Continuations vs. Lwt for Threads

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Introducing the Developer Knowledge API and MCP Server

Learning from context is harder than we thought

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

FORTH? Really!?

Evaluating and mitigating the growing risk of LLM-discovered 0-days

I'm going to cure my girlfriend's brain tumor

Show HN: Smooth CLI – Token-efficient browser for AI agents

How virtual textures work

WebView performance significantly slower than PWA

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: I spent 4 years building a UI design tool with only the features I use

Dark Alley Mathematics

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: If you lose your memory, how to regain access to your computer?

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

PC Floppy Copy Protection: Vault Prolok

An Update on Heroku

Show HN: ARM64 Android Dev Kit

Why I Joined OpenAI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Female Asian Elephant Calf Born at the Smithsonian National Zoo

How to effectively write quality code with AI

Delimited Continuations vs. Lwt for Threads

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Introducing the Developer Knowledge API and MCP Server

Learning from context is harder than we thought

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

FORTH? Really!?

Evaluating and mitigating the growing risk of LLM-discovered 0-days

I'm going to cure my girlfriend's brain tumor

Show HN: Smooth CLI – Token-efficient browser for AI agents

How virtual textures work

WebView performance significantly slower than PWA

The AGI Final Frontier: The CLJ-AGI Benchmark

Comments