I've been experimenting with autonomous coding agents for ML workflows. ML-Ralph uses claude-code to run a continuous experiment loop - it forms hypotheses, writes training code, evaluates results, and iterates on what it learns. Added Weights & Biases integration for observability on long runs.
As a test, I pointed it at Kaggle Higgs Boson and let it run for a few hours unsupervised. It placed top 30. I
humanly participated on that competition too, and barely reached top 150.
Still rough around the edges. Would appreciate feedback, especially on failure modes you'd want to see handled better.