I had no doubts claude/codex/gemini could handle this, but I was surprised that even an open-source model (glm4.7) did it 1 shot (just c/p the prompt from Install.md and came back 10 min later):
============================= 123 passed in 0.18s ==============================
Test Results:
All 123 tests from tests.yaml pass successfully:
- 35 timeago tests
- 26 duration tests
- 28 parse_duration tests
- 19 human_date tests
- 9 date_range tests
Files Created:
1. /workspaces/glmcode/whenwords/src/__init__.py - Library implementation
2. /workspaces/glmcode/whenwords/test_whenwords.py - Test file generated from tests.yaml
3. /workspaces/glmcode/whenwords/usage.md - Usage documentation
Crunched for 11m 38s
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Showing detailed transcript · ctrl+o to toggleI mean, it's a "toy" library, but the concept is so cool! And the fact that an open, locally hostable model can do it 1shot is insane.
If an LLM can do this with English and arbitrary test cases why wouldn't you pick a programming language and specific test cases? This would give you significantly more repeatability and consistency all while probably having less overall lines of code.
The key idea is to have one test suite/specification that multiple implantations in different languages can share.
If the tests are in YAML it doesn't need to convert them at all. It can write a new test harness in the new language and run against those existing, deterministic tests.
The advantage I can think of is it would might be more human readable but Python is damn close to pseudocode. It’ll likely always be a bit annoying to write because it has to be a formal language.
The YAML describes the tests - like this file here: https://github.com/dbreunig/whenwords/blob/main/tests.yaml
Snippet:
- name: "5 hours ago"
input: { timestamp: 1704049200, reference: 1704067200 }
output: "5 hours ago"
- name: "21 hours ago"
input: { timestamp: 1703991600, reference: 1704067200 }
output: "21 hours ago"
When told "use red/green TDD to write code for this in Ruby", a coding agent like Claude Code will write a test harness in Ruby that loops through all of those YAML tests, run it and watch it fail, then write just enough Ruby that the tests pass.But to me clearly that YAML snippet you provided is a specification which needs to be translated to Ruby as much as Python would. If the equivalent Python is:
def test_timeago_5_hours_ago(self):
self.assertEqual(timeago(1704049200, 1704067200)), "5 hours ago")
def test_timeago_21_hours_ago(self): self.assertEqual(timeago(1703991600, 1704067200)), "21 hours ago")
The YAML is no more clear than the Python, nor closer to Ruby. Honestly I think it's less clear as a human reading it because it's hard to tell which function is being tested in context of a specific test case. I guess it's possible Claude is better at working with the YAML than the Python but that would be a coincidence I think.
simonw•11h ago
I've now got a JavaScript interpreter and a WebAssembly runtime written in Python, built by Claude Code for web run from my phone.