- general llm-style chat: math Q&A
- tool use: hashing with tools
- image understanding
- web-browsing: (tic-tac-toe)
- code generation & execution
- memory across sessions
The idea isn’t benchmarking, but giving builders practical drills to see where agents succeed or fail. Useful for education, workshops, or self-study.
Would love feedback on what other exercises would make it more useful.
Try it here: https://ape.llm.phd/