Out of that frustration I built KeelTest - a VS Code extension that generates pytest tests and executes them, got hooked and decided to push this project forward... When tests fail, it tries to figure out why:
- Generation error: Attemps to fix it automatically, then tries again
- Bug in your source code: flags it and explains what's wrong
How it works:
- Static analysis to map dependencies, patterns, services to mock.
- Generate a plan for each function and what edge cases to cover
- Generate those tests
- Execute in "sandbox"
- Self-heal failures or flag source bugs
Python + pytest only for now. Alpha stage - not all codebases work reliably. But testing on personal projects and a few production apps at work, it's been consistently decent. Works best on simpler applications, sometimes glitches on monorepos setups. Supports Poetry/UV/plain pip setups.
Install from VS Code marketplace: https://marketplace.visualstudio.com/items?itemName=KeelCode...
More detailed writeup how it works: https://keelcode.dev/blog/introducing-keeltest
Free tier is 7 tests files/month (current limit is <=300 source LOC). To make it easier to try without signing up, giving away a few API keys (they have shared ~30 test files generation quota):
KEY-1: tgai_jHOEgOfpMJ_mrtNgSQ6iKKKXFm1RQ7FJOkI0a7LJiWg
KEY-2: tgai_NlSZN-4yRYZ15g5SAbDb0V0DRMfVw-bcEIOuzbycip0
KEY-3: tgai_kiiSIikrBZothZYqQ76V6zNbb2Qv-o6qiZjYZjeaczc
KEY-4: tgai_JBfSV_4w-87bZHpJYX0zLQ8kJfFrzas4dzj0vu31K5E
Would love your honest feedback where this could go next, and on which setups it failed, how it failed, it has quite verbose debug output at this stage!
ericyd•1d ago
bulba4aur•1d ago
So from my experience with the LLMs if you ask them directly "is this a bug or a feature" they might start hallucinating and assume stuff that isn't there.
I found in a few research/blog posts that if you ask the LLM to categorize (basically label) and provide score in which category this issue belongs it performs very very well.
So that's exactly what this tool does, when it sees the failing test it formulates the prompt in a following way:
## SOURCE CODE UNDER TEST: ## FAILED TEST CODE: ## PYTEST FAILURE FOR THIS TEST: ## PARSED FAILURE INFO: ## YOUR TASK: Perform a deep "Step-by-Step" analysis to determine if this failure is: 1. *hallucination*: The test expects behavior, parameters, or side effects that do NOT exist in the source code. 2. *source_bug*: The test is logically correct based on the requirements/signature, but the source code has a bug (e.g., missing await, wrong logic, typo). 3. *mock_issue*: The test is correct but the technical implementation of mocks (especially AsyncMock) is problematic. 4. *test_design_issue*: The test is too brittle, over-mocked, or has poor assertions.
Then it also assigns the "confidence" score to it's answer, based on that either full regeneration of the tests proceeds, commenting on the bug in the test, fixing mocks or full test redesign (if it's to brittle)
While this is not 100% bullet proof, i found this to be quite effective way - basically using LLM for the categorization.
Hope that answers your question!
bulba4aur•1d ago
arthurstarlake•20h ago
bulba4aur•20h ago