frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
242•isitcontent•16h ago•27 comments

Show HN: MCP App to play backgammon with your LLM

https://github.com/sam-mfb/backgammon-mcp
2•sam256•41m ago•1 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
344•vecti•18h ago•153 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
310•eljojo•19h ago•192 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
5•sakanakana00•1h ago•1 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•1h ago•0 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
77•phreda4•16h ago•14 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
93•antves•1d ago•70 comments

Show HN: ARM64 Android Dev Kit

https://github.com/denuoweb/ARM64-ADK
17•denuoweb•2d ago•2 comments

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

https://www.biotradingarena.com/hn
26•dchu17•21h ago•12 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
49•nwparker•1d ago•11 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
2•melvinzammit•3h ago•0 comments

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

https://github.com/artifact-keeper
152•bsgeraci•1d ago•64 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•4h ago•2 comments

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

https://github.com/rivet-dev/sandbox-agent/tree/main/gigacode
18•NathanFlurry•1d ago•9 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
10•michaelchicory•5h ago•1 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
15•keepamovin•6h ago•5 comments

Show HN: Daily-updated database of malicious browser extensions

https://github.com/toborrm9/malicious_extension_sentry
14•toborrm9•21h ago•7 comments

Show HN: Horizons – OSS agent execution engine

https://github.com/synth-laboratories/Horizons
23•JoshPurtell•1d ago•5 comments

Show HN: Micropolis/SimCity Clone in Emacs Lisp

https://github.com/vkazanov/elcity
172•vkazanov•2d ago•49 comments

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

https://rahuljaguste.github.io/Nethack_Falcons_Eye/
5•rahuljaguste•15h ago•1 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
2•devavinoth12•9h ago•0 comments

Show HN: I built a RAG engine to search Singaporean laws

https://github.com/adityaprasad-sudo/Explore-Singapore
4•ambitious_potat•10h ago•4 comments

Show HN: Local task classifier and dispatcher on RTX 3080

https://github.com/resilientworkflowsentinel/resilient-workflow-sentinel
25•Shubham_Amb•1d ago•2 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
2•rs545837•11h ago•1 comments

Show HN: A password system with no database, no sync, and nothing to breach

https://bastion-enclave.vercel.app
12•KevinChasse•21h ago•16 comments

Show HN: FastLog: 1.4 GB/s text file analyzer with AVX2 SIMD

https://github.com/AGDNoob/FastLog
5•AGDNoob•12h ago•1 comments

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

https://github.com/SawyerHood/gitclaw
9•sawyerjhood•22h ago•0 comments

Show HN: Gohpts tproxy with arp spoofing and sniffing got a new update

https://github.com/shadowy-pycoder/go-http-proxy-to-socks
2•shadowy-pycoder•13h ago•0 comments

Show HN: I built a directory of $1M+ in free credits for startups

https://startupperks.directory
4•osmansiddique•13h ago•0 comments
Open in hackernews

Show HN: KeelTest – AI-driven VS Code unit test generator with bug discovery

https://keelcode.dev/keeltest
30•bulba4aur•1mo ago
I built this because Cursor, Claude Code and other agentic AI tools kept giving me tests that looked fine but failed when I ran them. Or worse - I'd ask the agent to run them and it would start looping: fix tests, those fail, then it starts "fixing" my code so tests pass, or just deletes assertions so they "pass".

Out of that frustration I built KeelTest - a VS Code extension that generates pytest tests and executes them, got hooked and decided to push this project forward... When tests fail, it tries to figure out why:

- Generation error: Attemps to fix it automatically, then tries again

- Bug in your source code: flags it and explains what's wrong

How it works:

- Static analysis to map dependencies, patterns, services to mock.

- Generate a plan for each function and what edge cases to cover

- Generate those tests

- Execute in "sandbox"

- Self-heal failures or flag source bugs

Python + pytest only for now. Alpha stage - not all codebases work reliably. But testing on personal projects and a few production apps at work, it's been consistently decent. Works best on simpler applications, sometimes glitches on monorepos setups. Supports Poetry/UV/plain pip setups.

Install from VS Code marketplace: https://marketplace.visualstudio.com/items?itemName=KeelCode...

More detailed writeup how it works: https://keelcode.dev/blog/introducing-keeltest

Free tier is 7 tests files/month (current limit is <=300 source LOC). To make it easier to try without signing up, giving away a few API keys (they have shared ~30 test files generation quota):

KEY-1: tgai_jHOEgOfpMJ_mrtNgSQ6iKKKXFm1RQ7FJOkI0a7LJiWg

KEY-2: tgai_NlSZN-4yRYZ15g5SAbDb0V0DRMfVw-bcEIOuzbycip0

KEY-3: tgai_kiiSIikrBZothZYqQ76V6zNbb2Qv-o6qiZjYZjeaczc

KEY-4: tgai_JBfSV_4w-87bZHpJYX0zLQ8kJfFrzas4dzj0vu31K5E

Would love your honest feedback where this could go next, and on which setups it failed, how it failed, it has quite verbose debug output at this stage!

Comments

ericyd•1mo ago
I'd be curious to hear more about how it determines when a failure is a source code bug. In my experience it's very hard to encapsulate the "why" of a particular behavior in a way the agents will understand. How does this tool know that the test it wrote indicates an issue in the source vs an issue in the test?
bulba4aur•1mo ago
Hey, thanks for the question.

So from my experience with the LLMs if you ask them directly "is this a bug or a feature" they might start hallucinating and assume stuff that isn't there.

I found in a few research/blog posts that if you ask the LLM to categorize (basically label) and provide score in which category this issue belongs it performs very very well.

So that's exactly what this tool does, when it sees the failing test it formulates the prompt in a following way:

## SOURCE CODE UNDER TEST: ## FAILED TEST CODE: ## PYTEST FAILURE FOR THIS TEST: ## PARSED FAILURE INFO: ## YOUR TASK: Perform a deep "Step-by-Step" analysis to determine if this failure is: 1. *hallucination*: The test expects behavior, parameters, or side effects that do NOT exist in the source code. 2. *source_bug*: The test is logically correct based on the requirements/signature, but the source code has a bug (e.g., missing await, wrong logic, typo). 3. *mock_issue*: The test is correct but the technical implementation of mocks (especially AsyncMock) is problematic. 4. *test_design_issue*: The test is too brittle, over-mocked, or has poor assertions.

Then it also assigns the "confidence" score to it's answer, based on that either full regeneration of the tests proceeds, commenting on the bug in the test, fixing mocks or full test redesign (if it's to brittle)

While this is not 100% bullet proof, i found this to be quite effective way - basically using LLM for the categorization.

Hope that answers your question!

bulba4aur•1mo ago
To clarify, each failing test triggers "review" agent, to determine "why" the test fails, and again, it can be improved with better heuristics probably, more in depth static analysis than the source code, but it is how it works in the current version.
arthurstarlake•1mo ago
i wonder if always having a design doc of some substance discussing the intended behavior of the whole app would help reduce instances of hallucination. The human developer should create it and let it be accessed by the AI
bulba4aur•1mo ago
100% agree with that
ericyd•4w ago
Conceptually a good idea but this feels extremely non-scalable
hrimfaxi•1mo ago
How exactly do credits work? Your pricing mentions files and functions but doesn't appear to give a true unit of measure.
bulba4aur•1mo ago
Hey, thanks for the feedback, i will make sure to make it more visible/less confusing. So the model is actually quite simple.

1 credit - 1 file up to 15 functions. <-- only this tier is available in alpha, due to current limitations in the implementation, i tried generating on bigger files and it took quite a long time, so i am in the workings on solving this issue before enabling larger files support.

2 credits - 1 file up to 30 functions. 3 credits - 1 file 30-35 functions.

P.s if generated tests have <70% pass rate (at which point probably something went horribly wrong, your credits are refunded)

Hope this answer clears things up!

joshuaisaact•1mo ago
I notice one of the things you don't really talk about in the blog post (or if you did, I missed it) is unnecessary tests, which is one of the key problems LLMs have with test writing.

In my experience, if you just ask an LLM to write tests, it'll write you a ton of boilerplate happy path tests that aren't wrong, per se, they're just pointless (one fun one in react is 'the component renders').

How do you plan to handle this?

bulba4aur•1mo ago
I actually though about it multiple times over at this point.

You're right, this deserves more attention, and is a valid problem going forward with this app. And I had this problem when just started building, it either generated XSS tests for any user input validation method (even if it used other validators) or just 1 single test case.

For now I attempt to strictly limit the amount of tests for LLM to generate.

This is achieved with "Planner" that plans the tests for each function before any generation happens, that agent is instructed to generate a plan that follows the criteria:

- testCases.category MUST be one of "happy_path" | "edge_case" | "error_handling" | "boundary".

And it is asked to generate 2-3 tests for each category. While this may result in the unnecessary tests, it at least tries to limit the amount of them.

Going forward I believe the best approach would be to tune and tweak the requirements based on the language/framework it detects.

observationist•1mo ago
Do a structured code review, with a few passes by Claude or Codex. Have it provide an annotated justification for each test, and flag tests with redundant, low, or no utility within the context of the rest of the tests. Anything that looks questionable to you, call it out on the next pass, and if it's not justified by the time you fully understand the tests, nuke it.

You could automate this, but you'll end up getting rid of useful tests and keeping weird useless ones until the AI gets better at nuance and large codebases.

OptionOfT•1mo ago
What I see a lot is a generated test for something I prompt, and the test passes. Then I manually break the test and it fails for a different reason, not what I wanted to verify.

Guess I need to make it generate negative tests?

aleksiy123•1mo ago
The automated version of this is mutation testing.

Which is actually probably a solid idea for this exact use case.

rcarmo•1mo ago
Weird. Copilot knows what tests are and only "fixes" them after we've refactored the relevant code.

I really wonder if Claude Code and other agents keep track of these dependencies at all (I know that VS Code exposes its internal testing tools to agents, and use Anthropic and OpenAI tools with them).

bulba4aur•1mo ago
Indeed, the Microsoft Copilot eco-system might be a bit more sophisticated these days.

It so just happens than people around me, including myself, don't use the copilot, we "left" for the next big thing when Cursor was release, and copilot was still a glorified auto-complete.

From your feedback it seems like they became quite good?