Show HN: Magnitude – open-source, AI-native test framework for web apps

https://github.com/magnitudedev/magnitude

140•anerli•12h ago

Hey HN, Anders and Tom here - we’ve been building an end-to-end testing framework powered by visual LLM agents to replace traditional web testing.

We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:

- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution

- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.

The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

It’s completely open source. Would love to have more people try it out and tell us how we can make it great.

Repo: https://github.com/magnitudedev/magnitude

Comments

grbsh•12h ago

I know moondream is cheap / fast and can run locally, but is it good enough? In my experience testing things like Computer Use, anything but the large LLMs has been so unreliable as to be unworkable. But maybe you guys are doing something special to make it work well in concert?

anerli•11h ago

So it's key to still have a big model that is devising the overall strategy for executing the test case. Moondream on its own is pretty limited and can't handle complex queries. The planner gives very specific instructions to Moondream, which is just responsible for locating different targets on the screen. It's basically just the layer between the big LLM doing the actual "thinking" and grounding that to specific UI interactions.

Where it gets interesting, is that we can save the execution plan that the big model comes up with and run with ONLY Moondream if the plan is specific enough. Then switch back out to the big model if some action path requires adjustment. This means we can run repeated tests much more efficiently and consistently.

grbsh•11h ago

Ooh, I really like the idea about deciding whether to use the big or small model based on task specificity.

tough•11h ago

You might like https://pypi.org/project/llm-predictive-router/

anerli•11h ago

Oh this is interesting. In our case we are being very specific about which types of prompts go where, so the planner essentially creates prompts that will be executed by Moondream, instead of trying to route prompts generally to the appropriate model. The types of requests that our planner agent vs Moondream can handle are fundamentally different for our use case.

tough•11h ago

interesting, will check out yours i'm mostly interested on these dynamic routers so I can mix local and API based depending on needs, i cannot run some models locally but most of the tasks don't even require such power (on building ai agentic systems)

there's also https://github.com/lm-sys/RouteLLM

and other similar

I guess your system is not as open-ended task oriented so you can just build workflows deciding which model to use at each step, these routing mechanisms are more useful for open-ended tasks that dont fit on a workflow so well (maybe?)

badmonster•11h ago

How does Magnitude differentiate between the planner and executor LLM roles, and how customizable are these components for specific test flows?

anerli•11h ago

So the prompts that are sent to the planner vs executor are completely distinct. We allow complete customization of the planner LLM with all major providers (Anthropic, OpenAI, Google AI Studio, Google Vertex AI, AWS Bedrock, OpenAI compatible). The executor LLM on the other hand has to fit very specific criteria, so we only support the Moondream model right now. For a model to act as the executor it needs to be able to specific specific pixel coordinates (only a few models support this, for example OpenAI/Anthropic computer use, Molmo, Moondream, and some others). We like Moondream because its super tiny and fast (2B). This means as long as we still have a "smart" planner LLM we can have very fast/cheap execution and precise UI interaction.

badmonster•11h ago

does Moondream handle multi-step UI tasks reliably (like opening a menu, waiting for render, then clicking), or do you have to scaffold that logic separately in the planner?

anerli•10h ago

The planner can plan out multiple web actions at once, which Moondream can then execute in sequence on its own. So Moondream is never deciding how to execute more than one web action in a single prompt.

What this really means for developers writing the tests is you don't really have to worry about it. A "step" in Magnitude can map to any number of web actions dynamically based on the description, and the agents will figure out how to do it repeatably.

NitpickLawyer•11h ago

> The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?

anerli•11h ago

So this is a path that we definitely considered. However we think its a half-measure to generate actual Playwright code and just run that. Because if you do that, you still have a brittle test at the end of the day, and once it breaks you would need to pull in some LLM to try and adapt it anyway.

Instead of caching actual code, we cache a "plan" of specific web actions that are still described in natural language.

For example, a cached "typing" action might look like: { variant: 'type'; target: string; content: string; }

The target is a natural language description. The content is what to type. Moondream's job is simply to find the target, and then we will click into that target and type whatever content. This means it can be full vision and not rely on DOM at all, while still being very consistent. Moondream is also trivially cheap to run since it's only a 2B model. If it can't find the target or it's confidence changed significantly (using token probabilities), it's an indication that the action/plan requires adjustment, and we can dynamically swap in the planner LLM to decide how to adjust the test from there.

ekzy•9h ago

Did you consider also caching the coordinates returned by moondream? I understand that it is cheap, but it could be useful to detect if an element has changed position as it may be a regression

anerli•7h ago

So the problem is if we cache the coordinates and click blindly at the saved positions, there's no way to tell if the interface changes or if we are actually clicking the wring things (unless we try and do something hacky like listen for events on the DOM). Detecting whether elements have changed position though would definitely be feasible if re-running a test with Moondream, could compared against the coordinates of the last run.

tomatohs•4h ago

This is exactly our workflow, though we defined our own YAML spec [1] for reasons mentioned in previous comments.

We have multiple fallbacks to prevent flakes; The "cheap" command, a description of the intended step, and the original prompt.

If any step fails, we fall back to the next source.

1. https://docs.testdriver.ai/reference/test-steps

tobr•11h ago

Interesting! My first concern is - isn’t this the ultimate non-deterministic test? In practice, does it seem flaky?

anerli•11h ago

So the architecture is built with determinism in mind. The plan-caching system is still a work in progress, but especially once fully implemented it should be very consistent. As long as your interface doesn't change (or changes in trivial ways), Moondream alone can execute the same exact web actions as previous test runs without relying on any DOM selectors. When the interface does eventually change, that's where it becomes non-deterministic again by necessity, since the planner will need to generatively update the test and continue building the new cache from there. However once it's been adapted, it can once again be executed that way every time until the interface changes again.

daxfohl•10h ago

In a way, nondeterminism could be an advantage. Instead of using these as unit tests, use them as usability tests. Especially if you want your site to be accessible by AI agents, it would be good to have a way of knowing what tweaks increase the success rate.

Of course that would be even more valuable for testing your MCP or A2A services, but could be useful for UI as well. Or it could be useless. It would be interesting to see if the same UI changes affect both human and AI success rate in the same way.

And if not, could an AI be trained to correlate more closely to human behavior. That could be a good selling point if possible.

anerli•10h ago

Originally we were actually thinking about doing exactly this and building agents for usability testing. However, we think that LLMs are much better suited for tackling well defined tasks rather than trying to emulate human nuance, so we pivoted to end-to-end testing and figuring out how to make LLM browser agents act deterministically.

dimal•11h ago

> Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

One benefit not using pure vision is that it's a strong signal to developers to make pages accessible. This would let them off the hook.

Perhaps testing both paths separately would be more appropriate. I could imagine a different AI agent attempting to navigate the page through accessibility landmarks. Or even different agents that simulate different types of disabilities.

anerli•10h ago

Yeah good criticism for sure. We definitely want to keep this in mind as we continue to build. Some kind of accessibility tests which run in parallel with each visual test that are only allowed to use the accessibility tree could make it much easier for developers to identify how to address different accessibility concerns.

jcmontx•10h ago

Does it only work for node projects? Can I run it against a Staging environment without mixing it with my project?

anerli•10h ago

You can run it against any URL, not just node projects! You'll still need a skeleton node project for the actual Magnitude tests, but you could configure some other public or staging URL as the target site.

pandemic_region•9h ago

Bang me sideways, "AI-native" is a thing now? What does that even mean?

Alifatisk•9h ago

Had to look it up too! https://www.studioglobal.ai/blog/what-is-ai-native

mcbuilder•9h ago

It definitely means something, probably an app designed around being interacted by with an LLM, upon first hearing it. Browser interaction is one of those things that is a great killer app for LLMs IMO.

For instance, I just discovered there are a ton of high quality scans of film and slides available at the Library of Congress website, but I don't really enjoy their interface. I could build a scraping tool and get too much info, or suffer and use just clicking through their search UI. Or I could ask my browser tool wielding LLM agent to automate the boring stuff and provide a map of the subjects I would be interested in, and give me a different way to discover things. I've just discovered the entire browser automation thing, and I'm having fun have my LLM go "research" for a few minutes while I go do something else.

anerli•9h ago

Well yeah it's kind of ambiguous, it's just our way of saying that we're trying to use AI to make testing easier!

SparkyMcUnicorn•8h ago

This is pretty much exactly what I was going to build. It's missing a few things, so I'll either be contributing or forking this in the future.

I'll need a way to extract data as part of the tests, like screenshots and page content. This will allow supplementing the tests with non-magnitude features, as well as add things that are a bit more deterministic. Assert that the added todo item exactly matches what was used as input data, screenshot diffs when the planner fallback came into play, execution log data, etc.

This isn't currently possible from what I can see in the docs, but maybe I'm wrong?

It'd also be ideal if it had an LLM-free executor mode to reduce costs and increase speed (caching outputs, or maybe use accessibility tree instead of VLM), and also fit requirements when the planner should not automatically kick in.

anerli•7h ago

Hey, awesome to hear! We are definitely open to contributions :)

We plan to (very soon) enable mixing standard Playwright or other code in between Magnitude steps, which should enable doing exact assertions or anything else you want to do.

Definitely understand the need to reduce costs / increase speed, which mainly we think will be best enabled by our plan-caching system that will get executed by Moondream (a 2B model). Moondream is very fast and also has self-hosted options. However there's no reason we couldn't potentially have an option to generate pure Playwright for people who would prefer to do that instead.

We have a discord as well if you'd like to easily stay in touch about contributing: https://discord.gg/VcdpMh9tTy

sergiomattei•8h ago

Hi, this looks great! Any plans to support Azure OpenAI as a backend?

anerli•7h ago

Hey! We can add this pretty easily! We find that Gemini Pro 2.5 works the best as the planner model by a good margin, but we definitely want to support a variety of providers. I'll keep this in mind and implement soon!

edit: tracking here https://github.com/magnitudedev/magnitude/issues/6

aoeusnth1•1h ago

Why not make the strong model compile a non-ai-driven test execution plan using selectors / events? Is Moondream that good?

I wrote a book called "Crap Towns". It seemed funny at the time

Berkeley Humanoid Lite – Open-source robot

Lossless LLM compression for efficient GPU inference via dynamic-length float

Wikipedia’s nonprofit status questioned by D.C. U.S. attorney

Parallel ./configure

Colossal Cave Adventure (1976)

World Emulation via Neural Network

Show HN: I used OpenAI's new image API for a personalized coloring book service

Show HN: Empty Enter Expander – Type less in the terminal with this tool

Writing "/etc/hosts" breaks the Substack editor

I designed my LED matrix PCB with code

Show HN: Formalizing Principia Mathematica using Lean

Eurorack Knob Idea

Paper2Code: Automating Code Generation from Scientific Papers

A $20k American-made electric pickup with no paint, no stereo, no screen

Reading RSS content is a skilled activity

Reproducibility project fails to validate dozens of biomedical studies

Show HN: Magnitude – open-source, AI-native test framework for web apps

Curry: A functional logic programming language

Tumor-derived erythropoietin acts as immunosuppressive switch in cancer immunity

Programming in D: Tutorial and Reference

GCC 15.1

Gym Class (YC W22) Is Hiring Character Animation Engineering Lead

ACM's flagship magazine seeks submissions by/for practitioners

Large language models, small labor market effects [pdf]

Differential Coverage for Debugging

A Love Letter to People Who Believe in People

Show HN: A modern spreadsheet with Python integration

Mathematicians just solved a 125-year-old problem, uniting 3 theories in physics

The VTech Socratic Method

Show HN: Magnitude – open-source, AI-native test framework for web apps

Comments

I wrote a book called "Crap Towns". It seemed funny at the time

Berkeley Humanoid Lite – Open-source robot

Lossless LLM compression for efficient GPU inference via dynamic-length float

Wikipedia’s nonprofit status questioned by D.C. U.S. attorney

Parallel ./configure

Colossal Cave Adventure (1976)

World Emulation via Neural Network

Show HN: I used OpenAI's new image API for a personalized coloring book service

Show HN: Empty Enter Expander – Type less in the terminal with this tool

Writing "/etc/hosts" breaks the Substack editor

I designed my LED matrix PCB with code

Show HN: Formalizing Principia Mathematica using Lean

Eurorack Knob Idea

Paper2Code: Automating Code Generation from Scientific Papers

A $20k American-made electric pickup with no paint, no stereo, no screen

Reading RSS content is a skilled activity

Reproducibility project fails to validate dozens of biomedical studies

Show HN: Magnitude – open-source, AI-native test framework for web apps

Curry: A functional logic programming language

Tumor-derived erythropoietin acts as immunosuppressive switch in cancer immunity

Programming in D: Tutorial and Reference

GCC 15.1

Gym Class (YC W22) Is Hiring Character Animation Engineering Lead

ACM's flagship magazine seeks submissions by/for practitioners

Large language models, small labor market effects [pdf]

Differential Coverage for Debugging

A Love Letter to People Who Believe in People

Show HN: A modern spreadsheet with Python integration

Mathematicians just solved a 125-year-old problem, uniting 3 theories in physics

The VTech Socratic Method