frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Launch HN: Webhound (YC S23) – Research agent that builds datasets from the web

25•mfkhalil•1h ago
We're the team behind Webhound (https://webhound.ai), an AI agent that builds datasets from the web based on natural language prompts. You describe what you're trying to find. The agent figures out how to structure the data and where to look, then searches, extracts the results, and outputs everything in a CSV you can export.

We've set up a special no-signup version for the HN community at https://hn.webhound.ai - just click "Continue as Guest" to try it without signing up.

Here's a demo: https://youtu.be/fGaRfPdK1Sk

We started building it after getting tired of doing this kind of research manually. Open 50 tabs, copy everything into a spreadsheet, realize it's inconsistent, start over. It felt like something an LLM should be able to handle.

Some examples of how people have used it in the past month:

Competitor analysis: "Create a comparison table of internal tooling platforms (Retool, Appsmith, Superblocks, UI Bakery, BudiBase, etc) with their free plan limits, pricing tiers, onboarding experience, integrations, and how they position themselves on their landing pages." (https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff69...)

Lead generation: "Find Shopify stores launched recently that sell skincare products. I want the store URLs, founder names, emails, Instagram handles, and product categories." (https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e...)

Pricing tracking: "Track how the free and paid plans of note-taking apps have changed over the past 6 months using official sites and changelogs. List each app with a timeline of changes and the source for each." (https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8dea...)

Investor mapping: "Find VCs who led or participated in pre-seed or seed rounds for browser-based devtools startups in the past year. Include the VC name, relevant partners, contact info, and portfolio links for context." (https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fd...)

Research collection: "Get a list of recent arXiv papers on weak supervision in NLP. For each, include the abstract, citation count, publication date, and a GitHub repo if available." (https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b...)

Hypothesis testing: "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site and show the most relevant posts with timestamps and engagement metrics." (https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b...)

The first version of Webhound was a single agent running on Claude 4 Sonnet. It worked, but sessions routinely cost over $1100 and it would often get lost in infinite loops. We knew that wasn't sustainable, so we started building around smaller models.

That meant adding more structure. We introduced a multi-agent system to keep it reliable and accurate. There's a main agent, a set of search agents that run subtasks in parallel, a critic agent that keeps things on track, and a validator that double-checks extracted data before saving it. We also gave it a notepad for long-term memory, which helps avoid duplicates and keeps track of what it's already seen.

After switching to Gemini 2.5 Flash and layering in the agent system, we were able to cut costs by more than 30x while also improving speed and output quality.

The system runs in two phases. First is planning, where it decides the schema, how to search, what sources to use, and how to know when it's done. Then comes extraction, where it executes the plan and gathers the data.

It uses a text-based browser we built that renders pages as markdown and extracts content directly. We tried full browser use but it was slower and less reliable. Plain text still works better for this kind of task.

We also built scheduled refreshes to keep datasets up to date and an API so you can integrate the data directly into your workflows.

Right now, everything stays in the agent's context during a run. It starts to break down around 1000-5000 rows depending on the number of attributes. We're working on a better architecture for scaling past that.

We'd love feedback, especially from anyone who's tried solving this problem or built similar tools. Happy to answer anything in the thread.

Thanks! Moe

Comments

Oras•1h ago
When I read your description, I thought, "This is just like the Exa dataset, no?" but then I gave it a try, and I am genuinely impressed.

Great decision to make it without a login so people can test.

Here is what I liked:

- The agent told me exactly what's happening, which sources it is checking, and the schema.

- The agent correctly identified where to look at, and how to obtain the data.

- Managing expectations: Webhound is extracting data Extraction can take multiple hours. We'll send you an email when it's complete.

Minor point:

- There is no pricing on the main domain, just the HN one https://hn.webhound.ai/pricing

Good luck!

mfkhalil•45m ago
Thanks, glad to hear you had a good experience.

We were heavily inspired by tools like Cursor - basically tried to prioritize user control and visibility above everything else.

What we discovered during iteration was that our users are usually domain experts who know exactly what they want. The more we showed them what was happening under the hood and gave them control over the process, the better their results got.

giancarlostoro•33m ago
This is how I use Perplexity, I will have to give this a try, I am always on the lookout for newer tools.