Asking AI to build scrapers should be easy right?

https://www.skyvern.com/blog/asking-ai-to-build-scrapers-should-be-easy-right/

27•suchintan•1h ago

Comments

showerst•56m ago

A point orthogonal to this; consider whether you need browser automation at all.

If a website isn't using Cloudflare or a JS-only design, it's generally better to skip playwright. All the major AIs understand beautifulsoup pretty well, and they're likely to write you a faster, less brittle scraper.

pavel_lishin•54m ago

If.

Etheryte•36m ago

The vast majority of the modern internet falls into one of those two buckets though, no?

showerst•29m ago

I mostly scrape government data so the sites are a little 'behind' on that trend, but no. Even JS heavy sites are almost always pulling from a JSON or graphql source under the hood.

At scale, dropping the heavier dependencies and network traffic of a browser is meaningful.

suchintan•25m ago

Yeah, reverse engineering APIs is another fantastic approach. They aren't enough if you are dealing with wizards (eg typeform), but they can work really well

suchintan•25m ago

IF you can use crawlers, definitely do.

They aren't enough for anything that's login-protected, or requires interacting with wizards (eg JS, downloading files, etc)

philipbjorge•52m ago

We had a similar realization here at Thoughtful and pivoted towards code generation approaches as well.

I know the authors of Skyvern are around here sometimes -- How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?

From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.

suchintan•26m ago

I think they're complimentary, and that's the direction we're headed.

We can ask the vision based models to output why they are doing what they are doing, and fallback to code-based approaches for subsequent runs

ahstilde•51m ago

this matches our personal experience, too

franze•48m ago

In AI First workshops. By now I tell them for the last exercise "no scrappers". the learning is to separate reasoning (AI) from data (that you have to bring.) and ai coded scrappers seem a logical, but always fail. scrapping is a scaling issue, not reasoning challenge. also the most interesting websites are not keen for new scrappers.

pyuser583•18m ago

Over the past few days I've spent a lot of time dealing with terribly designed UIs. Some legitimate and desired use cases are impossible because poor logic excludes them.

Is AI capable of saying, "This website sucks, and doesn't work - file a complaint with the webmaster?"

I once had similar problems with the CIA's World Factbook. I shudder to think what an I would do there.

nithril•14m ago

The same day, a post on reddit was about: "We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source" [1].

Not fully equivalent to what is doing Skyvern, but still an interesting approach.

[1] https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...

Kagi Specials

Zendesk seems to be compromised: tickets created without email verification

Apple and F1 reach 5-year media deal for US broadcasts

Renaming the Default Branch of Rust

Developing for web browsers is a lot of fun

Show HN: I turned my resume into a catchy song. It's a game changer

Xi Is Never Giving Up His Newfound Leverage over Trump

Europe unveils plans for 'drone wall' to shield continent from Russian threats

List of classical music concerts with an unruly audience response

Rattled Wall Street on Alert After Trillion-Dollar Risk Runup

AI Helped Bootstrap Our Startup: Atlas

A petabyte worth of Omarchy in a month

SIM-Swap Attack Victim

Artificial Intelligence – A Modern Approach (visualization of concepts)

Show HN: Quickmark Lightweight Bookmark Manager

MoonScript – A programmer friendly language that compiles to Lua

TSMC moves up 2nm production plans in Arizon

Free Solfege Books

Republicans use deepfake video of Chuck Schumer in new attack ad

Why are mothers in the developed world are having less kids?

Show HN: Privacy-first work journal with no backend

The Majority AI View

Poppy Game Insult to Our War Dead [Ahoy] [video]

Show HN: ASCII Automata

Code from MIT's 1986 SICP video lectures

Air pollution modulates brown adipose tissue function

Women Running from Houses

Debian TC Overrules systemd Maintainers on /var/lock Permissions

Basis Trade Helps Mask Who Owns $1.4T of U.S. Treasuries

Kremlin envoy proposes 'Putin-Trump tunnel' to link Russia and US