Launching Crawlee for Python v1.0 to simplify building web scrapers and crawlers

2•jancurn•1h ago

Comments

jancurn•1h ago

Hey HN,

This is Jan, the founder of Apify (https://apify.com/) — a full-stack web scraping platform.

With the help of Python community and the early adopters feedback, after an year of building Crawlee for Python in beta mode, we are launching Crawlee for Python v1.0.0.

The main features are:

- Unified storage client system: less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations.

- Adaptive Playwright crawler: makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites.

- New default HTTP client `ImpitHttpClient` (https://crawlee.dev/python/api/class/ImpitHttpClient), powered by the Impit (https://github.com/apify/impit) library): fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself: you can also create your own instance, configure it to your needs (e.g., enable HTTP/3 or choose a specific browser profile), and pass it into your crawler.

- Sitemap request loader: easier to start large-scale crawls where sitemaps already provide full coverage of the site

- Robots exclusion standard: not only helps you build ethical crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages

- Fingerprinting: each crawler run looks like a real browser on a real device. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler.

- Open telemetry: monitor real-time dashboards or analyze traces to understand crawler performance. easier to integrate Crawlee into existing monitoring pipelines

For details, you can read the announcement blog post: https://crawlee.dev/blog/crawlee-for-python-v1

Our team and I will be happy to answer here any questions you might have.

As Floods Worsen, Pakistan Is the Epicenter of Climate Change

Llms.py – Local ChatGPT-Like UI and OpenAI Chat Server

AI CheatSheet – AI Tools Directory

Ask HN: Pure HTML micro-front end

Imagine with Claude: build working software and UI on the fly [video]

Sock it to the shoes: why more offices are going footwear-free

Higgs Audio

F-Droid says Google's new sideloading restrictions will kill the project

Show HN: My heart is open source

NoPorn – Stop Pornhub

UAlbany Chemists Create New High-Energy Compound to Fuel Space Flight

Show HN: jsonpipe - stream JSON tweaks toolkit in GO

Global Sumud Flotilla Tracker

Got hit by 1k Trump bots within an hour after launching a platform

Periodic Labs aims to build a scientific super-intelligence

Notes on Unreal Engine 5, Nanite

Ratios of iterated logarithms to different bases

The Majority of Your Users

The Game Engine that would not have been made without Rust

Using the TPDE Codegen Back End in LLVM Orc

Show HN: InpaintKit – A plugin let you use latest AI models in Photoshop

Affinity (owned by Canva) is closing their public forums and using Discord

Io_uring is not an event system [2021]

Show HN: EasyDesign-one-click reproduction of any popular poster

DeepSeek-v3.2-Exp

I built a 4000 Weeks/Slow Productivity-inspired tool (yes I see the irony)

Show HN: Are you in AI coded startup race? Check out Dodo Payments SDK in Rust

The Problem with AI Is the Problem with Capitalism

Comprehension Debt: The Ticking Time Bomb of LLM-Generated Code

Microsoft has lost it's way