This is Jan, the founder of Apify (https://apify.com/) — a full-stack web scraping platform.
With the help of Python community and the early adopters feedback, after an year of building Crawlee for Python in beta mode, we are launching Crawlee for Python v1.0.0.
The main features are:
- Unified storage client system: less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations.
- Adaptive Playwright crawler: makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites.
- New default HTTP client `ImpitHttpClient` (https://crawlee.dev/python/api/class/ImpitHttpClient), powered by the Impit (https://github.com/apify/impit) library): fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself: you can also create your own instance, configure it to your needs (e.g., enable HTTP/3 or choose a specific browser profile), and pass it into your crawler.
- Sitemap request loader: easier to start large-scale crawls where sitemaps already provide full coverage of the site
- Robots exclusion standard: not only helps you build ethical crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages
- Fingerprinting: each crawler run looks like a real browser on a real device. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler.
- Open telemetry: monitor real-time dashboards or analyze traces to understand crawler performance. easier to integrate Crawlee into existing monitoring pipelines
jancurn•1h ago
This is Jan, the founder of Apify (https://apify.com/) — a full-stack web scraping platform.
With the help of Python community and the early adopters feedback, after an year of building Crawlee for Python in beta mode, we are launching Crawlee for Python v1.0.0.
The main features are:
- Unified storage client system: less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations.
- Adaptive Playwright crawler: makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites.
- New default HTTP client `ImpitHttpClient` (https://crawlee.dev/python/api/class/ImpitHttpClient), powered by the Impit (https://github.com/apify/impit) library): fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself: you can also create your own instance, configure it to your needs (e.g., enable HTTP/3 or choose a specific browser profile), and pass it into your crawler.
- Sitemap request loader: easier to start large-scale crawls where sitemaps already provide full coverage of the site
- Robots exclusion standard: not only helps you build ethical crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages
- Fingerprinting: each crawler run looks like a real browser on a real device. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler.
- Open telemetry: monitor real-time dashboards or analyze traces to understand crawler performance. easier to integrate Crawlee into existing monitoring pipelines
For details, you can read the announcement blog post: https://crawlee.dev/blog/crawlee-for-python-v1
Our team and I will be happy to answer here any questions you might have.