I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see.
Github Repo: https://github.com/tadpolehq/tadpole Docs: https://tadpolehq.com/
The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data.
Here is an example for scraping from `books.toscrape.com`
main {
new_page {
goto "https://books.toscrape.com/"
loop {
do {
$$ article.product_pod {
extract "books[]" {
title { $ "h3 a"; attr title }
rating {
$ ".star-rating";
attr "class";
extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
}
price { $ "p.price_color"; text; as_float }
in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
}
}
}
while { $ "li.next" }
next {
$ "li.next a" { click }
wait_until
}
}
}
}
I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities: module stealth {
// Apple M2 Pro
action apply_apple_m2 {
apply_identity mac
set_webgl_vendor "Apple Inc." "Apple M2"
set_device_memory 16
set_hardware_concurrency 8
set_viewport 1440 900 deviceScaleFactor=2
}
// Windows Desktop
action apply_windows_16_8 {
apply_identity windows
set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
set_device_memory 16
set_hardware_concurrency 8
set_viewport 1920 1080
}
// Windows Budget Laptop
action apply_windows_8_4 {
apply_identity windows
set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
set_device_memory 8
set_hardware_concurrency 4
set_viewport 1366 768
}
}
The full release changelog is available here:
https://github.com/tadpolehq/tadpole/releases/My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome.
I will keep trying to keep my release cadence at every 2 weeks!