Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression:
import wxpath
# Crawl, extract fields, build a Wikipedia knowledge graph
path_expr = """
url('https://en.wikipedia.org/wiki/Expression_language')
///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
/map{
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
'url': string(base-uri(.)),
'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
}
"""
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
print(item)
The key addition is a `url(...)` operator that fetches and returns HTML for further XPath processing, and `///url(...)` for deep (or paginated) traversal. Everything else is standard XPath 3.1 (maps/arrays/functions).Features:
- Async/concurrent crawling with streaming results
- Scrapy-inspired auto-throttle and polite crawling
- Hook system for custom processing
- CLI for quick experiments
Another example, paginating through HN comments (via "follow=" argument) pages and extracting data:
url('https://news.ycombinator.com',
follow=//a[text()='comments']/@href | //a[@class='morelink']/@href)
//tr[@class='athing']
/map {
'text': .//div[@class='comment']//text(),
'user': .//a[@class='hnuser']/@href,
'parent_post': .//span[@class='onstory']/a/@href
}
Limitations: HTTP-only (no JS rendering yet), no crawl persistence.
Both are on the roadmap if there's interest.GitHub: https://github.com/rodricios/wxpath
PyPI: pip install wxpath
I'd love feedback on the expression syntax and any use cases this might unlock.
Thanks!