frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath
29•rodricios•6d ago
wxpath is a declarative web crawler where web crawling and scraping are expressed directly in XPath.

Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression:

    import wxpath

    # Crawl, extract fields, build a Wikipedia knowledge graph
    path_expr = """
    url('https://en.wikipedia.org/wiki/Expression_language')
         ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
             /map{
                'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
                'url': string(base-uri(.)),
                'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
                'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
             }
    """

    for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
        print(item)
The key addition is a `url(...)` operator that fetches and returns HTML for further XPath processing, and `///url(...)` for deep (or paginated) traversal. Everything else is standard XPath 3.1 (maps/arrays/functions).

Features:

- Async/concurrent crawling with streaming results

- Scrapy-inspired auto-throttle and polite crawling

- Hook system for custom processing

- CLI for quick experiments

Another example, paginating through HN comments (via "follow=" argument) pages and extracting data:

    url('https://news.ycombinator.com',
        follow=//a[text()='comments']/@href | //a[@class='morelink']/@href)
        //tr[@class='athing']
          /map {
            'text': .//div[@class='comment']//text(),
            'user': .//a[@class='hnuser']/@href,
            'parent_post': .//span[@class='onstory']/a/@href
          }
Limitations: HTTP-only (no JS rendering yet), no crawl persistence. Both are on the roadmap if there's interest.

GitHub: https://github.com/rodricios/wxpath

PyPI: pip install wxpath

I'd love feedback on the expression syntax and any use cases this might unlock.

Thanks!

Comments

css_apologist•37m ago
xpath is so fucking cool

i can understand why it failed for general use, but shit like this revives my excitement

q: i'm not an expert, this looks like it extends xpath syntax? haven't seen stuff like the /map is this referring to the html map element? or a fp-style map?

rodricios•27m ago
I think xpath is cool too!

If wxpath can help revive some of that excitement, then I consider my project a success.

As for your question, while wxpath does extend the xpath syntax, `/map` is not one of its additions, nor is it a html map element.

XPath 3.1 introduced first-class maps (and arrays) (https://www.w3.org/TR/xpath-31/#id-maps), and `/map` is the syntax to create said structure. It's an awesome feature that's especially useful for quickly delivering JSON-like objects.

css_apologist•6m ago
sick, ty
rhdunn•26m ago
Maps were added in XPath 3.1 -- https://www.w3.org/TR/xpath-31/#id-maps.

There's currently work on XPath 4.0 -- https://qt4cg.org/specifications/xquery-40/xpath-40.html.

The 26,000-Year Astronomical Monument Hidden in Plain Sight

https://longnow.org/ideas/the-26000-year-astronomical-monument-hidden-in-plain-sight/
122•mkmk•1h ago•19 comments

OpenAI is rolling out age prediction

https://openai.com/index/our-approach-to-age-prediction/
12•pretext•23m ago•6 comments

Instabridge has acquired Nova Launcher

https://novalauncher.com/nova-is-here-to-stay
26•KORraN•50m ago•14 comments

The Unix Pipe Card Game

https://punkx.org/unix-pipe-game/
97•kykeonaut•3h ago•25 comments

Show HN: wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath
29•rodricios•6d ago•4 comments

Unconventional PostgreSQL Optimizations

https://hakibenita.com/postgresql-unconventional-optimizations
139•haki•5h ago•15 comments

I'm addicted to being useful

https://www.seangoedecke.com/addicted-to-being-useful/
350•swah•9h ago•181 comments

Nvidia Stock Crash Prediction

https://entropicthoughts.com/nvidia-stock-crash-prediction
221•todsacerdoti•4h ago•177 comments

IP Addresses Through 2025

https://www.potaroo.net/ispcol/2026-01/addr2025.html
127•petercooper•6h ago•71 comments

The Zen of Reticulum

https://github.com/markqvist/Reticulum/blob/master/Zen%20of%20Reticulum.md
76•mikece•6h ago•46 comments

Linux kernel framework for PCIe device emulation, in userspace

https://github.com/cakehonolulu/pciem
175•71bw•12h ago•67 comments

De-dollarization: Is the US dollar losing its dominance? (2025)

https://www.jpmorgan.com/insights/global-research/currencies/de-dollarization
444•andsoitis•3h ago•563 comments

Level S4 solar radiation event

https://www.swpc.noaa.gov/news/g4-severe-geomagnetic-storm-levels-reached-19-jan-2026
572•WorldPeas•23h ago•185 comments

Apple testing new App Store design that blurs the line between ads and results

https://9to5mac.com/2026/01/16/iphone-apple-app-store-search-results-ads-new-design/
576•ksec•1d ago•475 comments

Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

https://github.com/majcheradam/ocrbase
68•adammajcher•6h ago•21 comments

Much of the World Facing 'Water Bankruptcy,' U.N. Report Warns

https://e360.yale.edu/digest/water-bankruptcy-report
10•speckx•51m ago•2 comments

IP over Avian Carriers with Quality of Service (1999)

https://www.rfc-editor.org/rfc/rfc2549.html
57•mig4ng•8h ago•24 comments

The Alignment Game (2023)

https://dmvaldman.github.io/alignment-game/
40•dmvaldman•4d ago•7 comments

Reticulum, a secure and anonymous mesh networking stack

https://github.com/markqvist/Reticulum
307•brogu•19h ago•76 comments

Channel3 (YC S25) Is Hiring

https://www.ycombinator.com/companies/channel3/jobs/3DIAYYY-backend-engineer
1•aschiff1•7h ago

Show HN: Typing Tennis

https://typingtennis.com
3•twalichiewicz•1h ago•0 comments

Running Claude Code dangerously (safely)

https://blog.emilburzo.com/2026/01/running-claude-code-dangerously-safely/
218•emilburzo•7h ago•181 comments

What came first: the CNAME or the A record?

https://blog.cloudflare.com/cname-a-record-order-dns-standards/
436•linolevan•1d ago•150 comments

Increasing the performance of WebAssembly Text Format parser by 350%

https://blog.gplane.win/posts/improve-wat-parser-perf.html
89•gplane•5d ago•30 comments

The coming industrialisation of exploit generation with LLMs

https://sean.heelan.io/2026/01/18/on-the-coming-industrialisation-of-exploit-generation-with-llms/
234•long•1d ago•145 comments

Benchmarking a Baseline Fully-in-Place Functional Language Compiler [pdf]

https://trendsfp.github.io/papers/tfp26-paper-12.pdf
33•matt_d•4d ago•5 comments

Prediction markets are ushering in a world in which news becomes about gambling

https://www.theatlantic.com/technology/2026/01/america-polymarket-disaster/685662/
447•krustyburger•2d ago•437 comments

Nanolang: A tiny experimental language designed to be targeted by coding LLMs

https://github.com/jordanhubbard/nanolang
215•Scramblejams•22h ago•169 comments

Notes on Apple's Nano Texture (2025)

https://jon.bo/posts/nano-texture/
243•dsr12•1d ago•126 comments

How Hightouch built their long-running agent harness

https://www.amplifypartners.com/blog-posts/how-hightouch-built-their-long-running-agent-harness
42•thecr0w•1h ago•2 comments