frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath
29•rodricios•6d ago
wxpath is a declarative web crawler where web crawling and scraping are expressed directly in XPath.

Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression:

    import wxpath

    # Crawl, extract fields, build a Wikipedia knowledge graph
    path_expr = """
    url('https://en.wikipedia.org/wiki/Expression_language')
         ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
             /map{
                'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
                'url': string(base-uri(.)),
                'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
                'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
             }
    """

    for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
        print(item)
The key addition is a `url(...)` operator that fetches and returns HTML for further XPath processing, and `///url(...)` for deep (or paginated) traversal. Everything else is standard XPath 3.1 (maps/arrays/functions).

Features:

- Async/concurrent crawling with streaming results

- Scrapy-inspired auto-throttle and polite crawling

- Hook system for custom processing

- CLI for quick experiments

Another example, paginating through HN comments (via "follow=" argument) pages and extracting data:

    url('https://news.ycombinator.com',
        follow=//a[text()='comments']/@href | //a[@class='morelink']/@href)
        //tr[@class='athing']
          /map {
            'text': .//div[@class='comment']//text(),
            'user': .//a[@class='hnuser']/@href,
            'parent_post': .//span[@class='onstory']/a/@href
          }
Limitations: HTTP-only (no JS rendering yet), no crawl persistence. Both are on the roadmap if there's interest.

GitHub: https://github.com/rodricios/wxpath

PyPI: pip install wxpath

I'd love feedback on the expression syntax and any use cases this might unlock.

Thanks!

Comments

css_apologist•30m ago
xpath is so fucking cool

i can understand why it failed for general use, but shit like this revives my excitement

q: i'm not an expert, this looks like it extends xpath syntax? haven't seen stuff like the /map is this referring to the html map element? or a fp-style map?

rodricios•21m ago
I think xpath is cool too!

If wxpath can help revive some of that excitement, then I consider my project a success.

As for your question, while wxpath does extend the xpath syntax, `/map` is not one of its additions, nor is it a html map element.

XPath 3.1 introduced first-class maps (and arrays) (https://www.w3.org/TR/xpath-31/#id-maps), and `/map` is the syntax to create said structure. It's an awesome feature that's especially useful for quickly delivering JSON-like objects.

rhdunn•20m ago
Maps were added in XPath 3.1 -- https://www.w3.org/TR/xpath-31/#id-maps.

There's currently work on XPath 4.0 -- https://qt4cg.org/specifications/xquery-40/xpath-40.html.

Show HN: wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath
29•rodricios•6d ago•4 comments

Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

https://github.com/majcheradam/ocrbase
67•adammajcher•6h ago•21 comments

Show HN: Typing Tennis

https://typingtennis.com
3•twalichiewicz•1h ago•0 comments

Show HN: Modal Agents SDK

https://github.com/sshh12/modal-claude-agent-sdk-python
4•sshh12•1h ago•0 comments

Show HN: Trinity – a native macOS Neovim app with Finder-style projects

https://scopecreeplabs.com/trinity/
2•kidproquo•2h ago•0 comments

Show HN: Xv6OS – A modified MIT xv6 with GUI

https://github.com/danko1122q/xv6-os
2•danko_os•2h ago•0 comments

Show HN: I was burnt out and failing so I built AI that give shit about me

3•kaufy•2h ago•5 comments

Show HN: Loci – Visual knowledge map with auto-generated flashcards and FSRS

https://github.com/lmanhes/loci
3•omnitrol•3h ago•0 comments

Show HN: Artificial Ivy in the Browser

https://da.nmcardle.com/grow
93•dnmc•16h ago•16 comments

Show HN: An interactive physics simulator with 1000’s of balls, in your terminal

https://github.com/minimaxir/ballin
68•minimaxir•1d ago•14 comments

Show HN: Agent Skills – 1k curated Claude Code skills from 60k+ GitHub skills

https://agent-skills.cc/
2•lixiaofei•3h ago•1 comments

Show HN: Picocode – a Rust based tiny Claude Code clone for any LLM, for fun

https://github.com/jondot/picocode
2•jondot•3h ago•0 comments

Show HN: Preloop – An MCP proxy for human-in-the-loop tool approvals

https://preloop.ai
3•yconst•3h ago•1 comments

Show HN: Subth.ink – write something and see how many others wrote the same

https://subth.ink/
79•sonnig•1d ago•45 comments

Show HN: www.kitty.cards – Make your own Apple Wallet cards

https://www.kitty.cards
4•xenodium•4h ago•2 comments

Show HN: Pipenet – A Modern Alternative to Localtunnel

https://pipenet.dev/
105•punkpeye•1d ago•19 comments

Show HN: Orcheo – a Python n8n‑like workflow engine built for AI agents

https://github.com/ShaojieJiang/orcheo
2•NeuralNotwork•4h ago•0 comments

Show HN: E80: an 8-bit CPU in structural VHDL

https://github.com/Stokpan/E80
29•Axonis•2d ago•2 comments

Show HN: Claude Skill Editor

https://github.com/mtct/skill-editor
2•mtct88•5h ago•0 comments

Show HN: APIsec MCP Audit – Audit what your AI agents can access

https://github.com/apisec-inc/mcp-audit
2•rajaramr7•5h ago•0 comments

Show HN: Mother MCP – Manage your Agent Skills like a boss-Auto provision skills

https://github.com/dmgrok/mcp_mother_skills
2•DavidGraca•5h ago•0 comments

Show HN: Munimet.ro – ML-based status page for the local subways in SF

https://munimet.ro/
12•MrEricSir•5d ago•6 comments

Show HN: A creative coding library for making art with desktop windows

https://github.com/willmeyers/window-art
34•willmeyers•23h ago•4 comments

Show HN: repere – Local-first SQL data explorer using DuckDB WASM

https://repere.ai
2•mattismegevand•6h ago•0 comments

Show HN: Movieagent.io – An agent for movie recommendations (with couple mode)

https://movieagent.io
20•roknovosel•1d ago•5 comments

Show HN: Lume 0.2 – Build and Run macOS VMs with unattended setup

https://cua.ai/docs/lume/guide/getting-started/introduction
146•frabonacci•2d ago•41 comments

Show HN: LangGraph architecture that scales (hexagonal pattern, 110 tests)

https://github.com/cleverhoods/sagecompass
3•cleverhoods•11h ago•0 comments

Show HN: AWS-doctor – A terminal-based AWS health check and cost optimizer in Go

https://github.com/elC0mpa/aws-doctor
51•elC0mpa•1d ago•21 comments

Show HN: Beats, a web-based drum machine

https://beats.lasagna.pizza
152•kinduff•1d ago•48 comments

Show HN: ChunkHound, a local-first tool for understanding large codebases

https://github.com/chunkhound/chunkhound
113•NadavBenItzhak•2d ago•30 comments