Hikugen's main features:
- Automatically generates, runs, regenerates and caches the LLM-generated extraction code.
- It uses sqlite to save the current working code for each page so it can be reused across executions.
- It uses OpenRouter (https://openrouter.ai/) to call the LLM.
- It can fetch the page automatically (it can even reuse Netscape-formatted cookies) but you can also just feed it the raw HTML and leverage the rest of its functionalities.
Here's a snippet of what it looks like:
from hikugen import HikuExtractor
from pydantic import BaseModel
from typing import List
class Article(BaseModel):
title: str
author: str
published_date: str
content: str
class ArticlePage(BaseModel):
articles: List[Article]
extractor = HikuExtractor(api_key="your-openrouter-api-key")
result = extractor.extract(
url="https://example.com/articles",
schema=ArticlePage
)
for a in result.articles:
print(a.title, a.author)
Hikugen is intentionally minimal: it doesn't attempt website navigation, login flows, headless browsers, or large-scale crawling. Just "given this HTML, extract this structured data".A good chunk of this was built with Claude Code (shoutout to Harper’s blog: https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/.
Would love feedback or ideas—especially from others playing with codegen for scraping tasks.