Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting. https://rapidapi.com/yakpdf-yakpdf/api/yakpdf
I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.
[0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp
You can also easily generate screenshots if that's more suitable than PDFs.
You can also easily use this to do stuff like jam a set of images into a HTML table and PDF or screenshot them in that format.
For a proprietary solution, try Prince XML:
I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.
Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826
Yes it costs money. So does developer time.
But, I upvote weasyprint for that instead.
Can't they just render the screen content in a pdf ? Seems easy for other programs to do this.
I tried yesterday. With compliments to the moms of SWE who coded the functionality in firefox. Aparently puting the screen on a pdf page is an insurmontable task in 2025. (20 years ago was still doable). I had to make a screenshot and process the picture to print it.
The maintainers are also very responsive, and helpful.
Amazing project
One way is to install poppler-utils and use pdfunite. There are many other open-source packages you can use as well.
It's very likely to be a massive batch operation of a ton of HTML files that might not even be their own site.
ended up using headless chrome specifically to make sure javascript things rendered properly
Edit: it appears so- https://news.ycombinator.com/item?id=15131840
with an elaborate script that relies on xdotool
/path/to/firefox --window-size 1700 --headless -screenshot myfile.png file://myfile.html
Easy, right ?
Used this for many years... but beware:
- caveat 1: this is (or was) a more or less undocumented function and a few years ago it just disappeared only to come back in a later release.
- caveat 2: even though you can convert local files it does require internet access as any references to icons, style sheets, fonts and tracker pixels cause Firefox to attempt to retrieve them without any (sensible) timeout. So, running this on a server without internet access will make the process hang forever.
So I used Firefox for multi-page documents and Chromium for single-page invoices.
I spent a lot of time with different versions of both browsers, and numerous quirks made a very unpleasant experience.
Eventually I settled on Chromium (Ungoogled), which I use nowadays for invoices.
I am currently writing a WASM-ready PDF toolkit that can handle both HTML to PDF and then rendering PDF pages to SVG. However, it's not yet production-ready.
The underlying HTML engine is currently a severe "work in progress", but it gives me the low-level access that I need: https://azul.rs/reftest
At low volumes, my preferred approach is to select and extract text (copy/paste, perhaps using the poppler library for larger-scale work), dump that to plain-text and convert that (manually / scripted) to Markdown. From there you can get to PDF or pretty much any other format through tools such as pandoc.
pandoc --self-contained input.html -o output.htmlpandoc input.html -o output.pdf --pdf-engine=<your engine>
It's much more versatile than PDF and, if the algorithm decides the user is allowed to read the document, then the user gets to make use of all of the document's options like a better search function (PDF can't find words that are bro-
ken across lines because that information of what's a word is gone, transformed into coordinates of what characters need to go where). It's also much more readable on different screen sizes, as the user can resize the window to whatever is comfortable on a 27" screen, or fits on their pocket e-reader. You can even draw it on a canvas if you want to prevent people from extracting the decrypted strings (though it's evil, you have that option). There's only benefits!
PDF is the lazy way to half-ass a read-only document while screwing, ahem, making anyone using a mobile phone zoom, pan, and squint. Thankfully, phones are falling out of fash— wait, scratch that, I just heard text reflow is more relevant than ever as phone use continues to soar
I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.
I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.
There is no mechanism for annotations in HTML or the other formats I listed. An editor would just be editing the original content in its own non-standardized, non-portable way, which is not desirable for a number of reasons.
So when you say:
> What you are describing are features of an editor, not a file format.
That is incorrect. It is an intentionally designed and standardized feature of the file format.
https://superuser.com/questions/333378/where-does-okular-sto...
And it's not a Chrome thing. I don't think any browsers support it, do they? It's not really clear there's a need for it, when collaborative editors already handle document annotations in their own ways.
> That's different. Those are a data structure defining annotations that are meant to be stored externally.
The protocol is a separate standard.
The format is JSON-LD. Putting JSON-LD into HTML isn't a question mark. (There's info at W3C.org about how to do that, too. Not that it's necessary. You can guess what it says.)
But these aren't meant for direct user annotations in a general way.
The web standard doesn't define any standardized mechanism for one user to add highlights and comments, and another user to see them and edit them further.
The annotations are tools that software can use for its own purposes. They're not a user-facing feature like they are in PDF.
They're both called "annotations" but they're completely different. Completely different technologies for completely different use cases.
> [It] doesn't define any standardized mechanism for one user to add highlights and comments, and another user to see them and edit them further.
It does.
Sometimes you want one, sometimes, the other.
This is the part that the top commenter missed. Instead they decided that one format is "nice" and the other, by implication, isn't. I find PDFs a lot easier to keep organized en masse, I like that I can use them on any of my devices and it's easy for me to use them when I'm doing in-depth reading such as an ebook. Doubly so because my ereader also does text to speech and syncs across devices so I can read on my tablet while I'm on the exercise bike and then switch to listening to the same book on my phone with minimal seams and without losing my place. It is, in a word, nice.
- A text to speech engine should work better with the original html structure where it sees bold tags, headings, and full sentences ra-
ther than broken-off ones
- Keeping PDFs organised, how would that differ from keeping any other filetype organised? I don't understand what difference you, "by implication", attribute to a file ending in .html or .pdf for being able to handle them en masse. If anything, searching across them will be vastly easier for software (self-written or third-party) and more reliable because it's all plain text
- Text and audio rendering syncing, I have no experience with but that doesn't sound like it ought to fundamentally work for a display format and not for the source text format. Of course, the software has to have support for this format (and otherwise it's trivial to pdfify a html but vice versa is nearly impossible)
Theory doesn't matter here, tooling and standards do. And PDF doesn't just have the tooling for highlighting and annotations, it has the standards for them so that tools support them in an interoperable way. A highlight made with one tool can be removed with another, without altering the underlying content.
Video speed would sync to the exercise bike speed, giving a feeling of reality.
The core problem is that sweating inside a Quest isn't a good idea ...
Maybe instead of a quest you just display video to a screen? When I was using a hotel fitness center they had a peloton and that seems to be something you can do with those. It was a couple years back and I recall the video being loosely if at all tied to the speed you pedal at, but it was more fun than just looking at a wall while I pretend to go somewhere.
I'd love to see a text to speech engine that pronounces formatting but I think it might be more complicated than learning to pronounce something boldly. Am I yelling? Am I keeping my voice low but adding intensity? Can you automate answering that question in a way that's mostly correct most of the time? If something is in italics am I whispering, stage whispering, emphasizing or merely saying the title of an existing work out loud? It's a fundamental abuse of a text formatting engine to try to use it for speech formatting, you either have to use the existing tags for things they were never intended for or you have to start adding tags like <slywhisper> and <scream emotion="angry"> vs <scream emotion="excited">. That being said, an html-independent form of emotional text annotation might actually be a good idea as the inevitability of synthesized human voices being a part of our daily lives takes hold.
I find PDFs easier to organize than HTML because HTML is any number of files referencing each other across a directory structure that can have any size or shape, and a PDF is a single file. If I'm searching my library for Bob Wilson, I want his books to show up and I want them to have his picture in them if that's how the book was published but I don't want Bob_Wilson.jpeg to show up as a result. I could automate print to PDF from html or use the tool someone else posted in order to condense my saved HTMLs to single files but that's more processing time and effort in order to get what I already have from a PDF
Syncing position across HTML files may be doable, but syncing position across PDFs is done. You're absolutely right that that has nothing to do with the format but the (implied) question I was answering when I brought it up was why I would sometimes want one and other times want the other. That's why.
Finally, and probably the only one that really matters inasmuch as all the other reasons can be coded around but this one cant: the places I get documents distribute them in PDF, mobi and epub but almost never in HTML
This and trying to read the header/footer are the most annoying parts of pdf to audio apps. At least some apps will let you set a margin outside of which text is ignored, so every page doesn't start with the book title, author's name and chapter title and end with the page number.
Each one has things the other can't do. Neither is universally more flexible.
That's what the "self contained" option does: turn it into one nice file. Makes no difference if you copy example.pdf or example.html when both contain all images and styles (except one of them also contains the original semantic text)
Converting HTML to PDF shouldn't result in an image wrapped in a PDF. Text will be preserved as text in the final PDF. (Unless the converter is garbage, of course.)
I didn't mean literally an image, hence saying image-like. You get similar limitations to when using OCR, which seems very image-like to me
--embed-resources --standalone.
https://github.com/rstudio/rmarkdown/issues/2382https://pandoc.org/MANUAL.html#:~:text=Deprecated%20synonym%...
- It duplicated the headline, one in the correct place top-center but then a 2nd copy of the headline left-aligned below that.
- It shrunk the width of the content of the page (in fact, it seems to have completely discarded the css for the #content selector)
- It discarded the CSS for my code blocks, so now they are unreadable.
- My images are no longer center-aligned
- It added CSS that was not in the original document. For some reason, it addded hyphens: auto, overflow-wrap: break-word, text-rendering: optimizeLegibility, font-kerning: normal . None of those rules existed in the original document anywhere. Now my text is breaking mid-word with hyphens inserted.
- It pointlessly HTML-escaped some characters (like every quotation mark in every paragraph). This didn't break anything, but just... why?
Implementing the same functionality is like less than 100 lines of python, so I'm just going to go that route. I've implemented it once before, but it was for a previous company so I no longer have access to that code, but its like 1 afternoon of scripting and doesn't randomly destroy your documents. I don't know how pandoc got this so wrong.For context: the document I am attempting to process has no javascript. It is a simple Emacs Org document (similar to markdown) rendered to HTML and then processed with pandoc. The only external content was a couple of images.
That's probably it. Most of my issues were related to CSS, which markdown does not have[0]. The duplicated headline wasn't CSS though, that is by far the oddest issue. I'll probably file a bug report to pandoc after I replace all the text in this document with lorem ipsum (it is a document for work, so I can't share it publicly in its current form).
[0] unless you embedded your own CSS since markdown technically allows arbitrary HTML
If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs).
Another possibility is to use pandoc and LaTeX.
[2]: https://keenwrite.com/docs/user-manual.pdf
This would be far more efficient than spinning up an entire browser and printing PDFs to disk.
Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable.
Ghost script is working with postscript natively and will likely manage idiosyncrasies of web content better. It’s got a decent ecosystem, command line, you can find gui’s if that’s your thing (no judgement, your lifestyle is none of my business).
Many other good tools mentioned here as well, but if your asking because you need more, or fine grained (near infinite) control over the pdf composition, there’s nothing OSS I can think of that approaches its capabilities.
It only handles like 5% of HTML, but it's the 5% I was using.
I've also had success producing PDFs with GhostScript from a PostScript file. PostScript is really easy to write, almost like SVG.
[0]: https://carbone.io/
My main use for that is printing appointment information, tickets, and product listings. The product listings are useful when trying to find in a store something that's supposedly available and in stock. Usually, only the first page is useful. There will be additional useless pages of irrelevant items, deals, and ads.
```
pandoc input.html -t typst -o output.typ
typst compile output.typ output.pdf
```
Unfortunately, that server and software stack is still around and still in production.
that means you did a good job.
createPDF(configuration:completionHandler:)
https://developer.apple.com/documentation/webkit/wkwebview/c...:)
Print it? Archivize it? Send it via email? Read it on another device (which)?
Depending on that, there are different solutions and trade-offs. For example on how to deal with pagination.
Not related to the thread, but if anyone is looking to hire a developer or knows of opportunities, I was recently let go and am actively searching. Any leads or feedback would be greatly appreciated.
Sample PDF: https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...
Once you've got an appropriate canonical version in any of these options, you have an embarassment of riches to convert to any given document format (what I call endpoints) you'd care for: PDF, HTML, RTF, DOCX, or many, many others. I generally reach for Pandoc first, which itself, yes, of course, often relies on additional tools/libraries to parse or generate endpoints, but is quite versatile.
You can simplify the intake of HTML by stripping out cruft. Readability, Beautiful Soup, or other HTML filtering tools can target the core content and metadata you most likely want.
Otherwise, think through what you're doing and why to more narrowly define your goals and tools. E.g., if you want a faithful printed representation of a mainstream-browser-rendered page (that is, Google Chrome), you'd probably do best to use its print-to-PDF options (mentioned several times here). If you want to extract core text, filtering out much of today's WWW cruft will be a high priority.
https://gotenberg.dev/docs/routes
We use it in production for pdf exports and reports generation.
Just spin up a docker container and use a client library or REST API to send html data.
zja•4mo ago
hhthrowaway1230•4mo ago
brudgers•4mo ago
I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL.
kakokiyrvoooo•4mo ago
brudgers•4mo ago
Because lots of things work this way. For example compilers built on LLV uses an intermediate language and Python uses byte code.
I suspect some html to pdf tools go through postScript.
kreetx•4mo ago
brudgers•4mo ago
Of course Latex gives you fine control to hand tune the engine…but that doesn’t seem like what the OP is looking for.
kreetx•4mo ago
brudgers•4mo ago
Theoretically you can drive nails with a 22 caliber blank cartridge without making the “call” through a nail gun. But you won’t finish laying shingles as quickly and easily…
Or to put it another way, there’s a reason assemblers are almost always better than machine code and compilers are almost always better than assemblers for the ends people care about.
I mean why use Latex at all when you could write your own typesetting language? Maybe because you are not a knuth.
kreetx•3mo ago
cpach•4mo ago
beeforpork•4mo ago
w10-1•4mo ago
Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution.
The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why.
The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy.
Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need.