Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

176•hhthrowaway1230•4mo ago

I'm using wkhtmltopdf but it is painful to work with? what are other people using nowadays? i.e canva or other tools?

Comments

zja•4mo ago

pandoc

hhthrowaway1230•4mo ago

doesn't pandoc rely on some engine itself?

brudgers•4mo ago

Curious why that matters to you?

I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL.

kakokiyrvoooo•4mo ago

It matters because pandoc is not rendering the website to pdf, it converts the html to latex and then uses a latex engine to render the pdf.

brudgers•4mo ago

Forgive me but I don’t understand why that matters to you and am trying to understand what the issue with Latex is.

Because lots of things work this way. For example compilers built on LLV uses an intermediate language and Python uses byte code.

I suspect some html to pdf tools go through postScript.

kreetx•4mo ago

There are multiple ways to "depend", so if pandoc executes some external tool all of the work then might as well use that external tool directly. You will get more control over how the conversion happens, know for what search for when in trouble etc.

brudgers•4mo ago

My understanding and experience is that Latex has a significant learning curve and Pandoc provides a more gentle front end.

Of course Latex gives you fine control to hand tune the engine…but that doesn’t seem like what the OP is looking for.

kreetx•4mo ago

Sure, I don't mean that anyone would look at the Latex in between. I'm just saying that if tool x directly calls tool y to do the job then might as well use tool y directly.

brudgers•4mo ago

Since hammers and nails are a common tool-workpiece example…consider the nail gun.

Theoretically you can drive nails with a 22 caliber blank cartridge without making the “call” through a nail gun. But you won’t finish laying shingles as quickly and easily…

Or to put it another way, there’s a reason assemblers are almost always better than machine code and compilers are almost always better than assemblers for the ends people care about.

I mean why use Latex at all when you could write your own typesetting language? Maybe because you are not a knuth.

kreetx•4mo ago

You're confusing wrappers with alternatives. The comparison is more like if somebody published a script called html-to-pdf.sh which directly calls, e.g, chrome, would you want to use this script or use chrome directly? I would prefer the latter because (1) I would know what actually does the conversion, (2) I would know what to search for on the web should I need to tweak the output. This knowledge gives me more power as I know the actual converter. The wrapper script perhaps only helps with what the command line should be.

cpach•4mo ago

Yep, you need something like XeTeX in order to render the PDF.

beeforpork•4mo ago

Does pandoc do JavaScript? For stuff that is rendered (I don't want animated, interactive PDFs...).

w10-1•4mo ago

To reinforce this: pandoc has been the go-to for a long, long time and they have encountered and addressed tons of issues, which is especially important for two underspecified and over-provisioned formats like HTML and pdf.

Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution.

The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why.

The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy.

Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need.

throw03172019•4mo ago

I run chromium on my server and render the PDF from there using puppeteer.

kappadi3•4mo ago

Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted. Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating.

Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting. https://rapidapi.com/yakpdf-yakpdf/api/yakpdf

johnh-hn•4mo ago

Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention.

I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.

[0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp

benoau•4mo ago

Thirded, you can build this straight into your backend or into a microservice very easily.

You can also easily generate screenshots if that's more suitable than PDFs.

You can also easily use this to do stuff like jam a set of images into a HTML table and PDF or screenshot them in that format.

ChuckMcM•4mo ago

You made me realize that tractor feed roll paper would be really great for printed web pages, no page breaks! Kinda like reading scrolls of yore.

mightjustwork•4mo ago

https://gotenberg.dev/ ...has been working well for me for the last few years. It's a headless instance of Google Chrome with a golang wrapper. Runs well in Docker or a cloud instance.

hansonkd•4mo ago

gotenberg is really rock solid for us. Easy to deploy as a docker container to any infrastructure.

zikani_03•4mo ago

Seconding.. gotenberg has been solid for us. We also make use of it's convert from Word to PDF feature and it's been really solid.

pabs3•4mo ago

Just print to PDF in a browser, or automate that using a browser automation tool. For a non-browser-based open source solution, WeasyPrint.

https://weasyprint.org/

For a proprietary solution, try Prince XML:

https://www.princexml.com/

rossdavidh•4mo ago

+1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well.

grounder•4mo ago

WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service.

I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.

Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826

stuaxo•4mo ago

This is my experience and recommendation too.

sureglymop•4mo ago

Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough.

chinathrow•4mo ago

Yes, I did that for a recent small program. The @media print media query is powerful enough for most of the stuff I wanted to format nicely. Even page breaks are possible.

jmyeet•4mo ago

I’ve had excellent experience with Prince XML and poor experience with everything else I’ve tried. Prince is fast, robust and reliable.

Yes it costs money. So does developer time.

angst_ridden•4mo ago

Agreed. Prince also has a lot of good features for headers, footers, page numbering, etc, that make it very powerful.

jiehong•4mo ago

Most website do not have a print CSS, so it doesn’t print that nicely in PDF.

But, I upvote weasyprint for that instead.

hulitu•4mo ago

> Most website do not have a print CSS, so it doesn’t print that nicely in PDF.

Can't they just render the screen content in a pdf ? Seems easy for other programs to do this.

Wilduck•4mo ago

Viewport size and deciding where to paginate makes a naive approach to this surprisingly difficult. That being said, if you can control the css / html, you can often solve these problems with a short media query and some hints at where to break pages (e.g. https://developer.mozilla.org/en-US/docs/Web/CSS/break-after).

rcarmo•4mo ago

These two are the only right answers if you want a reliable, reproducible, relatively low resource experience. Running a browser engine has always been hard to maintain in the long run for me.

thenews•4mo ago

https://stirlingpdf.io also uses weasyprint !!

alsetmusic•4mo ago

There was a critical book that I read two years ago that is only available online. The web presentation is full of images of maps, artifacts, etc to help contextualize the content. No PDF converter tool has ever been up to the job of just extracting the text until this one. Thank you!

sodimel•4mo ago

+1 - Weasyprint is an excellent tool to make pdf from html content, and we're using it at work (with django) to export various documents.

bluebarbet•4mo ago

Seconded. In my eccentric workflow, I use Weasyprint to convert HTML emails to more portable PDFs. A surprisingly successful experiment.

hulitu•4mo ago

> Just print to PDF in a browser

I tried yesterday. With compliments to the moms of SWE who coded the functionality in firefox. Aparently puting the screen on a pdf page is an insurmontable task in 2025. (20 years ago was still doable). I had to make a screenshot and process the picture to print it.

carlosjobim•4mo ago

Orion browser produces PDFs which are exactly what you see on screen.

Semaphor•4mo ago

I'll join the choir. We use weasyprint for ebooks and invoices and it's a joy to use. Massively new support for features over the last few years (partially thanks to some monetary sponsorships), it started pretty bare bones, and is now close to commercial solutions.

The maintainers are also very responsive, and helpful.

Amazing project

journal•4mo ago

if you are doing html to pdf, you might also need the ability to merge. a few more features and you're better of with a commercial solution.

crazygringo•4mo ago

Merge what?

pentium166•4mo ago

I assume combining 2+ documents. For example, attaching a cover page with document owner/version control/lifecycle information to an existing PDF.

crazygringo•4mo ago

That's the easiest thing in the world with free software.

One way is to install poppler-utils and use pdfunite. There are many other open-source packages you can use as well.

fogzen•4mo ago

Don’t. Show a web page and open the print dialog, and tell people to save as PDF. All major browsers support this, and the browser HTML to PDF code is the most robust and accurate.

crazygringo•4mo ago

There's nothing in OP's question that suggests this is a one-off operation in response to a user action.

It's very likely to be a massive batch operation of a ton of HTML files that might not even be their own site.

hhthrowaway1230•4mo ago

this is the case indeed

chibbell•4mo ago

That does make sense where possible. I do feel like OPs question is super relevant if you are doing anything where the PDF has to be rendered server side, like say as part of a larger data process when producing an exportable report in PDF format.

Snawoot•4mo ago

chrome --headless --disable-gpu --print-to-pdf https://example.com

piptastic•4mo ago

same: google-chrome --headless --disable-gpu --no-pdf-header-footer --hide-scrollbars --print-to-pdf-margins="0,0,0,0" --print-to-pdf --window-size=1280,720 https://example.com

ended up using headless chrome specifically to make sure javascript things rendered properly

hhthrowaway1230•4mo ago

Used this, sigh of relief, thank you

HPsquared•4mo ago

Can Chromium do this?

Edit: it appears so- https://news.ycombinator.com/item?id=15131840

nine_k•4mo ago

Yes, routinely works for me.

mmphosis•4mo ago

Can Firefox do this?

with an elaborate script that relies on xdotool

andrehacker•4mo ago

Yes, kind of...

/path/to/firefox --window-size 1700 --headless -screenshot myfile.png file://myfile.html

Easy, right ?

Used this for many years... but beware:

- caveat 1: this is (or was) a more or less undocumented function and a few years ago it just disappeared only to come back in a later release.

- caveat 2: even though you can convert local files it does require internet access as any references to icons, style sheets, fonts and tracker pixels cause Firefox to attempt to retrieve them without any (sensible) timeout. So, running this on a server without internet access will make the process hang forever.

nine_k•4mo ago

Why, Firefox has a headless mode. It can't just print a document via a simple CLI command, you have to go for Selenium (or maybe Playwright, I did not try it in that capacity). Foxdriver would work, but its development ceased.

jlokier•4mo ago

Last time I explored this, Firefox rendered thin lines in subtly bordered tables as thick lines, so I had to use Chromium. But back then Chrome did worse at pagination than Firefox.

So I used Firefox for multi-page documents and Chromium for single-page invoices.

I spent a lot of time with different versions of both browsers, and numerous quirks made a very unpleasant experience.

Eventually I settled on Chromium (Ungoogled), which I use nowadays for invoices.

exabrial•4mo ago

openhtmltopdf is what we're using. Some outdated versions.

supersaw•4mo ago

Been using this as well. It's worth noting that while the original project appears to have been abandoned, it has since been forked and is currently maintained here: https://github.com/openhtmltopdf/openhtmltopdf

exabrial•4mo ago

thanks, didnt know that!

RiverCrochet•4mo ago

If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode.

haft•4mo ago

A revers of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

haft•4mo ago

A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

bencornia•4mo ago

I have been using pdf2htmlex with some success. https://github.com/pdf2htmlEX/pdf2htmlEX

drabbiticus•4mo ago

This is really cool, so thanks for sharing. Since the motivating goal for the question you are answering is WCAG compliance, is the output of pdf2htmlex meaningfully more WCAG compliant?

fschuett•4mo ago

Rendering to SVG, at least that's what I did on https://fschutt.github.io/printpdf/

I am currently writing a WASM-ready PDF toolkit that can handle both HTML to PDF and then rendering PDF pages to SVG. However, it's not yet production-ready.

The underlying HTML engine is currently a severe "work in progress", but it gives me the low-level access that I need: https://azul.rs/reftest

dredmorbius•4mo ago

This reduces to parsing PDFs, which is an unsolved hard problem.

At low volumes, my preferred approach is to select and extract text (copy/paste, perhaps using the poppler library for larger-scale work), dump that to plain-text and convert that (manually / scripted) to Markdown. From there you can get to PDF or pretty much any other format through tools such as pandoc.

Aachen•4mo ago

Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:

    pandoc --self-contained input.html -o output.html

agedclock•4mo ago

Pandoc would be my preferred tool. It is excellent at converting between other formats as well.

TylerE•4mo ago

Being (not so easily) edited is often a feature, not a bug.

guywithahat•4mo ago

I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex

pandoc input.html -o output.pdf --pdf-engine=<your engine>

ryandrake•4mo ago

Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.

craftkiller•4mo ago

If that is your goal, you should be cryptographically signing your documents with your PGP key. That way you actually have assurance the document has not been modified rather than just hoping someone hasn't modified the document. Additionally, PGP can sign anything so you are open to use whatever format you want.

Aachen•4mo ago

May I recommend .html in that case? You can embed scripts that control who can run it, having it fetch a decryption token from a server or require a decryption password with a safe password hashing algorithm of your choice

It's much more versatile than PDF and, if the algorithm decides the user is allowed to read the document, then the user gets to make use of all of the document's options like a better search function (PDF can't find words that are bro-

ken across lines because that information of what's a word is gone, transformed into coordinates of what characters need to go where). It's also much more readable on different screen sizes, as the user can resize the window to whatever is comfortable on a 27" screen, or fits on their pocket e-reader. You can even draw it on a canvas if you want to prevent people from extracting the decrypted strings (though it's evil, you have that option). There's only benefits!

PDF is the lazy way to half-ass a read-only document while screwing, ahem, making anyone using a mobile phone zoom, pan, and squint. Thankfully, phones are falling out of fash— wait, scratch that, I just heard text reflow is more relevant than ever as phone use continues to soar

crazygringo•4mo ago

Or, please do?

I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.

I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.

mr_mitm•4mo ago

You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks.

whenc•4mo ago

PDF annotations sit within the file.

mr_mitm•4mo ago

I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well.

crazygringo•4mo ago

PDF is designed for annotations in the file format. You annotate in one editor, you can change the annotations in another. You can always distinguish between original content and annotations. I see no indication that Okular stores highlights or annotations in a separate file, that would be bizarre.

There is no mechanism for annotations in HTML or the other formats I listed. An editor would just be editing the original content in its own non-standardized, non-portable way, which is not desirable for a number of reasons.

So when you say:

> What you are describing are features of an editor, not a file format.

That is incorrect. It is an intentionally designed and standardized feature of the file format.

mr_mitm•4mo ago

It definitely used to be bizarre then:

https://superuser.com/questions/333378/where-does-okular-sto...

ratelimitsteve•4mo ago

turns out the default for okular is to save to an external file but there's a setting that can be changed to use the format correctly and store annotations within the file, which is universally compatible with other PDF readers. You can't really blame the format for someone using it wrong on purpose, and if you can then I'll just abuse HTML and the fact that I use it wrong will be evidence that it is, in itself, wrong

cxr•4mo ago

The W3C standardized HTML annotations years ago. There's a difference between a standard not existing versus people pretending it doesn't exist because it's not implemented by Chrome.

crazygringo•4mo ago

That's different. Those are a data structure defining annotations that are meant to be stored externally. They're not part of an HTML file like PDF annotations are. They're meant more for live collaborative commenting within a shared online space, not for making private portable annotations like PDF does.

And it's not a Chrome thing. I don't think any browsers support it, do they? It's not really clear there's a need for it, when collaborative editors already handle document annotations in their own ways.

cxr•4mo ago

So is there a need for it or isn't there?

> That's different. Those are a data structure defining annotations that are meant to be stored externally.

The protocol is a separate standard.

The format is JSON-LD. Putting JSON-LD into HTML isn't a question mark. (There's info at W3C.org about how to do that, too. Not that it's necessary. You can guess what it says.)

crazygringo•4mo ago

Sorry, yes you're right the annotations can be embedded too.

But these aren't meant for direct user annotations in a general way.

The web standard doesn't define any standardized mechanism for one user to add highlights and comments, and another user to see them and edit them further.

The annotations are tools that software can use for its own purposes. They're not a user-facing feature like they are in PDF.

They're both called "annotations" but they're completely different. Completely different technologies for completely different use cases.

cxr•4mo ago

I don't know what "these aren't meant for direct user annotations in a general way" is supposed to mean.

> [It] doesn't define any standardized mechanism for one user to add highlights and comments, and another user to see them and edit them further.

It does.

circuit10•4mo ago

But in this case the flexibility of HTML is a negative because any layout shift would mess up the positions of the annotations, so fixing the layout (and making sure it’s non-interactive) is helpful here

nine_k•4mo ago

PDF is literally digital paper. HTML has logical structure, it can adapt to different displays, etc.

Sometimes you want one, sometimes, the other.

ratelimitsteve•4mo ago

>Sometimes you want one, sometimes, the other.

This is the part that the top commenter missed. Instead they decided that one format is "nice" and the other, by implication, isn't. I find PDFs a lot easier to keep organized en masse, I like that I can use them on any of my devices and it's easy for me to use them when I'm doing in-depth reading such as an ebook. Doubly so because my ereader also does text to speech and syncs across devices so I can read on my tablet while I'm on the exercise bike and then switch to listening to the same book on my phone with minimal seams and without losing my place. It is, in a word, nice.

Aachen•4mo ago

None of that sounds related to the format?

- A text to speech engine should work better with the original html structure where it sees bold tags, headings, and full sentences ra-

ther than broken-off ones

- Keeping PDFs organised, how would that differ from keeping any other filetype organised? I don't understand what difference you, "by implication", attribute to a file ending in .html or .pdf for being able to handle them en masse. If anything, searching across them will be vastly easier for software (self-written or third-party) and more reliable because it's all plain text

- Text and audio rendering syncing, I have no experience with but that doesn't sound like it ought to fundamentally work for a display format and not for the source text format. Of course, the software has to have support for this format (and otherwise it's trivial to pdfify a html but vice versa is nearly impossible)

user3939382•4mo ago

HTML could do everything PDF does in theory but it doesn’t in practice because the tooling doesn’t exist.

crazygringo•4mo ago

My exercise bike can play Doom on its display in theory.

Theory doesn't matter here, tooling and standards do. And PDF doesn't just have the tooling for highlighting and annotations, it has the standards for them so that tools support them in an interoperable way. A highlight made with one tool can be removed with another, without altering the underlying content.

ratelimitsteve•4mo ago

I would love an exercise bike that runs doom. Maybe link my movement speed in-game to how fast I pedal, with a joystick on one of the handlebars to move and a couple buttons on the other one to shoot and reload. So far every exercise bike game is just bike race, which I'll admit is a close fit for the existing hardware and probably the first idea I'd have too but it gets boring after a while.

Towaway69•4mo ago

Had the same idea - not with doom - but with a Quest 3D and watching videos of me riding in the summer. First make a film using an 3D camera (something like a Insta360) and then view that on the Quest in winter while riding exercise bike.

Video speed would sync to the exercise bike speed, giving a feeling of reality.

The core problem is that sweating inside a Quest isn't a good idea ...

ratelimitsteve•4mo ago

I did thrill of the fight a lot back during the pandemic. sweating in a quest isn't ideal, but it's not unmanageable for brief periods of time (15-30 minutes). The only real problem is that the lenses can get sweaty or foggy and all of a sudden I go from punching faces and dodging fists to punching at blurs and failing to dodge other blurs.

Maybe instead of a quest you just display video to a screen? When I was using a hotel fitness center they had a peloton and that seems to be something you can do with those. It was a couple years back and I recall the video being loosely if at all tied to the speed you pedal at, but it was more fun than just looking at a wall while I pretend to go somewhere.

user3939382•4mo ago

It’s a credible argument, but to be clear one I didn’t comment on. It’s the logical next question if I’m correct, which I am.

ratelimitsteve•4mo ago

maybe html can do all of these and it will only cost me the time it takes to build the app, but right now PDF does all of those things for me, here today in my pocket, for $15. Which is nice.

I'd love to see a text to speech engine that pronounces formatting but I think it might be more complicated than learning to pronounce something boldly. Am I yelling? Am I keeping my voice low but adding intensity? Can you automate answering that question in a way that's mostly correct most of the time? If something is in italics am I whispering, stage whispering, emphasizing or merely saying the title of an existing work out loud? It's a fundamental abuse of a text formatting engine to try to use it for speech formatting, you either have to use the existing tags for things they were never intended for or you have to start adding tags like <slywhisper> and <scream emotion="angry"> vs <scream emotion="excited">. That being said, an html-independent form of emotional text annotation might actually be a good idea as the inevitability of synthesized human voices being a part of our daily lives takes hold.

I find PDFs easier to organize than HTML because HTML is any number of files referencing each other across a directory structure that can have any size or shape, and a PDF is a single file. If I'm searching my library for Bob Wilson, I want his books to show up and I want them to have his picture in them if that's how the book was published but I don't want Bob_Wilson.jpeg to show up as a result. I could automate print to PDF from html or use the tool someone else posted in order to condense my saved HTMLs to single files but that's more processing time and effort in order to get what I already have from a PDF

Syncing position across HTML files may be doable, but syncing position across PDFs is done. You're absolutely right that that has nothing to do with the format but the (implied) question I was answering when I brought it up was why I would sometimes want one and other times want the other. That's why.

Finally, and probably the only one that really matters inasmuch as all the other reasons can be coded around but this one cant: the places I get documents distribute them in PDF, mobi and epub but almost never in HTML

ratelimitsteve•4mo ago

>and full sentences ra- > >ther than broken-off ones

This and trying to read the header/footer are the most annoying parts of pdf to audio apps. At least some apps will let you set a margin outside of which text is ignored, so every page doesn't start with the book title, author's name and chapter title and end with the page number.

Aachen•4mo ago

When do want the digital paper when you can have the more flexible one?

crazygringo•4mo ago

Did you not read my reply to your root comment? I already answered this for you.

Each one has things the other can't do. Neither is universally more flexible.

jerjerjer•4mo ago

When I want it to be displayed in the exact same way everywhere.

moralestapia•4mo ago

Please don't police what other people do.

Aachen•4mo ago

If I were police, I could still not enforce that this is what they run until it's law. They're free to choose this option if they like the merits

layer8•4mo ago

HTML+CSS+media files isn’t a nice format, and much less portable through time and space than PDF.

Aachen•4mo ago

Not sure if I'm misreading your comment, but it's not plural files with all those formats separately

That's what the "self contained" option does: turn it into one nice file. Makes no difference if you copy example.pdf or example.html when both contain all images and styles (except one of them also contains the original semantic text)

kelnos•4mo ago

> Please don't turn nice formats into a format that's similar to screenshots of text

Converting HTML to PDF shouldn't result in an image wrapped in a PDF. Text will be preserved as text in the final PDF. (Unless the converter is garbage, of course.)

Aachen•4mo ago

If you've ever copied text out of a PDF, you'll know it's not the original text anymore. Besides ligatures, you get broken sentences with extra hyphens inserted in wrong places (that were word/line breaks in the PDF-rendered version), if it'll properly let you select more than a few words at all. It works like "put these couple words at position x,y" and not (html's) semantic "here comes a heading" tag that helps people accessibly read your text, and if you're not suffering from any impairment or mobile devices with narrower screens than this particular render was designed for, it also lets you work with the document more easily. It's like you remove all HTML and keep only the CSS: all definitions of what's a section, sentence, emphasis, or caption are gone

I didn't mean literally an image, hence saying image-like. You get similar limitations to when using OCR, which seems very image-like to me

jasode•4mo ago

Fyi... the preferred new syntax since 2022 is:

  --embed-resources --standalone.

https://github.com/rstudio/rmarkdown/issues/2382

https://pandoc.org/MANUAL.html#:~:text=Deprecated%20synonym%...

Aachen•4mo ago

I noticed when trying it out for this comment, but then looked around when it was introduced and it seems recent (as in, an LTS distribution won't have it). Someone on stackoverflow said they get "unknown option --embed-resources". The old option will work for everyone and is also simpler, one instead of two parameters. People whose client supports the new option will see the upgrade suggestion when they run this. In the end I saw mainly downsides to mentioning the new rather than the old way

craftkiller•4mo ago

I was excited to try this today, but this is unusable. It absolutely mangles the page.

  - It duplicated the headline, one in the correct place top-center but then a 2nd copy of the headline left-aligned below that.
  - It shrunk the width of the content of the page (in fact, it seems to have completely discarded the css for the #content selector)
  - It discarded the CSS for my code blocks, so now they are unreadable.
  - My images are no longer center-aligned
  - It added CSS that was not in the original document. For some reason, it addded hyphens: auto, overflow-wrap: break-word, text-rendering: optimizeLegibility, font-kerning: normal . None of those rules existed in the original document anywhere. Now my text is breaking mid-word with hyphens inserted.
  - It pointlessly HTML-escaped some characters (like every quotation mark in every paragraph). This didn't break anything, but just... why?

Implementing the same functionality is like less than 100 lines of python, so I'm just going to go that route. I've implemented it once before, but it was for a previous company so I no longer have access to that code, but its like 1 afternoon of scripting and doesn't randomly destroy your documents. I don't know how pandoc got this so wrong.

For context: the document I am attempting to process has no javascript. It is a simple Emacs Org document (similar to markdown) rendered to HTML and then processed with pandoc. The only external content was a couple of images.

Aachen•4mo ago

Huh, that's a bummer! I only used it once myself to send colleagues draft versions of some markdown file that would later go on our blog, maybe it somehow helped that the source was markdown instead of html? Not sure, I'm sorry to hear of this disappointing experience :/

craftkiller•4mo ago

> maybe it somehow helped that the source was markdown instead of html?

That's probably it. Most of my issues were related to CSS, which markdown does not have[0]. The duplicated headline wasn't CSS though, that is by far the oddest issue. I'll probably file a bug report to pandoc after I replace all the text in this document with lorem ipsum (it is a document for work, so I can't share it publicly in its current form).

[0] unless you embedded your own CSS since markdown technically allows arbitrary HTML

thangalin•4mo ago

Is this an xy problem? If you have the original document (in Markdown), one possibility would be to use my software, KeenWrite[1], to convert Markdown to XHTML then typeset XHTML to PDF via ConTeXt. See the user manual[2] for an example of a Markdown document typeset in this fashion, along with usage instructions.

If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs).

Another possibility is to use pandoc and LaTeX.

[1]: https://keenwrite.com/

[2]: https://keenwrite.com/docs/user-manual.pdf

[3]: https://github.com/flyingsaucerproject/flyingsaucer

[4]: https://itextpdf.com/

nicoburns•4mo ago

https://github.com/plutoprint/plutobook was a recent Show HN and looks excellent

ftchd•4mo ago

the only thing I found to work reliably well is simply Chromium's print feature

hhthrowaway1230•4mo ago

5k pdfs a month for archival purposes, must be pdf, customers demand this

bob1029•4mo ago

If your HTML is simply an intermediary to get you to a PDF, you could consider just skipping straight to building the PDF directly:

https://pdfbox.apache.org

This would be far more efficient than spinning up an entire browser and printing PDFs to disk.

deaddodo•4mo ago

Building PDF directly (unless you're creating documents, especially fillables) is non-intuitive. Most PDFs are people trying to capture live data in a cached manner. If not, using a preliminary format like Markdown/HTML/LaTeX/DocX/etc to generate your PDF is almost always more intuitive.

ratStallion•4mo ago

My website's content is xml, and I use Apache Fop to turn it into a PDF with page numbers and other nice things. It works nicely, but takes some setup.

juice_bus•4mo ago

I have Chromium shoved into an AWS Lambda Layer, when we need HTML to PDF conversion we shove it off onto that. It loads the HTML into Chromium then "prints" it to PDF.

freedomben•4mo ago

I'd love to go the other way: convert a PDF into a self contained HTML page that renders properly in a browser. It's been way harder than I thought it would. Any advice?

drabbiticus•4mo ago

> renders properly

Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable.

mr_mitm•4mo ago

You could embed it as a base64 blob, embed PDF.js (which is included by browsers anyway, I think) and use that to render it in the HTML. But I realize you probably meant a static HTML without JavaScript.

freedomben•4mo ago

Yes ideally, but even this is helpful, thank you!

gucci-on-fleek•4mo ago

You can use dvisvgm, pdftocairo, or Inkscape to convert PDF to SVG, which you can either use directly or insert inline into an HTML document.

lizimo•4mo ago

If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/ We use it in production to generate reports, and it is amazing.

leephillips•4mo ago

See https://lwn.net/Articles/1037577/ for a recent summary of what you can do with Typst.

delduca•4mo ago

https://gotenberg.dev

roxolotl•4mo ago

Would second this. I’ve used it in production to generate tens of thousands of PDFs a day. It just works. Run the docker container throw html and variables at it and get PDFs back.

NanoWar•4mo ago

We use this in production and it's very stable. It also supports background gradients which we wanted to use so badly :-) Can recommend

Glyptodon•4mo ago

The last time I had to do this I scripted a back-end that scaled up headless chrome browsers to render web pages to PDF. I think it was using Puppeteer, but was a few years ago. (FWIW the decision I think was mostly driven by the environment, I think there are other options.)

gigatexal•4mo ago

pandoc is your friend.

lucis•4mo ago

jsPDF is a work of art https://parall.ax/products/jspdf

flanbiscuit•4mo ago

Been looking at this one. I inherited a project and I set it up to use puppeteer and chrome server side to generate a PDF from HTML but it's too much overhead. I want to do this all on the frontend because it should be simple enough to do and can use less resources on the server.

handzhiev•4mo ago

I'm surprised no one mentioned mPDF. Maybe php isn't very popular here :)

cjm42•4mo ago

I've had decent results with html-pdf-chrome[0], which automates printing to PDF from Chromium or Chrome.

[0] https://github.com/westy92/html-pdf-chrome/

syngrog66•4mo ago

pandoc

efnx•4mo ago

pandoc

ineedasername•4mo ago

Ghostscript. Depending on specific needs it may be much more turnkey than Pandoc, which isn’t actually doing much directly with things other than intermediating, iiuc. (LaTex) does the heavy lifting.

Ghost script is working with postscript natively and will likely manage idiosyncrasies of web content better. It’s got a decent ecosystem, command line, you can find gui’s if that’s your thing (no judgement, your lifestyle is none of my business).

Many other good tools mentioned here as well, but if your asking because you need more, or fine grained (near infinite) control over the pdf composition, there’s nothing OSS I can think of that approaches its capabilities.

https://ghostscript.com/

roschdal•4mo ago

OpenPDF for Java https://github.com/LibrePDF/OpenPDF

kragen•4mo ago

I just wrote a quick HTML renderer in Python with ReportLab: https://GitHub.com/kragen/dercuano/blob/master/genpdf.py

It only handles like 5% of HTML, but it's the 5% I was using.

I've also had success producing PDFs with GhostScript from a PostScript file. PostScript is really easy to write, almost like SVG.

Koffiepoeder•4mo ago

If you want lots of differently styled templates, template management and editing/styling capabilites in word or excel (ie. you can just ask your customer/employer/.. to make an example document), I can really recommend Carbone [0]. I've been a happy customer for a few years now. Extra advantage is also that it also offers you excel outout generation as well, which is also often a requirement in applications. They have a SaaS offering as well if you'd like. They are open source though, so you can easily run a docker container!

[0]: https://carbone.io/

crsr•4mo ago

Browserless have browsers as a service, and a dedicated pdf endpoint[0] you can call. Had really good experience with this.

[0] http://docs.browserless.io/rest-apis/pdf-api

Animats•4mo ago

Print to PDF in the browser?

My main use for that is printing appointment information, tickets, and product listings. The product listings are useful when trying to find in a store something that's supposedly available and in stock. Usually, only the first page is useful. There will be additional useless pages of irrelevant items, deals, and ads.

fredguth•4mo ago

I would use pandoc and convert to pdf using typst:

```

pandoc input.html -t typst -o output.typ

typst compile output.typ output.pdf

```

trollbridge•4mo ago

I wrote a solution in 2010 that used headless Firefox with some plugins to generate a PDF and then had the graphic designer write print CSSes. It was driven by Perl and was a convenient way for non-programmers to design forms.

Unfortunately, that server and software stack is still around and still in production.

znpy•4mo ago

> Unfortunately, that server and software stack is still around and still in production.

that means you did a good job.

Dwedit•4mo ago

2010-era Firefox is probably plagued by security holes.

detaro•4mo ago

If your print-file generation code tries to exploit the headless browser you use to turn its outputs into PDF something has gone very wrong already.

trollbridge•4mo ago

My biggest concern would be the Perl libraries I used to sanitise the input. I checked and none of them have any CVEs, though.

busymom0•4mo ago

If you are able to do this on a Mac, you can load the html in a WKWebView and then use the function:

createPDF(configuration:completionHandler:)

https://developer.apple.com/documentation/webkit/wkwebview/c...:)

estimator7292•4mo ago

IIRC LibreOffice has some command line tools to do all kinds of document conversion

freeopinion•4mo ago

Clearly,a million people have tried to find an answer for this question. I've tried. At least one of my attempts was an XY problem. I was converting generated HTML that would never see a browser. It was never intended to see a browser. The people generating it were very good at HTML/CSS/JS, but didn't know how to produce the same content outside HTML.

gangtao•4mo ago

I use chrome and ctrl+P

jvanveen•4mo ago

Puppeteer (Pdfium => https://github.com/chromium/pdfium)

stared•4mo ago

What's your goal?

Print it? Archivize it? Send it via email? Read it on another device (which)?

Depending on that, there are different solutions and trade-offs. For example on how to deal with pagination.

Towaway69•4mo ago

OP wants to archive 5k a month --> https://news.ycombinator.com/item?id=45440073

0xMohan•4mo ago

Open the html file in firefox and `ctrl + p`

dvcoolarun•4mo ago

I built this: https://github.com/dvcoolarun/web2pdf — a CLI tool for converting web pages to PDFs, recently open-sourced after adding several new features. (Might be useful!)

Not related to the thread, but if anyone is looking to hire a developer or knows of opportunities, I was recently let go and am actively searching. Any leads or feedback would be greatly appreciated.

Sample PDF: https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...

mimi_007•4mo ago

Yeah, wkhtmltopdf can be a pain, especially with modern CSS/JS-heavy pages. One option you could try is PDFBolt - you can design a template in HTML/CSS once, then simply pass the JSON data and template ID. It handles dynamic content and modern layouts without all the quirks of wkhtmltopdf. You can also convert HTML content or URLs to PDFs.

dredmorbius•4mo ago

Shortcutting much of the discussion here (what are you goals / why would you do that / don't use format X): a key problem is that neither HTML (as published on today's Web) nor PDF are reliable as canonical document formats. Tagged-markup such as Markdown (or otherlightweight markup languages) or LaTeX (or other heavy markup languages) are far more robust. Markdown has its variants, but all are pretty simple and easy to produce. LaTeX is slightly more complex, but remains quite straightforward for simple works.

Once you've got an appropriate canonical version in any of these options, you have an embarassment of riches to convert to any given document format (what I call endpoints) you'd care for: PDF, HTML, RTF, DOCX, or many, many others. I generally reach for Pandoc first, which itself, yes, of course, often relies on additional tools/libraries to parse or generate endpoints, but is quite versatile.

You can simplify the intake of HTML by stripping out cruft. Readability, Beautiful Soup, or other HTML filtering tools can target the core content and metadata you most likely want.

Otherwise, think through what you're doing and why to more narrowly define your goals and tools. E.g., if you want a faithful printed representation of a mainstream-browser-rendered page (that is, Google Chrome), you'd probably do best to use its print-to-PDF options (mentioned several times here). If you want to extract core text, filtering out much of today's WWW cruft will be a high priority.

carlosjobim•4mo ago

Orion browser and export to PDF function. It will export the page exactly as it looks, including the page dimension.

deafpolygon•4mo ago

Pandoc or use a browser to save as pdf

bevstratov•4mo ago

Gotenberg

https://gotenberg.dev/docs/routes

We use it in production for pdf exports and reports generation.

Just spin up a docker container and use a client library or REST API to send html data.

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

Comments