The Dangers of SSL Certificates

https://surfingcomplexity.blog/2025/12/27/the-dangers-of-ssl-certificates/

96•azhenley•1mo ago

Comments

loloquwowndueo•1mo ago

There are plenty of other technologies whose failure mode is a total outage, it’s not exclusive to a failed certificate renewal.

A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.

A broken alerting system is mentioned “didn’t alert for whatever reason”.

If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.

yearolinuxdsktp•1mo ago

Additionally, warnings can be built into the clients themselves. If you connect to a host with less than 2 weeks cert expiry time, print a warning in your client. That will be further incentive to not let certs be not renewed in time.

SoftTalker•1mo ago

Wait until they start expiring 47 days from issue (coming soon). Though maybe this will actually help, because it will happen often enough that you (a) won't completely forget how to deal with it and (b) have more motivation to be proactive.

tetha•1mo ago

> If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

This is also why you want a mix of alerts from the service users point of view, as well as internal troubleshooting alerts. The users point-of-view alerts usually give more value and can be surprisingly simple at times.

"Remaining validity of the certificates offered by the service" is a classical check from the users point of view. It may not tell you why this is going wrong, but it tells you something is going wrong. This captures a multitude of different possible errors - certs not reloading, the wrong certs being loaded, certs not being issued, DNS going to the wrong instance, new, shorter cert lifecycles, outages at the CA, and so on.

And then you can add further checks into the machinery to speed up the process of finding out why: Checks if the cert creation jobs run properly, checks if the certs on disk / in secret store are loaded or not, ...

Good alerting solutions might also allow relationships between these alerts to simplify troubleshooting as well: Don't alert for the cert expiry, if there is a failed cert renew cron job, alert for that instead.

flowerlad•1mo ago

We need a way to set multiple SSL certificates with overlapping duration. So if one certificate expires the backup certificate will become active. If the overlap is a couple of months then you have plenty of time to detect and fix the issue.

Having only one SSL certificate is a single point of failure, we have eliminated single points of failure almost everywhere else.

woodruffw•1mo ago

You can do this pretty easily with Let’s Encrypt, to my knowledge. You can request resistance every 30 days, for example, which would give you a ladder of three 90 day certificates.

Edit: but to be clear, I don’t understand why you’d want this. If you’re worried about your CA going offline, you should shorten your renewal period instead.

flowerlad•1mo ago

Do services such as K8S ingress and Azure web apps allow you to specify multiple certificates?

Update: looks like the answer is yes. So then the issue is people not taking advantage of this technique.

woodruffw•1mo ago

I don’t think there’s a ton of benefit to the technique. If you’re worried about getting too close to your certificate expiry via automation, the solution is to renew earlier rather than complicate things with a ladder of valid certs.

kees99•1mo ago

Exactly. It's not like backup certificate have validity starting at a future date.

flowerlad•1mo ago

Yes the backup certificate can have validity starting at a future date. You just need to wait till that future date to create it.

bawolff•1mo ago

There are reasons to do this, just not because of expiry.

The main reason to have multiple certs is so if your host (and cert prov key) is compromised, you can quickly switch to a backup, without first having to sort out getting a new cert issued.

miladyincontrol•1mo ago

If getting a new cert issued is some sort of thing you need to sort out, as in a process that takes time, you've already missed the target.

bawolff•1mo ago

If you want a backup system its best if its self contained. When your site is down its easier to just run a single command to copy over a single file in your control instead of depending on an external service.

throw0101c•1mo ago

> We need a way to set multiple SSL certificates with overlapping duration.

Both Apache (SSLCertificateFile) and nginx (ssl_certificate) allow for multiple files, though they cannot be of the same algorithm: you can have one RSA, one ECC, etc, but not (say) an ECC and another ECC. (This may be a limitation of OpenSSL.)

So if the RSA expires on Feb 1, you can have the ECC expire on Feb 14 or Mar 1.

deIeted•1mo ago

That's a lot of words coming from people who were against this very idea not that long ago. Before Let's Encrypt existed, 90% of you were violently against the idea. "No, that's not how it's supposed to work." That's how it was.

superkuh•1mo ago

For corporations, institutions, and for-profits this matters and there's no real good solution.

But for human persons and personal websites HTTP+HTTPS fixes this easily and completely. You get the best of both worlds. Fragile short lifetime pseudo-privacy if you want it (HTTPS) and long term stable access no matter what via HTTP. HTTPS-only does more harm than good. HTTP+HTTPS is far better than either alone.

deIeted•1mo ago

I think your only defense would be to pretend to be a bot at this point, because what you just said was completely ridiculous and embarrassing. You realize it's not a requirement that you have to post a comment when you have no idea what to say?

yesindeed4•1mo ago

I'm also leaning in this direction.

dvratil•1mo ago

Happened on the first day of my first on-call rotation - a cert for one of the key services expired. Autorenew failed, because one of the subdomains on the cert no longer resolved.

The main lesson we took from this was: you absolutely need monitoring for cert expiration, with alert when (valid_to - now) becomes less than typical refresh window.

It's easy to forget this, especially when it's not strictly part of your app, but essential nonetheless.

0x073•1mo ago

And it get worse, as they are changing the max days to until 47 in 2029.

JoshTriplett•1mo ago

On the other hand, as the time gets shorter, it'll become less likely that something will go undetected for a long time.

dextercd•1mo ago

You need external monitoring of certificate validity. Your ACME client might not be sending failure notifications properly (like happened to Bazel here). The client could also think everything is OK because it acquired a new cert, meanwhile the certificate isn't installed properly (e.g., not reloading a service so it keeps using the old cert).

I have a simple Python script that runs every day and checks the certificates of multiple sites.

One time this script signaled that a cert was close to expiring even though I saw a newer cert in my browser. It turned out that I had accidentally launched another reverse proxy instance which was stuck on the old cert. Requests were randomly passed to either instance. The script helped me correct this mistake before it caused issues.

firesteelrain•1mo ago

There is a Prometheus plugin called ssl_exporter that will provide the ability for Grafana to display a dashboard of all of your certs and their expirations. But, the trick is that you need to know where all your certs are located. We were using Venafi to do auto discovery but a simple script to basically nmap your network provides the same functionality.

machinationu•1mo ago

relevant certificates could be located by scanning the certificate transparency logs

tialaramex•1mo ago

What you're monitoring is "Did my system request a renewed cert?" but what most people's customers care about is instead, "Did our HTTPS endpoint use an in-date certificate?"

For example say you've got an internal test endpoint, two US endpoints and a rest-of-world endpoint, physically located in four places. Maybe your renewal process works with a month left - but the code to replace working certificates in a running instance is bugged. So, maybe Monday that renewal happens, your "CT log monitor" approach is green, but nobody gets new certs.

On Wednesday engineers ship a new test release to the test endpoint, restarting and thus grabbing the renewed cert, for them everything seems great. Then on Friday afternoon a weird glitch happens for some US customers, restarting both US servers seems to fix the glitch and now US customers also see a renewed cert. But a month later the Asian customers complain everything is broken - because their endpoint is still using the old certificate.

xorcist•1mo ago

> Did our HTTPS endpoint use an in-date certificate?

For any non-trivial organization, you want to know when client certificates expire too.

In my experience, the easiest way is to export anything that remotely looks like a certificate to the monitoring system, and let people exclude the false positives. Of course, that requires you to have a monitoring system in the first place. That is no longer a given.

tialaramex•1mo ago

So, I've worked for both startups and large entities, including both an international corporation and a major university, and in all that time I've worked with exactly one system that used client TLS certificates. They mostly weren't from the Web PKI (and so none of these technologies are relevant, Let's Encrypt for example has announced and maybe even implemented choices to explicitly not issue client certs) and they were handled by a handful of people who I'd say were... not experts.

It's true that you could use client certs with say, Entra ID, and one day I will work somewhere that does that. Or maybe I won't, I'm an old man and "We should use client certs" is an ambition I've heard from management several times but never seen enacted, so the renaming of Azure AD to Entra ID doesn't seem likely to change that.

Once you're not using the Web PKI cert expiry lifetimes are much more purpose specific. It might well make sense for your Entra ID apps to have 10 year certs because eh, if you need to kill a cert you can explicitly do that, it's not a vast global system where only expiry is realistically useful. If you're minting your own ten year certs, now expiry alerting is a very small part of your risk profile.

lucidnonsense•1mo ago

Client certificates aren't as esoteric as you think. They're not always used for web authentication, but many enterprises use them for WiFi/LAN authentication (EAP-TLS) and securing confidential APIs. Shops that run Kubernetes use mTLS for securing pod to pod traffic, etc. I've also seen them used for VPN authentication.

tialaramex•1mo ago

Huh. I have worked with Kubernetes so I guess it's possible that's a second place with client certs and I never noticed.

The big employers didn't use EAP-TLS with client certs. The University of course has Eduroam (for WiFi), and I guess in principle you could use client certs with Eduroam but that sounds like extra work with few benefits and I've never seen it from either the implementation side or the user side even though I've worked on or observed numerous Eduroam installs.

I checked install advice for my language (it might differ in other languages) and there's no sign that Eduroam thinks client certificates would be a good idea. Server certs are necessary to make this system work, and there's plenty of guidance on how to best obtain and renew these certificates e.g. does the Web PKI make sense for Eduroam or should you just busk it? But nothing about client certificates that I could see.

lucidnonsense•1mo ago

I can't comment on Eduroam as I have no experience working in the Edu space, but in general, EAP-TLS is considered to be the gold standard for WiFi/LAN authentication, as alternatives like EAP-TTLS and PEAP-MSCHAPv2 are all flawed in one way or another and rely on username/password auth, which is a weaker form of authentication than relying on asymmetric cryptography (mTLS). Passwords can be shared and phished, if you're not properly enforcing server cert validation, you will be susceptible to evil twin attacks, etc.

Of course, implementing EAP-TLS usually requires a robust way for distributing client certificates to the clients. If all your devices are managed, this is often done using the SCEP protocol. The CA can be either AD CS, your NAC solution, or a cloud PKI solution like SecureW2.

tialaramex•1mo ago

Yeah, I don't think EAP-TLS with client certs would work out well for Eduroam applications. You have a very large number of end users, they're only barely under your authority (students, not staff) and they have a wide variety of devices, also not under your control.

But even in Enterprise corporate settings I did not ever see this though I'm sure some people do it. It sounds like potentially a good idea, of course it can have excellent security properties, however one of the major downside IMHO is that people wind up with the weakest link being a poorly secured SCEP endpoint. Bad guys could never hope to break the encryption needed to forge credentials, but they could trivially tail-gate a call center worker and get real credentials which work fine, so, who cares.

Maybe that's actually enough. Threat models where adversaries are willing to physically travel to your location (or activate a local asset) might be out of your league anyway. But it feels to me as if that's the wrong way to look at it.

machinationu•1mo ago

sure, I was just giving parent another way of finding all the certificates besides scanning the network

firesteelrain•1mo ago

I am airgapped and the certs are usually wildcard with multiple SANs. You would think that the SANs alone would tell you which host has a cert. But, it can be difficult to find all the hosts or even internal hosts that use TLS.

stackskipton•1mo ago

Blackbox exporter will do same thing while testing HTTP and others.

compumike•1mo ago

100%, I've run into this too. I wrote some minimal scripts in Bash, Python, Ruby, Node.js (JavaScript), Go, and Powershell to send a request and alert if the expiration is less than 14 days from now: https://heyoncall.com/blog/barebone-scripts-to-check-ssl-cer... because anyone who's operating a TLS-secured website (which is... basically anyone with a website) should have at least that level of automated sanity check. We're talking about ~10 lines of Python!

weddpros•1mo ago

The scalable way (up to thousands of certificates) is https://sslboard.com. Give it one apex domain, it will find all your in-use certificates, then set alerts (email or webhook). Fully external monitoring and inventory.

jcgl•1mo ago

Looks like it relies on certificate transparency logs. That means that it won’t be monitor endpoints using wildcard certs. Best thing it could do would be to alert when a wildcard cert is expiring without a renewed cert having been issued.

lousken•1mo ago

Is that enough though? You may have wildcards on domains that are not even on a public DNS and you may forget to replace it "somewhere". For that reason it is better to either dump list of domains from your local DNS or have e.g. zabbix or another agent on every host machine checking that file for you.

jcgl•1mo ago

That's exactly my point. Is that while this service sounds quite useful for many common cases, it's going to fail in cases where there's not a 1-to-1 certificate-to-server mapping. Even outside of wildcards, you have to account for cases where the cert might be installed on N number of load balancers.

weddpros•1mo ago

If you're using a cert on multiple IPs, or IPv4+v6, SSLBoard will monitor all IPs. It's not foolproof, but it covers most common practices. btw wildcard certs don't have a good reputation (blast radius)...

jcgl•1mo ago

I'd say that load balancers (one-address-to-N-servers) count as a common practice, but I otherwise agree in that regard.

Regarding wildcard certs, eh. I wouldn't say they have a bad reputation. Sure, greater blast radius. But sometimes it can certainly simplify things to use one. Your ACME client configuration is easier and your TLS terminator configuration often becomes easier when the terminator would otherwise need to switch based on SNI.

weddpros•1mo ago

one-address-to-N-servers is perfect if the N servers don't all terminate TLS. If not, it becomes impossible to actually test what certificates are actually served. I've seen this fail before (TLS tests flip/flop between good/bad between checks).

As for wildcard certs, I agree there are use cases where we really need them like dynamic subdomains {customer}.status.com

Can you share how they make ACME client configuration easier?

jcgl•1mo ago

> Can you share how they make ACME client configuration easier?

It's not a profound difference, but you don't need to add each name to your config. Depending on the team's tooling and processes, that may be inconsequential. But in a setting where config management isn't handled super well, where the TLS terminator is a resource shared by multiple, distinct teams, this is a simplification that can make a difference at the margin.

Think less Cloudflare-scale, and more SMB scale (especially in a Windows shop or recovering Windows shop with a different kind of technical culture than what we might all be implicitly imagining).

weddpros•1mo ago

I'm working on something that could help: linking sslboard with software that's making issuance and distribution of certs easier, ie. a proper CLM. It's not cloud based for security reasons. In that context, we know your wildcard certs because we issue them, and we could know where they are if we distribute them... Please get in touch with me (chris@sslboard.com) if you're interested in early access and having a word in the development of the product!

jcgl•1mo ago

I didn't realize you were behind SSLBoard. I think you should've disclaimed that involvement at the beginning. I see now that it's in your bio, but disclaiming is still on you.

weddpros•1mo ago

Indeed, SSLBoard is scanning CT logs. You can add/import host names though, to allow monitoring of wildcard certs. Same if you're using ports that are not 443, you have to add these to the list of hostnames that are checked.

It's not as convenient, but it's the best SSLBoard can do...

KronisLV•1mo ago

> You need external monitoring of certificate validity.

Plug for Uptime Kuma, they support notifications ahead of expiry: https://github.com/louislam/uptime-kuma

Kind of cool to have an uptime monitoring tool that also had an option like that, two birds one stone and all that. Not affiliated with them, FOSS project.

throw20251220•1mo ago

TLS certificates… SSL is some old Java anachronism.

> There’s no natural signal back to the operators that the SSL certificate is getting close to expiry.

There is. The not after is right there in the certificate itself. Just look at it with openssl x509 -text and set yourself up some alerts… it’s so frustrating having to refute such random bs every time when talking to clients because some guy on the internet has no idea but blogs about their own inefficiencies.

Furthermore, their autorenew should have been failing loud and clear, everyone should know from metrics or logs… but nobody noticed anything.

tomas789•1mo ago

I don’t think this is as simple as it seems. For example, we have our own CA and issue several mTLS certificates, with hundreds of them currently in use across our machines. We need to check every single one (which we don’t do yet) because there is an additional distribution step that might fail selectively. And that’s not even touching on expiring CAs, which is a total nightmare.

viraptor•1mo ago

If you have your own CA, you log every certificate with the expiry details. It's easier compared to an external CA because you automatically get the full asset list as long as you care to preserve it.

SoftTalker•1mo ago

When I ran my own CA I issued certificates with 99-year expiry dates, and I never worried about them again.

throw20251220•1mo ago

Why would it be difficult? You have a single CA, so a single place where certs are issued. That means there’s a single place with the knowledge of what certs are issued for which identity, how long are those valid for, and has there been a new cert issued for that identity prior to previous cert expiration. Could not be simpler, in fact.

ronsor•1mo ago

> TLS certificates… SSL is some old Java anachronism.

OpenSSL is still called OpenSSL. Despite "SSL" not being the proper name anymore, people are still going to use it.

By the way, TLS 1.3 is actually SSL v3.4 :)

toast0•1mo ago

If we're being picky, they're x.509 certificates, not TLS or SSL.

throw20251220•1mo ago

Thanks for the correction.

tialaramex•1mo ago

In this context the specific thing they are is certificates from the Web PKI. A PKI (Public Key Infrastructure) is an arrangement with Relying Parties (in this case, basically everybody), CAs (Certificate Authorities - in this case a mix of companies, not-for-profits, government and so on entities around the world) and Subscribers. The Subscriber says to a CA "I want you to certify that I'm some.website.example" and the CA issues them an X.509 certificate, which the Relying Parties trust to prove that this really is some.website.example. The Relying Parties (indirectly as we'll see shortly) ensure they trust only CAs who will do this name certifying job well. This uses Public Key encryption, which is a mathematical technology where you pick two related huge numbers, one public key (revealed to anyone who wants it) and one private (known only to you) and then you can prove you know the private key by performing arithmetic that anyone with the public key can verify is correct, and yet they could not perform that arithmetic without your private key.

It is called the Web PKI because although this secures most of the Internet, the billions of Relying Parties are represented in practice almost solely by a handful of Trust Stores who mostly make Web Browsers. Specifically, Mozilla, Google, Microsoft and Apple.

The Web PKI requires that the certificates are not only X.509 but specifically they obey PKIX, RFC 5280 which explains how X.509 (a standard from the X.500 directory system, a directory which in reality never ended up existing) can be used for the Internet (which very much did end up existing) via "Alternative Names". When your modern certificates have a "Subject Alternative Name" the word Alternative there means alternative to the X.500 naming scheme, which is irrelevant to us, specifically the Internet's two alternatives, an ipAddress (4 bytes or 16 bytes forming either an IPv4 or IPv6 address) or a dnsName (a subset of ASCII characters, punctuated with but never ending in a dot)

Edited: Correct s/Server/Subject/ in expansion of SAN acronym

riffic•1mo ago

X.509 certificates

themafia•1mo ago

They specified a lot of stuff that ultimately didn't get used but ITU is still my favorite standards organization.

tialaramex•1mo ago

To the extent that it can be considered an "organization" the IETF is definitely a better Standards Development Organization than the ITU. Most importantly because the IETF is for people, and I'm a person, whereas as a UN Specialized Agency the ITU is for UN Member States and I am not and will never be a UN Member State.

themafia•1mo ago

They both publish their standards for free.

The ITU has a slight edge in that their published standards are generally of a higher quality and easier to implement.

I care about utility not about nominal designations.

tialaramex•1mo ago

Yeah, I don't see it. I looked at a few cases where there's close overlap in the work and mostly what I see is that the ITU doesn't want me looking at the draft stages of documents, which I guess isn't crucial for your implementation work. I couldn't see any particular higher quality though I suppose if you like PDFs and hate text that might go in the ITU's favour.

gmuslera•1mo ago

If you think SSL certificates are dangerous, try seeing the dangers of NOT using them, specially for a service that is a central repository of artifacts meant to be automatically deployed.

It is not about encryption (that a self-signed certificate lasting till 2035 will suffice), but verification, who am I talking with, because reaching the right server can be messed up with DNS or routing, among other things. Yes, that adds complexity, but we are talking more about trust than technology.

And once you recognize that it is essential to have a trusted service, then give it the proper instrumentation to ensure that it work properly, including monitoring and expiration alerts, and documentation about it, not just "it works" and dismiss it.

May we retitle the post as "The dangers of not understanding SSL Certificates"?

duufuvkfmc•1mo ago

Debian’s apt do not use SSL as far as I know and I am not aware of any serious security disaster. Their packages are signed and content is not considered confidental.

direwolf20•1mo ago

The selection of packages installed on a server should be treated as confidential, but you could probably infer it from file sizes.

crote•1mo ago

If I'm not mistaken, apt repositories have very similar failure modes - just using PGP certs instead of SSL certs. The repository signing key can still expire or get revoked, and you'll have an even harder time getting every client to install a new one...

tuetuopay•1mo ago

Debian 13 uses https://deb.debian.org by default. Even the upgrade docs from 12 to 13 mention the https variant. They were quite hostile for a while to https, but now it seems they bit the bullet.

gmuslera•1mo ago

Debian have multiple mirrors, and some distributions even promote to have local mirrors, the model is different, as you say the packages are signed so you know who made them, wherever you got them from.

And I said above, SSL is more than about encryption, but also knowing that you are connecting to the right party. Maybe for a repository with multiple mirrors, dns aliases and a layer of "knowing from whom this come from" is not that essential, but for most the rest, even if the information is public, knowing that it comes from the authoritative source or really from who you think it comes from is important.

firesteelrain•1mo ago

Operationally, the issue is rooted in simple monitoring and accurate inventory. The article is apt: “ With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong”

You can update your cert to prepare for it by appending—-NEW CERT—-

To the same file as ——-OLD CERT—-

But you also need to know where all your certificates are located. We were using Venafi for the auto discovery and email notifications. Prometheus ssl_exporter with Grafana integration and email alerts works the same. The problem is knowing where all hosts, containers and systems that have certs are located. Simple nmap style scan of all endpoints can help. But, you might also have containers with certs or you might have certs baked into VM images. Sure, there all sorts of things like storing the cert in a CICD global variable, bind mounting secrets, Vault Secret Injector, etc

But it’s all rooted in maintaining a valid, up to date TLS inventory. And that’s hard. As the article states: “ There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.”

Every time this happens you whack a mole a change. You get better at it but not before you lose some credibility

renewiltord•1mo ago

Can do with any weighted LB, right? E.g. route53 or Cloudflare LB. But even manually you just need k IPs (perhaps even 2) and have host k1 and host k2 report different (overlappingly valid) certs. Then (1/k) users will see bad cert. your usual will be near zero failures but canary will have 100% failures.

I’ve always used the calendar event before expiry and then manual renew option but I wonder why I didn’t do this. It’s trivial to roll out. With Route53 just make one canary LB and balance 1% traffic to it. Can be entirely automated.

firesteelrain•1mo ago

That would work. In my case, which I am living right now, I am dealing with multiple environments where we didn’t set up the environment and we get burned by an expiring cert here and there leading to an outage. Users have zero appetite for any outage whatsoever and our inventory is bad.

thecosmicfrog•1mo ago

> the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails.

This has given me some interesting food for thought. I wonder how feasible it would be to create a toy webserver that did exactly this (failing an increasing percentage of requests as the deadline approaches)? My thought would be to start failing some requests as the deadline approaches a point where most would consider it "far too late" (e.g. 4 hours before `notAfter`). At this point, start responding to some percentage of requests with a custom HTTP status code (599 for the sake of example).

Probably a lot less useful than just monitoring each webserver endpoint's TLS cert using synthetics, but it's given me an idea for a fun project if nothing else.

johannes1234321•1mo ago

For a fun project it certainly is a fun idea.

In real life, I guess there are people who don't monitor at all. For them failing requests would go unnoticed ... for the others monitoring must be easy.

But I think the core thing might be to make monitoring SSL lifetime the "obvious" default: All the grafana dashboards etc should have such an entry.

Then as soon as I setup a monitoring stack I get that reminder as well.

loloquwowndueo•1mo ago

Your idea shifts monitoring to end users, which doesn’t sound awesome.

Just check expiration of the active certificate; if it’s under a threshold (say 1 week, assuming you auto-renew it when it’s 3 weeks to expiry; still serving a cert when it’s 1 week to expiration is enough signal that something went wrong) then you alert.

Then you just need to test that your alerting system is reliable. No need to use your users as canaries.

thecosmicfrog•1mo ago

Oh absolutely, I wouldn't use this for any production system. It would be a toy hobby project. I just find the notion of turning a no-degradation failure mode into a gradual-degradation one fascinating for some reason.

firesteelrain•1mo ago

This canary is a good thought. The problem the article highlights is that people don’t practice updates enough and assume someone else or something is handling it. You only get better at it the more often it happens which is partly why long expirations are not ideal. This is what the article is highlighting as the main issue.

loloquwowndueo•1mo ago

It’s not a good thought. Run a single client (uptime kuma) and ask it to alert you on expiration proximity. I.e. implement proper monitoring and alerting. No need to randomly degrade your users’ experience and hope they’ll notify you instead of shrugging and going to a site that doesn’t throw made-up http errors at them randomly.

firesteelrain•1mo ago

If a “canary” is degrading users, it’s misdesigned.

The canary narrows the blast radius and time-to-detection.

loloquwowndueo•1mo ago

Agreed. That’s exactly what the proposed canary is - misdesigned.

1970-01-01•1mo ago

I agree with this. Certs are designed to function as digital cliff. They will either be accepted or they won't, with no safe middle ground. Therefore all certs in a chain can only be as reliable as the least understood cert in your certificate management.

deIeted•1mo ago

Nobody to blame but yourselves.

How long did it take for us to get to a "letsencrypt" setup? and exactly 100ms before that existed, you (meaning 90% of you) mocked and derided that very idea

Spivak•1mo ago

Infra person here: you will need external monitoring at some point because checking that your site is up all over the world isn't something you want to do in house. Not because you couldn't but because their outages are likely to be uncorrelated with yours—AWS notwithstanding.

Anyway you'll have one of these things anyway and I haven't seen one yet that doesn't let you monitor your cert and send you expiration notices in advance.

nrhrjrjrjtntbt•1mo ago

As always, you need a test that runs and notifies SRE or oncall. Ideally 14 or maybe 28 days before expiry.

0xbadcafebee•1mo ago

> With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong. And things don’t go wrong that often with certificates

Don't worry. With 2 or 3 industry players dictating how all TLS certs work, now your certs will expire in weeks rather than years, so you will all be subject to these failures more frequently. But as a back-stop to process failures like this, use automated immutable runbooks in CI/CD. It works like this:

1) Does it need a runbook? Ask yourself, if everything was deleted tomorrow, do you (and all the other people) remember every step needed to get everything running again? If not, it needs a runbook.

2) What's a runbook? It's a document that gives step by step instructions to do a thing. The steps can be text, video recordings, code/shell snippets, etc as long as it does not assume anything and gives all necessary instructions (or links to them) so a random braindead engineer at 3am can just do what it says and it'll result in a working thing.

3) Automate the runbook over time. Put more and more of the steps into some kind of script the user can just run. Put the script into a Docker container so that everyone's laptop environment doesn't have to be identical for the steps to work.

4) Run the containerized script from CI/CD. This ensures all credentials, environment vars, networking, etc are the same when it runs which better ensures success, and that leads to:

5) Running it frequently/on a schedule. Most CI/CD systems support scheduled jobs. Run your runbooks frequently to identify unexpected failures and fix bugs. Most of you get notifications for failed builds, so you'll see failed runbooks. If you use a cron job on a random server, the server could go down, the job could get deleted, or the reports of failure could go to /dev/null; but nobody's missing their CI/CD build failures.

Running runbooks from CI/CD is a game changer. Most devs will never update a document. Some will update code they run on their laptop. But if it runs from CI/CD, now anyone can run it, and anyone can update it, so people actually do keep it up to date.

whirlwin•1mo ago

TLS certificates is not the only technology for which the default mode is failure. What about disks, databases or syntax errors in configuration files in general?

In technology, there are known problems and unknown problems. Expiring TLS certificates is a known problem which has an established solution.

Imagine if only some of the requests failed because a certificate is about to expire. That would be a debugging nightmare.

philippta•1mo ago

When I connect my server over SSH, I don't have to rotate anything, yet my connection is always secure.

I manually approve the authenticity of the server on the first connection.

From then, the only time I'd be prompted again would be, if either the server changed or if there's a risk of MITM.

Why can't we have this for the web?

ILearnAsIGo•1mo ago

Would the issue not be that you would need to trust that first connection?

01HNNWZ0MV43FF•1mo ago

Yep https://en.wikipedia.org/wiki/Trust_on_first_use

trvz•1mo ago

Cookie banners aren’t annoying enough for you?

philippta•1mo ago

For the handful of regularly visited websites, I wouldn't mind.

jsiepkes•1mo ago

> Why can't we have this for the web?

How do you propose to scale trust on first use? SSH basically says the trusting of a key is "out of scope" for them and makes it your problem. As in: You can put on a piece of paper, tell it over the phone, whatever, but SSH isn't going to solve it for you. How is some user landing on a HTTPS site going to determine the key used is actually trustworthy?

There have actually been attempts at solving this with some thing like DANE [1]. For a brief period Chrome had DANE support but it was removed due to being too complicated and being in (security) critical components. Besides, since DNSSEC has some cracks in it (you local resolver probably doesn't check it) you can have a discussion about how secure DANE is.

[1] https://en.wikipedia.org/wiki/DNS-based_Authentication_of_Na...

DANmode•1mo ago

So DNS-adjacent protocols are supposed to be handling this TOFU directory,

but industry behemoths are too busy pushing other self-serving standards to execute together on this?

Am I…close?

tialaramex•1mo ago

What "TOFU directory" ? The whole point of TOFU is that you're just going to accept that anybody's first claim of who they are is correct. This is going to often work pretty well, after all it's how a lot of our social relationships work. I was introduced to a woman as Nodis, so, I called her Nodis, everyone else I know calls her Nodis, her boyfriend calls her Nodis. But it turns out her employer and the government do not call her that because their paperwork has a legal name which she does not like - like many humans probably her legal name was chosen by her parents not by her.

Now, what if she'd insisted her name is Princess Charlotte. I mean, sure, OK, she's Princess Charlotte? But wait, my country has a Princess Charlotte, who is a little girl with some chance of becoming Queen one day (if her elder brother died or refused to be King). So if I just trusted that Nodis is Princess Charlotte because she said so, is there a problem?

jeroenhd•1mo ago

SSH has its own certificate authority system to validate users and servers. This is because trust-on-first-use is not scalable unless you just ignore the risk (at which point you may as well not do encryption at all), so host keys are signed.

There is quite literally nothing that prevents you from putting a self-signed server certificate. Your browser will even ask you to trust and store the certificate like your client does on the screen that shows the fingerprint.

Good luck getting everyone else to trust your fingerprint, though.

tialaramex•1mo ago

The monitoring is the wrong way up, which is the case almost everywhere I've ever worked.

You want an upside down pyramid, in which every checked subsystem contributes an OK or some failure, and failure of these checks is the most serious failure, so the output from the bottom of your pyramid is in theory a single green OK. In practice, systems have always failed or are operating in some degraded state.

In this design the alternatives are: 1. Monitor says the Geese are Transmogrified correctly or 2. Monitoring detected a Goose Transmogrifier problem, or 3. Goose Transmogrifier Monitor failed. The absence of any overall result is a sign that the bottom of the pyramid failed, there is a major disaster, we need to urgently get monitoring working.

What I tend to see is instead a pyramid where the alternatives 1 and 2 work but 3 is silent, and in a summarisation layer, that can fail silently too, and in subsequent layers the same. In this system you always have an unknown amount of silently failed systems. You are flying blind.

xorcist•1mo ago

Closely related to the ever more popular "We don't need monitoring, we have metrics."

jsiepkes•1mo ago

I wonder what the point of this blog is. It's kinda easy to rip on certificates without giving atleast one possible way of fixing this, even if it's an unrealistic one.

Sure, the low-level nitty gritty of managing keys and certificates for TLS is hard if you don't have the expertise. You don't know about the hundreds of ways you can get bitten. But all the pieces for a better solution are there. Someone just needs to fold it into a neater higher level solution. But apparently by the time someone gained the expertise to manage this complexity they also loose interest in making a simple solution (I know I have).

> You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.

Of course you can, if you really want to. You could get different certificates with different expiry times for your reverse (ingress) proxies.

A more straight forward solution is to have monitoring which retrieves the certificate on your HTTPS endpoints and alert when the expiry time is sooner than it ever should be (i.e. when it should already have been renewed). For example by using Prometheus and ssl_exporter [1].

> and the renewal failures didn’t send notifications for whatever reason.

That's why you need to have deadman switch [2] type of monitoring in your alerting. That's not specific to TLS BTW. Heck even your entire Prometheus infra can go down. A service like healthchecks.io [3] can help with "monitoring the monitors".

[1] https://github.com/ribbybibby/ssl_exporter [2] https://en.wikipedia.org/wiki/Dead_man%27s_switch [3] https://healthchecks.io/

JackSlateur•1mo ago

But certificates work as intended

Of course, if your certificate is expired, then "the failure mode is the opposite of graceful degradation"

Just like when your password is wrong: you cannot login, the failure mode is the opposite of graceful degradation

aljgz•1mo ago

No criticism of SSL-Certs in particular.

Essentially the flip side of any critical but low maintenance part of your system: it's so reliable that you can forget to have external monitors, it's reliable enough that it can work for years without any manual labor, it's so critical that can break everything.

Competent infra teams are really good at going over these. But once in a while one of them slips through. It's not a failure of the reliable but critical subsystem, it's a failure mode of humans.

One of the main ways "How Complex Systems Fail"

donatj•1mo ago

Once a year for a number of years we would have a small total outage as our Ops team forgot to renew our wildcard certificate. Like clockwork.

It's been a couple of years now so they must have set better reminders for themselves.

I have tried several times to convince them of the joys of ACME, but they're insistent that a Let's Encrypt certificate "looks unprofessional". More professional than a down application in my opinion at least. It's not the early 2000s anymore, no one's looking at your certificate.

dwood_dev•1mo ago

I use ACME with Google Public CA for this reason. No one bats an eye at GPCA. Also, their limits are dramatically higher than LE.

Good news for your manual renewal friends, renewals drop to 197 days in February, halving again the year after, halving again until it reaches 47. So they will soon adopt automation, or suffer endless renewal pain.

teunispeters•1mo ago

One of the interesting things in the ISO 15118-2 (and ISO 15118-20) protocols for EV charging, is that they include a check for "is your contract certificate expiring soon?".

So yeah, certificate timelines can be monitored, completely with warnings ahead of time.

Corollary : the service checking the certificates should have a reasonably accurate time.

throwawayqqq11•1mo ago

This could be a part of your CI/CD. Warn when cert lifetime is below threshold.

jeffrallen•1mo ago

The blackbox exporter from Prometheus publishes the "number of seconds until expiration" as part of the metrics of every HTTPS fetch. Set an alert with 30 days warning, and then don't ignore the alerts.

Problem solved.

PS: It would be nice if it could check whois for the expiration of your domain too, but I haven't seen that yet.

OhMeadhbh•1mo ago

Meh. Seems like the author just doesn't want to have to remember to renew his certs. But I guess "standard tooling makes it harder than it should be for people focused on things other than renewing certs to easily figure out what they're supposed to do" is a valid critique. Suggestions for how to make things better would have been nice.

Saris•1mo ago

If places aren't setting up renewals for SSL it makes me worry about what else they're not paying attention to, like security updates.

navigate8310•1mo ago

Pointless blogpost, you need external monitoring of the cert and automatically raise a ticket when the renewal doesn't happen within a certain remaining time period or monitoring the certbot itself for any error thrown

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

Show HN: One-click AI employee with its own cloud desktop

Show HN: Poddley – Search podcasts by who's speaking

Same Surface, Different Weight

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

Show HN: One-click AI employee with its own cloud desktop

Show HN: Poddley – Search podcasts by who's speaking

Same Surface, Different Weight

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

The Dangers of SSL Certificates

Comments