This is why people should really use XHTML, the strict XML dialect of HTML, in order to avoid these nasty parsing surprises. It has the predictable behavior that you want.
In XHTML, the code does exactly what it says it does. If you write <table><a></a></table> like the example on the mXSS page, then you get a table element and an anchor child. As another example, if you write <table><td>xyz</td></table>, that's exactly what you get, and there are no implicit <tbody> or <tr> inserted inside.
It's just wild as I continue to watch the world double down for decades on HTML and all its wild behavior in parsing. Furthermore, HTML's syntax is a unique snowflake, whereas XML is a standardized language that just so happens to be used in SVG, MathML, Atom, and other standards - no need to relearn syntax every single time.
With `const clean = DOMPurify.sanitize(input); context.innerHTML = clean;` your linter suddenly needs to do complex code analysis and keep track if each variable passed to `context.innerHTML` is clean or tainted.
brainbag•1h ago
> The Sanitizer API is a proposed new browser API to bring a safe and easy-to-use capability to sanitize HTML into the web platform [and] is currently being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Which would replace the need for sanitizing user-entered content with libraries like DOMPurify by having it built into the browser's API.
The proposed specification has additional information: https://github.com/WICG/sanitizer-api/
mubou2•46m ago
> HTML parsing is not stable and a line of HTML being parsed and serialized and parsed again may turn into something rather different
Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous? Because it's pretty much the only thing you can do when sanitizing server-side, which we do a lot.
Edit: I also wonder how one would add for example rel="nofollow noreferrer" to links using this. Some sanitizers have a "post process node" visitor function for this purpose (it already has to traverse the dom tree anyway).
tobr•35m ago
This is a common and rather tiresome critique of all kinds of blog posts. I think it is fair to assume the reader has a bit of contextual awareness when you publish on your personal blog. Yes, you were linked to it from a place without that context, but it’s readily available on the page, not a secret.
mubou2•25m ago
It's not hard to add one line of context so readers aren't lost. Here, take this for example, combining a couple parts of the GitHub readme:
> For those who are unfamiliar, the Sanitizer API is a proposed new browser API being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Easy. Can fit that in right after "this blog post will explain why", and now everyone is on the same page.
swiftcoder•16m ago
Do we have data to back that up? Anecdotally the blogs I have operated over the years tend to mostly sustain on repeat traffic from followers (with occasional bursts of external traffic if something trends on social media)
LegionMammal978•29m ago
Personally, my recommendation in most cases would be "maintain a strict list of common elements/attributes to allow in the serialized form, and don't put anything weird in that list: if a serialize-parse roundtrip has the remote possibility of breaking something, then you're allowing too much". Also, "if you want to mutate something, then do it in the object tree, not in the serialized version".
[0] https://www.sonarsource.com/blog/mxss-the-vulnerability-hidi...
mubou2•20m ago
crote•23m ago
The article links to [0], which has some examples of instances in which HTML parsing is context-sensitive. The exact same string being put into a <div> might be totally fine, while putting it inside a <style> results in XSS.
[0]: https://www.sonarsource.com/blog/mxss-the-vulnerability-hidi...
crote•36m ago
A big part of designing a security-related API is making it really easy and obvious to do the secure thing, and hide the insecure stuff behind a giant "here be dragons" sign. You want people to accidentally do the right thing, so you call your secure and insecure functions "setHTML" and "setUnsafeHTML" instead of "setSanitizedHTML" and "setHTML".