Lo and behold, his input method automatically collapsed two consecutive dashes into an en-dash (`–-f`), and the "option" was instead treated as a regular positional argument.
Apparently Word has a habit of inserting these in fields, whether needed or not in the context, with any right-to-left language supporting language packs are installed. Once added they are silently maintained and depending on exactly what you select may get included when you copy the text out to paste elsewhere, or get included if you use some form of automation to read the field value directly from the document or Word itself.
--------
[1] I noticed it while digging into some output to analyse a related issue, the file had been mashed together from content with different codepages in a way that meant it included invalid code points.
This was the 2000s so it was all scripts (SQL scripts and vbscripts I seem to remember). As part of it, we ended up cleaning the customer data from a myriad of bugs. Inconsistent capitalization, leading and trailing spaces, and this. Weird characters you didn't even know exists.
Over time more and more of these hidden characters were added to the script, because back then it wasn't a case of googling it or asking on SO.
I have a friend who works as a data analyst for a local council. He hates the school reports season as the data from the schools comes in with all sorts of weird problems in consistency.
But the first half of the post really is an interesting problem -- what to do about invisible Unicode characters that wind up in a username login field, thus becoming an invalid user, because the username was copied-pasted from a source that inserted things. The post lists potential sources as:
> Copy-paste from PDFs or Word docs: Rich-text formats often inject hidden control characters.
> Email clients and chat apps: Some insert soft hyphens, directionality markers, or non-breaking spaces.
> Keyboards and IMEs: Certain language input systems add combining marks or zero-width joiners.
But of course it's part of a broader Unicode problem, like the fact that there are two ways of representing common accented characters (precomposed vs decomposed) that are also not equivalent, or that multiple accents can be in a different order. Normalization handles those cases, but it doesn't do anything about nonprinting characters.
Is there not any common method for Unicode we should be using to check for, essentially, "grapheme comparison" that doesn't just normalize but ignores non-printing codepoints?
It was back in the 1.8.7 days, just before proper Unicode support in 1.9, but I don’t remember if that was relevant to this story.
He was deleting code until the bug disappeared, and then we zeroed in and found the character.
It was in the Textmate days, and it didn’t highlight such characters.
Agent766•1h ago