Smallest Possible Files

85•yread•8mo ago

Comments

RandallBrown•8mo ago

There must be some interesting code golf stuff hidden in here, but it seems like it's mostly empty files.

eru•8mo ago

For eg the C examples, it depends a lot on which compiler you are using (and implicitly then also on which standard).

JimDabell•8mo ago

The linked blog post about the smallest possible valid (X)HTML documents is noteworthy, if only for the fact that a surprising amount of people adamantly refuse to believe that they are valid. Even when you think you have gotten through to them with specifications and validators, a lot of people will still think “yeah, but it’s relying on error handling though”. I’m not sure why “HTML explicitly permits this” will not be tolerated as a thought and somehow transforms into “HTML doesn’t permit this but browsers are lenient”. It’s a remarkably unshakeable position. And even the people who are eventually convinced that it’s valid still think that it is technically incorrect in some unspecified way.

currysausage•8mo ago

This is especially ironic, considering the same people will gladly use XML syntax and serve it as text/html. Historically, this has only worked because no relevant browser has ever implemented SGML (and NET [1], in particular), as required by HTML standards up to version 4 [2].

[1] https://en.wikipedia.org/wiki/Standard_Generalized_Markup_La...

[2] https://www.w3.org/TR/html401/conform.html#h-4.2

JimDabell•8mo ago

That’s not quite the whole story. Appendix C of the XHTML 1.0 specification provides HTML compatibility guidelines:

> This appendix summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents.

— https://www.w3.org/TR/xhtml1/#guidelines

And RFC 2854, which defines the text/html media type, explicitly states this is permissible to label as text/html:

> The text/html media type is now defined by W3C Recommendations; the latest published version is [HTML401]. In addition, [XHTML1] defines a profile of use of XHTML which is compatible with HTML 4.01 and which may also be labeled as text/html.

— https://datatracker.ietf.org/doc/html/rfc2854#section-2

However even browsers that support XHTML rendering use their HTML parser for XHTML 1.0 documents served as text/html, even though they should really be parsing them as XHTML 1.0.

But yes, that extra slash means something entirely different to the SGML formulation of HTML (HTML 2.0 to HTML 4.01). HTML5 ditched SGML though, so SHORTTAG NET is no longer a thing.

currysausage•8mo ago

I believe the sentence from the RFC:

[XHTML1] defines a profile of use of XHTML which is compatible with HTML 4.01

is technically incorrect. While the XHTML 1 compatibility profile was compatible with HTML 4 as implemented by major browsers, that wasn't actually HTML 4. HTML 4 is based on SGML, while what was implemented was a combination of HTML 4 semantics with the tagsoup parsing rules that browsers organically developed. These rules were only later formalized as part of HTML 5.

The compatibility guidelines do recommend a space between <br and />, but (at least according to https://validator.w3.org/ in HTML 4 mode) this doesn't change anything about <br /> being a NET-enabling start-tag <br /, followed by a greather-than sign.

Enter this:

  <h1>Hello<br />world</h1>

and select "Validate HTML fragment", "HTML 4.01", and "Show Outline". This is the result:

  [H1] Hello>world

(Obviously nitpicking, but that's my point: the nitpickers can be out-nitpicked.)

JimDabell•8mo ago

Haha yes. Appendix C gave compatibility guidelines, but you are right that doesn’t actually result in documents that could be parsed by a parser that implemented SHORTTAG NET.

Elsewhere in the thread, I posted an example of SHORTTAG NET being removed from a browser to enable parsing of XHTML documents:

https://github.com/emacsmirror/w3/commit/68af7c107dcbe194e30...

Nevertheless, the text/html RFC explicitly condones Appendix C, so despite it not being fully reflective of reality, it’s still permissible to use text/html to label XHTML 1.0 documents that follow Appendix C :D

myfonj•8mo ago

> Historically, […] no relevant browser has ever implemented SGML […] NET

I can probably confirm that "relevant" part of this claim for the times spanning from the first decade of 2000s, but I still desperately (in a way) seek information whether ANY even niche and obscure application that consumed "HTML" treated the NET as specified back then. I am quite certain W3C Validator did (that Mathias' article proves that, after all) and that Amaya might have do that, since it was a reference implementation from the same spec body, IIRC, but cannot swear on that.

Have anybody here have a clearer recollection of that times, or even some evidence?

I still find it strange such feature had such prominent space in the specs back then, but practically nowhere else.

JimDabell•8mo ago

EMACS/W3 originally supported SHORTTAG NET but was “fixed” to remove support. In practical terms, mainstream browsers couldn’t afford to parse SHORTTAG NET properly because it was very common to leave attribute values unquoted. You can leave some values unquoted, but not ones with slashes in. So the very common error <a href=http://xn--rvg would not get parsed as the author expected if SHORTTAG NET was enabled.

This is the earliest reference I could locate easily, from the www-html mailing list:

https://lists.w3.org/Archives/Public/www-html/2002Nov/0057.h...

You’ll be able to find more if you go trawling through USENET archives of places like comp.infosystems.www.authoring.html from 25–30 years ago, but it was a fairly niche subject even back then.

I think there were a couple of other niche tools that supported it, but I don’t remember the details after all this time.

JimDabell•8mo ago

I believe this is the exact change where support for SHORTTAG NET was removed from EMACS/W3 in order to support XHTML better:

https://github.com/emacsmirror/w3/commit/68af7c107dcbe194e30...

myfonj•8mo ago

Thanks! That's actually really valuable insight and seems to be a promising start for a interesting investigation

I'd even say that from a glance, EMACS ("W3" browser in it) seems like possibly hugely relevant application, actually. Will look into it.

JimDabell•8mo ago

If you really want to, you could check out Evolt’s browser archive:

https://browsers.evolt.org

It‘s got over a hundred ancient web browsers. I suspect none of them support SHORTTAG NET though.

myfonj•8mo ago

Good idea. I remember I have done some research about this in the past when I tried to trace historical arguments for the infamous "should there be a space before slash in void tags for the best compatibility"

    <br/> vs <br /> (vs <br>)

discussion, but didn't get much far then (https://stackoverflow.com/a/30880386/540955).

jerf•8mo ago

"if only for the fact that a surprising amount of people adamantly refuse to believe that they are valid... And even the people who are eventually convinced that it’s valid still think that it is technically incorrect in some unspecified way."

Speaking from my personal experience, if your idea of "valid HTML" was created in the late 1990s or early 2000s, it's worth a spin through the current HTML standard. HTML has always de facto been permissive, but de jure it had certain requirements. However, HTML 5 essentially works by reifying a very, very well-specified algorithm for how to handle HTML "loosely" (even though it is very strictly specified), and then refactors away effectively every requirement it possibly can and defers them to that algorithm instead.

Technically speaking, as long as you put down the correct doctype, you can elide almost anything nowadays and get a functional document; for instance, "<!DOCTYPE html><title>Hello</title>" is fully standards compliant now (push it through [1]). Only thing the validator gives is a warning that you might like to specify a language in the doctype. It isn't just "browsers will pretty much do the 'right thing'" with that, which has been true for a long time... that's actually standards-compliant HTML now.

What a lot of old hands don't understand is that HTML 5 was a seismic shift in how HTML is specified. Instead of specifying a rigid language and then pretending the world is complying and it's super naughty of them not to, it defines a standard for extracting a DOM tree from effectively any soup of characters you can throw at it, compliance is loosened as much as is practical, and even when things don't comply there's a specification on exactly how to pick up the pieces. HTML 5 has a completely different philosophy than HTML 4 and before.

(Relatedly, the answer to the frequently-asked question "What is the BeautifulSoup equivalent for $LANGUAGE", at least as far as parsing, is effectively now "Find an HTML 5-compliant parser", which they all have now. Beautiful Soup's parsing philosophy was enshrined into the standard.)

[1]: https://validator.w3.org/nu/#textarea

JimDabell•8mo ago

It’s fair to point out the big difference in parsing philosophy between HTML 2–4 and HTML 5, but what I’m talking about happened before HTML5 as well. Some people can’t handle the fact that HTML intentionally has implied elements.

> <!DOCTYPE html><title>Hello</title>" is fully standards compliant now

Sure, but switch the doctype and put a <p> on the end, and it’s fully standards compliant HTML 4.01 Strict too. And yet so many people are adamant that it can’t be. That it’s invalid (even though a validator says it’s valid). That it’s relying on error handling (even though the spec. says otherwise). That some browsers parse it wrong (but they can never name one). That the DOM ends up broken (when browser dev tools show a normal DOM). That you need <html> and <body> elements (even though it already has both). That there’s something wrong with it at a technical level (even though they cannot describe what).

The concept “This is correct HTML that works everywhere with no error handling” is very difficult for some people to grasp, to a genuinely surprising degree.

arexxbifs•8mo ago

The Python, Perl, Lua, etc. files are arguably valid quines.

ayaros•8mo ago

Reminds me of https://github.com/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/eeeeeeee...

rollcat•8mo ago

Kinda. Empty files for so many languages, it would be interesting to see at least an exit(0) or so.

DaSHacka•8mo ago

I love how even though the entire repo is essentially a shitpost, it still uses a CoC.

You know, to ensure cordiality in any of the various riveting PRs and discussions.

vitorfrois•8mo ago

yes what about the biggest possible files

jerf•8mo ago

Many of them are infinite, so you'd have to provide them as functions rather than files. There's obvious ones like plain text, but some less obvious ones, like, PNGs are defined as a series of chunks, but there's no chunk count in the header, so you can keep appending chunks forever: https://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html

This sort of thing is not just a funny question, it's something you think about when you're writing scanners. For instance, another "biggest possible file" is the zip file that decompresses to itself[1], which is in some sense also an infinite file. Many a scanner has been written that will fill the disk then crash if presented with that file, which is actually more pathological behavior than would be experienced if the scanner isn't there.

[1]: https://research.swtch.com/zip

adzm•8mo ago

I really appreciate the .gitignore file there https://github.com/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/eeeeeeee...

arexxbifs•8mo ago

The 42 byte transparent GIF saw ample use in web development a quarter century ago, when it was used to create pixel perfect <table> layouts. Some things have changed for the better.

https://x42.com/test/gifdot.shtml?abcdef

JimDabell•8mo ago

The smallest GIF is still useful because it is the smallest possible valid favicon. This means you can stuff it into a data: URI to prevent useless requests showing up when you are working on something:

    <link rel="icon" href="data:image/gif;base64,R0lGODlhAQABAAAAADs=">

vbezhenar•8mo ago

You can also make an actually useful and readable SVG favicon this way:

    <link
      rel="shortcut icon"
      href='data:image/svg+xml,%3csvg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100">%3ccircle cx="25" cy="50" r="20"/>%3ccircle cx="75" cy="50" r="20"/>%3c/svg>'
    />

JimDabell•8mo ago

Good to know! My goal is simply to stop a 404 popping up during development in the simplest way possible, so the smallest amount of code is best for me.

zamadatix•8mo ago

If you're just wanting to shut the request up and aren't actually trying to display a certain favicon you can do:

  <link rel=icon href=data:>

With the bonus you've probably already remembered how to reconstruct this on demand just by reading this comment. It is "invalid" data but so is your example on Safari and Firefox instead of Chromium based browsers. It doesn't matter as much because that problem is local and silent in the logs, unlike the request.

JimDabell•8mo ago

Thanks! I’m pretty sure I tried this ages ago and it didn’t work at the time, but I tried this again now and it does the job.

zamadatix•8mo ago

The key is to keep up through "data:" since any shorter (even just dropping the ":") and it gets treated a relative link instead.

gudzpoz•8mo ago

A use case: https://news.ycombinator.com/s.gif (43 bytes) (use for comment indentation)

rollcat•8mo ago

It's kinda cool than HN looks OK even in simple browsers like Dillo:

<https://imgur.com/a/Seu8rYT>

However it's pretty bad on narrow screens. I wish there was some progressive enhancement via modern CSS, or at least just dark mode.

user32489318•8mo ago

Reminded me of a major “data”/“AI” platform that stripped all empty files when deploying the code. Because of “security” you were not allowed to list files on the deployed instance, nor review the deployment pipeline code or logs (“it just a works/batteries included”).

The most brilliant way to screw all Python developers I’ve ever seen.

Later learnt that the docker container run the code as root, so basically you could destroy the platform from within. Good times.

RainyDayTmrw•8mo ago

For context, this is because Python uses __init__.py files to indicate which directories are modules. They can contain contents, but quite often are empty placeholders with meaning. Removing those would make the corresponding Python modules invalid and invisible to the Python module loader.

Wowfunhappy•8mo ago

...I feel like completely empty files shouldn't be allowed. Like, I realize the Python interpreter won't error if you feed it an empty file, but how can you really say that empty file represents a Python script if there is no script there?

However, I can't put my finger on what the correct rule would be.

ks2048•8mo ago

I guess if you can run `python myfile.py` and it finishes with without error (return code 0), you could consider it valid.

By that measure, there are also 1 byte valid Python programs (e.g. "1").

Wowfunhappy•8mo ago

But (at least for Python) that test also works on empty (0 byte) files, which is presumably why the repository says an empty file is the smallest possible Python program, but which feels wrong to me somehow.

ks2048•8mo ago

Yes, that was my point. And thus “also” for 1 byte programs

nivertech•8mo ago

File size of -∞ is the smallest

jotux•8mo ago

Not if the file size is -∞ - 1.

LegionMammal978•8mo ago

Some of these files are very much nonstandard, even when the standard leaves no leeway (unlike HTML). E.g., every PDF standard requires an %%EOF, startxref offset, and an xref table (or an xref stream in the later versions), but this PDF file is missing those, among other oddities, like the page object missing a /Type and /MediaBox. Too bad the author doesn't specify which implementation these are supposed to work in.

ks2048•8mo ago

Pretty cool. But as everyone is pointing out, empty files aren't that interesting. 31/137.

    $ find . -name ".git" -prune -o -name "README.md" -prune -o -type f -print | wc -l
    137
    $ find . -name ".git" -prune -o -name "README.md" -prune -o -type f -empty -print | wc -l
    31

I suppose if you wanted minimal, non-empty examples, you'd end up with a "hello, world" collection, of which there are many, but nice that this handles file formats as well as programming languages.

aidenn0•8mo ago

The traditional minimal bourne-like shell script has a single ":" in it. This is because, when looking at an executable[1], bourne-alikes may try to detect if the file is binary to prevent executing a binary file. I don't know for a fact that some sh implementations will refuse to execute an empty file, but it seems likely.

1: If you try to run a program binary from a bourne-like shell and execl() signals ENOEXEC, then (if it believes it to be a text file) it will try to run it as a shell script; this makes shebangs optional for programs executed only from a shell. You can try it yourself (tested on bash, dash, ksh, fish, zsh, and osh):

  $ echo 'echo hi' > foo.sh
  $ chmod +x foo.sh
  $ ./foo.sh

chasing•8mo ago

Okay, but what about the largest possible files?

dmd•8mo ago

For people who enjoy this sort of thing, vaguely related is this puzzle: https://dmd.3e.org/a-shell-puzzle/

xelxebar•8mo ago

Oh, you're the author! I didn't notice and sent you an email, but will repost here:

    $ for i in 3 4 5; do f=puzzle.$i; echo $f: $(head -1 $f | wc -c); tail -$((i-1)) $f; ./$f; done
    puzzle.3: 1
    futz
    futz
    ./puzzle.3: line 3: futz: command not found
    puzzle.4: 1
    futz
    futz
    futz
    ./puzzle.4: line 4: futz: command not found
    puzzle.5: 1
    futz
    futz
    futz
    futz
    ./puzzle.5: line 5: futz: command not found

Does this count?

dmd•8mo ago

That's certainly the expected output. Emailed you back. You landed in spam, btw.

dmd•8mo ago

I just wanna note here, for future readers, that this guy cheated in a deliciously naughty way, by registering a custom binfmt handler specifically for puzzle.[345], and deserves to get hit on the nose with a rolled up newspaper.

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

The AI boom is causing shortages everywhere else

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

The AI boom is causing shortages everywhere else

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Smallest Possible Files

Comments