Why does C have the best file API

63•maurycyz•4h ago

Comments

FrankWilhoit•2h ago

A file API is not the same thing as a filesystem API. The holy grail is still a universal but high(-enough)-level filesystem API.

seba_dos1•2h ago

mmap is not a C feature, but POSIX. There are C platforms that don't provide mmap, and on those that do you can use mmap from other languages (there's mmap module in the Python's standard library, for example).

ajross•1h ago

I think this is sort of missing the point, though. Yes, mmap() is in POSIX[1] in the sense of "where is it specified".

But mmap() was implemented in C because C is the natural language for exposing Unix system calls and mmap() is a syscall provided by the OS. And this is true up and down the stack. Best language for integrating with low level kernel networking (sockopts, routing, etc...)? C. Best language for async I/O primitives? C. Best language for SIMD integration? C. And it goes on and on.

Obviously you can do this stuff (including mmap()) in all sorts of runtimes. But it always appears first in C and gets ported elsewhere. Because no matter how much you think your language is better, if you have to go into the kernel to plumb out hooks for your new feature, you're going to integrated and test it using a C rig before you get the other ports.

[1] Given that the pedantry bottle was opened already, it's worth pointing out that you'd have gotten more points by noting that it appeared in 4.2BSD.

nickelpro•1h ago

If we're going to be pedantic, mmap is a syscall. It happens that the C version is standardized by POSIX.

The underlying syscall doesn't use the C ABI, you need to wrap it to use it from C in the same way you need to wrap it to use it from any language, which is exactly what glibc and friends do.

Moral of the story is mmap belongs to the platform, not the language.

a-dub•1h ago

it also appears in operating systems that aren't written in c. i see it as an operating system feature, categorically.

projektfu•1h ago

Why does Ada have the best file API?

https://github.com/AdaCore/florist/blob/master/libsrc/posix-...

mmastrac•1h ago

"best file API" and the man page for the O_ flags disagree.

srean•1h ago

> However, in other most languages, you have to read() in tiny chunks, parse, process, serialize and finally write() back to the disk. This works, but is verbose and needlessly limited

C has those too and am glad that they do. This is what allows one to do other things while the buffer gets filled, without the need for multithreading.

Yes easier standardized portable async interfaces would have been nice, not sure how well supported they are.

general_reveal•1h ago

Wouldn’t we need to implement all of that extra stuff if we really wanted to work with text from files? I have a use case where I do need extra fast text input/output from files. If anyone has thoughts on this, I’d love it.

srean•1h ago

The standard way is to use libraries like libevent, libuv that wraps system calls such as epoll, kqueue etc.

The other palatable way is to register consumer coroutines on a system provided event-loop. In C one does so with macro magic, or using stack switching with the help of tiny bit of insight inline assembly.

Take a look at Simon Tatham's page on coroutines in C.

To get really fast you may need to bypass the kernel. Or have more control on the event loop / scheduler. Database implementations would be the place to look.

Const-me•1h ago

I think C# standard library is better. You can do same unsafe code as in C, SafeBuffer.AcquirePointer method then directly access the memory. Or you can do safer and slightly slower by calling Read or Write methods of MemoryMappedViewAccessor.

All these methods are in the standard library, i.e. they work on all platforms. The C code is specific to POSIX; Windows supports memory mapped files too but the APIs are quite different.

SvenL•57m ago

I think you don’t need to be unsafe, they have normal API for it.

https://learn.microsoft.com/en-us/dotnet/standard/io/memory-...

koakuma-chan•1h ago

How do you handle read/write errors with mmap?

Falcondor•1h ago

mmap on file io errors would manifest in Signals (For example SIGBUS or SIGSEGV).

So if you wanted to handle file read/write errors you would need to implement signal handlers.

https://stackoverflow.com/questions/6791415/how-do-memory-ma...

koakuma-chan•1h ago

... which is not great for an API.

icedchai•26m ago

In my experience, having worked with a large system that used almost exclusively mmap for I/O, you don’t. The process segfaults and is restarted. In practice it almost never happened.

castral•1h ago

I think OP and I have very divergent opinions on what makes a file API "best". This may have been the best 30 years ago. The world has moved on.

Dwedit•1h ago

Using mmap means that you need to be able to handle memory access exceptions when a disk read or write fails. Examples of disk access that fails includes reading from a file on a Wifi network drive, a USB device with a cable that suddenly loses its connection when the cable is jiggled, or even a removable USB drive where all disk reads fail after it sees one bad sector. If you're not prepared to handle a memory access exception when you access the mapped file, don't use mmap.

justmedep•1h ago

You can even mmap a socket on some systems (iOS and macOS via GCD). But doing that is super fragile. Socket errors are swallowed.

My interpretation always was the mmap should only be used for immutable and local files. You may still run into issues with those type of files but it’s very unlikely.

phoronixrly•42m ago

Ah, reminds me of 'Are You Sure You Want to Use MMAP in Your Database Management System? (2022)' https://db.cs.cmu.edu/mmap-cidr2022/

zahlman•1h ago

Aside from what https://news.ycombinator.com/item?id=47210893 said, mmap() is a low-level design that makes it easier to work with files that don't fit in memory and fundamentally represent a single homogeneous array of some structure. But it turns out that files commonly do fit in memory (nowadays you commonly have on the order of ~100x as much disk as memory, but millions of files); and you very often want to read them in order, because that's the easiest way to make sense of them (and tape is not at all the only storage medium historically that had a much easier time with linear access than random access); and you need to parse them because they don't represent any such array.

When I was first taught C formally, they definitely walked us through all the standard FILE* manipulators and didn't mention mmap() at all. And when I first heard about mmap() I couldn't imagine personally having a reason to use it.

nickelpro•1h ago

mmap is also relatively slow (compared to modern solutions, io_uring and friends), and immensely painful for error handling.

It's simple, I'll give it that.

nickelpro•1h ago

> Why does C have the best file API

> Look inside

> Platform APIs

Ok.

I agree platform APIs are better than most generic language APIs at least. I disagree on mmap being the "best".

charcircuit•1h ago

C's API does not include mmap, nor does it contain any API to deal with file paths, nor does it contain any support for opening up a file picker. This paired with C's bad string support results in one of it being one of the worst file APIs.

Also using mmap is not as simple as the article lays out. For example what happens when another process modifies the file and now your processes' mapped memory consists of parts of 2 different versions of the file at the same time. You also need to build a way to know how to grow the mapping if you run out room. You also want to be able to handle failures to read or write. This means you pretty much will need to reimplement a fread and fwrite going back to the approach the author didn't like: "This works, but is verbose and needlessly limited to sequential access." So it turns out "It ends up being just a nicer way to call read() and write()" is only true if you ignore the edge cases.

userbinator•1h ago

It still works if the file doesn't fit in RAM

No it doesn't. If you have a file that's 2^36 bytes and your address space is only 2^32, it won't work.

On a related digression, I've seen so many cases of programs that could've handled infinitely long input in constant space instead implemented as some form of "read the whole input into memory", which unnecessarily puts a limit on the input length.

actionfromafar•1h ago

You can mmap with offset, for that case. Just FYI in anyone thought it was a hard limit.

dekhn•46m ago

Address space size and RAM are two different things.

jiggawatts•14m ago

All memory map APIs support moveable “windows” or views into files that are much larger than either physical memory or the virtual address space.

I’ve seen otherwise competent developers use compile time flags to bypass memmap on 32-bit systems even though this always worked! I dealt with database engines in the 1990s that used memmap for files tens of gigabytes in size.

alanfranz•1h ago

Well...

I'm not sure what the author really wants to say. mmap is available in many languages (e.g. Python) on Linux (and many other *nix I suppose). C provides you with raw memory access, so using mmap is sort-of-convenient for this use case.

But if you use Python then, yes, you'll need a bytearray, because Python doesn't give you raw access to such memory - and I'm not sure you'd want to mmap a PyObject anyway?

Then, writing and reading this kind of raw memory can be kind of dangerous and non-portable - I'm not really sure that the pickle analogy even makes sense. I very much suppose (I've never tried) that if you mmap-read malicious data in C, a vulnerability would be _quite_ easy to exploit.

bvrmn•1h ago

Actually in Python you could recast (zerocopy) bytearray as other primitive C type or even any other structure using ctypes module.

okanat•45m ago

Creating memory mapped files is a very common OS feature since 90s. Many high level languages have it as OS agnostic POSIX or not.

chuckadams•1h ago

It has the best API for the author, that's for sure. One size does not fit all: believe it or not, different files have different uses. One does not mmap a pipe or /dev/urandom.

ibejoeb•1h ago

The article only touches on `open` and `close` and doesn't deal with any of the realities of file access. Not a particularly compelling write-up.

andersmurphy•59m ago

mmap is nice. But, I find sqlite is a better filesystem API [1]. If you are going to use mmap why not take it further and use LMDB? Both have bindings for most languages.

[1] - https://sqlite.org/fasterthanfs.html

okanat•47m ago

I guess the author didn't use that many other programming languages or OSes. You can do the same even in garbage collected languages like Java and C# and on Windows too.

https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByt...

https://learn.microsoft.com/en-us/dotnet/api/system.io.memor...

https://learn.microsoft.com/en-us/windows/win32/memory/creat...

Memory mapping is very common.

dekhn•45m ago

mmap is a built-in module on python! Also true for perl.

jccx70•43m ago

This is like: I discovered the wheel and want to let you know!

teunispeters•21m ago

technically yes, because there's a failure path for every single failure that an OS knows about. And most others aren't so resilient. However, mmap bypasses a lot of that....

nice_byte•20m ago

mmap is not a language feature. it is also full of its own pitfalls that you need to be aware of. recommended reading: https://db.cs.cmu.edu/mmap-cidr2022/

WebMCP is available for early preview

Are the Mysteries of Quantum Mechanics Beginning to Dissolve?

How to talk to anyone, and why you should

Ghostty – Terminal Emulator

Big Breakfast Alters Appetite, Gut Health

Tove Jansson's criticized illustrations of The Hobbit

When does MCP make sense vs CLI?

Little Free Library Books

Why does C have the best file API

Decision trees – the unreasonable power of nested decision rules

Long Range E-Bike (2021)

Microgpt explained interactively

Waymo blocking ambulance during deadly Austin shooting

Chorba: A novel CRC32 implementation (2024)

Setting up phones is a nightmare

Flightradar24 for Ships

Why XML tags are so fundamental to Claude

Operational issue – Multiple services (UAE)

Python Type Checker Comparison: Empty Container Inference

Programming in K

Microgpt

How the Government Deceived Congress in the Debate over Surveillance Powers (2013)

Interview with Øyvind Kolås, GIMP developer (2017)

Gzpeek: Tool to Parse Gzip Metadata

10-202: Introduction to Modern AI (CMU)

I built a demo of what AI chat will look like when it's "free" and ad-supported

Show HN: Audio Toolkit for Agents

New iron nanomaterial wipes out cancer cells without harming healthy tissue

South Korean Police Lose Seized Crypto by Posting Password Online

Aromatic 5-silicon rings synthesized at last

WebMCP is available for early preview

Are the Mysteries of Quantum Mechanics Beginning to Dissolve?

How to talk to anyone, and why you should

Ghostty – Terminal Emulator

Big Breakfast Alters Appetite, Gut Health

Tove Jansson's criticized illustrations of The Hobbit

When does MCP make sense vs CLI?

Little Free Library Books

Why does C have the best file API

Decision trees – the unreasonable power of nested decision rules

Long Range E-Bike (2021)

Microgpt explained interactively

Waymo blocking ambulance during deadly Austin shooting

Chorba: A novel CRC32 implementation (2024)

Setting up phones is a nightmare

Flightradar24 for Ships

Why XML tags are so fundamental to Claude

Operational issue – Multiple services (UAE)

Python Type Checker Comparison: Empty Container Inference

Programming in K

Microgpt

How the Government Deceived Congress in the Debate over Surveillance Powers (2013)

Interview with Øyvind Kolås, GIMP developer (2017)

Gzpeek: Tool to Parse Gzip Metadata

10-202: Introduction to Modern AI (CMU)

I built a demo of what AI chat will look like when it's "free" and ad-supported

Show HN: Audio Toolkit for Agents

New iron nanomaterial wipes out cancer cells without harming healthy tissue

South Korean Police Lose Seized Crypto by Posting Password Online

Aromatic 5-silicon rings synthesized at last

Why does C have the best file API

Comments