Zig's New Writer

https://www.openmymind.net/Zigs-New-Writer/

108•Bogdanp•6mo ago

Comments

mishafb•6mo ago

I agree on the last point of the lack of composition here.

While it's true that writers need to be aware of buffering to make use of fancy syscalls, implementing that should be an option, but not a requirement.

Naively this would mean implementing one of two APIs in an interface, which ruins the direct peformance. So I see why the choice was made, but I still hope for something better.

It's probably not possible with zig's current capabilities, but I would ideally like to see a solution that:

- Allows implementations to know at comptime what the interface actually implements and optimize for that (is buffering supported? Can you get access to the buffer inplace for zero copy?).

- For the generic version (which is in the vtable), choose one of the methods and wrap it (at comptime).

There's so many directions to take Zig into (more types? more metaprogramming? closer to metal?) so it's always interesting to see new developments!

biggerben•6mo ago

I wonder if making this change will improve design of buffering across IO implementers because buffering needs consideration upfront, rather than treatment as some feature bolted on the side?

It’s a good sacrifice if the redesign, whilst being more complicated, is avoiding an oversimplified abstraction which end up restricting optimisation opportunities.

messe•6mo ago

> While it's true that writers need to be aware of buffering to make use of fancy syscalls, implementing that should be an option, but not a requirement.

Buffering is implemented and handled in the vtable struct itself, the writers (implentations of the interface) themselves don't actually have to know or care about it other than passing through the user-provided buffer when initializing the vtable.

If you don't want buffering, you can pass a zero-length buffer upon creation, and it'll get optimized out. This optimization doesn't require devirtualization because the buffering happens before any virtual function calls.

amluto•6mo ago

From my personal experience, buffered and unbuffered writers are different enough that I think it’s a bit of a mistake to make them indistinguishable to the type system. An unbuffered writer sends the data out of process immediately. A buffered writer usually doesn’t, so sleeping after a write (or just doing something else and not writing more for a while) will delay the write indefinitely. An unbuffered write does not do this.

This means that plenty of algorithms are correct with unbuffered writers and are incorrect with buffered writers. I’ve been bitten by this and diagnosed bugs caused by this multiple times.

Meanwhile an unbuffered writer has abysmal performance if you write a byte at a time.

I’d rather see an interface (trait, abstract class, whatever the language calls it) for a generic writer, with appropriate warnings that you probably don’t want to use it unless you take specific action to address its shortcomings, and subtypes for buffered and unbuffered writers.

And there could be a conditional buffering wrapper that temporarily adds buffering to a generic writer and is zero-cost if applied to an already buffered writer. A language with enforced borrowing semantics like Rust could make this very hard to misuse. But even Python could do it decently well, e.g.:

    w: MaybeBufferedByteWriter
    with io.LocalBuffer(w) as bufwriter:
        do stuff with bufwriter

donatj•6mo ago

I absolutely agree, and would like to add I feel like the ergonomics of the new interface are just very awkward and almost leaky.

Buffered and unbuffered IO should just be entirely separately things, and separate interfaces. Then as you mention the standard library can provide an adapter in at least one direction, maybe both.

This seems like a blunder to me.

josephg•6mo ago

> This means that plenty of algorithms are correct with unbuffered writers and are incorrect with buffered writers. I’ve been bitten by this and diagnosed bugs caused by this multiple times.

But write() on POSIX is also a buffered API. Until your program calls fsync / fdatasync, linux isn't required to actually flush anything to the underlying storage medium. And even then, many consumer storage devices will lie and return from fsync immediately before data has actually been flushed.

All the OSes that I know of will eagerly write data instead of waiting for fsync, but there's no guarantee the data will be persisted by the time your write() call returns. It usually isn't. If you're relying on write() to durably flush data to disk, you've probably got correctness / data corruption bugs lurking in your code that will show up if power goes out at the wrong time.

InfiniteRand•6mo ago

This has bit me a few times when a Linux system crashes so there’s no final call to fsync implicit or otherwise

o11c•6mo ago

I wouldn't call that "buffered", since `write` is guaranteed to appear immediately from the view of other processes that can see the same mount. It's only the disk that needs to be informed to really pick up (could we say "read"?) the changes.

josephg•6mo ago

I still call that buffering because the OS buffers the writes to the physical device. Maybe from the POV of other processes the writes don’t appear to be buffered. But I’m not typically reading a database from multiple processes. I do, however, care about my database surviving a system crash.

On that note though - What guarantees does the OS provide to reads from other processes? Will other processes see all writes immediately?

amluto•6mo ago

I’m not talking about data loss if the host crashes. I’m talking about a much broader sense of correctness.

Imagine an RPC server that uses persistent connections. If you reply using a buffered writer and forget to flush, then your tail latency blows up, possibly to infinity. It’s very easy to imagine situations involving multiple threads or processes that simply don’t work if buffers aren’t flushed on time.

Imagine a program that is intended to crash on error but writes a log message first. If it buffers and doesn’t flush, then the log message will almost always get lost, whereas if the writer is unbuffered or is flushed, then the message will usually not get lost.

josephg•6mo ago

> I’m not talking about data loss if the host crashes. I’m talking about a much broader sense of correctness

Then why aren’t you talking about data loss if the host crashes? I consider that in the same bucket as your other examples. If you write a log and then the system crashes before it’s been written out to disk, the log message is lost. If a database issues two writes then the host crashes, what has been written to disk? It could be one write, both, neither, or a skewed write. If the database isn’t extremely careful, that crash could totally corrupt the database files.

Correctness matters. Writes are sent to disk (or sent over the network) some time between when you call write and when your call to flush returns. Using a buffered writer API doesn’t (shouldn’t) change that.

0xcafefood•6mo ago

Great point. It's like the earlier days where remote procedure calls were intended to happen "transparently" but the fact that networking is involved in some procedure calls and not others makes them very different in key ways that should not be hidden.

mmastrac•6mo ago

It's an interesting choice, but every writer now needs to handle:

1) vectored i/o (array of arrays, lots of fun for cache lines)

2) buffering

3) a splat optimization for compression? (skipped over in this post, but mentioned in an earlier one)

I'm skeptical here, but I guess we will see if adding this overhead on all I/O is a win. Devirtualization helps _sometimes_ but when you've got larger systems it's entirely possible you've got sync and async I/O in the same optimization space and lose out on optimization opportunities.

In practice, I/O stacks tend to consist of a lot of composition, and in many cases, leak a lot of abstractions. Buffering is one part, corking/backpressure is another (neither of which is handled here, but I might be mistaken). In some cases, you've got meaningful framing on streams that needs to be maintained (or decorated with metadata).

If it works out, I suppose this will be a new I/O paradigm. In fairness, nobody has _really_ solved I/O yet, so maybe a brave new swing is what we need.

wasmperson•6mo ago

I'm not usually in the "defending C++" camp, but when I see this:

  pub const File = struct {
  
    pub fn writer(self: *File, buffer: []u8) Writer{
      return .{
        .file = self,
        .interface = std.Io.Writer{
          .buffer = buffer,
          .vtable = .{.drain = Writer.drain},
        }
      };
    }
  
    pub const Writer = struct {
      file: *File,
      interface: std.Io.Writer,
      // this has a bunch of other fields
  
      fn drain(io_w: *Writer, data: []const []const u8, splat: usize) !usize {
        const self: *Writer = @fieldParentPtr("interface", io_w);
        // ....
      }
    }
  }

...I can't help but think of this:

  struct FileWriter: public Writer {
      File *file;
      // this has a bunch of other fields
  
      FileWriter(File *self, span<char> buffer)
          : Writer(buffer), file(self) {}
  
      size_t drain(span<span<char const> const> data, size_t splat) override {
        // ....
      }
  };

Writing code to build a vtable and having it implicitly run at compile time is pretty neat, though!

varyous•6mo ago

The zig std library often builds vtables for structs in an effort to minimize runtime cost for the typical non-virtual cases. I feel it leads to creating a lot of boilerplate to effectively have virtual functions. Worse, you have to study the zig code in a case by case basis to determine how to even use this ad-hoc virtual function scheme. Surely zig can introduce virtual function support in a more ergonomic way than this, as it's so widely used in real life code and extensively in zig's own std library.

bvrmn•6mo ago

I consider built-in buffering as a huge win.

* For example python gives you buffered by default streams. It's an amazing DX.

* In case of Zig you as a developer should be explicit about buffer sizes.

* You could opt-out to unbuffered any time.

* Allows for optimization without leaky "composable" io stack.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

Vouch

Show HN: Axiomeer – An open marketplace for AI agents

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

Selection rather than prediction

The F Word

The silent death of good code

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Where did all the starships go?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

Vouch

Show HN: Axiomeer – An open marketplace for AI agents

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

Selection rather than prediction

The F Word

The silent death of good code

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Where did all the starships go?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Zig's New Writer

Comments