frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Future AI bills of $100k/yr per dev

https://blog.kilocode.ai/p/future-ai-spend-100k-per-dev
108•twapi•48m ago•45 comments

Wikimedia Foundation Challenges UK Online Safety Act Regulations

https://wikimediafoundation.org/news/2025/08/11/wikimedia-foundation-challenges-uk-online-safety-act-regulations/
532•danso•6h ago•182 comments

I tried every todo app and ended up with a .txt file

https://www.al3rez.com/todo-txt-journey
452•al3rez•4h ago•318 comments

GitHub is no longer independent at Microsoft after CEO resignation

https://www.theverge.com/news/757461/microsoft-github-thomas-dohmke-resignation-coreai-team-transition
440•Handy-Man•3h ago•267 comments

The Associated Press tells its book critics that it's ending weekly reviews

https://dankennedy.net/2025/08/08/the-associated-press-tells-its-book-critics-that-its-ending-weekly-reviews/
32•thm•1h ago•3 comments

OpenSSH Post-Quantum Cryptography

https://www.openssh.com/pq.html
248•throw0101d•6h ago•75 comments

The Demographic Future of Humanity: Facts and Consequences [pdf]

https://www.sas.upenn.edu/~jesusfv/Slides_London.pdf
25•akyuu•1h ago•28 comments

Claude Is the Drug, Cursor Is the Dealer

https://middlelayer.substack.com/p/i-claude-is-the-drug-cursor-is-the
37•logan1085•2h ago•13 comments

The Value of Institutional Memory

https://timharford.com/2025/05/the-value-of-institutional-memory/
34•leoc•1h ago•10 comments

Trellis (YC W24) Is Hiring: Automate Prior Auth in Healthcare

https://www.ycombinator.com/companies/trellis/jobs/Cv3ZwXh-forward-deployed-engineers-all-levels-august-2025
1•jackylin•1h ago

The Joy of Mixing Custom Elements, Web Components, and Markdown

https://deanebarker.net/tech/blog/custom-elements-markdown/
29•deanebarker•2h ago•12 comments

Byte Buddy is a code generation and manipulation library for Java

https://bytebuddy.net/
16•mooreds•3d ago•4 comments

Claude Code is all you need

https://dwyer.co.za/static/claude-code-is-all-you-need.html
294•sixhobbits•4h ago•182 comments

Pricing Pages – A Curated Gallery of Pricing Page Designs

https://pricingpages.design/
122•finniansturdy•6h ago•36 comments

UI vs. API. vs. UAI

https://www.joshbeckman.org/blog/practicing/ui-vs-api-vs-uai
18•bckmn•2h ago•11 comments

Neki – sharded Postgres by the team behind Vitess

https://planetscale.com/blog/announcing-neki
15•thdxr•44m ago•0 comments

The Chrome VRP Panel has decided to award $250k for this report

https://issues.chromium.org/issues/412578726
431•alexcos•12h ago•231 comments

Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use

22•wujerry2000•3h ago•21 comments

36B solar mass black hole at centre of the Cosmic Horseshoe gravitational lens

https://academic.oup.com/mnras/article/541/4/2853/8213862?login=false
66•bookofjoe•4h ago•45 comments

Learn, Reflect, Apply, Prepare: The Four Daily Practices That Changed How I Live

https://opuslabs.substack.com/p/learn-reflect-apply-prepare
15•opuslabs•2h ago•0 comments

Porting to OS/2 – GitPius

https://gitpi.us/article-archive/porting-to-os2/
21•rbanffy•3d ago•0 comments

Designing Software in the Large

https://dafoster.net/articles/2025/07/22/designing-software-in-the-large/
41•davidfstr•4h ago•13 comments

Wikipedia loses challenge against Online Safety Act verification rules

https://www.bbc.com/news/articles/cjr11qqvvwlo
73•phlummox•2h ago•33 comments

How Boom uses software to accelerate hardware development

https://bscholl.substack.com/p/move-fast-and-dont-break-safety-critical
10•flabber•23h ago•3 comments

Faster substring search with SIMD in Zig

https://aarol.dev/posts/zig-simd-substr/
149•todsacerdoti•9h ago•44 comments

Washington, DC police put under federal control, National Guard deployed

https://www.cnbc.com/2025/08/11/trump-washington-crime-fed-national-guard-homeless.html
25•pwim•34m ago•4 comments

Mistral Integration Improved in Llama.cpp

https://github.com/ggml-org/llama.cpp/pull/14737
48•decide1000•8h ago•3 comments

Apache Iceberg V3 Spec new features for more efficient and flexible data lakes

https://opensource.googleblog.com/2025/08/whats-new-in-iceberg-v3.html
30•talatuyarer•1h ago•2 comments

A Global Look at Teletext

https://text-mode.org/?p=23643
52•aqua_worm_hole•7h ago•15 comments

A simple pixel physics simulator in Rust using Macroquad

https://github.com/gale93/sbixel
30•sbirulo•4d ago•1 comments
Open in hackernews

Going faster than memcpy

https://squadrick.dev/journal/going-faster-than-memcpy
127•snihalani•13h ago

Comments

waschl•13h ago
Thought about zero-copy IPC recently. In order to avoid memcopy for the complete chain, I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created. Is this a standard thing in such optimized IPC and which libraries offer this?
throwaway81523•13h ago
This is one of mmap's designed-for use cases. Look at DPDK maybe.
dataflow•13h ago
> I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created.

On an SMP system yes. On a NUMA system it depends on your access patterns etc.

6keZbCECT2uB•13h ago
I've been meaning to look at Iceoryx as a way to wrap this.

Pytorch multiprocessing queues work this way, but it is hard for the sender to ensure the data is already in shared memory, so it often has a copy. It is also common for buffers to not be reused, so that can end up a bottleneck, but it can, in principle, be limited by the rate of sending fds.

yokaze•12h ago
Boost.Interprocess:

https://www.boost.org/doc/libs/1_46_0/doc/html/interprocess/...

comex•12h ago
IPC libraries often specifically avoid zero-copy for security reasons. If a malicious message sender can modify the message while the receiver is in the middle of parsing it, you have to be very careful not to enable time-of-check-time-of-use attacks. (To be fair, not all use cases need to be robust against a malicious sender.)
o11c•12h ago
On Linux, that's exactly what `memfd` seals are for.

That said, even without seals, it's often possible to guarantee that you only read the memory once; in this case, even if the memory is technically mutating after you start, it doesn't matter since you never see any inconsistent state.

kragen•10h ago
Thanks for the reference! I had been wondering if there was a way to do this on Linux for years. https://lwn.net/Articles/591108/ seems to be the relevant note?
murderfs•8h ago
It is very easy for zero-copy IPC using sealed memfd to be massively slower than just copying, because of the cost associated with doing a TLB shootdown on munmap. In order to see a benefit over just writing into a pipe, you'd likely need to be sending gigantic blobs, mapping them in both the reader and write into an address space that isn't shared with any other threads that are doing anything, and deferring and batching munmapping (and Linux doesn't really provide you an actual way to do this, aside from mapping them all in consecutive pages with MAP_FIXED and munmapping multiple mappings with a single call).

Any realistic high-performance zero copy IPC mechanism needs to avoid changing the page tables like the plague, which means things like memfd seals aren't really useful.

duped•12h ago
What's the threat model where a malicious message sender has write access to shared memory
hmry•11h ago
How would someone send a message over shared memory without write access to that memory?
IshKebab•10h ago
I think he meant what's the scenario where you're using IPC via shared memory and don't trust both processes. Basically it only applies if the processes are running as two different users. (I think Android does that a lot?)
kragen•10h ago
When you are using the shared memory to communicate with an untrusted sender. Examples might include:

- browser main processes that don't trust renderer processes

- window system compositors that don't trust all windowed applications, and vice versa

- database servers that don't trust database clients, and vice versa

- message queue brokers that don't trust publishers and subscribers, and vice versa

- userspace filesystems that don't trust normal user processes

a_t48•11h ago
I've looked into this a bit - the big blocker isn't on the transport/IPC library, but the serializer itself, assuming you _also_ want to support serializing messages to disk or over network. It's a bit of a pickle - at least in C++, tying an allocator to a structure and its children is an ugly mess. And what happens if you do something like resize a string? Does it mean a whole new allocation? I've (partially) solved it before for single process IPC by having a concept of a sharable structure and its serialization type, you could do the same for shared memory. One could also use a serializer that offers promises around allocations, FlatBuffer might fit the bill. There's also https://github.com/Verdant-Robotics/cbuf but I'm not sure how well maintained it is right now, publicly.

As for allocation - it looks like Zenoh might offer the allocation pattern necessary. https://zenoh-cpp.readthedocs.io/en/1.0.0.5/shm.html TBH most of the big wins come from not copying big blocks of memory around from sensor data and the like. A thin header and reference to a block of shared memory containing an image or point cloud coming in over UDS is likely more than performant enough for most use cases. Again, big wins from not having to serialize/deserialize the sensor data.

Another pattern which I haven't really seen anywhere is handling multiple transports - at one point I had the concept of setting up one transport as an allocator (to put into shared memory or the like) - serialize once to shared memory, hand that serialized buffer to your network transport(s) or your disk writer. It's not quite zero copy but in practice most zero copy is actually at least one copy on each end.

(Sorry, this post is a little scatterbrained, hopefully some of my points come across)

userbinator•13h ago
It's not clear from a skim of this article, but a common problem I've seen in the past with memory copying benchmarks is to not serialise and access the copied data in its destination to ensure that it was actually completed before concluding the timing. A simple REP MOVS should be at or near the top, especially on CPUs with ERMSB.
kachapopopow•13h ago
Yah, these benchmarks are irrelevant since the CPU executes instructions out of order. Majority of the time the cpu will continue executing assembly while a copy operation is ongoing.
viraptor•12h ago
The full reorder buffer is still going to be only 200-500 instructions. The actual benchmark is not linked, but it would take only a hundred or so messages to largely ignore the reordering. On the other hand, when you use the library, the write needs to actually finish in the shared memory before you notify the other process. So unless the benchmark was tiny for some reason, why would this be irrelevant?
brucehoult•13h ago
Conclusion

Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.

----

So that's five minutes I'll never get back.

I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...

    0000000000000000 <memcpy>:
       0:   86aa                    mv      a3,a0
    
    0000000000000002 <.L1^B1>:
       2:   00267757                vsetvli a4,a2,e8,m4,tu,mu
       6:   02058007                vle8.v  v0,(a1)
       a:   95ba                    add     a1,a1,a4
       c:   8e19                    sub     a2,a2,a4
       e:   02068027                vse8.v  v0,(a3)
      12:   96ba                    add     a3,a3,a4
      14:   f67d                    bnez    a2,2 <.L1^B1>
      16:   8082                    ret
... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.

https://hoult.org/d1_memcpy.txt

viraptor•12h ago
> So that's five minutes I'll never get back.

Confirming null hypothesis, with good supporting data is still interesting. Could save you from doing this yourself.

makach•8h ago
You pre-stole my comment, I was about to make the exact same post :-D

Although the blog post is about going faster and him showing alternative algorithms, conclusion remains for safety which makes perfect sense. However, he did show us a few strategies which is useful. The five minutes I spent, will never be returned to me but at least I learned something interesting...

snihalani•3h ago
You could read the article and end up disagreeing with it. The value is in grokking over the details and not whether the insight changes your decisions. It can just make your decisions more grounded in data
dataflow•13h ago
I thought this was going to be about https://github.com/Blosc/c-blosc
Arech•13h ago
It's not clear how the author controlled for HW caching. Without this, the results are, unfortunately, meaningless, even though some good work has been gone
jesse__•13h ago
Would have loved to see performance comparisons along the way, instead of just the small squashed graph at the end. Nice article otherwise :)
wolfi1•13h ago
the "dumb of perf": some Freudian Slip?
_ZeD_•12h ago
soo... time to send a patch to glibc?
bawolff•11h ago
Given their conclusion that glibc was the best option for most use cases, i would say no.
adwn•12h ago
> The operation of copying data is super easy to parallelize across multiple threads. […] This will make the copy super-fast especially if the CPU has a large core count.

I seriously doubt that. Unless you have a NUMA system, a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller. If you can avoid going through main memory – e.g., when copying between the L2 caches of different cores – multi-threading can speed things up. But then you need precise knowledge of your program's memory access behavior, and this is outside the scope of a general-purpose memcpy.

hugh-avherald•12h ago
I've experienced modest but significant improvements in speed using very basic pragma omp section style parallelizing of this sort of thing.
adwn•11h ago
Do you remember any specifics? For example, the size of the copy, whether it was a NUMA system, or the total bandwidth of your system RAM?
bob1029•10h ago
> a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller.

Modern x86 machines offer far more memory bandwidth than what a single core can consume. The entire architecture is designed on purpose to ensure this.

The interesting thing to note is that this has not always been the case. The 2010s is when the transition occurred.

zozbot234•8h ago
Some modern non-x86 machines (and maybe even some very recent x86 ones) can't even saturate their system memory bandwidth with all of their CPU cores running at full tilt, they'd need to combine both CPU and non-CPU access for absolute best performance.
Orangeair•12h ago
[2020]
Sesse__•12h ago
There's an error here: “NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded.”

Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon, so it shouldn't push out other things in the cache. They may skip the cache entirely, or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.

orlp•9h ago
> Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon

I disagree with this statement (taken at face value, I don't necessarily agree with the wording in the OP either). Non-temporal instructions are unordered with respect to normal memory operations, so without a _mm_sfence() after doing your non-temporal writes you're going to get nasty hardware UB.

m0th87•9h ago
I had interpreted GP to mean that you don’t slap on NTs for correctness reasons, rather you do it for performance reasons.
orlp•9h ago
That is something I can agree with, but I can't in good faith just let "it's just a hint, they don't have anything to do with correctness" stand unchallenged.
Sesse__•9h ago
You mean if you access it from a different core? I believe that within the same core, you still have the normal ordering, but indeed, non-temporal writes don't have an implicit write fence after them like x86 stores normally do.

In any case, if so they are potentially _less_ correct; they never help you.

m0th87•8h ago
There are no guarantees even if everything operates on the same core. Rust docs have some details: https://doc.rust-lang.org/stable/core/arch/x86_64/fn._mm_sfe...
Sesse__•8h ago
Do you have any Intel references for it? I mean, Rust has its own memory model and it will not always give the same guarantees as when writing assembler.
m0th87•7h ago
https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

Intel's docs are unfortunately spartan, but the guarantees around program order is a hint that this is what it does.

Sesse__•6h ago
That doc is about visibility _outside the core_ (“globally visible”), so it's not what I'm looking for.

Similarly, if I look up MOVNTDQ in the Intel manuals (https://www.intel.com/content/dam/www/public/us/en/documents...), they say:

“Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple processors might use different memory types to read/write the destination memory locations”

Note _if multiple processors_.

m0th87•9h ago
I work on optimizations like this at work, and yes this is largely correct. But do you have a source on this?

> or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.

I hadn’t heard of this before. It looks like older x86 CPUs may have had a dedicated cache.

Sesse__•9h ago
A source on what? The Intel optimization manuals explain what MOVNTQ is for. I don't think they explain in detail how it is implemented behind-the-scenes.

See e.g. https://cdrdv2.intel.com/v1/dl/getContent/671200 chapter 13.5.5:

“The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) allow data to be moved from the processor’s registers directly into system memory without being also written into the L1, L2, and/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. These instructions operate on data in the general-purpose, MMX, and XMM registers.”

I believe that non-temporal moves basically work similar to memory marked as write-combining; which is explained in 13.1.1: “Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. The WC buffer is not snooped and thus does not provide data coherency. Buffering of writes to WC memory is done to allow software a small window of time to supply more modified data to the WC buffer while remaining as non-intrusive to software as possible. The buffering of writes to WC memory also causes data to be collapsed; that is, multiple writes to the same memory location will leave the last data written in the location and the other writes will be lost.”

In the old days (Pentium Pro and the likes), I think there was basically a 4- or 8-way associative cache, and non-temporal loads/stores would go to only one of the sets, so you could only waste 1/4 (or 1/8) on your cache on it at worst.

m0th87•8h ago
I see, thanks. I had assumed incorrectly that NT writes operated the same as NT accesses, where there is no dedicated cache.
Tuna-Fish•9h ago
IIRC they used the write-combining buffer, which was also a cache.

A common trick is to cache it but put it directly in the last or second-to-last bin in your pseudo-LRU order, so it's in cache like normal but gets evicted quickly when you need to cache a new line in the same set. Other solutions can lead to complicated situations when the user was wrong and the line gets immediately reused by normal instructions, this way it's just in cache like normal and gets promoted to least recently used if you do that.

davrosthedalek•9h ago
> Since the loop copies data pointer by pointer, it can handle the case of overlapping data.

I don't think this loop does the right thing if destination points somewhere into source. It will start overwriting the non-copied parts of source.

kvemkon•8h ago
BTW, if we copy data between some device and RAM efficiently using DMA without spending CPU cycles, why we can't use DMA to copy RAM-to-RAM?
shakna•8h ago
You can copy that way.

It's faster of you use the CPU, but you absolutely can just use DMA - and some embedded systems do.

kvemkon•8h ago
> It's faster of you use the CPU

But not for AMD? E.g. 8 Zen 5 cores in the CCD have only 64 GB/s read and 32 GB/s write bandwidth, while the dual-channel memory controller in the IOD has up to 87 GB/s bandwidth.

whizzter•5h ago
The issue is that a DMA setup:

A: requires the DMA system to know about each user process memory mappings (ie hardware support understanding CPU pagetables)

B: spend time going from user-kernelmode and back (we invented the entire io_uring and other mechanisms to avoid that).

To some extent I guess the IOMMU's available to modern graphics cards solve it partially but I'm not sure that it's a free lunch (ie it might be partially in driver/OS level to manage mappings for this).

toast0•2h ago
DMA works for devices, because the device does the memory access. RAM to RAM DMA would need something to do the accesses.

The other reason DMA works for devices is because it is asynchronous. You give a device a command and some memory to do it with, it does the thing and lets you know. Most devices can't complete commands instantaneously, so we know we have to queue things and then go do something else. Often when doing memcpy, we want to use the copied memory immediately... if it were a DMA, you'd need to submit the request and wait for it to complete before you continued... If your general purpose DMA engine is a typical device, you're probably doing a syscall to the kernel, which would submit the command (possibly through a queue), suspend your process, schedule something else and there may be delay before getting scheduled again when the DMA is complete.

If async memcpy was what was wanted, it could make sense, but that feels pretty hard to use.

zozbot234•1h ago
> DMA works for devices, because the device does the memory access. RAM to RAM DMA would need something to do the accesses.

Isn't a blitter exactly that sort of device? Assuming that it can access the relevant RAM, why couldn't that be used for general-purpose memory copying operations?

toast0•17m ago
Yes, but PCs have only rarely had general purpose blitters. They were integrated in some video cards, but that's more or less like DMA; Intel had one for a while recently [1]; FreeBSD loads a driver for it on my Xeon L5640 hosted server, but I don't see any evidence that anything actually uses it. and I'm not sure there was enough actual performance improvement enabled by offloading copies, so Intel stopped including these. Linux marked their driver as broken because it caused issues with copy-on-write [2]

[1] https://lwn.net/Articles/162966/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

coxley•5h ago
Ha, I love the project name "Shadesmar". Journey before destination, friend. :crossed-wrists:
PaulHoule•5h ago
If I understand that chart at the end it looks like the better performance is only for small buffer sizes which fit in the cache (4k) but if you are looking at big buffers the stdlib copy performs about the same as the optimized copy that he writes.
mojo-ponderer•5h ago
The graph at the end seems pretty dubious. For example, for the AvxUnrollCopier, why does data transfer speed jump to >120gb/s for 4kb, then down to ~50gb/s for 32kb, then down to <20gb/s for 16mb? It just doesn't make sense.
Sesse__•4h ago
The L1 cache is faster than the L3 cache. Does it need to be anything more complicated than that?
EGreg•4h ago
Wait, I thought memcpy would have launched some sort of built-in mechanism (parallelized or whatever) to copy in RAM.

Just indicate the start and length. Why would the CPU need to keep issuing copy instructions?

Sesse__•4h ago
The poster has a Zen 2, where this is only optimal for large copies. For newer Intel, glibc might indeed choose to use REP MOVSB more often.
ack_complete•3h ago
The problem is that the built-in mechanism is often microcode, which is still slower than plain machine code in some cases.

There are some interesting writings from a former architect of the Pentium Pro on the reasons for this. One is apparently that the microcode engine often lacked branch prediction, so handling special cases in the microcode was slower than compare/branch in direct code. REP MOVS has a bunch of such cases due to the need to handle overlapping copies, interrupts, and determining when it should switch to cache line sized non-temporal accesses.

More recent Intel CPUs have enhanced REP MOVS support with faster microcode and a flag indicating that memcpy() should rely on it more often. But people have still found cases where if the relative alignment between source and destination is just right, a manual copy loop is still noticeably faster than REP MOVS.

CyberDildonics•2h ago
I thought memcpy would have launched some sort of built-in mechanism

Where did you get this impression?

EGreg•1h ago
From my college days, which were quite long ago. And working with Win32 "BitBlt" requests to the OS, etc.

And also, it would just make sense. If copying entire blocks or memory pages, such as "BitBlt", is one command, why would I need CPU cycles to actually do it? It would seem like the lowest hanging fruit to automate in SDRAM

It just seems like the easiest example of SIMD

CyberDildonics•31m ago
These are contradictory things. SIMD instructions are still regular instructions, not some concurrent system for copying. When you say command, maybe you meant a windows OS function that was similar to memcpy. An OS function and individual CPU instructions are two different thing. There is something called DMA, but I don't know how much that is used for memory to memory copies.
commandlinefan•1h ago
I've gotten a lot of gains in this area in the past by just - not memcpy'ing. A good percentage of the time, somebody assumes that they need to copy something somewhere when in fact, the original never gets referenced. I can often get away with reading a buffer off the wire, inserting null terminators to turn bits of the buffer into proper C-style strings and just using them in-place.
t00•1h ago
That is a really good advice, copying data everywhere makes only sense if the data will be mutated. I only wonder why, why C-style strings were invented with 0 termination instead of varint prefix, this would have saved so much copying and so many bugs knowing the string length upfront.
AlotOfReading•1h ago
Pascal strings have the issue that you need to agree on an int size to cross an ABI boundary, unless you want to limit all strings to 255 characters and what the prefix means is ambiguous if you have variable length characters (e.g. Unicode). These were severe enough that Pascal derivatives all added null terminated strings.

Took a bit for languages to develop the distinction between string length in characters and bytes that allows us to make it work today. In that time C derivatives took over the world.

commandlinefan•57s ago
That reminds me of one of my favorite vulnerabilities. A security researcher named Moxie Marlinspike managed to register an SSL cert for .com by submitting a certificate request for the domain .com\0mygooddomain.com. The CA looked at the (length prefixed) ASN.1 subject name and saw that it had a legitimate domain, they accepted it, but most implementations treated the subject name as a C-delimited string and stopped parsing at the null terminator.