Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.
----
So that's five minutes I'll never get back.
I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...
0000000000000000 <memcpy>:
0: 86aa mv a3,a0
0000000000000002 <.L1^B1>:
2: 00267757 vsetvli a4,a2,e8,m4,tu,mu
6: 02058007 vle8.v v0,(a1)
a: 95ba add a1,a1,a4
c: 8e19 sub a2,a2,a4
e: 02068027 vse8.v v0,(a3)
12: 96ba add a3,a3,a4
14: f67d bnez a2,2 <.L1^B1>
16: 8082 ret
... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.Confirming null hypothesis, with good supporting data is still interesting. Could save you from doing this yourself.
Although the blog post is about going faster and him showing alternative algorithms, conclusion remains for safety which makes perfect sense. However, he did show us a few strategies which is useful. The five minutes I spent, will never be returned to me but at least I learned something interesting...
I seriously doubt that. Unless you have a NUMA system, a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller. If you can avoid going through main memory – e.g., when copying between the L2 caches of different cores – multi-threading can speed things up. But then you need precise knowledge of your program's memory access behavior, and this is outside the scope of a general-purpose memcpy.
Modern x86 machines offer far more memory bandwidth than what a single core can consume. The entire architecture is designed on purpose to ensure this.
The interesting thing to note is that this has not always been the case. The 2010s is when the transition occurred.
Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon, so it shouldn't push out other things in the cache. They may skip the cache entirely, or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.
I disagree with this statement (taken at face value, I don't necessarily agree with the wording in the OP either). Non-temporal instructions are unordered with respect to normal memory operations, so without a _mm_sfence() after doing your non-temporal writes you're going to get nasty hardware UB.
In any case, if so they are potentially _less_ correct; they never help you.
Intel's docs are unfortunately spartan, but the guarantees around program order is a hint that this is what it does.
Similarly, if I look up MOVNTDQ in the Intel manuals (https://www.intel.com/content/dam/www/public/us/en/documents...), they say:
“Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple processors might use different memory types to read/write the destination memory locations”
Note _if multiple processors_.
> or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.
I hadn’t heard of this before. It looks like older x86 CPUs may have had a dedicated cache.
See e.g. https://cdrdv2.intel.com/v1/dl/getContent/671200 chapter 13.5.5:
“The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) allow data to be moved from the processor’s registers directly into system memory without being also written into the L1, L2, and/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. These instructions operate on data in the general-purpose, MMX, and XMM registers.”
I believe that non-temporal moves basically work similar to memory marked as write-combining; which is explained in 13.1.1: “Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. The WC buffer is not snooped and thus does not provide data coherency. Buffering of writes to WC memory is done to allow software a small window of time to supply more modified data to the WC buffer while remaining as non-intrusive to software as possible. The buffering of writes to WC memory also causes data to be collapsed; that is, multiple writes to the same memory location will leave the last data written in the location and the other writes will be lost.”
In the old days (Pentium Pro and the likes), I think there was basically a 4- or 8-way associative cache, and non-temporal loads/stores would go to only one of the sets, so you could only waste 1/4 (or 1/8) on your cache on it at worst.
A common trick is to cache it but put it directly in the last or second-to-last bin in your pseudo-LRU order, so it's in cache like normal but gets evicted quickly when you need to cache a new line in the same set. Other solutions can lead to complicated situations when the user was wrong and the line gets immediately reused by normal instructions, this way it's just in cache like normal and gets promoted to least recently used if you do that.
I don't think this loop does the right thing if destination points somewhere into source. It will start overwriting the non-copied parts of source.
It's faster of you use the CPU, but you absolutely can just use DMA - and some embedded systems do.
But not for AMD? E.g. 8 Zen 5 cores in the CCD have only 64 GB/s read and 32 GB/s write bandwidth, while the dual-channel memory controller in the IOD has up to 87 GB/s bandwidth.
A: requires the DMA system to know about each user process memory mappings (ie hardware support understanding CPU pagetables)
B: spend time going from user-kernelmode and back (we invented the entire io_uring and other mechanisms to avoid that).
To some extent I guess the IOMMU's available to modern graphics cards solve it partially but I'm not sure that it's a free lunch (ie it might be partially in driver/OS level to manage mappings for this).
The other reason DMA works for devices is because it is asynchronous. You give a device a command and some memory to do it with, it does the thing and lets you know. Most devices can't complete commands instantaneously, so we know we have to queue things and then go do something else. Often when doing memcpy, we want to use the copied memory immediately... if it were a DMA, you'd need to submit the request and wait for it to complete before you continued... If your general purpose DMA engine is a typical device, you're probably doing a syscall to the kernel, which would submit the command (possibly through a queue), suspend your process, schedule something else and there may be delay before getting scheduled again when the DMA is complete.
If async memcpy was what was wanted, it could make sense, but that feels pretty hard to use.
Isn't a blitter exactly that sort of device? Assuming that it can access the relevant RAM, why couldn't that be used for general-purpose memory copying operations?
[1] https://lwn.net/Articles/162966/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
Just indicate the start and length. Why would the CPU need to keep issuing copy instructions?
There are some interesting writings from a former architect of the Pentium Pro on the reasons for this. One is apparently that the microcode engine often lacked branch prediction, so handling special cases in the microcode was slower than compare/branch in direct code. REP MOVS has a bunch of such cases due to the need to handle overlapping copies, interrupts, and determining when it should switch to cache line sized non-temporal accesses.
More recent Intel CPUs have enhanced REP MOVS support with faster microcode and a flag indicating that memcpy() should rely on it more often. But people have still found cases where if the relative alignment between source and destination is just right, a manual copy loop is still noticeably faster than REP MOVS.
Where did you get this impression?
And also, it would just make sense. If copying entire blocks or memory pages, such as "BitBlt", is one command, why would I need CPU cycles to actually do it? It would seem like the lowest hanging fruit to automate in SDRAM
It just seems like the easiest example of SIMD
Took a bit for languages to develop the distinction between string length in characters and bytes that allows us to make it work today. In that time C derivatives took over the world.
waschl•13h ago
throwaway81523•13h ago
dataflow•13h ago
On an SMP system yes. On a NUMA system it depends on your access patterns etc.
6keZbCECT2uB•13h ago
Pytorch multiprocessing queues work this way, but it is hard for the sender to ensure the data is already in shared memory, so it often has a copy. It is also common for buffers to not be reused, so that can end up a bottleneck, but it can, in principle, be limited by the rate of sending fds.
yokaze•12h ago
https://www.boost.org/doc/libs/1_46_0/doc/html/interprocess/...
comex•12h ago
o11c•12h ago
That said, even without seals, it's often possible to guarantee that you only read the memory once; in this case, even if the memory is technically mutating after you start, it doesn't matter since you never see any inconsistent state.
kragen•10h ago
murderfs•8h ago
Any realistic high-performance zero copy IPC mechanism needs to avoid changing the page tables like the plague, which means things like memfd seals aren't really useful.
duped•12h ago
hmry•11h ago
IshKebab•10h ago
kragen•10h ago
- browser main processes that don't trust renderer processes
- window system compositors that don't trust all windowed applications, and vice versa
- database servers that don't trust database clients, and vice versa
- message queue brokers that don't trust publishers and subscribers, and vice versa
- userspace filesystems that don't trust normal user processes
a_t48•11h ago
As for allocation - it looks like Zenoh might offer the allocation pattern necessary. https://zenoh-cpp.readthedocs.io/en/1.0.0.5/shm.html TBH most of the big wins come from not copying big blocks of memory around from sensor data and the like. A thin header and reference to a block of shared memory containing an image or point cloud coming in over UDS is likely more than performant enough for most use cases. Again, big wins from not having to serialize/deserialize the sensor data.
Another pattern which I haven't really seen anywhere is handling multiple transports - at one point I had the concept of setting up one transport as an allocator (to put into shared memory or the like) - serialize once to shared memory, hand that serialized buffer to your network transport(s) or your disk writer. It's not quite zero copy but in practice most zero copy is actually at least one copy on each end.
(Sorry, this post is a little scatterbrained, hopefully some of my points come across)