One of the maintainers (Catalin Marinas) made [0] a much more important point: Apple makes no promises about how their "TSO" bits work now or will work in the future. This mode was designed for Rosetta2, not the general public. It is not documented formally. Someone saying "it is TSO" is not documentation. A formal definition of a memory model is usually a very long document describing a lot of corner cases, for example [1] is a SUMMARY of the ARMv8 memory model, it is 31 pages long. It is a summary! The full spec makes up chapters D7 and D8 in [2], totaling 243 pages. Even there, there are corners that it does not touch on and people get wrong. Without such a spec for Apple's TSO mode, how can anyone rely on how it might or might not work?
Additionally, you might find silicon bugs if you do something in this mode that Rosetta2 doesn't or didn't. Consider that the only first-party user of this mode was Rosetta2. Anything it does not do that you do might find a bug.
The stated linux kernel policy of "do not break user space" is impossible to deliver on, if built on an undocumented hardware feature that might change at any time and was never fully publicly specified. The maintainers are right to reject this.
[0] https://lwn.net/ml/linux-kernel/ZiKyWGKTw6Aqntod@arm.com/
[1] https://developer.arm.com/-/media/Arm%20Developer%20Communit...
[2] https://documentation-service.arm.com/static/6943ef0c7982093...
> Some NVIDIA and Fujitsu CPUs run with TSO at all times; Apple's CPUs provide it as an optional feature that can be enabled at run time.
I thought I had been quite clear. I guess I'll try again even more clearly.
"TSO" is three letters. It is not a spec. "We all do TSO" is as meaningful as "we all want world peace". Everyone has their own meaning for those words, and the meanings may differ significantly. Each is a memory model, and each can be called "TSO". But just like not every "John Smith" is the same person, nor is everything called "TSO" the same. Does NVIDIA's TSO order ALL reads with respect to ALL writes? Does Apple's? What does x86 do in that case? What does a Fujitsu CPU do? "TSO" does not mean the same thing to everyone just like "world peace" does not. If, for example, NVIDIA came out and said "our TSO mode complies 100% with x86 memory model and will always continue to", then Fujitsu did the same, and then (LOL) Apple also publicly promised that, then and only then would your comment make sense. As it stands, four entities use the same acronym to each mean their own thing, and you are assuming absolute equality because the three letters match.
Fun story: I know FOR A FACT the answer to my above question about ordering of all reads vs all writes is not the same for x86, Apple's TSO, NVIDIA's TSO, and Fujitsu's TSO. Do you? Do you know how? Do you know how the answers might change with time and hardware revisions, given that at least Apple made no promises as to how their undocumented TSO mode works today or will work tomorrow? Exactly...
One cannot build a stable f{ea,u}ture on undocumented un[der]specified hardware features.
> Someone saying "it is TSO" is not documentation.
Trying to re-implement what IBM's BIOS did was not documentation either.
The original sets the standard, whether a given implementation is perfectly equivalent or not.
Oh, and to answer your question, yes, quite aware, actually. I've done quite a bit of low level work over the decades, including, curiously, working in the Apple platform kernel team at the time when this TSO bit appeared.
Well of course they differ. TSO says that some reorderings are banned and some are optional, and there's a million factors that go into deciding when those options are taken.
> "TSO" is three letters. It is not a spec.
It's a few rules that you can depend on. Are those rules not enough to build a program on top of? The simpler you make your rules, the less spec you need. On the other end of the spectrum, a dozen specialized memory barriers need a ton of explanation.
>It's a few rules that you can depend on.
Until properly specified they are not "rules" but "hopes". Apple made no promises and provided no specs for their TSO mode. What makes you sure that that TSO bit on AppleM4pro acts the same as on AppleM1? That same "TSO" bit might mean yet a third thing on AppleM7megaMaxProEliteG2 in 2031. How do you know that an OS update that also updated iBoot on your Mac did not change some internal chip config MSR and now even on your AppleM4pro CPU whose TSO you understood, it acts differently due to this config bit change?
"If you know you have world peace"
Sure, now define "total". Which accesses does that affect and which ones does it not? Is device memory included? PCIe memory? Are there ordering guarantees between mappings with different permissions?
Then, define "store ordering". Does it affect loads in any way? Or simply just stores?
At a basic level TSO is a model for how cores interact and devices are weird, so I'd say those get to be unspecified.
And ideally you want a line saying if the instruction cache needs to be flushed for self-modifying code since that's kind of a violation if not specified but it's a forgivable one.
> Then, define "store ordering".
Sure, though I'm not promising my wording is perfect: In TSO, when stores complete they become visible to all other cores and all cores agree on the exact same list of completed stores.
> Does it affect loads in any way? Or simply just stores?
Depends on what you mean by "affects". Loads in one core might not see stores from another core that have not yet reached the global/total list.
Not that they agree on what completed but on the order they completed in. That is the "o" in TSO. You inadvertently proved my point.
.
> so I'd say those get to be unspecified.
* CRASH *
You left something unspecified that mattered. Ordering of accesses to mappings with differing permissions matters, and whether they are seen in-program-order or not by other cores will break x86 emulators (main use cases for TSO).
.
That's the point here :) This is the usual "i am sure we can all agree what X means" argument - it does not work when it comes to precise things like memory models.
A list is ordered. You're trying too hard to nitpick. (Also I gave a disclaimer that my wording wasn't perfect, and it only took a couple words for you to "fix" it. If it can be fixed that easily then that doesn't actually counteract my point.)
> You left something unspecified that mattered. Ordering of accesses to mappings with differing permissions matters, and whether they are seen in-program-order or not by other cores will break x86 emulators (main use cases for TSO).
How many x86 emulators have the emulated code talking directly to hardware, to the same piece of hardware, from multiple cores at the same time?
I don't think this is a "main use case".
Plus there's going to be a baseline for how talking to the hardware works. Only TSO-mode-specific details of the hardware access are left unsaid in this basic model, and many access patterns fitting the above description still won't notice anything one way or the other.
It affects the visible ordering of remote stores to normal memory, so load are necessarily affected (it wouldn't make sense to guarantee a store order if unobservable).
Really, TSO is defined independently of x86 and in fact it took a while to actually prove that x86 was TSO. Concretely, how do architectures that claim (optional) TSO differ from each other at least for access to normal memory?
I bring this up only because recently, I made it my mission to get IBM's "PowerVM Lx86" dynamic translation system running on a POWER9 system running modern Debian Linux.
This required a lot of hackery I won't go into here, but it revealed two things to me:
1. The lx86 system depended on an undocumented (??) pseudo-syscall to switch between big and little endian modes. This was "syscall 0x1EBE", which was implemented in the exception handler for PowerPC. In other words, it wasn't a real syscall, and tools like strace do NOT capture/log it. It was a conditional branch in the assembly entry point for the process exception handler which switches endianness and then returns. Quicker than a "real" syscall. Also, long gone in the Linux kernel, replaced with syscall switch_endian, hex 0x16B. Adding this in wasn't too hard, but it'd sure as heck never make it upstream again ;)
2. A lot of other Linux calls have had bits added which break old applications that are "rigidly coded". For example, netlink has had extra data stuffed into various structures, which code like lx86's translator panics on. To get network stuff running, even with an antique x86 userland, required removing a bunch of stuff from the kernel. Not disabling it via configuration, but deleting code.
All this to say, there is a precedent for breaking the Linux user facing ABI for hardware features like this. I'm not saying that's a good thing, but it is a thing.
rbanffy•17h ago