https://beej.us/guide/bgc/html/split-wide/structs-ii-more-fu...
...goes into offsets, padding bytes, bit-field, unions, etc etc.
> Well, dear reader, this padding is added because the CPU needs memory to be aligned in sets of 4 bytes because it’s optimized in that fashion.
> ...
> Remember: since structs are aligned to 4 bytes, any padding is therefore unnecessary if the size of the struct is a multiple of 4 without the padding.
Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.
In this article, `struct Monster` has an alignment of 4 because `int` and `float` have an alignment of 4 for the author's configuration. Expanding one of the `int`s to a `long` could increase the alignment to 8 on some CPUs, and removing the `int` and `float` fields would decrease the alignment to 1 for most CPUs.
(technically there is still an advantage of items aligned to their size in that such an item can never straddle adjacent cache lines though)
And there's also still tons of different alignment requirements when working with GPU data - and interestingly those alignment requirements may differ from C's alignment rules, so you may need to explicitly use packed structs (which are still not a standard C feature!) with manual padding.
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...
TL;DR: 10% difference on what in 2012 was a low-end CPU, no difference on "new in 2012" CPUs. So my guess is that by now it really doesn't matter anymore :)
I happen to have a copy of AAPCS64 in front of me, and you can find the specification of bit packing in section 10.1.8. The sysv ABI for x86/x86_64 has its own wording (but I'm pretty sure is compatible). MSVC does something else I believe, etc...
> An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.
In practice, I've been using this feature for ages (along with __attribute__((packed))). There comes a point where you can start depending on compiler specific features instead of relying solely on the standard.
The original (?) DNS library used bit-fields to read and write the packet header status and flags fields, and that original code is still used today: https://github.com/openbsd/src/blob/master/include/arpa/name... It does rely on implementation-defined behavior--it swaps the order of fields based on whether the machine is little-, big-, or middle-endian--but even that behavior has remained remarkably consistent, at least on Unix ABIs.
I think the wisdom comes from dealing with niche compilers, especially from years ago, where bit-fields were one of the areas where standard adherence was less consistent; and in embedded and kernel contexts, where the implementation-defined and (implicit) undefined behavior, such as wrt atomicity or the intersection of bit-fields and (extension) packed pragmas, counseled against any usage of bit-fields.
https://github.com/PeterCDMcLean/BitLib
You can inherit from bit_array and create a pseudo bitfield:
class mybits_t : bit::bit_array<17> {
public:
// slice a 1 bit sized field
// returns a proxy reference to the single bit
auto field1() {
return (*this)(0, 1);
}
// slice a 3 bit sized field
// returns a proxy reference to the 3 bits
auto field2() {
return (*this)(1, 4);
}
};
mybits_t mybits(7);
assert(mybits.field1() == 1);
mybits.field1() = 0;
assert(static_cast<int>(mybits) == 6); //no implicit cast to integers, but explicit supported
There is minimal overhead depending on how aggressively the compiler inlines or optimizes*bit_array can be compile time storage (ala bitset) or dynamic at construction time. There is also bit_vector which is entirely dynamic
Critically though, is that the type/library is built on a proxy iterator/pointer that lets you do certain things with STL or the bit algorithms that you cannot otherwise do with bitset.
But back to the bitfields idea. The bit array backing storage is configurable through templates so you will be able to know for certain exactly how much stack or heap your type will take up
Bitfields in C leave way too much behavior to the compiler or undefined. It's really intolerable.
Arrow's struct is called StructArray. Fields of StructArray have a StructType.
StructType has .bit_width and .byte_width attrs in Python and probably the other implementations too: https://arrow.apache.org/docs/python/generated/pyarrow.Struc...
Arrow supports bitfields with BooleanArray, and enums with categoricals but
"BUG: Categorical columns using the PyArrow backend requires 4x more memory" (open) https://github.com/pandas-dev/pandas/issues/58062 :
> On disk Parquet appears to store the category data as logical type String which is compressed with snappy and encoded
Arrow Flight RPC handles nested structs with enums over the wire somehow too FWIU
I'm not sure portability is the reason though. I've seen it in places that have no problem spamming #define __GNU_SOURCE
It's absolutely wonderful for IO.
Big-endian systems sort-of still exist (I do indeed have an Amiga, and an Atari STe, and an Xbox 360, and they all work, so they must count) - but they're only slightly less weird from a practical C programming perspective than stuff like the DSPs with CHAR_BIT==24, and Harvard architecture microcontrollers with 2-byte ints. And so you can ignore them. People that use still actually these systems will be used to it.
S->buffer_free = 1;
This is a very serious limitation if the structure/class is being updated in a multi-threaded environment.
>”I’m by far an expert and I could be wrong in some specifications”
You’re missing a “not” on that 2nd sentence ;)
In all seriousness, you don’t need this disclaimer, you’re selling yourself short.
There’s nothing wrong with your post to make the disclaimer.
And if there is something wrong with the post, let the comments flood in and tell you that rather than yourself.
The next thing to really explore is how the struct is being used. Smaller is always nice, but sometimes you need it to be fast. To do that, you'll want to group together fields that are commonly read. For example, x, y, and speed are likely always read together. So, it'd make sense to group those fields together to optimize for cache reads.
The next thing to question is if your Monster is being interacted with as an individual thing or as a group of things. If it's as a group of things then it might make sense to create a "Struct of arrays" of monsters. Imagine, for example, you have a system which removes all the dead monsters. It'd be faster to iterate through an `isAlive` array rather than an array of `Monster` structs.
And now you are on your way to reinventing an ECS :)
Many a spacecraft telemetry stacks work like that.
You can eliminate the cost of a wrapping class with no members by taking advantage of the empty base class optimization (or rather: by avoiding members in a base class you will be taking advantage of EBCO).
You can use an untagged union in C++ (taking care to avoid undefined behavior) to implement small object optimization. In it's general form, the idea is you can store some smaller struct in a union (as a char[], since C++ doesn't have native unions) with a larger structure, and use a sentinel in the larger structure to determine which it is. E.g., if large struct has some field that must not be zero, then make it zero in the smaller struct and you can distinguish between the two without a tag (like would be required with std::variant).
I'm also a huge fan in general of "make invalid states unrepresentable" and following this can often reduce redundancy and eliminate extra fields.
In C, alignment for basic types is linked to size. A type may have an alignment requirement that is as much as its size: e.g. an 8 byte data type might be aligned as strictly as addresses divisible by 8.
The alignment of an array aggregate is that of its element type.
The alignment of a struct is that of its most strictly aligned member.
In portable coding, we don't exactly know, but we can make the worst-case assumption that every basic type has the strictest possible alignment---i.e. its size. We also don't know the sizes of all the types, but we know that integers of higher ranks are at least as large as lower ranks; i.e. sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long), etc.
Things are tricky when you have a mixture of floating-point, pointers and integers. E.g. pointer could be the same size as a float, or bigger.
Here is a trick you use: combine together clumps of like types, and try to make combinations that are powers of two. For instance, if your structure has two or four pointers to things, try to make them adjacent. E.g.:
struct s {
void *p1, *p2, *p3, p4;
short s1, s2, s3, s4;
float f1, f2, f3, f4;
};
Even though float is almost certainly larger than short and may have stricter alignment, its moot because I have four of everything. Even if this is on a tiny system where pointers are 16 bits wide, four of them will make 8 bytes. So when the f1 member is being placed into the struct, the offset is at least 8 byte aligned: we don't expect padding.If you aggregate enough members of the same type in a power-of-two sized clump, you can easily create alignment which meets or exceeds the strictest alignment in the system.
So those are the main principles: group together members of the same type. Then keeping those groups together, sort by descending alignment strictness.
But it does mean that we've always had those programmers who have had characteristic holes in their education as a result.
Personally I enjoy having a mix of self-taught and formally-taught around; I think there quite a few things the self-taught are often better at than the formally educated as a group. I respect both sides. I'm just saying that there has always been a market for this sort of thing on HN because the self-taught have always been around.
Would you mind expanding on this a bit?
(various vector types)
int128_t
long double
long long, int64_t, int_fast64_t, double
off_t
time_t (can be 64 bits on 32-bit platforms only if off_t also is)
register_t, int_fast16_t, int_fast32_t (64-bit on e.g. x32)
void *, void(*)(), intptr_t
ptrdiff_t (theoretically wrong if fully aligned and bigger than pointer)
size_t, ssize_t
long (32-bit on win64; generally has 16-bit alignment if bigger than pointer)
int32_t, float
int (can be 16-bit)
short, int16_t
char, bool, int8_t, int_fast8_t
(a trailing array member must come last; note when calculating combined size you should use offsetof instead of sizeof to not waste padding)
Besides this, `int_leastN_t` goes with `intN_t`, and all unsigned types go with their signed types. Note that types like `id_t` are hard to place however.Although there do exist many platforms where this is not in order of size, the bigger types generally have smaller alignment on such platforms. Despite this I have split out rows occasionally despite not knowing of any such platform.
Why are you making your structs smaller. Probably for "performance". If your goal is literally just reducing memory usage, then this is all fine. Maybe you're running on a small microcontroller and you just straight up don't have much RAM.
But these days, in most areas, memory is cheap and plentiful. What's expensive is CPU cache. If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.
In that case, the advice here is only part of the story. Using bitfields and smaller integers saves memory. But in order to access those fields and do anything with them, they need to get loaded into registers. I'm not an expert here, but my understanding is that loading a bitfield or small int into a word-sized register can sometimes have some bit of overhead, so you have to pay attention to the trade-off.
If I was optimizing for overall runtime speed and considering packing structs to do that, I'd really want to have good profiling and benchmarks in place first. Otherwise it's easy to think you're making progress when you're actually going backwards.
In general, keeping commonly accessed together things on the same cache line is a good idea and the work needed to remove excessive padding overlaps with the work needed to do this.
and that's why I am contemplating freaking 128G of ram for my next machine, I have 64 right now and OOM regularly, 32 is unuseable
The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.
You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.
It's also a reasonably common case for parsing binary data formats (e.g. a file's header has fields at exact byte offsets, and it's much faster and easier to just grab all the bytes of that header and interpret those bytes as a packed struct).
CPU cache is memory. Also memory isn't cheap, it is relatively expensive to access. CPU cache is way cheaper. You have it backwards.
Personally, I only pay attention to the cache line and layout when the structure is a concurrent data structure and needs to sync across multiple cores.
And there is a related talk by Andrew Kelley implementing these ideas on Zig toolchain: https://www.youtube.com/watch?v=IroPQ150F6c
Example framework:
https://github.com/skypjack/entt?tab=readme-ov-file
It has benefits besides memory/cache optimizations. It organizes logic in a much more composable way that is friendly for reuse.
sim7c00•18h ago
like the examples though, very clear what the changes are and the specific impact. always a good reminder to review for these things after a session hacking in new things!