"Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small."
Even for seemingly trivial scenarios like searching an array, the contents and length of the array make a massive difference in results and how to optimize (as shown in the last section of this write-up where I tried to benchmark search algorithms correctly https://www.jasonthorsness.com/23).
I've not seen a perfect solution to this that isn't just "thinking carefully about the test setup" (profile-guided optimization/production profiles replayed for benchmarks seem like maybe it could be an alternative, but I haven't seen that used much).
I can vouch for this. I've been doing web performance for nearly 15 years and finding clear representative problems is one the hardest parts of my job. Once I understood this and worked on getting better at finding representative issues, it became the single thing that I did to boost my productivity the most.
I'm not sure if it has gotten easier. Async is harder to profile than classic threaded code using blocking IO. And for database bound code, I'm not sure which databases output detailed enough performance information.
And there's also https://docs.datadoghq.com/profiler/
Someone out there might have an application in which they are serializing large numbers of large integers.
The case for abandoning it is not as strong, since you don't know who is doing what.
It's a guessing game in programming language run-times and compilers: "will anyone need this improvement?" You have room to invent "yes" vibes. :)
Something like grab .01% of the inputs to that function for a day or something like that (maybe a lower fraction over a week or month).
Is the cost of grabbing this that high?
definitely made me go :-O - how big a company do you need to be for that to be a realistic thing to maintain in house?
I've maintained in house forks of major code bases with a handful of minor code changes while we've tried to convince upstream to accept them.
There are 17 listed on SDKMAN's website: https://sdkman.io/jdks
> OpenJDK
As far as I know, all other distributions of the JDK are forks of OpenJDK. Would love to know if there's any that isn't.
https://www.ptc.com/en/products/developer-tools/perc
https://www.aicas.com/products-services/jamaicavm/
JikesRVM (https://www.jikesrvm.org) no longer active, but it was its own thing.
Then you have Azul Zing and OpenJ9, which use parts of OpenJDK for the standard library part, but the foundations are their own thing.
https://www.azul.com/products/prime/
https://developer.ibm.com/components/open-j9/
Then you have that mobile OS, which isn't really JVM compliant implementation, but it is close enough for many workloads and libraries to work on top of it.
An example of why I'd prefer to avoid Java for something like that. Java has a reputation for being slow because of JIT compilation and GC, but I also think a lot of it comes down to all non-primitive types being boxed Objects (boxed = in a separate allocation). This means it works that garbage collector so much harder, there's poorer locality, it fills RAM/cache with extra Object stuff like mutexes that no one ever uses. (Apparently now 8 bytes after a lot of effort to reduce it. https://openjdk.org/jeps/450)
The problem is that String really isn't a language primitive.
Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)
If they do, they're wrong, as the two languages are quite different here. In Java, String requires two allocations: your variable is implicitly a pointer to String allocation, which in turn has a pointer to a char[] allocation. In C++, the std::string itself is a value type. The actual bytes might be inline (short string optimization) or behind a single allocation accessible from the std::string.
Rust's std::string::String is somewhere between: it's a value type but does not have a short string optimization (unless you count the empty string returned by String::new).
> Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)
Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that you can't have a class or struct that bundles up the parts of your data structure (in the case of a mutable string, two fields: data pointer/capacity + the used length) without boxing. There's a heavy cost to any non-primitive data type.
That’s true, but thanks to GC an allocation also is just a pointer bump in Java, and the two allocations are likely to be close to each other. For short-lived strings, the GC cost is furthermore zero, because only the longer-lived objects need to be tracked and copied with generational GC. So, “heavy cost” is relative.
Also, String.intern() can be quite slow compared to something like ConcurrentHashMap.putIfAbsent().
You also can’t make a first class string type in most of those language because “hi” is defined to be of a system specified class. You can make a different type to store strings but it’ll never be as ergonomic to use.
It’s even worse in JavaScript, where the standard library is written in a different language (usually C++). It’s impossible to write JavaScript code that matches the performance of the built in string primitive. At least outside of specific niche use cases.
Rust has lots of alternate string like libraries in cargo that are optimized better for different use cases - like smallstring or ropey. They’re fast, convenient and ergonomic. I imagine C++ is the same.
This is less problematic nowadays, because for many folks there are only three compilers that still matter, or even a single one.
While they were used as inspiration on their design, the AOT approach, with automatic memory management, and programming language features for low level coding really took a while to get back from them, even the NGEN/unsafe from C# v 1.0, while welcomed wasn't that great.
I always think that in an alternate reality, had them been had those features already on version 1.0, C and C++ would have lost even more mindshare at the turn of the century, and something like C++11 would not have been as impactful.
They're working on it: https://openjdk.org/projects/valhalla/
But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel. So in JVM boxing is really cheap, often just a pointer increment. Also oftentimes the wrapper object and its value are located near each other in the same memory page, so we're not adding a RAM access, more like L2 access.
PS: In some cases it can be even faster than Rust or C++, because it pre-allocates larger pages and drops them in chunks within generational GC (e.g. all values "allocated" to process one HTTP request can be GCed immediately after), while C++ is eager to destruct each object immediately. Also, a GC swipe can happen in a separate thread, not bothering the main thread user is waiting for. One can do the same in Rust and C++ using Arenas, of course.
What? No. Rust doesn't ask the kernel for each allocation individually. That'd be insane; besides the crushing system call overhead, the kernel only does allocations in whole pages (4 KiB on x86-64) so it'd be incredibly wasteful of RAM.
Rust does the same thing as virtually every non-GCed language. It uses a memory allocator [1] that does bookkeeping in userspace and asks the kernel for big chunks via sbrk and/or mmap/munmap. Probably not the whole heap as a single chunk of virtual memory as in Java, but much closer to that than to the other extreme of a separate kernel call for each allocation.
[1] by default, just libc's malloc and free, although you can override this, and many people choose jemalloc or mimalloc instead. My high-level description applies equally well to any of the three.
Of course you can optimize better with Rust/CPP/etc, but it is not trivial and you definitely not get it out of the box for free. My point is, this is a bit overblown how much overhead java has.
I meant that Java usually already has that memory allocated, JVM is a memory allocator of its own. It operates within its own heap. One can do that within Rust of course (or easier in Zig, as it welcomes passing an allocator around), it's just built-in in JVM already. Drawback is that it's more complex to release that memory back from the JVM process, so Java apps (not AOT-compiled) usually consume more RAM.
I recently encountered an example of someone writing a Rust version of a popular Java library by just taking the Java code, commenting it out, and writing the Rust equivalent almost line for line. The approach works great (no need to reinvent the wheel and you can point to the existing documentation and code samples) but in terms of allocations, you're not going to find many improvements.
There's a type of Java code that looks more like C code than anything else that runs blazing fast with minimal overhead. It's not the type of Java code you'll probably encounter when writing Java applications, but if you use Java as a kind of cross-platform C target, you can get pretty close to Rust (and for some use cases even beat it). Java has a LOT of tricks up its sleave (pointer compression, dynamic realignment) that Rust can't automatically take advantage of.
Your AbstractFunctorClassFactoryProducer isn't going to be very allocation efficient, but once you start seeing volatile ints all over the place, things quickly become a lot faster.
Nonsense. Sure, the act of creating a new allocation is cheap, but that's not where the expense lies at all. And of course, to make allocations cheap, you needed to give up something else.
Each allocation needs to be at some point handled by the GC, so the more allocations, the more GC pressure. That then forces you to make your GC generational, which constrains your GC design. You end up with a slower GC no matter what you do.
Moreover, every access to that allocation goes through a pointer, which is basically a guaranteed cache miss.
The classic case of this is iterating over an Array<Integer>. Not only did you have to make n allocations that the GC now has to keep track of, it's not possible to efficiently fetch items of this array from memory at all. You need to first fetch the pointers, then request the pointed to memory. Even in the best possible case of the pointer living right next to the data it points to, you're still paying the overhead of extra memory taking up space in your L1 cache.
Compare with an array of simple integers. Zero GC overhead, zero pointers, zero memory overhead, trivial to prefetch.
---
This is a case of horrid approach to design. Making allocating cheap must come at a cost of slowing down some other part of the GC and just encourages programmers to allocate more, which puts more and more pressure on the GC, making it even slower. It's a self-defeating approach that is therefore completely bone headed.
Allocations should be expensive!
With expensive allocations you get breathing room in your GC to make it otherwise faster. Moreover, since programmers are now encouraged not to allocate, the GC pressure is lower, making it even faster. But it gets even better. Optimizing becomes very clear and straightforward - just reduce allocations. This is great because it allows for targeted optimizations - if a particular piece of code is slow, just reduce its allocation rate and you'll get a nice speedup. Very easy to reason about.
That's why code written C/C++/Go/Rust tends to be so conscious of allocations and any indirections, but Java is full of linked lists, arrays of pointers, allocations everywhere and thousands of layers of indirections.
Cleaning up heaps of garbage can never be faster than not creating the garbage in the first place.
I'm always looking for / wondering about valid data representing the problem so that I can measure success.
A little of last and much of this week I've spent addressing a performance problem. At first I was told it was always happening in some specific situation, it very much is not. Then I was told it frequently happens ... nope, maybe 2 to 4 times a day at most.
I think I've managed to track the issue down. I think so, pretty sure? On the other hand this client has a couple performance related issues ongoing and listening to them talk on a conference call... I'm not really sure they differentiate between them. It's very possible my neighbor a few seats away from me is actually addressing "the problem" and I'm chasing ghosts. But the client is too busy to provide good data. Meanwhile this issue "too important" of an issue for us to do nothing and wait, granted ... we might be doing nothing anyway.
This literally made the GA encoding exactly the same as the solution and also very, very obviously favoured techniques that would MAKE ALL THE BITS 1!
https://link.springer.com/article/10.1007/s10710-021-09425-5
They do the convergence analysis on the linear system Ax = 0, which means any iteration matrix (including a zero matrix) that produces shrinking numbers will converge to the obvious solution x = 0 and the genetic program just happens to produce iteration matrices with lots of zeros.
It WAS still a great compiler, and way faster than the competition at the time.
That's only half-snark. It was "wasted" in the sense that it wasn't as useful as the OP thought it would be. But diving deep into a technical problem? That left an indelible pattern on the OP's psyche. If they ever have to do something like that again, it will certainly be easier. And since AI is going to be doing all the simple coding tasks, humans will have to dive into the deep, hard tasks.
Adding the fast path to the hard-won assembly is just optimising the slow path, and there is little point to optimising it. Complex optimisations are also a form of technical debt, so they are probably better off not integrating it.
I am wondering more about the bit-by-bit compatibility part from "This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.".
How does one test for 18.4 quintillion numbers? This isn't possible.
In Canada, your org can get government money for this under the SR&ED program:
https://www.canada.ca/en/revenue-agency/services/scientific-...
He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request.
Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code.
Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.
Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).
Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one.
Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already.
[1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures.
[2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree.
Lines of thinking like that are part of the reason most modern software is so sloooow :)
normal(x) == optimized(x)Then I write the optimized code and use a test like yours to compare the results between the simple code and the optimized code.
One time I needed to write a function to search for a specific pattern in a binary file and change it. So I wrote the brute force code as a first step, the same code that anyone would probably write as a simple solution. It worked the first time, and a couple of people reviewed the code and said "yep, even if it's slow, it is correct."
But this code took more than a second to run!
Of course I thought about optimizing it with Boyer-Moore or the like. Then I went, "Hold on to your horses. This isn't something like a web page load where one second matters. It's part of a build process that only runs a few times a day and already takes several minutes to run. One extra second is nothing!"
In the wise words of Kenny Rogers in The Gambler:
You got to know when to hold 'em,
know when to fold 'em
Know when to walk away
and know when to run Every coder knows
the secret to optimizin’
is knowing what to throw away
and knowing what to keep"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
A more robust strategy would be at least be to check if the rule was the same as the previous one (or a small hash table) so that the system is self-healing.
Ken’s solution is at least robust and by that property I would prefer it since it’s just as fast but doesn’t have any weird tail latencies where the requests out of your cache distribution are as fast as the ones in.
Ken's solution was stated to have been slower than the alternative optimization.
Also, it's trivial to keep Ken's implementation as the slow path. If request patterns change, dig up the new fast path and put the old one in Ken's slow path code. Most of the performance will still come from the initial `if`.
if value in most_common_range:
take_fast_path()
else:
use_kens_solution()Learning this is worth writing the otherwise useless code.
Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem
That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt.
¹: Write Everything Twice
That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort).
In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions.
I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more.
Theres absolutely no reason to think that the specific workload of a system generally makes random/arbitrary/heuristic be the best way to pick a most likely branch for a given branching instruction.
For example, when scanning data you might check if the current value is equal to a test value, and branch on that to either code that handles the match, or else code that moves to the next entry to test that. This may be extremely likely to match (95%) as in a hash table, or very likely to not match (string search). A compiler can't tell which is true, and your workload literally has no impact on this unless it is pathological.
This is what I meant by pathological data. If you had weird data that somehow, over and over, got hash tables right up to their fill limit and just stayed there then you would have slightly more collisions than an average use case. However you would have to construct weird data to do this. It wouldn't be "more ingestion". The thing PGO optimized would be individual branches in the hash table code, so for example the branching instruction to see if the key in a bucket is == to the key you are looking for.
In read-only workload, searching through the key-value space there can't be a collision, no?
https://en.wikipedia.org/wiki/Hash_table#Collision_resolutio...
What I pictured in my head is that the ingestion workloads in contrast to the read-only workloads are different in the fact that collisions in former can trigger hash-map resizing/rehashing and which is why I say that these are two distinctive workloads. Moving bunch of data to and from the buckets.
So, my hypothesis is the following: if I PGO the implementation using the dataset (volume, data_distribution) that it triggers a handful of such resizing/rehashing operations, and we observe the positive improvements in wall-time runtime performance, what are the chances that using the same binary but now over a completely different dataset will preserve the same runtime performance improvements? What you're saying, if I understand correctly, is that the improvements will be the same regardless of the dataset (and even type of workload) and I am saying I am not so sure about that.
So every time you add something to the hashtable, it will check the current size against the next size it needs to resize. 90% of the time, then 99% of the time, then 99.9% of the time and so on the answer will be 'no, there is space left'. This is just because the size goes up geometrically (in general) so a very low % of insert operations trigger a resize. The 'high probability branch' for "if current_size == max_size" should always be false. This is independent of workload. The compiler would not know whether that comparison is 'almost always true' or 'almost always false' which might lead the CPU's branch predictor to guess wrong, which will lead to pipeline stalls.
Read up on how pipelined, superscaler cpu's work to understand why this is very important.
https://en.wikipedia.org/wiki/Pipeline_stall
https://en.wikipedia.org/wiki/Predication_(computer_architec...
I'd say that an org has a limited number of pennies to spend on non-trivial code. Let's spend them wisely.
Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request.
Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process.
So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five.
Always pay attention to the raw numbers and not just the percentages in benchmarks, kids.
>so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.
but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially.
1. One branch for the one-byte case, always inlined, just store the byte
2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6
3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift.
https://github.com/protocolbuffers/protobuf/commit/b039dfe26...
This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores.
I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values.
I would expect C or asm to be a better choice at ultra high levels of optimization. I know Java can get somewhat similar performance if you put the work in, but I’d assume you want to do manual memory management at minimum.
I’m just saying that this company seems to be throwing a lot of money at the problem, and at those levels you can squeeze more performance out of them and get them reasonably right.
After all, many, many pieces of reliable software are written in C. If you wanted a “safer” modern language, you could pick Rust.
I would just assume those would be easier picks rather than optimizing around GC pauses, which can be notoriously difficult when nanosecond matter.
Java is unique in that it often ran a dual stack for byte code, and platform specific binary objects.
My advice often is to use high-level languages if you can get away with the convenience, and a low-level language with proper design patterns if you must.
Many programmers I've met shouldn't write C/C++, but mostly because they can't be trusted to use memory or thread safe code.
Cheers, =3
C doesn’t inherently have to be bad (and can often be better than C++) if you are strict about adhering to standards.
I like to point to OpenBSD as a successful and well-done C project. As a bonus, if you develop on OpenBSD, your program is more likely to have correct memory handling operations (OpenBSD has a lot of very aggressive memory protections. Many issues in ported applications like Firefox were found and fixed by OpenBSD users, but would happily keep running on Linux).
"Why we chose Java for our High-Frequency Trading application"
https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-...
I still just feel like a non-GC language (maybe even Rust) is a better choose when nanoseconds matter.
Ada is not something that these companies would bother using, even though there are about 7 vendors still in business.
Likewise with either FreePascal or Modula-2.
Most likely a few of these exchanges are writting new code in Rust for such deployment scenarios.
Data condition can definitely affect runtime performance. All the way down to the micro architecture level. Random, sorted, reverse-sorted, uniform data sets often lead to dramatic differences in perf. Then you get into “Ooops, my dataset had a surprisingly large number of NaNs or denormalized floats!”
Obviously, it becomes hard in such an aggressively optimized system, but if that's the case, trying and failing is an acceptable approach (I've seen 2 weeks wasted in a many a worse ways, to be honest).
Another anecdote: I profiled our Java-based server and concluded that most of the time was being spent on creating temporary DI containers for request handling. I spent several days trying to avoid copying the "app" scope components to the "request" scope container (we forked and heavily modified our DI container implementation) so that creating request containers was cheaper. I wrote a JMH benchmark that tried to emulate the real thing based on the real data (with something like 40% of components in the "app" scope, 60% in the "request" scope). My optimisation made the code noticably faster, I can't remember by how much exactly, but probably something like 25% faster.
When I ran the "real" performance test that simulates real-world usage of the server (not just the containers) the performance was slightly worse!! Ended up throwing the code away, as the OP. I had already spent too much time on it so I couldn't dig much further into why that was. But from other experiments, I can say that the JVM, thanks to JIT compilation of bytecode, will heavily optimize the hot path anyway based on the actual usage patterns, so much that the Java code you wrote may resemble very little of what machine code is actually executing in the end. That's why I said it may also be "trivial" to optimize JVM code: just let the JIT do its job.
Of course you may still obtain big gains when you optimize code that was doing stupid things, but because of JIT "magic", even your stupid code might run really fast given enough JIT passes.
In a sense, a shame this encoding wasn't structured like UTF-8, or even the other way around, a shame UTF-8 wasn't structured in this, more generic way.
* it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes!
* Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point.
* You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point.
But for issue #2, that seems to not be too bad since you only need to look one byte backwards.
At the same time for #3, in the middle of UTF-8 bytestream, you need look backwards as well for anything but the ASCII (7-bit) codepoints too.
That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).
A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.
In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.
Better to do janky testing with a cobbled together script on your own data & usage scenarios than standard tools that measure standard metrics
That must be awful to work with. Why couldn't they just pick a more adequate language if they wanted performance? Or write bindings for performance critical parts?
The good variable integer encoding is the one used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si... The compression win i.e. encoded size is equal to LEB128, however both encoding and decoding are way more efficient.
The encoder is a few fast instructions for all numbers no matter large or small: bit scan reverse (or leading zero count) to find count of bits required to store the input number, division by 7 rounding up to compute count of bytes (modern compiler reliably optimize into integer multiplication and bit shifts), bitwise shift + OR to set the VINT_MARKER bit, then byte swap to store in big endian format.
The decoder is equally efficient: compare first byte with 0 to detect numbers longer than 8 bytes (very rarely happens), bit scan reverse of either first or second byte to find encoded length, byte swap to load the big-endian format, then reset the most significant bit in the integer to get rid of the VINT_MARKER.
Actually there is - you can exploit the data parallelism through SIMD to replace the logarithmic with near-constant complexity. Classic approach indeed is very slow and unfriendly to the CPU.
That doesn’t help much when you’re streaming the bytes, like many parsers or de-serializers do. You have to read bytes from the source stream one by one because each of the next byte might be the last one of the integer being decoded. You could workaround with a custom buffering adapter. Hard to do correctly, costs performance, and in some cases even impossible: when the encoded integers are coming in realtime from network, trying to read next few bytes into a RAM buffer may block.
With MKV variable integers, you typically know the encoded length from just the first byte. Or in very rare cases of huge uint64 numbers, you need 2 bytes for that. Once you know the encoded length, you can read the rest of the encoded bytes with a single read function.
Another thing, unless you are running on a modern AMD64 CPU which supports PDEP/PEXT instructions (parts of BMI2 ISA extension), it’s expensive to split/merge the input bits to/from these groups of 7 bits in every encoded byte. MKV variable integers don’t do that, they only need bit scan and byte swap, both instructions are available on all modern mainstream processors including ARM and are very cheap.
I sometimes do for the following two use cases.
1. When the protocol delivers real-time data, I serialize byte by byte over the network to minimize latency.
2. When the messages in question are large, like 1GB or more, I don’t want to waste that much memory on both ends of the socket to buffer complete messages like ProtoBuff does. On the sending side, I want to produce the message, encode at the same time with a streaming serializer, and stream bytes directly into the socket. Similarly, on the receiving side I want to de-serialize bytes as they come with a streaming de-serializer. Without buffering the complete message on sending side, receiving side can’t know length of the message in advance before receiving the complete message, because the sender only knows the length after the complete message is serialized.
With streamed serialization, the receiver doesn’t know when the message will end. This generally means you can’t optimize LEB128 decoding by testing high bits of 10 bytes at once.
For example, let’s say the message is a long sequence of strings. Serialized into a sequence of pairs [ length, payload ] length is var.int, payload is an array of UTF8 bytes of that length, and the message is terminated with a zero-length string.
You can’t implement data parallel LEB128 decoder for the length field in that message by testing multiple bytes at once because it may consume bytes past the end of the message. A decoder for MKV variable integers only needs 1-2 read calls to decode even large numbers, because just the first byte contains the encoded length of the var.int.
kccqzy•6mo ago
> Never have I ever seen such a highly optimized Java codebase. Not before, not since.
This makes me wonder where it was. I know in the fintech world there's at least one hedge fund (not HFT, not market maker) where all their latency-sensitive trading code is written in Java, and when I asked them about GC pauses they said that their hot path produces no garbage so GC doesn't kick in.
pengaru•6mo ago
From what I saw poking at the development trading boxes they over-provisioned both cpu ghz and ram to a ridiculous level. The machines looked nearly idle at their busiest.
potato3732842•6mo ago
perihelions•6mo ago
Reminded of that missile defense system that was designed to be software-reset once a day, except the design assumptions changed (a war started) and it was ordered to be left running nonstop, as an emergency measure; after being left on for a week, failed, causing a large number of deaths. That one had some kind of integrator that accumulated floating-point roundoff over time.
scottlamb•6mo ago
I'm sure you've simplified the story, but it seems like a bit of process failure for a missile defense system to assume peacetime. There's writing software that implements the requirements, and then there's making sure the requirements are right both up front and when you really rely on them in a critical way.
mohaba•6mo ago
peterfirefly•6mo ago
scottlamb•6mo ago
wrs•6mo ago
[0] https://www.cnet.com/culture/windows-may-crash-after-49-7-da...
0xffany•6mo ago
https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...
renewiltord•6mo ago
anonymousDan•6mo ago
kccqzy•6mo ago
saati•6mo ago
sshine•6mo ago
Later they’d go back to using C++ for training deep neural nets.
pjmlp•6mo ago
renewiltord•6mo ago
scheme271•6mo ago
throwaway2037•6mo ago
scheme271•6mo ago
"This is because Zing uses a unique collector called C4 (Continuously Concurrent Compacting Collector) that allows pauseless garbage collection regardless of the Java heap size (up to 8 Terabytes)."
Azul's GC runs concurrrently with the app so those GC durations occur without a pause. If you think about it, there's no reason why GC requires a pause in order to work, with the right conditions, GC should be able to analyze and remove dead allocations without stopping an app.