"Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small."
Even for seemingly trivial scenarios like searching an array, the contents and length of the array make a massive difference in results and how to optimize (as shown in the last section of this write-up where I tried to benchmark search algorithms correctly https://www.jasonthorsness.com/23).
I've not seen a perfect solution to this that isn't just "thinking carefully about the test setup" (profile-guided optimization/production profiles replayed for benchmarks seem like maybe it could be an alternative, but I haven't seen that used much).
I can vouch for this. I've been doing web performance for nearly 15 years and finding clear representative problems is one the hardest parts of my job. Once I understood this and worked on getting better at finding representative issues, it became the single thing that I did to boost my productivity the most.
I'm not sure if it has gotten easier. Async is harder to profile than classic threaded code using blocking IO. And for database bound code, I'm not sure which databases output detailed enough performance information.
And there's also https://docs.datadoghq.com/profiler/
Someone out there might have an application in which they are serializing large numbers of large integers.
The case for abandoning it is not as strong, since you don't know who is doing what.
It's a guessing game in programming language run-times and compilers: "will anyone need this improvement?" You have room to invent "yes" vibes. :)
Something like grab .01% of the inputs to that function for a day or something like that (maybe a lower fraction over a week or month).
Is the cost of grabbing this that high?
definitely made me go :-O - how big a company do you need to be for that to be a realistic thing to maintain in house?
I've maintained in house forks of major code bases with a handful of minor code changes while we've tried to convince upstream to accept them.
There are 17 listed on SDKMAN's website: https://sdkman.io/jdks
> OpenJDK
As far as I know, all other distributions of the JDK are forks of OpenJDK. Would love to know if there's any that isn't.
https://www.ptc.com/en/products/developer-tools/perc
https://www.aicas.com/products-services/jamaicavm/
JikesRVM (https://www.jikesrvm.org) no longer active, but it was its own thing.
Then you have Azul Zing and OpenJ9, which use parts of OpenJDK for the standard library part, but the foundations are their own thing.
https://www.azul.com/products/prime/
https://developer.ibm.com/components/open-j9/
Then you have that mobile OS, which isn't really JVM compliant implementation, but it is close enough for many workloads and libraries to work on top of it.
An example of why I'd prefer to avoid Java for something like that. Java has a reputation for being slow because of JIT compilation and GC, but I also think a lot of it comes down to all non-primitive types being boxed Objects (boxed = in a separate allocation). This means it works that garbage collector so much harder, there's poorer locality, it fills RAM/cache with extra Object stuff like mutexes that no one ever uses. (Apparently now 8 bytes after a lot of effort to reduce it. https://openjdk.org/jeps/450)
The problem is that String really isn't a language primitive.
Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)
If they do, they're wrong, as the two languages are quite different here. In Java, String requires two allocations: your variable is implicitly a pointer to String allocation, which in turn has a pointer to a char[] allocation. In C++, the std::string itself is a value type. The actual bytes might be inline (short string optimization) or behind a single allocation accessible from the std::string.
Rust's std::string::String is somewhere between: it's a value type but does not have a short string optimization (unless you count the empty string returned by String::new).
> Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)
Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that you can't have a class or struct that bundles up the parts of your data structure (in the case of a mutable string, two fields: data pointer/capacity + the used length) without boxing. There's a heavy cost to any non-primitive data type.
That’s true, but thanks to GC an allocation also is just a pointer bump in Java, and the two allocations are likely to be close to each other. For short-lived strings, the GC cost is furthermore zero, because only the longer-lived objects need to be tracked and copied with generational GC. So, “heavy cost” is relative.
Also, String.intern() can be quite slow compared to something like ConcurrentHashMap.putIfAbsent().
You also can’t make a first class string type in most of those language because “hi” is defined to be of a system specified class. You can make a different type to store strings but it’ll never be as ergonomic to use.
It’s even worse in JavaScript, where the standard library is written in a different language (usually C++). It’s impossible to write JavaScript code that matches the performance of the built in string primitive. At least outside of specific niche use cases.
Rust has lots of alternate string like libraries in cargo that are optimized better for different use cases - like smallstring or ropey. They’re fast, convenient and ergonomic. I imagine C++ is the same.
This is less problematic nowadays, because for many folks there are only three compilers that still matter, or even a single one.
While they were used as inspiration on their design, the AOT approach, with automatic memory management, and programming language features for low level coding really took a while to get back from them, even the NGEN/unsafe from C# v 1.0, while welcomed wasn't that great.
I always think that in an alternate reality, had them been had those features already on version 1.0, C and C++ would have lost even more mindshare at the turn of the century, and something like C++11 would not have been as impactful.
They're working on it: https://openjdk.org/projects/valhalla/
But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel. So in JVM boxing is really cheap, often just a pointer increment. Also oftentimes the wrapper object and its value are located near each other in the same memory page, so we're not adding a RAM access, more like L2 access.
PS: In some cases it can be even faster than Rust or C++, because it pre-allocates larger pages and drops them in chunks within generational GC (e.g. all values "allocated" to process one HTTP request can be GCed immediately after), while C++ is eager to destruct each object immediately. Also, a GC swipe can happen in a separate thread, not bothering the main thread user is waiting for. One can do the same in Rust and C++ using Arenas, of course.
What? No. Rust doesn't ask the kernel for each allocation individually. That'd be insane; besides the crushing system call overhead, the kernel only does allocations in whole pages (4 KiB on x86-64) so it'd be incredibly wasteful of RAM.
Rust does the same thing as virtually every non-GCed language. It uses a memory allocator [1] that does bookkeeping in userspace and asks the kernel for big chunks via sbrk and/or mmap/munmap. Probably not the whole heap as a single chunk of virtual memory as in Java, but much closer to that than to the other extreme of a separate kernel call for each allocation.
[1] by default, just libc's malloc and free, although you can override this, and many people choose jemalloc or mimalloc instead. My high-level description applies equally well to any of the three.
Of course you can optimize better with Rust/CPP/etc, but it is not trivial and you definitely not get it out of the box for free. My point is, this is a bit overblown how much overhead java has.
I meant that Java usually already has that memory allocated, JVM is a memory allocator of its own. It operates within its own heap. One can do that within Rust of course (or easier in Zig, as it welcomes passing an allocator around), it's just built-in in JVM already. Drawback is that it's more complex to release that memory back from the JVM process, so Java apps (not AOT-compiled) usually consume more RAM.
I recently encountered an example of someone writing a Rust version of a popular Java library by just taking the Java code, commenting it out, and writing the Rust equivalent almost line for line. The approach works great (no need to reinvent the wheel and you can point to the existing documentation and code samples) but in terms of allocations, you're not going to find many improvements.
There's a type of Java code that looks more like C code than anything else that runs blazing fast with minimal overhead. It's not the type of Java code you'll probably encounter when writing Java applications, but if you use Java as a kind of cross-platform C target, you can get pretty close to Rust (and for some use cases even beat it). Java has a LOT of tricks up its sleave (pointer compression, dynamic realignment) that Rust can't automatically take advantage of.
Your AbstractFunctorClassFactoryProducer isn't going to be very allocation efficient, but once you start seeing volatile ints all over the place, things quickly become a lot faster.
I'm always looking for / wondering about valid data representing the problem so that I can measure success.
A little of last and much of this week I've spent addressing a performance problem. At first I was told it was always happening in some specific situation, it very much is not. Then I was told it frequently happens ... nope, maybe 2 to 4 times a day at most.
I think I've managed to track the issue down. I think so, pretty sure? On the other hand this client has a couple performance related issues ongoing and listening to them talk on a conference call... I'm not really sure they differentiate between them. It's very possible my neighbor a few seats away from me is actually addressing "the problem" and I'm chasing ghosts. But the client is too busy to provide good data. Meanwhile this issue "too important" of an issue for us to do nothing and wait, granted ... we might be doing nothing anyway.
This literally made the GA encoding exactly the same as the solution and also very, very obviously favoured techniques that would MAKE ALL THE BITS 1!
https://link.springer.com/article/10.1007/s10710-021-09425-5
They do the convergence analysis on the linear system Ax = 0, which means any iteration matrix (including a zero matrix) that produces shrinking numbers will converge to the obvious solution x = 0 and the genetic program just happens to produce iteration matrices with lots of zeros.
That's only half-snark. It was "wasted" in the sense that it wasn't as useful as the OP thought it would be. But diving deep into a technical problem? That left an indelible pattern on the OP's psyche. If they ever have to do something like that again, it will certainly be easier. And since AI is going to be doing all the simple coding tasks, humans will have to dive into the deep, hard tasks.
Adding the fast path to the hard-won assembly is just optimising the slow path, and there is little point to optimising it. Complex optimisations are also a form of technical debt, so they are probably better off not integrating it.
I am wondering more about the bit-by-bit compatibility part from "This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.".
How does one test for 18.4 quintillion numbers? This isn't possible.
In Canada, your org can get government money for this under the SR&ED program:
https://www.canada.ca/en/revenue-agency/services/scientific-...
He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request.
Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code.
Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.
Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).
Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one.
Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already.
[1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures.
[2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree.
Lines of thinking like that are part of the reason most modern software is so sloooow :)
normal(x) == optimized(x)
Then I write the optimized code and use a test like yours to compare the results between the simple code and the optimized code.
One time I needed to write a function to search for a specific pattern in a binary file and change it. So I wrote the brute force code as a first step, the same code that anyone would probably write as a simple solution. It worked the first time, and a couple of people reviewed the code and said "yep, even if it's slow, it is correct."
But this code took more than a second to run!
Of course I thought about optimizing it with Boyer-Moore or the like. Then I went, "Hold on to your horses. This isn't something like a web page load where one second matters. It's part of a build process that only runs a few times a day and already takes several minutes to run. One extra second is nothing!"
In the wise words of Kenny Rogers in The Gambler:
You got to know when to hold 'em,
know when to fold 'em
Know when to walk away
and know when to run
Every coder knows
the secret to optimizin’
is knowing what to throw away
and knowing what to keep
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
A more robust strategy would be at least be to check if the rule was the same as the previous one (or a small hash table) so that the system is self-healing.
Ken’s solution is at least robust and by that property I would prefer it since it’s just as fast but doesn’t have any weird tail latencies where the requests out of your cache distribution are as fast as the ones in.
Ken's solution was stated to have been slower than the alternative optimization.
Also, it's trivial to keep Ken's implementation as the slow path. If request patterns change, dig up the new fast path and put the old one in Ken's slow path code. Most of the performance will still come from the initial `if`.
if value in most_common_range:
take_fast_path()
else:
use_kens_solution()
Learning this is worth writing the otherwise useless code.
Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem
That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt.
¹: Write Everything Twice
That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort).
In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions.
I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more.
Theres absolutely no reason to think that the specific workload of a system generally makes random/arbitrary/heuristic be the best way to pick a most likely branch for a given branching instruction.
For example, when scanning data you might check if the current value is equal to a test value, and branch on that to either code that handles the match, or else code that moves to the next entry to test that. This may be extremely likely to match (95%) as in a hash table, or very likely to not match (string search). A compiler can't tell which is true, and your workload literally has no impact on this unless it is pathological.
Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request.
Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process.
So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five.
Always pay attention to the raw numbers and not just the percentages in benchmarks, kids.
>so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.
but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially.
1. One branch for the one-byte case, always inlined, just store the byte
2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6
3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift.
https://github.com/protocolbuffers/protobuf/commit/b039dfe26...
This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores.
I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values.
I would expect C or asm to be a better choice at ultra high levels of optimization. I know Java can get somewhat similar performance if you put the work in, but I’d assume you want to do manual memory management at minimum.
I’m just saying that this company seems to be throwing a lot of money at the problem, and at those levels you can squeeze more performance out of them and get them reasonably right.
After all, many, many pieces of reliable software are written in C. If you wanted a “safer” modern language, you could pick Rust.
I would just assume those would be easier picks rather than optimizing around GC pauses, which can be notoriously difficult when nanosecond matter.
Java is unique in that it often ran a dual stack for byte code, and platform specific binary objects.
My advice often is to use high-level languages if you can get away with the convenience, and a low-level language with proper design patterns if you must.
Many programmers I've met shouldn't write C/C++, but mostly because they can't be trusted to use memory or thread safe code.
Cheers, =3
C doesn’t inherently have to be bad (and can often be better than C++) if you are strict about adhering to standards.
I like to point to OpenBSD as a successful and well-done C project. As a bonus, if you develop on OpenBSD, your program is more likely to have correct memory handling operations (OpenBSD has a lot of very aggressive memory protections. Many issues in ported applications like Firefox were found and fixed by OpenBSD users, but would happily keep running on Linux).
"Why we chose Java for our High-Frequency Trading application"
https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-...
I still just feel like a non-GC language (maybe even Rust) is a better choose when nanoseconds matter.
Ada is not something that these companies would bother using, even though there are about 7 vendors still in business.
Likewise with either FreePascal or Modula-2.
Most likely a few of these exchanges are writting new code in Rust for such deployment scenarios.
Data condition can definitely affect runtime performance. All the way down to the micro architecture level. Random, sorted, reverse-sorted, uniform data sets often lead to dramatic differences in perf. Then you get into “Ooops, my dataset had a surprisingly large number of NaNs or denormalized floats!”
Obviously, it becomes hard in such an aggressively optimized system, but if that's the case, trying and failing is an acceptable approach (I've seen 2 weeks wasted in a many a worse ways, to be honest).
Another anecdote: I profiled our Java-based server and concluded that most of the time was being spent on creating temporary DI containers for request handling. I spent several days trying to avoid copying the "app" scope components to the "request" scope container (we forked and heavily modified our DI container implementation) so that creating request containers was cheaper. I wrote a JMH benchmark that tried to emulate the real thing based on the real data (with something like 40% of components in the "app" scope, 60% in the "request" scope). My optimisation made the code noticably faster, I can't remember by how much exactly, but probably something like 25% faster.
When I ran the "real" performance test that simulates real-world usage of the server (not just the containers) the performance was slightly worse!! Ended up throwing the code away, as the OP. I had already spent too much time on it so I couldn't dig much further into why that was. But from other experiments, I can say that the JVM, thanks to JIT compilation of bytecode, will heavily optimize the hot path anyway based on the actual usage patterns, so much that the Java code you wrote may resemble very little of what machine code is actually executing in the end. That's why I said it may also be "trivial" to optimize JVM code: just let the JIT do its job.
Of course you may still obtain big gains when you optimize code that was doing stupid things, but because of JIT "magic", even your stupid code might run really fast given enough JIT passes.
In a sense, a shame this encoding wasn't structured like UTF-8, or even the other way around, a shame UTF-8 wasn't structured in this, more generic way.
That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).
A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.
In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.
Better to do janky testing with a cobbled together script on your own data & usage scenarios than standard tools that measure standard metrics
That must be awful to work with. Why couldn't they just pick a more adequate language if they wanted performance? Or write bindings for performance critical parts?
The good variable integer encoding is the one used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si... The compression win i.e. encoded size is equal to LEB128, however both encoding and decoding are way more efficient.
The encoder is a few fast instructions for all numbers no matter large or small: bit scan reverse (or leading zero count) to find count of bits required to store the input number, division by 7 rounding up to compute count of bytes (modern compiler reliably optimize into integer multiplication and bit shifts), bitwise shift + OR to set the VINT_MARKER bit, then byte swap to store in big endian format.
The decoder is equally efficient: compare first byte with 0 to detect numbers longer than 8 bytes (very rarely happens), bit scan reverse of either first or second byte to find encoded length, byte swap to load the big-endian format, then reset the most significant bit in the integer to get rid of the VINT_MARKER.
kccqzy•1d ago
> Never have I ever seen such a highly optimized Java codebase. Not before, not since.
This makes me wonder where it was. I know in the fintech world there's at least one hedge fund (not HFT, not market maker) where all their latency-sensitive trading code is written in Java, and when I asked them about GC pauses they said that their hot path produces no garbage so GC doesn't kick in.
pengaru•1d ago
From what I saw poking at the development trading boxes they over-provisioned both cpu ghz and ram to a ridiculous level. The machines looked nearly idle at their busiest.
potato3732842•1d ago
perihelions•1d ago
Reminded of that missile defense system that was designed to be software-reset once a day, except the design assumptions changed (a war started) and it was ordered to be left running nonstop, as an emergency measure; after being left on for a week, failed, causing a large number of deaths. That one had some kind of integrator that accumulated floating-point roundoff over time.
scottlamb•1d ago
I'm sure you've simplified the story, but it seems like a bit of process failure for a missile defense system to assume peacetime. There's writing software that implements the requirements, and then there's making sure the requirements are right both up front and when you really rely on them in a critical way.
mohaba•1d ago
peterfirefly•1d ago
scottlamb•1d ago
wrs•1d ago
[0] https://www.cnet.com/culture/windows-may-crash-after-49-7-da...
0xffany•1d ago
https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...
renewiltord•1d ago
anonymousDan•1d ago
kccqzy•1d ago
saati•1d ago
sshine•1d ago
Later they’d go back to using C++ for training deep neural nets.
pjmlp•11h ago
renewiltord•1d ago
scheme271•23h ago
throwaway2037•5h ago