I wasted weeks hand optimizing assembly because I benchmarked on random data

https://www.vidarholen.net/contents/blog/?p=1160

391•thunderbong•6mo ago

Comments

kccqzy•6mo ago

Good story! I totally didn't know about the trick of changing what the JIT code produces bypassing JNI overhead.

> Never have I ever seen such a highly optimized Java codebase. Not before, not since.

This makes me wonder where it was. I know in the fintech world there's at least one hedge fund (not HFT, not market maker) where all their latency-sensitive trading code is written in Java, and when I asked them about GC pauses they said that their hot path produces no garbage so GC doesn't kick in.

pengaru•6mo ago

I too knew someone in HFT ages ago that spoke of using java even in the hot path. He told me they disabled GC altogether in combination with not generating garbage in the hot path, with large enough machines to not OOM before markets closed for the day.

From what I saw poking at the development trading boxes they over-provisioned both cpu ghz and ram to a ridiculous level. The machines looked nearly idle at their busiest.

potato3732842•6mo ago

It's a vicious cycle between provisioning more server and devs figuring out how to get more speed by using the resources for pre-computation.

perihelions•6mo ago

> "large enough machines to not OOM before markets closed for the day"

Reminded of that missile defense system that was designed to be software-reset once a day, except the design assumptions changed (a war started) and it was ordered to be left running nonstop, as an emergency measure; after being left on for a week, failed, causing a large number of deaths. That one had some kind of integrator that accumulated floating-point roundoff over time.

scottlamb•6mo ago

> Reminded of that missile defense system that was designed to be software-reset once a day, except the design assumptions changed (a war started) and it was ordered to be left running nonstop, as an emergency measure

I'm sure you've simplified the story, but it seems like a bit of process failure for a missile defense system to assume peacetime. There's writing software that implements the requirements, and then there's making sure the requirements are right both up front and when you really rely on them in a critical way.

mohaba•6mo ago

Patriot missile range gate. I remember covering that in a numerical analysis class.

peterfirefly•6mo ago

https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

scottlamb•6mo ago

I interpret that write-up to mean the daily reboot was a temporary user-suggested workaround for a bug discovered in the field, rather than something in the product specs from the beginning. And it makes more sense to me that no one realized the errors would accumulate or thought to test long operation in general than it would for them to have explicitly said it wasn't important.

wrs•6mo ago

Kind of like nobody thought to see what happened if Windows 95 managed to stay up for over 49.7 days [0], only deadlier.

[0] https://www.cnet.com/culture/windows-may-crash-after-49-7-da...

0xffany•6mo ago

Also on the topic of missiles, it reminded me of Raymond Chen's classic Null Garbage Collector

https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...

renewiltord•6mo ago

Disabling GC is wildly inefficient without making sure you don't have garbage because allocation will slow down the application to human scale. The magic is more in the no-allocation code than in the disable GC side. It is very annoying in Java because any non-primitive is boxed type. Object pools etc. Problem is it's easy to start in Java, but then you hit a wall. JNI has huge overhead. Exasperating. But hard to argue, sometimes if you don't start in Java you never get the problem because you never manage to bootstrap.

anonymousDan•6mo ago

Yeah it always felt like they were jumping through hoops to basically write C code through Java. I expect some of this might shift to Rust..

kccqzy•6mo ago

Well you do get guaranteed memory safety.

saati•6mo ago

If they were mucking around with asm and custom jit those memory safety guarantees are gone.

sshine•6mo ago

I worked with low-latency trading with C# in the hot path; they had started with C++ and eventually migrated to C# and just kept tight control of allocations.

Later they’d go back to using C++ for training deep neural nets.

pjmlp•6mo ago

To avoid going back to C++, is exactly the reason why modern .NET has native support for tensors and various SIMD flavours.

renewiltord•6mo ago

There's quite a few. Teza definitely does, but I've spoken to many people who have done this.

scheme271•6mo ago

Azul has a JVM that doesn't have GC pauses and has fairly good memory allocation / collection throughput.

throwaway2037•6mo ago

    > Azul has a JVM that doesn't have GC pauses

Unless GC is disabled, all GC has some pauses. You can see the chart of Azul GC pauses here: https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-... -> https://miro.medium.com/v2/resize:fit:1400/format:webp/1*yRG...

scheme271•6mo ago

That's GC duration time but the article indicates:

"This is because Zing uses a unique collector called C4 (Continuously Concurrent Compacting Collector) that allows pauseless garbage collection regardless of the Java heap size (up to 8 Terabytes)."

Azul's GC runs concurrrently with the app so those GC durations occur without a pause. If you think about it, there's no reason why GC requires a pause in order to work, with the right conditions, GC should be able to analyze and remove dead allocations without stopping an app.

chaosmachine•6mo ago

This reminds me of Benford's law[1]

"Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small."

[1] https://en.m.wikipedia.org/wiki/Benford%27s_law

jasonthorsness•6mo ago

Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

Even for seemingly trivial scenarios like searching an array, the contents and length of the array make a massive difference in results and how to optimize (as shown in the last section of this write-up where I tried to benchmark search algorithms correctly https://www.jasonthorsness.com/23).

I've not seen a perfect solution to this that isn't just "thinking carefully about the test setup" (profile-guided optimization/production profiles replayed for benchmarks seem like maybe it could be an alternative, but I haven't seen that used much).

anonymoushn•6mo ago

I have asked for like the ability to apply a filter that maps real strings from real users to a very small number of character classes, so that I may run some codec on data that is equivalent to user data for the purposes of the codec. I have heard, this is not a good use of my time, and if I really care so much I should instead make various guesses about the production data distribution of user-created strings (that need to be json-escaped) and then we can deploy each of them and keep the best, if we can even tell the difference.

jasonthorsness•6mo ago

If the real data is sensitive, it's hard to distill test data from it that succeeds in fully removing the sensitivity but that is still useful. Depending on the domain even median string length could be sensitive.

bgirard•6mo ago

> Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

I can vouch for this. I've been doing web performance for nearly 15 years and finding clear representative problems is one the hardest parts of my job. Once I understood this and worked on getting better at finding representative issues, it became the single thing that I did to boost my productivity the most.

viraptor•6mo ago

Tracing and continuous profiling made this task significantly easier than it was in the past, fortunately.

bgirard•6mo ago

It's great. Much harder on Web because it's difficult to get rich information from the browser. This is why I contributed to the JS Self-Profiling API in the first place: https://developer.mozilla.org/en-US/docs/Web/API/JS_Self-Pro...

CodesInChaos•6mo ago

Can your recommend tools for continuous profiling? I'm mainly interested in Rust and C# on Linux.

I'm not sure if it has gotten easier. Async is harder to profile than classic threaded code using blocking IO. And for database bound code, I'm not sure which databases output detailed enough performance information.

viraptor•6mo ago

https://grafana.com/products/cloud/profiles-for-continuous-p... can be self hosted as https://grafana.com/oss/pyroscope/

And there's also https://docs.datadoghq.com/profiler/

jasonthorsness•6mo ago

What have you found are the most useful approaches to collecting the representative issues in a way you can reuse and test? I haven’t worked as much in the web space.

bgirard•6mo ago

A combination of: 1) RUM, 2) Field tracing, 3) Local tracing with throttling (CPU, Network), 4) Focusing either on known problems, or if I'm on less certain territory I will do more vetting to make sure I'm chasing something real. I'll minimize the time spent on issues that I can only catch in one trace, but sometimes it's also all you get so I'll time box that work carefully. It's more of an art than a science.

sfink•6mo ago

It's a subset of profile-guided optimization, but I fairly frequently will gather summary histograms from more or less representative cases and then use those to validate my assumptions. (And once in a while, write a benchmark that produces a similar distribution, but honestly it usually comes down to "I think this case is really rare" and checking whether it is indeed rare.)

kazinator•6mo ago

If they were forking that JVM in order to offer it as a development tool to a broad developer base, then such an optimization might be worth keeping.

Someone out there might have an application in which they are serializing large numbers of large integers.

The case for abandoning it is not as strong, since you don't know who is doing what.

It's a guessing game in programming language run-times and compilers: "will anyone need this improvement?" You have room to invent "yes" vibes. :)

grogers•6mo ago

If the distribution of numbers you are serializing is uniformly random, you generally wouldn't choose a variable length encoding. Sometimes the choice is made for you of course, but not usually.

necovek•6mo ago

The problem is that this is a complex assembly implementation that needs to be maintained over the course of decades in the future. Unless you have a good reason to keep it, you should drop it.

spott•6mo ago

Why is just capturing real values not the answer here?

Something like grab .01% of the inputs to that function for a day or something like that (maybe a lower fraction over a week or month).

Is the cost of grabbing this that high?

adrianN•6mo ago

That works for some cases, but if you have a service that processes customer data you probably don’t want to make copies without consulting with legal.

anonymoushn•6mo ago

Yes, it is very important to test performance work on data distributions that are representative of the production data distribution. I hope the author has another go at using SIMD that actually makes it into use :)

zem•6mo ago

> We were in the early phases of forking the JVM

definitely made me go :-O - how big a company do you need to be for that to be a realistic thing to maintain in house?

dcrazy•6mo ago

You don’t need to be big, you just need to handle lots of money.

zem•6mo ago

fair point, I guess this was a high frequency trading firm or similar, who could justify paying a few engineers to maintain a jvm fork

cbzoiav•6mo ago

A fork can be many things.

I've maintained in house forks of major code bases with a handful of minor code changes while we've tried to convince upstream to accept them.

scheme271•6mo ago

Off-hand I think that there were at least 7 forks/implementations out there. Oracle/Sun, Oracle's Graal, IBM, OpenJDK, Amazon Coretto, Twitter's fork, and Azul. There's also a GNU effort but I don't think that really went very far. I think there's probably a few others that aren't publicly well known for trading firms, etc.

brabel•6mo ago

> there were at least 7 forks/implementations out there

There are 17 listed on SDKMAN's website: https://sdkman.io/jdks

> OpenJDK

As far as I know, all other distributions of the JDK are forks of OpenJDK. Would love to know if there's any that isn't.

pjmlp•6mo ago

Commercial embedded vendors with RTOS capabilities,

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/products-services/jamaicavm/

JikesRVM (https://www.jikesrvm.org) no longer active, but it was its own thing.

Then you have Azul Zing and OpenJ9, which use parts of OpenJDK for the standard library part, but the foundations are their own thing.

https://www.azul.com/products/prime/

https://developer.ibm.com/components/open-j9/

Then you have that mobile OS, which isn't really JVM compliant implementation, but it is close enough for many workloads and libraries to work on top of it.

_glass•6mo ago

SAP has also a fork: https://github.com/SAP/SapMachine.

throwaway2037•6mo ago

JetBrains also has a fork that they use to run some (all?) of their IDEs, such as CLion (C/C++) and IntelliJ (Java). As I understand, it has many fixes for HiDPI environments when painting GUI widgets (Swing, etc.).

scottlamb•6mo ago

> For example, there was minimal use of java.lang.String because ain’t nobody got time for a second allocation of a char[] with associated indirection and GC churn.

An example of why I'd prefer to avoid Java for something like that. Java has a reputation for being slow because of JIT compilation and GC, but I also think a lot of it comes down to all non-primitive types being boxed Objects (boxed = in a separate allocation). This means it works that garbage collector so much harder, there's poorer locality, it fills RAM/cache with extra Object stuff like mutexes that no one ever uses. (Apparently now 8 bytes after a lot of effort to reduce it. https://openjdk.org/jeps/450)

bsder•6mo ago

It's not like C++ users don't say the exact same thing about String, though.

The problem is that String really isn't a language primitive.

Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

scottlamb•6mo ago

> It's not like C++ users don't say the exact same thing about String, though.

If they do, they're wrong, as the two languages are quite different here. In Java, String requires two allocations: your variable is implicitly a pointer to String allocation, which in turn has a pointer to a char[] allocation. In C++, the std::string itself is a value type. The actual bytes might be inline (short string optimization) or behind a single allocation accessible from the std::string.

Rust's std::string::String is somewhere between: it's a value type but does not have a short string optimization (unless you count the empty string returned by String::new).

> Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that you can't have a class or struct that bundles up the parts of your data structure (in the case of a mutable string, two fields: data pointer/capacity + the used length) without boxing. There's a heavy cost to any non-primitive data type.

layer8•6mo ago

> In Java, String requires two allocations

That’s true, but thanks to GC an allocation also is just a pointer bump in Java, and the two allocations are likely to be close to each other. For short-lived strings, the GC cost is furthermore zero, because only the longer-lived objects need to be tracked and copied with generational GC. So, “heavy cost” is relative.

grogers•6mo ago

Also, you can't construct a java String without copying the data into it, because String is immutable. In other languages like c++ or rust the string type is mutable so you don't need an extra copy. Java doesn't even special case blessed APIs like StringBuilder to avoid the extra copy, since StringBuilder itself doesn't have a method to consume the inner buffer, you can only create a string from it without touching the buffer even though it's not the normal usage to create multiple strings from a given StringBuilder.

throeaway9•6mo ago

For cases where a string repeats itself, you can use String.intern() to reuse data.

layer8•6mo ago

String.intern() doesn't reuse the data per se. It merely gives you a canonical instance with the same value as the String instance you invoke it on (unless it's the first time it is invoked for that value, in which case it returns the same instance you already have in hand). At that point the latter duplicate instance has already been constructed. The only benefit in terms of memory is that the duplicate instance might be garbage-collected earlier if the program stops referencing it and uses the interned instance instead.

Also, String.intern() can be quite slow compared to something like ConcurrentHashMap.putIfAbsent().

josephg•6mo ago

> Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that …

You also can’t make a first class string type in most of those language because “hi” is defined to be of a system specified class. You can make a different type to store strings but it’ll never be as ergonomic to use.

It’s even worse in JavaScript, where the standard library is written in a different language (usually C++). It’s impossible to write JavaScript code that matches the performance of the built in string primitive. At least outside of specific niche use cases.

Rust has lots of alternate string like libraries in cargo that are optimized better for different use cases - like smallstring or ropey. They’re fast, convenient and ergonomic. I imagine C++ is the same.

pjmlp•6mo ago

Except many aren't aware that as long as std::string fullfills the ISO C++ complexity requirements for its implementation, anything goes.

This is less problematic nowadays, because for many folks there are only three compilers that still matter, or even a single one.

pjc50•6mo ago

I like how dotnet has acknowledged this and provided a whole lot of machinery recently for things like value types on the stack and Span<T> to reduce array copying.

pjmlp•6mo ago

I would have liked that both had acknowledged this to the point of languages like Oberon (all variants predate them), Eiffel, Modula-3, CLU, Cedar,...

While they were used as inspiration on their design, the AOT approach, with automatic memory management, and programming language features for low level coding really took a while to get back from them, even the NGEN/unsafe from C# v 1.0, while welcomed wasn't that great.

I always think that in an alternate reality, had them been had those features already on version 1.0, C and C++ would have lost even more mindshare at the turn of the century, and something like C++11 would not have been as impactful.

snickerdoodle12•6mo ago

> but I also think a lot of it comes down to all non-primitive types being boxed Objects

They're working on it: https://openjdk.org/projects/valhalla/

eulgro•6mo ago

Project Valhalla started over 10 years ago. It keeps being mentioned, but with seemingly no end in sight. Is there an ETA?

Mawr•6mo ago

Soon :) As of at least 8+ years ago [1].

[1]: https://news.ycombinator.com/item?id=13219955

deepsun•6mo ago

> it fills RAM/cache with extra Object stuff

But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel. So in JVM boxing is really cheap, often just a pointer increment. Also oftentimes the wrapper object and its value are located near each other in the same memory page, so we're not adding a RAM access, more like L2 access.

PS: In some cases it can be even faster than Rust or C++, because it pre-allocates larger pages and drops them in chunks within generational GC (e.g. all values "allocated" to process one HTTP request can be GCed immediately after), while C++ is eager to destruct each object immediately. Also, a GC swipe can happen in a separate thread, not bothering the main thread user is waiting for. One can do the same in Rust and C++ using Arenas, of course.

scottlamb•6mo ago

> But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel.

What? No. Rust doesn't ask the kernel for each allocation individually. That'd be insane; besides the crushing system call overhead, the kernel only does allocations in whole pages (4 KiB on x86-64) so it'd be incredibly wasteful of RAM.

Rust does the same thing as virtually every non-GCed language. It uses a memory allocator [1] that does bookkeeping in userspace and asks the kernel for big chunks via sbrk and/or mmap/munmap. Probably not the whole heap as a single chunk of virtual memory as in Java, but much closer to that than to the other extreme of a separate kernel call for each allocation.

[1] by default, just libc's malloc and free, although you can override this, and many people choose jemalloc or mimalloc instead. My high-level description applies equally well to any of the three.

gf000•6mo ago

While Java just does a thread local pointer bump, which will still be more efficient, and closer to stack allocation.

Of course you can optimize better with Rust/CPP/etc, but it is not trivial and you definitely not get it out of the box for free. My point is, this is a bit overblown how much overhead java has.

labadal•6mo ago

I'm glad that I'm over that phase I had in university where I wanted to write a custom memory allocator for everything because "I understand my usage better". I will admit that it was a good bit of fun though.

deepsun•6mo ago

Yes, my mistake, thanks for pointing out, upvoting. I meant asking memory allocator, not kernel.

I meant that Java usually already has that memory allocated, JVM is a memory allocator of its own. It operates within its own heap. One can do that within Rust of course (or easier in Zig, as it welcomes passing an allocator around), it's just built-in in JVM already. Drawback is that it's more complex to release that memory back from the JVM process, so Java apps (not AOT-compiled) usually consume more RAM.

IshKebab•6mo ago

Aside from Rust not working like that (as scottlamb said), Rust is faster than Java even if Java has a faster allocator because Rust code usually does much less allocation in the first place.

jeroenhd•6mo ago

I don't know if Rust code allocates more or less in general. It really depends on what kind of code you write. Once Rust code reaches the complexity of the Java stacks it's replacing, you get a lot of wrapper objects, locks, and intermediates to cross thread boundaries and to prove soundness to the borrow checker.

I recently encountered an example of someone writing a Rust version of a popular Java library by just taking the Java code, commenting it out, and writing the Rust equivalent almost line for line. The approach works great (no need to reinvent the wheel and you can point to the existing documentation and code samples) but in terms of allocations, you're not going to find many improvements.

There's a type of Java code that looks more like C code than anything else that runs blazing fast with minimal overhead. It's not the type of Java code you'll probably encounter when writing Java applications, but if you use Java as a kind of cross-platform C target, you can get pretty close to Rust (and for some use cases even beat it). Java has a LOT of tricks up its sleave (pointer compression, dynamic realignment) that Rust can't automatically take advantage of.

Your AbstractFunctorClassFactoryProducer isn't going to be very allocation efficient, but once you start seeing volatile ints all over the place, things quickly become a lot faster.

Mawr•6mo ago

> So in JVM boxing is really cheap, often just a pointer increment.

Nonsense. Sure, the act of creating a new allocation is cheap, but that's not where the expense lies at all. And of course, to make allocations cheap, you needed to give up something else.

Each allocation needs to be at some point handled by the GC, so the more allocations, the more GC pressure. That then forces you to make your GC generational, which constrains your GC design. You end up with a slower GC no matter what you do.

Moreover, every access to that allocation goes through a pointer, which is basically a guaranteed cache miss.

The classic case of this is iterating over an Array<Integer>. Not only did you have to make n allocations that the GC now has to keep track of, it's not possible to efficiently fetch items of this array from memory at all. You need to first fetch the pointers, then request the pointed to memory. Even in the best possible case of the pointer living right next to the data it points to, you're still paying the overhead of extra memory taking up space in your L1 cache.

Compare with an array of simple integers. Zero GC overhead, zero pointers, zero memory overhead, trivial to prefetch.

---

This is a case of horrid approach to design. Making allocating cheap must come at a cost of slowing down some other part of the GC and just encourages programmers to allocate more, which puts more and more pressure on the GC, making it even slower. It's a self-defeating approach that is therefore completely bone headed.

Allocations should be expensive!

With expensive allocations you get breathing room in your GC to make it otherwise faster. Moreover, since programmers are now encouraged not to allocate, the GC pressure is lower, making it even faster. But it gets even better. Optimizing becomes very clear and straightforward - just reduce allocations. This is great because it allows for targeted optimizations - if a particular piece of code is slow, just reduce its allocation rate and you'll get a nice speedup. Very easy to reason about.

That's why code written C/C++/Go/Rust tends to be so conscious of allocations and any indirections, but Java is full of linked lists, arrays of pointers, allocations everywhere and thousands of layers of indirections.

Cleaning up heaps of garbage can never be faster than not creating the garbage in the first place.

mcosta•6mo ago

An example of why I'd prefer to avoid Java for somethi [Core dumped]

duxup•6mo ago

I feel like this is the problem of, a lot of what I do, although not as cool as the author, and more due to human issues.

I'm always looking for / wondering about valid data representing the problem so that I can measure success.

A little of last and much of this week I've spent addressing a performance problem. At first I was told it was always happening in some specific situation, it very much is not. Then I was told it frequently happens ... nope, maybe 2 to 4 times a day at most.

I think I've managed to track the issue down. I think so, pretty sure? On the other hand this client has a couple performance related issues ongoing and listening to them talk on a conference call... I'm not really sure they differentiate between them. It's very possible my neighbor a few seats away from me is actually addressing "the problem" and I'm chasing ghosts. But the client is too busy to provide good data. Meanwhile this issue "too important" of an issue for us to do nothing and wait, granted ... we might be doing nothing anyway.

heipei•6mo ago

Counterpoint: I once wrote a paper on accelerating blockciphers (AES et al) using CUDA and while doing so I realised that most (if not all) previous academic work which had claimed incredible speedups had done so by benchmarking exclusively on zero-bytes. Since these blockciphers are implemenented using lookup tables this meant perfect cache hits on every block to be encrypted. Benchmarking on random data painted a very different, and in my opinion more realistic picture.

atiedebee•6mo ago

Why not use real world data instead? Grab a large pdf or archive and use that as the benchmark.

bluGill•6mo ago

A common case is to compress data before encrypting it so random is realistic. Pdf might allow some optimization (how I don't know) that is not representative

jbaber•6mo ago

Or at least the canonical test vectors. All zeros is a choice.

almostgotcaught•6mo ago

There are an enormous number of papers like this. I wrote a paper accelerating a small CNN classifier on FPGA and when I compared against the previous SOTA GPU implementation and the numbers were way off from the paper. I did a git blame on their repo and found that after the paper was published they deleted the lines that short-circuited eval if the sample was all 0 (which much of their synthetic data was ¯\_(ツ)_/¯).

rdc12•6mo ago

Do you have a link to that paper?

Marazan•6mo ago

Back when Genetic Algorithms were a hot topic I discovered that a large number of papers discussion optimal parameterisation of the the approach (mutation rate, cross-over, populations etc) were written using '1-Max' as the "problem" to be solved by the GA. (1-Max being attempting to make every bit of the candidate string a 1 rather than 0)

This literally made the GA encoding exactly the same as the solution and also very, very obviously favoured techniques that would MAKE ALL THE BITS 1!

imtringued•6mo ago

This reminds me of this paper:

https://link.springer.com/article/10.1007/s10710-021-09425-5

They do the convergence analysis on the linear system Ax = 0, which means any iteration matrix (including a zero matrix) that produces shrinking numbers will converge to the obvious solution x = 0 and the genetic program just happens to produce iteration matrices with lots of zeros.

michaelcampbell•6mo ago

I'm sure I'm getting this wrong, but I think I remember someone pointing out that Borland's Turbo Pascal "compiled lines per second" figure no one could even come close to replicating, until someone wrote a legitimate Pascal program that had essentially that number of lines containing only ";", or something along those lines.

It WAS still a great compiler, and way faster than the competition at the time.

OhMeadhbh•6mo ago

<snark>Time spent optimizing assembly language is never wasted.</snark>

That's only half-snark. It was "wasted" in the sense that it wasn't as useful as the OP thought it would be. But diving deep into a technical problem? That left an indelible pattern on the OP's psyche. If they ever have to do something like that again, it will certainly be easier. And since AI is going to be doing all the simple coding tasks, humans will have to dive into the deep, hard tasks.

xandrius•6mo ago

It could also burn someone out and take out the real reason technology exists: to help someone.

labadal•6mo ago

Kind of like math, huh? Some people burn out on "meaningless" impossible problems. Others get obsessed with them and make significant contributions along the way to failure.

bluGill•6mo ago

Also there is no way for author to know that his optimizations wouldn't work without trying them. We are not even sure that his analysis of why they work is correct (though it sounds reasonable). There is also the possibility that author could use the new found information to fix his optimizations so they are better.

titzer•6mo ago

I'm surprised they didn't just add a fast path of 1 and 2 byte LEBs to the hard-won assembly they generated. These fast paths are absolutely worth it. E.g. Wizard's Wasm interpreter uses a 3 instruction fastpath to decode 1-byte LEBs and then jumps out-of-line to handle the general case. It is worth 5-10% for the interpreter performance.

phire•6mo ago

The existing implementation was already "highly optimised", so it would have had a fast path for short ints (they probably profiled on realistic data sets).

Adding the fast path to the hard-won assembly is just optimising the slow path, and there is little point to optimising it. Complex optimisations are also a form of technical debt, so they are probably better off not integrating it.

menaerus•6mo ago

The issue was that they were hand-optimizing the SIMD for a workload that doesn't even exist in their case so they ended up seeing one big nothing once they profiled the code with a new optimization - a classic trap with 95% "optimizations". It turned out that the distribution of their data is in those 1 and 2 byte LEBs, and in that case their SIMD approach doesn't give the gains as it does with 9 and 10 byte LEBs.

I am wondering more about the bit-by-bit compatibility part from "This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.".

How does one test for 18.4 quintillion numbers? This isn't possible.

kazinator•6mo ago

> I ended up rolling back the change, and wrote off the time spent as a proof-of-concept for developing and productionizing a custom JIT optimization.

In Canada, your org can get government money for this under the SR&ED program:

https://www.canada.ca/en/revenue-agency/services/scientific-...

m00x•6mo ago

You can get SR&ED for anything in Canada. Even CRUD apps get SR&ED money.

Joel_Mckay•6mo ago

Except the people that review your project don't necessarily keep it secret. The tax credit also won't cover Patents and Trademarks etc. which means as a startup you are vulnerable to getting scooped. YMMV =3

MathMonkeyMan•6mo ago

Chandler Carruth told a similar story in one of his talks.

He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request.

Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code.

Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

tasn•6mo ago

This is such a great anecdote, thanks for sharing!

Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

scottlamb•6mo ago

> Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one.

Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already.

[1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures.

[2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree.

ralegh•6mo ago

This is fine assuming the popular request types don’t change, but arguably if both new versions of matching are sufficiently fast then I would prefer Ken’s long term as the other could become slow again if the distribution of request types changes.

sfilmeyer•6mo ago

As a counterpoint, what fraction of the future engineers who will touch the project are likely to be able to competently edit the finite automata based version without introducing bugs and what fraction will be able to competently edit the if statement that checks the particular policy?

mikepurvis•6mo ago

A further question mark is whether any of this has sufficient instrumentation to be able to notice and act on a change of and when it occurs.

andrepd•6mo ago

Nonsense. The pre-check can literally be one line (if common_case {fast_path()} else {slow_path()}), and thus enabling or disabling it is dead simple and obvious if the problem changes in the future.

Lines of thinking like that are part of the reason most modern software is so sloooow :)

ants_everywhere•6mo ago

You can even track the request statistics live and disable the fast path if the distribution of requests changes significantly.

Rendello•6mo ago

This situation where two paths produce the same output but one is optimized is the easiest case in property-based testing, as the property is just:

  normal(x) == optimized(x)

Stratoscope•6mo ago

I have sometimes done just this. First I write the simplest possible brute force code, something that any competent programmer can look at and say, "Yes, this may be slow, but it is straightforward and obvious that it will handle all cases."

Then I write the optimized code and use a test like yours to compare the results between the simple code and the optimized code.

One time I needed to write a function to search for a specific pattern in a binary file and change it. So I wrote the brute force code as a first step, the same code that anyone would probably write as a simple solution. It worked the first time, and a couple of people reviewed the code and said "yep, even if it's slow, it is correct."

But this code took more than a second to run!

Of course I thought about optimizing it with Boyer-Moore or the like. Then I went, "Hold on to your horses. This isn't something like a web page load where one second matters. It's part of a build process that only runs a few times a day and already takes several minutes to run. One extra second is nothing!"

In the wise words of Kenny Rogers in The Gambler:

  You got to know when to hold 'em,
  know when to fold 'em
  Know when to walk away
  and know when to run

throwaway-kg8d5•6mo ago

Very true. I would have gone with

  Every coder knows
  the secret to optimizin’
  is knowing what to throw away
  and knowing what to keep

scott_w•6mo ago

Also true for deciding whether to write code at all! About 15 years ago I was working with a junior who'd spent 3 hours trying to automate some data editing until I said "mate, you can just edit it all by hand in about an hour!"

VBprogrammer•6mo ago

Funny enough, with LLMs this trade off may well have flipped. For simple tasks like given a string like X, format it like Y, they work amazingly well.

scott_w•6mo ago

Absolutely. Even back then, trying to script it would sometimes be fine. I've had plenty of situations where I'd try to write a script, give it 30 minutes then say "fuck it, I'm hand-fixing it." The same probably applies to LLMs, you just need to set a cut-off point before it stops being worthwhile.

lettuceconstant•6mo ago

Sure, but I have to obtain my dopamine somehow.

taeric•6mo ago

I had a similar one where a billing mistake had happened in a system and the other devs were trying to find out how to determine all of the bad values. I asked if they had simply looked at the top N that could fit on their screen sorted by value. The incorrect values were the obvious outliers. Only about 4 of them. They had been blocked on identifying them for about an hour.

mark-r•6mo ago

Knuth's famous quote about premature optimization captures this perfectly.

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

andrepd•6mo ago

Absolutely! Proptesting is a great but underutilised feature. I use it a lot wherever I'm in a language which makes it easy (and with a mature library for that), like Haskell, Ocaml, or Rust.

Rendello•6mo ago

The simple cases are great demos, and then the complex cases melt your brain! But still worth it a lot of the time.

vlovich123•6mo ago

Hyperoptimizing for the fast path today and ignoring that hardware and usage patterns change is the reason modern software is so slooow :)

A more robust strategy would be at least be to check if the rule was the same as the previous one (or a small hash table) so that the system is self-healing.

Ken’s solution is at least robust and by that property I would prefer it since it’s just as fast but doesn’t have any weird tail latencies where the requests out of your cache distribution are as fast as the ones in.

necovek•6mo ago

Nobody is hyperoptimizing the fast path today.

Ken's solution was stated to have been slower than the alternative optimization.

akie•6mo ago

Ken's solution optimized the general case, basically everything that doesn't match the if-statement.

Jean-Papoulos•6mo ago

You were shown an example of exactly why this thinking is incorrect but you still insist...

Also, it's trivial to keep Ken's implementation as the slow path. If request patterns change, dig up the new fast path and put the old one in Ken's slow path code. Most of the performance will still come from the initial `if`.

vlovich123•6mo ago

It’s ungenerous to assume I would be against the if statement + Ken’s. But Ken’s approach is critically important and the “if” statement should just be a 1 entry cache instead of being hard coded. Getting this stuff right in a future proofed durable way is actually quite hard even when you notice the opportunity.

scott_w•6mo ago

I think you missed the point. Ken's version wasn't removed, it was simply prepended with something like:

  if value in most_common_range:
    take_fast_path()
  else:
    use_kens_solution()

snvzz•6mo ago

My take is that code is disposable, and often needs to be written to figure out what is really needed, then disposed off.

Learning this is worth writing the otherwise useless code.

Cthulhu_•6mo ago

Or one should spend a little more time at first to fully understand the problem, in both the mentioned case and the OP, use realistic data.

yetihehe•6mo ago

Sometimes you can't have realistic data without writing some code first, that will allow you to gather that data.

TillE•6mo ago

Writing code is very very often a big part of understanding any new problem. It's just that you should be quite comfortable discarding that code. Refactor mercilessly and all that.

scott_w•6mo ago

Writing code that solves the problem is a huge way of proving you understand the problem.

ddalex•6mo ago

I came to the same conclusion months ago with the advent of coding LLMs.... I used to treat code as precious, carefully saving, cataloguing and referencing git commits and SQL scripts and command line histories, because they were painful to write, and thus they are valuable. I mean, they were, because when I had the same need again, I could find previous instances and reuse stuff.

Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem

BigJ1211•6mo ago

This is why I strongly adhere to WET¹, you discover so much about the domain the first time you write something. By the time you have it 'working' you realize you should've written vastly different code to solve or cut-off edge cases.

That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt.

¹: Write Everything Twice

wizzwizz4•6mo ago

Don't do WET with your entire system, though, or you're likely to end up with the Second System Effect.

MathMonkeyMan•6mo ago

I think the idea is you never release the first version.

5pl1n73r•6mo ago

My reaction to the article title was recalling how old assembly teachers would drill into our heads to make sure we have proof of 200% gains before implementing an optimization. However the article shows how sometimes such planning doesn't work.

sylware•6mo ago

I am writing assembly and often you have many code paths/data structures which "fit". Each combinaison of code paths/data structures will favor some specific usage/data in spite of the others. And this is not real "optimization" since usually those code paths have roughly "the same" cost. The bottom of this: it is what you think of semantics and real-life usage of this code which will drive those "choices".

That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort).

n3t•6mo ago

Anyone has a link to the talk?

MathMonkeyMan•6mo ago

Neither I nor ChatGPT could find it again.

n3t•6mo ago

Found it! https://youtu.be/nXaxk27zwlk?t=393 (there is more context before the timestamp)

underdeserver•6mo ago

This is a good place to mention profile-guided optimization.

In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions.

I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more.

menaerus•6mo ago

What if your "application" is a server-side code hosting multitude of different types of workloads, e.g. databases? I was never really convinced by the PGO benefits for such open-ended and complex workloads.

ltbarcly3•6mo ago

PGO operates at the cpu branch level. The main benefit, if i understand correctly, is to reduce branch mispredictions which lead to pipeline stalls.

Theres absolutely no reason to think that the specific workload of a system generally makes random/arbitrary/heuristic be the best way to pick a most likely branch for a given branching instruction.

For example, when scanning data you might check if the current value is equal to a test value, and branch on that to either code that handles the match, or else code that moves to the next entry to test that. This may be extremely likely to match (95%) as in a hash table, or very likely to not match (string search). A compiler can't tell which is true, and your workload literally has no impact on this unless it is pathological.

menaerus•6mo ago

I don't follow. If I have an ingestion-heavy workload then my hash-map will certainly be stressed in a completely different way than if I have a read-only workload. The data that PGO collects during the former will be different than the data that it collects during the later, so, how is it not that the workload doesn't impact the PGO optimizations? Perhaps I misunderstood you.

ltbarcly3•6mo ago

That is not how hashtables work. Hash tables effectively randomize keys into buckets. Whether there is a hash collision is independent of the workload, and only has to do with the number of entries in the hash table vs the threshold to resize it.

This is what I meant by pathological data. If you had weird data that somehow, over and over, got hash tables right up to their fill limit and just stayed there then you would have slightly more collisions than an average use case. However you would have to construct weird data to do this. It wouldn't be "more ingestion". The thing PGO optimized would be individual branches in the hash table code, so for example the branching instruction to see if the key in a bucket is == to the key you are looking for.

menaerus•6mo ago

> Whether there is a hash collision is independent of the workload,

In read-only workload, searching through the key-value space there can't be a collision, no?

ltbarcly3•6mo ago

I think you should read and understand how a hash table works before you have an opinion on how they work.

menaerus•6mo ago

You're incredibly rude and also clueless but I'm nonetheless giving you a benefit of a doubt to understand if there is something that I'm missing. But you're right, this is now way over the top for your head.

ltbarcly3•6mo ago

I'm sorry for being rude. To address your question, yes, in fact all hash table collisions happen during the search key matching phase, if you study the collision resolution strategies on the wikipedia page you will see this.

https://en.wikipedia.org/wiki/Hash_table#Collision_resolutio...

menaerus•6mo ago

Ok, I can see how I didn't manage to fully explain myself - by replying to your "collisions" argument I actually meant something more than that and while it seemed obvious in my head what I wanted to say, I can agree that my wording was imprecise. Sorry about that.

What I pictured in my head is that the ingestion workloads in contrast to the read-only workloads are different in the fact that collisions in former can trigger hash-map resizing/rehashing and which is why I say that these are two distinctive workloads. Moving bunch of data to and from the buckets.

So, my hypothesis is the following: if I PGO the implementation using the dataset (volume, data_distribution) that it triggers a handful of such resizing/rehashing operations, and we observe the positive improvements in wall-time runtime performance, what are the chances that using the same binary but now over a completely different dataset will preserve the same runtime performance improvements? What you're saying, if I understand correctly, is that the improvements will be the same regardless of the dataset (and even type of workload) and I am saying I am not so sure about that.

ltbarcly3•6mo ago

Lets analyze one cpu instruction in your scenario: the comparison used to decide whether to resize. You have to remember that PGO operates to optimize each individual branching instruction, not the whole program.

So every time you add something to the hashtable, it will check the current size against the next size it needs to resize. 90% of the time, then 99% of the time, then 99.9% of the time and so on the answer will be 'no, there is space left'. This is just because the size goes up geometrically (in general) so a very low % of insert operations trigger a resize. The 'high probability branch' for "if current_size == max_size" should always be false. This is independent of workload. The compiler would not know whether that comparison is 'almost always true' or 'almost always false' which might lead the CPU's branch predictor to guess wrong, which will lead to pipeline stalls.

Read up on how pipelined, superscaler cpu's work to understand why this is very important.

https://en.wikipedia.org/wiki/Pipeline_stall

https://en.wikipedia.org/wiki/Predication_(computer_architec...

RedNifre•6mo ago

Okay, but how about checking the 99.9% policy first and if it doesn't match, run Ken's solution?

fifilura•6mo ago

You'd be better off with some stupid code that junior devs (or Grug brained senior devs) could easily understand for the 0.1% cases.

smcin•6mo ago

(In that specific case, with that specific very skewed data distribution, yes. But not necessarily in the general case. So "they" would be better off with that solution, but not a generalized "you").

username135•6mo ago

Then how do they learn

chii•6mo ago

They can learn finite automata from computer science text books.

username135•6mo ago

True, i suppose, but theres simply no substitute for contextual, on the job learning. imo

mannykannot•6mo ago

My guess is that they would be more likely to “fix” and “improve” Ken’s algorithm than understand it (Jon Bentley mentioned, in passing, a similar case where this was the outcome.)

fifilura•6mo ago

Oh! "Educating the newbies/grunts" is the reason for #$#@&#?

I'd say that an org has a limited number of pennies to spend on non-trivial code. Let's spend them wisely.

account42•6mo ago

That's exactly how you get today's abysmally slow software though, because if you chain enough components you end up running into 1% cases all the time.

hinkley•6mo ago

The devil is in the details. A pair of my coworkers spent almost two years working on our page generation to split it onto a static part and a dynamic part. Similar to Amazon’s landing pages. Static skeleton, but lots of lazy loaded blocks. When they finally finished it, the work was mediocre for two reasons. One, they ended up splitting one request into several (but usually two), and they showed average response times dropping 20% but but neglected to show the overall request rate also went up by more than 10%. They changed the denominator. Rookie mistake.

Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request.

Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process.

So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five.

Always pay attention to the raw numbers and not just the percentages in benchmarks, kids.

TrapLord_Rhodo•6mo ago

That's what he said they did. that's the joke.

>so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially.

RedNifre•6mo ago

Ah, thanks for clarifying. I understood it as "Check that predicate first and if it doesn't match, run the old solution".

lispitillo•6mo ago

If a case doesn't match, then speeding up the remaining 0.1% is not going to change much the overall speed. Hence, a non optimized algorithm maybe enough. But when speed and speed variance is critical, then optimization is a must.

account42•6mo ago

The "overall speed" is rarely all that matters.

charleslmunger•6mo ago

I tried implementing an optimized varint encoder that did something similar, by populating an 8 byte value and then storing it to ram, but unaligned overlapping stores caused big regressions. The approach that worked required a different technique. This one is for encoding backwards:

1. One branch for the one-byte case, always inlined, just store the byte

2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6

3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift.

https://github.com/protocolbuffers/protobuf/commit/b039dfe26...

This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores.

I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values.

grapesodaaaaa•6mo ago

I can’t help but wonder what the use-case is for optimizing Java to these levels.

I would expect C or asm to be a better choice at ultra high levels of optimization. I know Java can get somewhat similar performance if you put the work in, but I’d assume you want to do manual memory management at minimum.

mcosta•6mo ago

C or asm IS a better choice at ultra high levels of optimization and crashing the process.

grapesodaaaaa•6mo ago

Yes, C or asm get a bad reputation. They are definitely harder.

I’m just saying that this company seems to be throwing a lot of money at the problem, and at those levels you can squeeze more performance out of them and get them reasonably right.

After all, many, many pieces of reliable software are written in C. If you wanted a “safer” modern language, you could pick Rust.

I would just assume those would be easier picks rather than optimizing around GC pauses, which can be notoriously difficult when nanosecond matter.

Joel_Mckay•6mo ago

High level languages tend to hammer the allocator like a Piñata. C++ can improve code smell of C, but different languages have their own specific use-cases. The reason llvm and GC based languages is a problem in low-level real-time code is bewildering for many.

Java is unique in that it often ran a dual stack for byte code, and platform specific binary objects.

My advice often is to use high-level languages if you can get away with the convenience, and a low-level language with proper design patterns if you must.

Many programmers I've met shouldn't write C/C++, but mostly because they can't be trusted to use memory or thread safe code.

Cheers, =3

grapesodaaaaa•6mo ago

Agreed on a lot of that.

C doesn’t inherently have to be bad (and can often be better than C++) if you are strict about adhering to standards.

I like to point to OpenBSD as a successful and well-done C project. As a bonus, if you develop on OpenBSD, your program is more likely to have correct memory handling operations (OpenBSD has a lot of very aggressive memory protections. Many issues in ported applications like Firefox were found and fixed by OpenBSD users, but would happily keep running on Linux).

Joel_Mckay•6mo ago

Indeed, many wished the OpenBSD Debian project had succeeded. =3

pjmlp•6mo ago

Using Java gets many companies almost to the goal line, while paying less, and having a bigger pool to hire from.

"Why we chose Java for our High-Frequency Trading application"

https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-...

grapesodaaaaa•6mo ago

I’m not saying it can’t be done, and I understand the economics can work.

I still just feel like a non-GC language (maybe even Rust) is a better choose when nanoseconds matter.

pjmlp•6mo ago

The only non-GC language that offered Java like capabilties in productivity, without being either C or C++, was Delphi, which has stuck being mostly for Windows shops, and is taited with the way Borland went through re-inventing itself.

Ada is not something that these companies would bother using, even though there are about 7 vendors still in business.

Likewise with either FreePascal or Modula-2.

Most likely a few of these exchanges are writting new code in Rust for such deployment scenarios.

corysama•6mo ago

Reminds me of https://ridiculousfish.com/blog/posts/old-age-and-treachery....

Data condition can definitely affect runtime performance. All the way down to the micro architecture level. Random, sorted, reverse-sorted, uniform data sets often lead to dramatic differences in perf. Then you get into “Ooops, my dataset had a surprisingly large number of NaNs or denormalized floats!”

lqr•6mo ago

In this example, random numbers provoked the worst case too often. However, in other situations random numbers are "too nice". For example, a random matrix with independent identically-distributed zero-mean entries is overwhelmingly likely to be well-conditioned, have no repeated eigenvalues, etc. Testing numerical algorithms on random data alone is a bad idea.

necovek•6mo ago

I am guessing this came out of misunderstanding "profile before optimizing": you don't profile a benchmark, but a live, production system instead.

Obviously, it becomes hard in such an aggressively optimized system, but if that's the case, trying and failing is an acceptable approach (I've seen 2 weeks wasted in a many a worse ways, to be honest).

brabel•6mo ago

Optimizing for the JVM is a nightmare (or trivial depending on your point-of-view).

Another anecdote: I profiled our Java-based server and concluded that most of the time was being spent on creating temporary DI containers for request handling. I spent several days trying to avoid copying the "app" scope components to the "request" scope container (we forked and heavily modified our DI container implementation) so that creating request containers was cheaper. I wrote a JMH benchmark that tried to emulate the real thing based on the real data (with something like 40% of components in the "app" scope, 60% in the "request" scope). My optimisation made the code noticably faster, I can't remember by how much exactly, but probably something like 25% faster.

When I ran the "real" performance test that simulates real-world usage of the server (not just the containers) the performance was slightly worse!! Ended up throwing the code away, as the OP. I had already spent too much time on it so I couldn't dig much further into why that was. But from other experiments, I can say that the JVM, thanks to JIT compilation of bytecode, will heavily optimize the hot path anyway based on the actual usage patterns, so much that the Java code you wrote may resemble very little of what machine code is actually executing in the end. That's why I said it may also be "trivial" to optimize JVM code: just let the JIT do its job.

Of course you may still obtain big gains when you optimize code that was doing stupid things, but because of JIT "magic", even your stupid code might run really fast given enough JIT passes.

necovek•6mo ago

Definitely, sometimes, testing in production is the only way to know something works or doesn't (or the only non-cost-prohibitive way).

necovek•6mo ago

How is this encoding different from UTF-8 other than the upper boundary? There are certainly good enough encoding implementations for UTF-8, but perhaps the boundary matters.

advisedwang•6mo ago

UTF-8 the first byte isn't just 1xxxxxxx for continuation, it's either 110xxxxx, 1110xxxx, or 11110xxx depending on how many bytes that character will take up.

necovek•6mo ago

Good point, I didn't pay close attention.

In a sense, a shame this encoding wasn't structured like UTF-8, or even the other way around, a shame UTF-8 wasn't structured in this, more generic way.

advisedwang•6mo ago

ULEB128 has several deficiencies for strings:

* it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes!

* Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point.

* You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point.

necovek•6mo ago

Sure, it's not perfect, and there are other issues. With UTF-8, you know exactly how many octets you need to read for the rest of the characters.

But for issue #2, that seems to not be too bad since you only need to look one byte backwards.

At the same time for #3, in the middle of UTF-8 bytestream, you need look backwards as well for anything but the ASCII (7-bit) codepoints too.

gorset•6mo ago

In a previous project we used fixedint32/64 instead of varint values in the schema for messages where performance was critical.

That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).

A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.

In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.

rob_c•6mo ago

Pebkac?

Havoc•6mo ago

Learned that lesson with ZFS.

Better to do janky testing with a cobbled together script on your own data & usage scenarios than standard tools that measure standard metrics

eulgro•6mo ago

> Never have I ever seen such a highly optimized Java codebase. Not before, not since.

That must be awful to work with. Why couldn't they just pick a more adequate language if they wanted performance? Or write bindings for performance critical parts?

gf000•6mo ago

Binding can be very expensive.

mark-r•6mo ago

The problem is that random numbers produced by a computer don't come close to producing naturally occurring random distributions. Your data should at the very least conform to Benford's Law (https://en.wikipedia.org/wiki/Benford%27s_law).

Const-me•6mo ago

LEB128 is slow and there’s no way around that. I don’t know why so many data formats are using that codec.

The good variable integer encoding is the one used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si... The compression win i.e. encoded size is equal to LEB128, however both encoding and decoding are way more efficient.

The encoder is a few fast instructions for all numbers no matter large or small: bit scan reverse (or leading zero count) to find count of bits required to store the input number, division by 7 rounding up to compute count of bytes (modern compiler reliably optimize into integer multiplication and bit shifts), bitwise shift + OR to set the VINT_MARKER bit, then byte swap to store in big endian format.

The decoder is equally efficient: compare first byte with 0 to detect numbers longer than 8 bytes (very rarely happens), bit scan reverse of either first or second byte to find encoded length, byte swap to load the big-endian format, then reset the most significant bit in the integer to get rid of the VINT_MARKER.

menaerus•6mo ago

> LEB128 is slow and there’s no way around that.

Actually there is - you can exploit the data parallelism through SIMD to replace the logarithmic with near-constant complexity. Classic approach indeed is very slow and unfriendly to the CPU.

Const-me•6mo ago

> Actually there is - you can exploit the data parallelism

That doesn’t help much when you’re streaming the bytes, like many parsers or de-serializers do. You have to read bytes from the source stream one by one because each of the next byte might be the last one of the integer being decoded. You could workaround with a custom buffering adapter. Hard to do correctly, costs performance, and in some cases even impossible: when the encoded integers are coming in realtime from network, trying to read next few bytes into a RAM buffer may block.

With MKV variable integers, you typically know the encoded length from just the first byte. Or in very rare cases of huge uint64 numbers, you need 2 bytes for that. Once you know the encoded length, you can read the rest of the encoded bytes with a single read function.

Another thing, unless you are running on a modern AMD64 CPU which supports PDEP/PEXT instructions (parts of BMI2 ISA extension), it’s expensive to split/merge the input bits to/from these groups of 7 bits in every encoded byte. MKV variable integers don’t do that, they only need bit scan and byte swap, both instructions are available on all modern mainstream processors including ARM and are very cheap.

menaerus•6mo ago

But you never serialize a byte by byte over the network. You encode, let's say, 1000 varints and then send them out on the wire. On the other end, you may or may not know how many varints there are to unpack but since you certainly have to know the length of the packet you can start decoding on a stream of bytes. Largest 64-bit leb128 number occupies 10 bytes whereas the largest 32-bit one occupies 5 bytes so we know the upper bound.

Const-me•6mo ago

> you never serialize a byte by byte over the network

I sometimes do for the following two use cases.

1. When the protocol delivers real-time data, I serialize byte by byte over the network to minimize latency.

2. When the messages in question are large, like 1GB or more, I don’t want to waste that much memory on both ends of the socket to buffer complete messages like ProtoBuff does. On the sending side, I want to produce the message, encode at the same time with a streaming serializer, and stream bytes directly into the socket. Similarly, on the receiving side I want to de-serialize bytes as they come with a streaming de-serializer. Without buffering the complete message on sending side, receiving side can’t know length of the message in advance before receiving the complete message, because the sender only knows the length after the complete message is serialized.

menaerus•6mo ago

I still don't understand if you're really sending byte by byte over TCP/UDP or? Because that would be detrimental for performance.

Const-me•6mo ago

For the sending side, a small buffer (which should implement a flush method called after serializing the complete message) indeed helps amortize costs of system calls. However, a buffer large enough for 1GB messages will waste too much memory. Without such buffer on the sender side, it’s impossible to prepend every message with the length: you don’t know the length of a serialized messages until the entire message is serialized.

With streamed serialization, the receiver doesn’t know when the message will end. This generally means you can’t optimize LEB128 decoding by testing high bits of 10 bytes at once.

For example, let’s say the message is a long sequence of strings. Serialized into a sequence of pairs [ length, payload ] length is var.int, payload is an array of UTF8 bytes of that length, and the message is terminated with a zero-length string.

You can’t implement data parallel LEB128 decoder for the length field in that message by testing multiple bytes at once because it may consume bytes past the end of the message. A decoder for MKV variable integers only needs 1-2 read calls to decode even large numbers, because just the first byte contains the encoded length of the var.int.

Show HN: AI generated personal affirmations playing when your phone is locked

Show HN: GTM MCP Server- Let AI Manage Your Google Tag Manager Containers

Launch of X (Twitter) API Pay-per-Use Pricing

Facebook seemingly randomly bans tons of users

Global Bird Count

What Is Ruliology?

Jon Stewart – One of My Favorite People – What Now? With Trevor Noah Podcast [video]

P2P crypto exchange development company

Vocal Guide – belt sing without killing yourself

Write for Your Readers Even If They Are Agents

Knowledge-Creating LLMs

Maple Mono: Smooth your coding flow

Sid Meier's System for Real-Time Music Composition and Synthesis

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

White House Explores Opening Antitrust Probe on Homebuilders

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

How do you estimate AI app development costs accurately?

Going Through Snowden Documents, Part 5

Show HN: MCP Server for TradeStation

Canada unveils auto industry plan in latest pivot away from US

The essential Reinhold Niebuhr: selected essays and addresses

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

StovexGlobal – Compliance Gaps to Note

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

Trump says America should move on from Epstein – it may not be that easy

Tiny Clippy – A native Office Assistant built in Rust and egui

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

US moves to deport 5-year-old detained in Minnesota

Show HN: AI generated personal affirmations playing when your phone is locked

Show HN: GTM MCP Server- Let AI Manage Your Google Tag Manager Containers

Launch of X (Twitter) API Pay-per-Use Pricing

Facebook seemingly randomly bans tons of users

Global Bird Count

What Is Ruliology?

Jon Stewart – One of My Favorite People – What Now? With Trevor Noah Podcast [video]

P2P crypto exchange development company

Vocal Guide – belt sing without killing yourself

Write for Your Readers Even If They Are Agents

Knowledge-Creating LLMs

Maple Mono: Smooth your coding flow

Sid Meier's System for Real-Time Music Composition and Synthesis

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

White House Explores Opening Antitrust Probe on Homebuilders

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

How do you estimate AI app development costs accurately?

Going Through Snowden Documents, Part 5

Show HN: MCP Server for TradeStation

Canada unveils auto industry plan in latest pivot away from US

The essential Reinhold Niebuhr: selected essays and addresses

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

StovexGlobal – Compliance Gaps to Note

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

Trump says America should move on from Epstein – it may not be that easy

Tiny Clippy – A native Office Assistant built in Rust and egui

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

US moves to deport 5-year-old detained in Minnesota

I wasted weeks hand optimizing assembly because I benchmarked on random data

Comments