here it gets the task struct: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... and here https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... to here where it actually pulls the value out: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/sched...
where here is the vdso clock pick logic https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get... and here is the fallback to the syscall if it's not a vdso clock https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get...
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
Tbh I thought clock_gettime was a vdso based virtual syscall anyway
Yes, that's exactly what a seqlock (reader) is.
Love the people and their software.
Great blog Jaromir!
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.
Also `samply` and the Firefox profiler are pretty fancy too.
There's really no reason to use the original flamegraph scripts.
Me: looks at my code. "sure, ok, looks alright."
Me: looks at the resulting flamegraph. "what the hell is this?!?!?"
I've found all kinds of crazy stuff in codebases this way. Static initializers that aren't static, one-line logger calls that trigger expensive serialization, heavy string-parsing calls that don't memoize patterns, etc. Unfortunately some of those are my fault.
Regular flamegraphs are good too, icicle graphs are just another tool in the toolbox.
If you're referring to "one-line logger calls that trigger expensive serialization", it's also common in java.
Or another great one: new instances of ObjectMapper created inside a method for a single call and then thrown away.
Unless you're inheriting code written by Bill Atkinson or something.
https://github.com/brendangregg/FlameGraph
Useful site if you are on to perf/eBPF/performance things with many examples and descriptions even for other uses as e.g. memory usage, disk usage (prefer heatmaps here but they are nice if you want to send someone a interactive view of their directory tree ...).
This is what flame graphs are super helpful for, to see whether it’s really a problem or not.
Also, remember that every extra moment running instructions is a lost opportunity to put the CPU to sleep, so this has energy efficiency impact as well.
Even called every frame 60 times per second, it's only 0.2% of a 60 fps time budget.
It's not a huge amount of time in absolute terms; only if it's relatively "hot."
Very interesting read.
Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.
Full details in my write-up: https://norlinder.nu/posts/User-CPU-Time-JVM/
edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.
edit2: I linked your article from my post.
So, why do you reckon they did that?
The method in question (Java 1.5) was released in September 2004. While the POSIX standard existed, it only provided a way to get total CPU time, not the specific user time that Java needed. You can read about it more in the history section here: https://norlinder.nu/posts/User-CPU-Time-JVM/#a-walk-through....
But it's worth noting that while this specific case can be "fixed" with a function call, parsing /proc is still the standard way to get data in Linux.
Even today, a vast amount of kernel telemetry is only exposed via the filesystem. If you look at the source code for tools like htop, they are still busy parsing text files from /proc to get memory stats (/proc/meminfo), network I/O, or per-process limits. See here https://github.com/hishamhm/htop/blob/master/linux/LinuxProc....
I knew about using proc for all that other information. I just wouldn’t have imagined using it for critical performance path. Unless, that is, that’s the way you have to get the information.
While I don't particularly like the IO overhead and churn consequences of real files for performance metrics, I get the 9p-like appeal of treating the virtual fs as a DBMS/API/ABI.
jerrinot•3w ago
Neywiny•3w ago
jerrinot•3w ago
jacquesm•3w ago
rcxdude•3w ago
jacquesm•3w ago
loeg•3w ago
https://github.com/facebook/folly/blob/main/folly/docs/Bench...
jacquesm•3w ago
loeg•3w ago
jacquesm•3w ago
jerrinot•3w ago
The mean does get proper statistical treatment (t-distribution confidence interval), but you're right that JMH doesn't compute confidence intervals for percentiles. Reporting p0.00 with three significant figures is ... optimistic.
That said I think the core finding survives this critique. The improvement shows up consistently across ~11 million samples at every percentile from p0.50 through p0.999.
jacquesm•3w ago
menaerus•3w ago
That said, at your CPU base frequency, 80ns is ~344 cycles, 70ns is ~300 cycles. That's ~40 cycles of difference. That's on the order of ~2x CPU pipeline flushes due to branch mispredictions. Or another example is RDTSCP which, at least on Intel CPUs, forces all prior instructions to retire before executing, and it prevents speculative execution of following instructions until theirs results are available. This can also impose a 10-30 cycle penalty. Both of these can interfere with the measurements of the scale you have so there is a possibility that you're measuring these effects instead of the optimization you thought you implemented.
I am not saying that this is the case, I am just saying it's possible. Since the test is simple enough I would eliminate other similar CPU level gotchas that can screw your hypothesis testing up. In more complex scenarios I would have to consider them as well.
The only reliable way I found to be sure what is really happening is to read the codegen. And I do that _before_ each test run, or to be more precise after each recompile, because compilers do crazy transformations with our code, even when just moving a naively looking function few lines above or adding some naive boolean flag. If I don't do that, I could again end up measuring, observing, and finally drawing the conclusion that I implemented a speedup without realizing that the compiler in that last case decided to eliminate half of the code because of that innocuous boolean flag. Just an example.
radix tree lookup looks interesting and it would be interesting to see at what exact instruction does it idle on. I had a case where the function would be sitting idle, reproducible, but when you look into the function there is nothing obvious you can optimize. It turned out that the CPU pipeline was so saturated that there were no more available CPU ports for the instruction this function was idling for. The fix was to rewrite code elsewhere but in vicinity of this function. This is something flamegraphs can never show you, which is partly the reason I had never been a huge fan of.
6r17•3w ago
edit : I had an afterthought about this because it ended up being a low quality comment ;
Bringing up such TLDR give a lot of value to reading content, especially on HN, as it provides way more inertia and let focus on -
reading this short form felt like that cool friend who gave you a heads up.
jerrinot•3w ago
6r17•3w ago