A 40-line fix eliminated a 400x performance gap

(questdb.com)

370 points | by bluestreak 24 days ago

17 comments

ot 24 days ago
You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
[-]
- jerrinot 24 days ago
  That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!
  [-]
  - ot 24 days ago
    Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)
    [-]
    - catlifeonmars 24 days ago
      I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway
      [-]
      - jerrinot 24 days ago
        In a system with green threads, you typically want the CPU time of the fiber or tasklet rather than the carrier thread. In that case, you have to ask the scheduler, not the kernel.
- nly 24 days ago
  Why do you need a seqlock? To make sure you're not context switched out between the read of the page value and the rdtsc?
  Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
  Tbh I thought clock_gettime was a vdso based virtual syscall anyway
  [-]
  - ot 23 days ago
    > Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
    Yes, that's exactly what a seqlock (reader) is.
- mgaunard 24 days ago
  clock_gettime is not doing a syscall, it's using vdso.
  [-]
  - jerrinot 24 days ago
    clock_gettime() goes through the vDSO shim, but whether it avoids a syscall depends on the clock ID and (in some cases) the clock source. For thread-specific CPU user time, the vDSO shim cannot resolve the request in user space and must transit into the kernel. In this specific case, there is absolutely a syscall.
shermantanktop 24 days ago
Flamegraphs are wonderful.
Me: looks at my code. "sure, ok, looks alright."
Me: looks at the resulting flamegraph. "what the hell is this?!?!?"
I've found all kinds of crazy stuff in codebases this way. Static initializers that aren't static, one-line logger calls that trigger expensive serialization, heavy string-parsing calls that don't memoize patterns, etc. Unfortunately some of those are my fault.
[-]
- wging 24 days ago
  I also like icicle graphs for this. They're flamegraphs, but aggregated in the reverse order. (I.e. if you have calls A->B->C and D->E->C, then both calls to C are aggregated together, rather than being stacked on top of B and E respectively. It can make it easier to see what's wrong when you have a bunch of distinct codepaths that all invoke a common library where you're spending too much time.)
  Regular flamegraphs are good too, icicle graphs are just another tool in the toolbox.
  [-]
  - pests 24 days ago
    So someone else linked the original flamegraph site [0] and it describes icicle graphs as "inverting the y axis" but that's not only what's happening, right? You bucket top-down the stack opposed to bottom-up, correct?
    [0] https://www.brendangregg.com/flamegraphs.html
    [-]
    - wging 24 days ago
      It's certainly possible that what I encountered, labeled as an 'icicle graph', is a nonstandard usage of the term. But if so, that's a shame. I don't think inverting the y-axis is useful by itself, the different bucketing is what makes for an actually useful change.
    - yxhuvud 24 days ago
      Right, what is needed is something trie-like, with the root being the most fine-grained call.
- tempaccsoz5 24 days ago
  Also cool that when you open it in a new tab, the svg [0] is interactive! You can zoom in by clicking on sections, and there's a button to reset the zoom level.
  [0]: https://questdb.com/images/blog/2026-01-13/before.svg
  [-]
  - sllabres 24 days ago
    Yes, they are made with: http://www.brendangregg.com/flamegraphs.html and
    https://github.com/brendangregg/FlameGraph
    Useful site if you are on to perf/eBPF/performance things with many examples and descriptions even for other uses as e.g. memory usage, disk usage (prefer heatmaps here but they are nice if you want to send someone a interactive view of their directory tree ...).
- arethuza 24 days ago
  I always found profiling performance critical code and experimenting with optimisations to be one of the most enjoyable parts of development - probably because of the number of surprises that I encountered ("Why on Earth is that so slow?").
- jabwd 24 days ago
  I might be very wrong in every way but, string parsing and or manipulating and memoiziation... sound like a super strange combo? For the first you know you're already doing expensive allocations, but the 2nd is also not a pattern I really see apart from in JS codebases. Could you provide more context on how this actually bit you in the behind? memoizing strings seems like a complicated and error prone "welp it feels better now" territory in my mind so I'm genuinely curious.
  [-]
  - shermantanktop 24 days ago
    In Java it can be a bad toString() implementation hiding behind a + used for string assembly.
    Or another great one: new instances of ObjectMapper created inside a method for a single call and then thrown away.
    [-]
    - shermantanktop 24 days ago
      To be clear this is often sloppy code that shouldn’t have been written. But in a legacy codebase this stuff can easily happen.
      [-]
      - MengerSponge 23 days ago
        A huge chunk of a "legacy codebase" is "sloppy code that shouldn’t have been written"
        Unless you're inheriting code written by Bill Atkinson or something.
  - tyingq 24 days ago
    > but the 2nd is also not a pattern I really see apart from in JS codebases.
    If you're referring to "one-line logger calls that trigger expensive serialization", it's also common in java.
- sroerick 24 days ago
  I've never used flamegraphs but would like to know about them. Can you explain more? Or where should I start?
  [-]
  - atdt 24 days ago
    Flame graphs have an official web site, maintained by Brendan Gregg, who invented them: https://www.brendangregg.com/flamegraphs.html. It's a useful starting point.
  - dummydummy1234 24 days ago
    I would also try hotspot, it is a interactive viewer for perf graphs.
  - jsymolon 24 days ago
    I use them all the time on Perl code.
    https://metacpan.org/pod/Devel::NYTProf
jerrinot 24 days ago
Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.
[-]
- jacquesm 24 days ago
  I don't think it is possible to talk about fractions of nanoseconds without having an extremely good idea of the stability and accuracy of your clock. At best I think you could claim there is some kind of reduction but it is super hard to make such claims in the absolute without doing a massive amount of prep work to ensure that the measured times themselves are indeed accurate. You could be off by a large fraction and never know the difference. So unless there is a hidden atomic clock involved somewhere in these measurements I think they should be qualified somehow.
  [-]
  - rcxdude 24 days ago
    Stability and accuracy, when applied to clocks, are generally about dynamic range, i.e. how good is the scale with which you are measuring time. So if you're talking about nanoseconds across a long time period, seconds or longer, then yeah, you probably should care about your clock. But when you're measuring nanoseconds out of a millisecond or microsecond, it really doesn't matter that much and you're going to be OK with the average crystal oscillator in a PC. (and if you're measuring a 10% difference like in the article, you're going to be fine with a mechanical clock as your reference if you can do the operation a billion times in a row).
    [-]
    - jacquesm 24 days ago
      This setup is a user space program on a machine that is not exclusively dedicated to the test running all kinds of interrupts (and other tasks) left, right and center through the software under test.
      [-]
      - loeg 24 days ago
        For something like this, you can just take several trials and look at the minimum observed time, which is when there will have been ~no interruptions.
        https://github.com/facebook/folly/blob/main/folly/docs/Bench...
        [-]
        jacquesm 24 days ago
        You don't actually know that for sure. You have only placed a new upper bound.
        [-]
        loeg 24 days ago
        This seems like more of a philosophical argument than a practical one.
        [-]
        jacquesm 24 days ago
        No, it is a very practical one and I'm actually surprised that you don't see it that way. Benchmarking is hard, and if you don't understand the basics then you can easily measure nonsense.
        [-]
        jerrinot 24 days ago
        You raise a fair point about the percentiles. Those are reported as point estimates without confidence intervals and the implied precision overstates what system clock can deliver.
        The mean does get proper statistical treatment (t-distribution confidence interval), but you're right that JMH doesn't compute confidence intervals for percentiles. Reporting p0.00 with three significant figures is ... optimistic.
        That said I think the core finding survives this critique. The improvement shows up consistently across ~11 million samples at every percentile from p0.50 through p0.999.
        [-]
        jacquesm 24 days ago
        Yes, I would expect the 'order of magnitude' value to be relatively close but the absolute values to be very imprecise.
        menaerus 24 days ago
        You can compute the confidence intervals all you want but if you can't be sure, in one or another way, that what you're observing (measuring) in your experiment is what you actually wanted to measure (signal), not even confidence interval would help you there to distinguish between the signal and noise.
        That said, at your CPU base frequency, 80ns is ~344 cycles, 70ns is ~300 cycles. That's ~40 cycles of difference. That's on the order of ~2x CPU pipeline flushes due to branch mispredictions. Or another example is RDTSCP which, at least on Intel CPUs, forces all prior instructions to retire before executing, and it prevents speculative execution of following instructions until theirs results are available. This can also impose a 10-30 cycle penalty. Both of these can interfere with the measurements of the scale you have so there is a possibility that you're measuring these effects instead of the optimization you thought you implemented.
        I am not saying that this is the case, I am just saying it's possible. Since the test is simple enough I would eliminate other similar CPU level gotchas that can screw your hypothesis testing up. In more complex scenarios I would have to consider them as well.
        The only reliable way I found to be sure what is really happening is to read the codegen. And I do that _before_ each test run, or to be more precise after each recompile, because compilers do crazy transformations with our code, even when just moving a naively looking function few lines above or adding some naive boolean flag. If I don't do that, I could again end up measuring, observing, and finally drawing the conclusion that I implemented a speedup without realizing that the compiler in that last case decided to eliminate half of the code because of that innocuous boolean flag. Just an example.
        radix tree lookup looks interesting and it would be interesting to see at what exact instruction does it idle on. I had a case where the function would be sitting idle, reproducible, but when you look into the function there is nothing obvious you can optimize. It turned out that the CPU pipeline was so saturated that there were no more available CPU ports for the instruction this function was idling for. The fix was to rewrite code elsewhere but in vicinity of this function. This is something flamegraphs can never show you, which is partly the reason I had never been a huge fan of.
- Neywiny 24 days ago
  Did you look into the large spread on your distributions? Some of these span multiple orders of magnitude which is interesting
  [-]
  - jerrinot 24 days ago
    Fair point. These were run on a standard dev workstation under load, which may account for the noise. I haven't done a deep dive into the outliers yet, but the distribution definitely warrants a more isolated look.
- 6r17 24 days ago
  Very thankful for the 1liner tldr
  edit : I had an afterthought about this because it ended up being a low quality comment ;
  Bringing up such TLDR give a lot of value to reading content, especially on HN, as it provides way more inertia and let focus on -
  reading this short form felt like that cool friend who gave you a heads up.
  [-]
  - jerrinot 24 days ago
    I was unsure whether to post it or not so I am glad you found it useful!
    [-]
    - 6r17 24 days ago
      I have that 10-30s time window to fill when claude might be loading some stuff ; the 1 liner is exactly what fits in that window - it makes me wonder about the original idea of twitter now that I think of it - but since it's not the same kind of content I don't bother with it.It really feels like "here is the stuff, here's more about it if you want to" - really really appreciate that form and will definitely do the same format myself
- abicklefitch 24 days ago
  [dead]
jonasn 24 days ago
Author of the OpenJDK patch here.
Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.
Full details in my write-up: https://norlinder.nu/posts/User-CPU-Time-JVM/
[-]
- jerrinot 24 days ago
  Hi Jonas, thanks for the work on OpenJDK and the post! I swear I hadn't seen your blog :) I finished my draft around Christmas and it’s been in the queue since. Great minds think alike, I guess.
  edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.
  edit2: I linked your article from my post.
  [-]
  - jonasn 24 days ago
    Thanks for the kind words and the link :).
- kstrauser 24 days ago
  Why do you suppose it was originally written the way it was? To my eyes, that seems like a horrible approach. Doing file IO and parsing strings in every call? What?! And yet I assume the original author was a smart person who had a reason why this made sense to them, and my inability to guess why is my own limitation and not theirs.
  So, why do you reckon they did that?
  [-]
  - jonasn 24 days ago
    You are spot on that the original author had a valid reason: at the time, it was literally the only way to do it.
    The method in question (Java 1.5) was released in September 2004. While the POSIX standard existed, it only provided a way to get total CPU time, not the specific user time that Java needed. You can read about it more in the history section here: https://norlinder.nu/posts/User-CPU-Time-JVM/#a-walk-through....
    But it's worth noting that while this specific case can be "fixed" with a function call, parsing /proc is still the standard way to get data in Linux.
    Even today, a vast amount of kernel telemetry is only exposed via the filesystem. If you look at the source code for tools like htop, they are still busy parsing text files from /proc to get memory stats (/proc/meminfo), network I/O, or per-process limits. See here https://github.com/hishamhm/htop/blob/master/linux/LinuxProc....
    [-]
    - kstrauser 24 days ago
      That sounds like a pretty good reason!
      I knew about using proc for all that other information. I just wouldn’t have imagined using it for critical performance path. Unless, that is, that’s the way you have to get the information.
furyofantares 24 days ago
> Flame graph image
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
[-]
- jerrinot 24 days ago
  Courtesy of Brendan Gregg and his flamegraph.pl scripts: https://github.com/brendangregg/FlameGraph
  Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.
  [-]
  - IshKebab 23 days ago
    Note that pprof produces much fancier interactive flame graphs. I'm not sure they're a single SVG though.
    Also `samply` and the Firefox profiler are pretty fancy too.
    There's really no reason to use the original flamegraph scripts.
pjmlp 24 days ago
Which goes to show writing C, C++ or whatever systems language isn't automatically blazing fast, depending on what is being done.
Very interesting read.
higherhalf 24 days ago
clock_gettime() goes through vDSO, avoiding a context switch. It shows up on the flamegraph as well.
[-]
- jerrinot 24 days ago
  Only for some clocks (CLOCK_MONOTONIC, etc) and some clock sources. For VIRT/SCHED, the vDSO shim still has to invoke the actual syscall. You can't avoid the kernel transition when you need per-thread accounting.
  [-]
  - touisteur 24 days ago
    Oh for some time after its introduction, CLOCK_MONOTONIC_RAW wasn't vDSO'd and it took some time and syscall profiling ('huh, why do I see these as syscalls in perf record -e syscalls' ...) to understand what was going on.
  - higherhalf 24 days ago
    Thanks, I really should've looked deeper than that.
    [-]
    - jerrinot 24 days ago
      no problem at all, I was confused too when I saw the profile for the first time.
- ot 24 days ago
  If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).
  [-]
  - jerrinot 24 days ago
    Exactly this.
- a-dub 24 days ago
  edit: agh, no. CLOCK_THREAD_CPUTIME_ID falls through the vdso to the kernel which makes sense as it would likely need to look at the task struct.
  here it gets the task struct: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... and here https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... to here where it actually pulls the value out: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/sched...
  where here is the vdso clock pick logic https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get... and here is the fallback to the syscall if it's not a vdso clock https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get...
goodroot 24 days ago
The QuestDB team are among the best doing it.
Love the people and their software.
Great blog Jaromir!
burnt-resistor 24 days ago
I really wished™ there was an API/ABI for userland- and kernelland-defined individual virtual files at arbitrary locations, backed by processes and kernel modules respectively. I've tried pipes, overlays, and FUSE to no avail. It would greatly simply configuration management implementations while maintaining compatibility with the convention of plain text files, and there's often no need to have an actual file on any media or the expense of IOPS.
While I don't particularly like the IO overhead and churn consequences of real files for performance metrics, I get the 9p-like appeal of treating the virtual fs as a DBMS/API/ABI.
otterley 24 days ago
It took seven years to address this concern following the initial bug report (2018). That seems like a lot, considering how instrumenting CPU time can be in the hot path for profiled code.
[-]
- loeg 24 days ago
  400x slower than 70ns is still only 28us. How often is the JVM calling this function?
  [-]
  - otterley 24 days ago
    It depends. If you’re doing continuous profiling, it’d make a call to get the current time at every method entry and exit, each of which could then add a context switch. In an absolute sense it appears to be small, but it could really add up.
    This is what flame graphs are super helpful for, to see whether it’s really a problem or not.
    Also, remember that every extra moment running instructions is a lost opportunity to put the CPU to sleep, so this has energy efficiency impact as well.
    [-]
    - singron 24 days ago
      If you are doing continuous profiling, you are probably using a low overhead stack sampling profiler rather than recording every method entry and exit.
      [-]
      - otterley 24 days ago
        That's a fair point. It really depends. For example, if you're recording method run times via an observability SDK at full fidelity, this could be an issue.
    - loeg 24 days ago
      If it's calling it twice per function, that's enormously expensive and this is a major win.
  - u8080 24 days ago
    28us is still solid amount of time
    [-]
    - loeg 24 days ago
      If it's called once an hour, who cares?
      Even called every frame 60 times per second, it's only 0.2% of a 60 fps time budget.
      It's not a huge amount of time in absolute terms; only if it's relatively "hot."
Ono-Sendai 24 days ago
"look, I'm sorry, but the rule is simple: if you made something 2x faster, you might have done something smart if you made something 100x faster, you definitely just stopped doing something stupid"
https://x.com/rygorous/status/1271296834439282690
ee99ee 24 days ago
This is such a great writeup
squirrellous 24 days ago
Does anyone knowledgeable know whether it’s possible to drastically reduce the overhead of reading from procfs? IIUC everything in it is in-memory, so there’s no real reason reading some data should take the order of 10us.
mgaunard 24 days ago
Obviously a vdso read is going to be significantly faster than a syscall switching to the kernel, writing serialized data to a buffer, switching back to userland, and parsing that data.
xthe 24 days ago
This is a great example of how a small change in the right place can outweigh years of incremental tuning.
[-]
- nomel 24 days ago
  I don't think I've ever seen less than 10x speedup after putting some effort into improving performance of "organic"/legacy code. It's always shocking how slow code can be before anyone complains.
amelius 24 days ago
It's kinda crazy the amount of plumbing required to get a few bits across the CPU.
tomiezhang 24 days ago
cool