If you had no idea what a restorable sequence is the takeaway is about halfway down the OP:
“This is why Linux now provides rseq() which is a much more enlightened solution. With restartable sequences, you actually can get rid of both the mutex and atomics, while the OS continues to fully abstract scheduling. The way it works is you advise the kernel whenever your program enters a critical section of code that you don't want interrupted. It's probably going to be maybe 10 assembly instructions tops. The first assembly opcode should be a move instruction that sets the rseq_cs field. The last instruction needs to be the thing that makes the modification to your global data structure. Think of it sort of like a really tiny database transaction. What makes it go fast, is that the bidirectional communication with the kernel happens via shared memory.”
That doesn't really explain it though, IMO. IIUC, it's a sequence of instructions that either runs to completion atomically or doesn't. If it is interrupted by anything the kernel jumps you to the abort/retry vector you set with a guarantee that the last instruction in the sequence was not executed.
(Based on my reading of the LWN article rwmj posted).
> it's a sequence of instructions that either runs to completion atomically or doesn't
The way I read it, it either runs to completion in one go, or gets restarted from the beginning. This means the sequence as a whole isn't executed atomically, as the already-executed instructions during an interrupt aren't rolled back.
It can be used to build atomic actions, but it is up to the developer to create a sequence of instructions where the very last instruction "commits" the entire operation, with the side-effects of partial execution being harmless.
Yes, it's either atomic or the last instruction is guaranteed not to have run. I made this a little harder to read by inserting another clause in the sentence.
Yes, the API contract isn't "don't interrupt me during this critical section" it's "if you have to interrupt me during this critical section, go to this recovery/restart code".
There is a time-slice extension feature in the works that's roughly "please let me finish this critical section before you interrupt me". But a hard guarantee that userspace code won't be interrupted is probably untenable in a preemptive multitasking system.
That’s clever — am I right to think it’s the intermediate solution between locks and full STM, implemented at the kernel level, and with zero abstraction cost?
It's in some sense a light form of STM. The key insight behind rseq(2) is that if the data is local to a given CPU the only way to get a race is if the kernel deschedules your program from that CPU at an inopportune time. If your operation can be aborted and restarted and the kernel has a mechanism to notify you when that needs to happen you can dispense with the overhead of "real" synchronization and just use a couple mov instructions to enter and exit the critical section.
Maybe I'm just getting old but the "if you don't spend $20,000 on a workstation you're going to be left behind like a dinosaur" at the top of this article is a huge turn off to reading any further. And I say that as someone who owns a workstation with more cores than the author's.
The author was also asking for money to buy a house in SF and travel on private planes like a few days ago..the donation must have really showed up if they are using 20k machines at home.
She bought the workstation at a discount (see bottom of TFA).
Also, it was pre-2025 (before she got her job at Gradient Canopy), so long before asking for donations.
Finally, if you read the actual donation request, you can see she is trying to make a living doing open source, and is being honest about what the money is going to. Why is that an issue?
There's quite a bit of a spectrum between "trying to make a living doing open source" and "asking for people to pay for a house in one of the most expensive cities in the country - plus a private jet. It's also quite grating to see it written like we should be grateful that we are even allowed to donate to her.
And if she's even half the genius she's claiming to be, why aren't the big tech companies in a bidding war over who get to pay her a million-dollar salary?
From what I've read of her in the past she seems to be a pretty damn good developer. But in the open source world those are a dime a dozen. If you want to make a living off of it you've got to market yourself, and this... isn't how you do that.
> From what I've read of her in the past she seems to be a pretty damn good developer. But in the open source world those are a dime a dozen
Not exactly. Very few people in recent decades have achieved anything comparable to αcτµαlly pδrταblε εxεcµταblε and Cosmopolitan libc - they're in the category of "that should not even be possible". Of course, Tunney's work doesn't touch Fabrice Bellard in terms of sheer breadth and impact, but they're arguably in the same category.
I was wondering if the author is joking, but after reading a bit more about the attribution drama, it seems rather a lack of reality check and reflection. If you plagiarize work, get called out on it, and then call this "harrassment", I don't know...
Your bad faith reading of that article says otherwise. It's bleedingly obvious that it's satire. Do you think people seriously ask for donations to fund a private airplane?
$20K of gear isn't that much if you're an independent developer, and if you're working for others as such in the US, and you're not a financial basket case, it's doable. She even says "It put me in the poor house for a few months" so she made sacrifices to get there. You can too, if you want to. Why the envy?
To the contrary, so you really seem to have a problem with her.
Go to the end. Those machines were provided by the companies discounted so she could continue her llm research. She's not saying "anyone who's anyone" has a 1024-core workstation these days.
If we're being overly generous, they're saying you need at least a raspberry pi? You can see a 3x improvement there, which shows the pattern works, and that's good enough for a dinosaur (this interpretation is easier to justify if you just skim the article... Which I did the first time)
But agreeing with you, I've done big optimization stuff for multicore servers (not as many cores, but same kind of work) and my workstation was something small with not even the same os. I don't need the big machine on my desk to understand the concepts. I just need the big machine to check my work. For me, that's always been a production machine, sometimes a production machine taken out of rotation for pre-validation before running on production load. I guess I should mention, I work on applications specifically, and libraries and kernels as it relates to making whatever my application is work better. I also don't have a problem with pinning threads to cpus... but my applications are usually one big program that fills the system. Someone writing a general purpose library has a harder time.
Of course, if you want to do this kind of work and you don't have your own production load, you're going to have to borrow, rent, or buy a big machine. It doesn't need to be your workstation though. I hate working with cloud nonsense, but if your tests are short, and you do the upfront work to make your images start fast, you can probably save a lot of money by renting spot instances when testing ... I don't know if you can do spot instances of bare metal though, so you're probably stuck with vm overhead.
Yeah, you can rent an equivalent workstation from AWS for under $10/hour (and that's the on demand price) so I don't think cost is a huge barrier to doing this sort of work. The language and listing the prices of the workstations down to the penny just strikes me as a rather unprofessional way to communicate.
She's clearly making a point about taking advantage of the optimization/algorithm she's pitching, and doesn't seem very serious. Alternatively, someone reading that as a serious claim is rather naive.
Restartable windows, or more generically introspection windows, are a really useful technique you can apply in any situation where you understand or control the sources of preemption. The earliest uses of this technique in operating systems that I am aware of are ~25 years old.
The key insight is that the preempter can introspect the program counter of the code being preempted (which is now stable since it was preempted) and act accordingly. The simplest mechanism is to reset their program counter if in a critical section. The more generic mechanism is to jump them to a supplied address. This allows you to do things like hard abort and more.
You can further remove the need for the preempter to understand the preempted code by having the preempted code create a self-introspection code snippet and supplying that with the program counter at preemption. So the preempter just vectors them to their own code which knows how to interpret its own state at any preemption point.
Yep, it is a fairly old technique with a lot of of general applicability beyond just allowing mutex elision for usage of per-core data structures amidst potential core migration. But apparently using your own expert knowledge and actually explaining things and describing generalizations is worthy of flagging these days.
The author is referring to false sharing (https://en.wikipedia.org/wiki/False_sharing). CPU caches operate at cache line granularity (typically 64 bytes) so writes to one part of the cache line can require synchronization with writes to non-overlapping parts of the same cache line. This can dramatically reduce performance when there are a large number of cores operating on the same cache line.
If you remove the 64 byte alignment (which forces each counter variable onto a separate cache line) from hitcounter-shard.c you ought to be able to see the performance difference for yourself.
The "CPU mutex" is just the cache coherency mechanism. If you shard your data to avoid triggering it as suggested, then yes, it's much faster.
EDIT: or maybe you're asking if introducing an explicit userspace mutex is better than a lockless algorithm with false sharing issues. The answer is that it's workload dependent but it definitely can be.
OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace.
The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, ..
How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel.
If you have two threads on different cores that write to the same cacheline, the CPU has to enforce write ordering. The way it does this is for one of the cores to acquire a write lock on the cacheline. The actual implementation of this is via some variant of the MESI protocol that results in the cacheline on one core going to an Exclusive state, while copies of the cacheline on every other core becomes Invalid.
MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).
Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.
Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.
So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.
To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.
> The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI.
Even without rseq(2) you're looking at a "different algorithm". For one, a proper userspace mutex allows the excluded threads to be descheduled and the core can be repurposed to run other tasks, rather than hammering away at the contended cache line.
I'm not a systems guy, but you're probably right. A mutex requires an atomic variable which triggers the same cache coherency mechanism under contention.
What Justine probably meant was an apples to oranges comparison: in userspace, you can sometimes add constraints that allow for a faster algorithm whereas the cpu generally has to assume the worst.
That being said, if you look through this thread you'll find similar issues with OP's phrasing. A fine engineer / hacker, but too much 4chan troll in my book.
“This is why Linux now provides rseq() which is a much more enlightened solution. With restartable sequences, you actually can get rid of both the mutex and atomics, while the OS continues to fully abstract scheduling. The way it works is you advise the kernel whenever your program enters a critical section of code that you don't want interrupted. It's probably going to be maybe 10 assembly instructions tops. The first assembly opcode should be a move instruction that sets the rseq_cs field. The last instruction needs to be the thing that makes the modification to your global data structure. Think of it sort of like a really tiny database transaction. What makes it go fast, is that the bidirectional communication with the kernel happens via shared memory.”
(Based on my reading of the LWN article rwmj posted).
The way I read it, it either runs to completion in one go, or gets restarted from the beginning. This means the sequence as a whole isn't executed atomically, as the already-executed instructions during an interrupt aren't rolled back.
It can be used to build atomic actions, but it is up to the developer to create a sequence of instructions where the very last instruction "commits" the entire operation, with the side-effects of partial execution being harmless.
There is a time-slice extension feature in the works that's roughly "please let me finish this critical section before you interrupt me". But a hard guarantee that userspace code won't be interrupted is probably untenable in a preemptive multitasking system.
https://github.com/compudj/librseq
This has helpers for common use cases like counters and linked lists. You shouldn't need to write assembly at all to use rseq in most applications.
Also, it was pre-2025 (before she got her job at Gradient Canopy), so long before asking for donations.
Finally, if you read the actual donation request, you can see she is trying to make a living doing open source, and is being honest about what the money is going to. Why is that an issue?
And if she's even half the genius she's claiming to be, why aren't the big tech companies in a bidding war over who get to pay her a million-dollar salary?
From what I've read of her in the past she seems to be a pretty damn good developer. But in the open source world those are a dime a dozen. If you want to make a living off of it you've got to market yourself, and this... isn't how you do that.
Not exactly. Very few people in recent decades have achieved anything comparable to αcτµαlly pδrταblε εxεcµταblε and Cosmopolitan libc - they're in the category of "that should not even be possible". Of course, Tunney's work doesn't touch Fabrice Bellard in terms of sheer breadth and impact, but they're arguably in the same category.
I understand why the author deleted it !
Your bad faith reading of that article says otherwise. It's bleedingly obvious that it's satire. Do you think people seriously ask for donations to fund a private airplane?
$20K of gear isn't that much if you're an independent developer, and if you're working for others as such in the US, and you're not a financial basket case, it's doable. She even says "It put me in the poor house for a few months" so she made sacrifices to get there. You can too, if you want to. Why the envy?
To the contrary, so you really seem to have a problem with her.
[references https://web.archive.org/web/20260529122658/https://justine.l... ]
Perhaps in the USA if you have a well-paid job at a MAMAA company ...
But agreeing with you, I've done big optimization stuff for multicore servers (not as many cores, but same kind of work) and my workstation was something small with not even the same os. I don't need the big machine on my desk to understand the concepts. I just need the big machine to check my work. For me, that's always been a production machine, sometimes a production machine taken out of rotation for pre-validation before running on production load. I guess I should mention, I work on applications specifically, and libraries and kernels as it relates to making whatever my application is work better. I also don't have a problem with pinning threads to cpus... but my applications are usually one big program that fills the system. Someone writing a general purpose library has a harder time.
Of course, if you want to do this kind of work and you don't have your own production load, you're going to have to borrow, rent, or buy a big machine. It doesn't need to be your workstation though. I hate working with cloud nonsense, but if your tests are short, and you do the upfront work to make your images start fast, you can probably save a lot of money by renting spot instances when testing ... I don't know if you can do spot instances of bare metal though, so you're probably stuck with vm overhead.
The key insight is that the preempter can introspect the program counter of the code being preempted (which is now stable since it was preempted) and act accordingly. The simplest mechanism is to reset their program counter if in a critical section. The more generic mechanism is to jump them to a supplied address. This allows you to do things like hard abort and more.
You can further remove the need for the preempter to understand the preempted code by having the preempted code create a self-introspection code snippet and supplying that with the program counter at preemption. So the preempter just vectors them to their own code which knows how to interpret its own state at any preemption point.
https://dl.acm.org/doi/abs/10.1145/512429.512451
Anyone with an informed opinion on this statement? It's seems counter intuitive (npi).
If you remove the 64 byte alignment (which forces each counter variable onto a separate cache line) from hitcounter-shard.c you ought to be able to see the performance difference for yourself.
The Q is: is it true the CPU mutexes are actually slower than those implemented in userspace?
EDIT: or maybe you're asking if introducing an explicit userspace mutex is better than a lockless algorithm with false sharing issues. The answer is that it's workload dependent but it definitely can be.
OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace.
The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, ..
How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel.
MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).
Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.
Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.
So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.
To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.
Even without rseq(2) you're looking at a "different algorithm". For one, a proper userspace mutex allows the excluded threads to be descheduled and the core can be repurposed to run other tasks, rather than hammering away at the contended cache line.
What Justine probably meant was an apples to oranges comparison: in userspace, you can sometimes add constraints that allow for a faster algorithm whereas the cpu generally has to assume the worst.
That being said, if you look through this thread you'll find similar issues with OP's phrasing. A fine engineer / hacker, but too much 4chan troll in my book.