I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.
I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.
In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
> I think a legitimate criticism is that it is unclear who std::simd is for.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
> I think a legitimate criticism is that it is unclear who std::simd is for
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).
Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.
There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.
That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!
I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.
There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.
[1]: https://github.com/NoNaeAbC/std_simd