Programming language speed comparison using Leibniz formula for π

(niklas-heer.github.io)

41 points | by PKop 54 days ago

22 comments

nmaludy 50 days ago
If i'm understanding the repository correctly, it looks like each language reads from a file, does some I/O printing to console, then computes the value, then some more console printing and exits.
In my opinion, the comparisons could be better if the file I/O and console printing were removed.
[-]
- nheer 48 days ago
  Author here — yep, you understood the repository correctly.
  My Python version is a good example of the structure: read rounds.txt, run the loop, print the result, exit. I’m timing the whole program with hyperfine.
  I agree that for a “pure compute” microbenchmark you could remove file I/O and console output. I kept them mainly because:
  - It gives every language the same simple interface (same input, same output) and acts as a basic correctness/sanity check.
  - The benchmark runs 1 billion iterations. The file read and a single print happen once per run, so that overhead is tiny compared to the loop, and the results stay comparable in practice.
  That said, I’m not against a compute-only / quiet mode. Since hyperfine already handles timing externally, the real work is implementing and maintaining a consistent --quiet / --no-io variant across 50+ languages.
  If someone wants to contribute that (even starting with a subset), I’m happy to review PRs.
- Twirrim 50 days ago
  I'm not sure why the contents of rounds.txt isn't just provided as some kind of argument instead of read in from a file. Given all the other boilerplate involved, I would have expected it to be trivial to add relevant templating.
  [-]
  - cgh 50 days ago
    Zig could include the file at compile-time with @embedFile.
    [-]
    - Twirrim 50 days ago
      So could Go (https://pkg.go.dev/embed), Rust (https://doc.rust-lang.org/std/macro.include_str.html) and a number of other languages in the set.
    - AndyKelley 50 days ago
      this would provide the optimizer the unfair chance to replace the entire application with a compile time constant
    - gus_massa 50 days ago
      The idea of using a file is to force the program to use that number as an unknown value at compile time.
      [-]
      - igouy 50 days ago
        command line arguments would be unknown at compile time.
- gavinray 50 days ago
  I'm fairly sure I can speed the JVM implementations up a significant amount by MMAP'ing the file into memory and ensuring it's aligned.
  [-]
  - pjscott 50 days ago
    I'm not too familiar with the JVM so perhaps I'm missing something here: how would that help? The file is tiny, just a few bytes, so I'd expect the main slowdown to come from system call overhead. With non-mmap file I/O you've got the open/read/close trio, and only one read(2) should be needed, so that's three expensive trips into kernel space. With mmap, you've got open/stat/mmap/munmap/close.
    Memory-mapped I/O can be great in some circumstances, but a one-time read of a small file is one of the canonical examples for when it isn't worth the hassle and setup/teardown overhead.
  - neonsunset 49 days ago
    [dead]
forgotpwd16 54 days ago
Some seeings:
- C++ unsurpassable king.
- There's a stark jump of times going from ~200ms to ~900ms. (Rust v1.92.0 being an in-between outlier.)
- C# gets massive boost (990->225ms) when using SIMD.
- But C++ somehow gets slower when using SIMD.
- Zig very fast*!
- Rust got big boost (630ms->230ms) upgrading v1.92.0->1.94.0.
- Nim (that compiles to C then native via GCC) somehow faster than GCC-compiled C.
- Julia keeps proving high-level languages can be fast too**.
- Swift gets faster when using SIMD but loses much accuracy.
- Go fastest language with own compiler (ie not dependent to GCC/LLVM).
- V (also compiles to C) expected it (appearing similar) be close to Nim.
- Odin (LLVM) & Ada (GCC) surprisingly slow. (Was expecting them to be close to Zig/Fortran.)
- Crystal slowest LLVM-based language.
- Pure CPython unsurpassable turtle.
Curious how D's reference compiler (DMD) compares to the LLVM/GCC front-ends, how LFortran to gfortran, and QBE to GCC/LLVM. Also would like to see Scala Native (Scala currently being inside the 900~1000ms bunch).
* Note that uses `@setFloatMode(.Optimized)` which according to docs is equivalent to `--fast-math` but only D/Fortran use this flag (C/C++ do not).
** Uses `@fastmath` AND `@simd`. The comparison supposedly is for performance on idiomatic code and for Julia SIMD is a simple annotation applied to the loop (and Julia may even auto do it) but should still be noted because (as seen in C# example) it can be big.
[-]
- nheer 48 days ago
  Author here! Thanks for the detailed breakdown. Let me address a few points:
  - C++ SIMD being slower: The standard C++ uses i & 0x1 which lets the compiler auto-vectorize. With -O3 -ffast-math -march=native, gcc/clang do this really well. The explicit AVX2 version has overhead from manual vector setup and horizontal sum at the end. Modern compilers often beat hand-written SIMD for simple loops like this.
  - Zig fast-math: Correct. Line 5 has @setFloatMode(.optimized) with a comment saying "like C -ffast-math".
  - Julia: Also correct. Uses @fastmath @simd for - both annotations together.
  - Crystal/Odin/Ada being slow: All three use x = -x which creates a loop-carried dependency that blocks auto-vectorization. The fast implementations use the branchless i & 0x1 trick instead.
  - C# SIMD: Uses Vector512 doing 8 doubles per iteration. That explains the ~4x speedup.
  - Nim vs C: Both compile via gcc with similar flags. Probably just measurement variance.
  - Fortran: Interestingly does NOT use -ffast-math. Uses manual loop unrolling instead (processes 4 terms per iteration).
  - Go: You're right that it's the fastest with its own compiler. No LLVM/GCC backend, just Go's own SSA-based compiler.
  For suggestions - DMD, LFortran, and Scala Native would be great additions. PRs welcome!
- vips7L 46 days ago
  > curious how D's reference compiler (DMD) compares to the LLVM/GCC front-ends
  On my machine (an old i7-8700), dmd performs rather poorly. 3.5 seconds.
```
    dmd -O leibniz.d
    Measure-Command { ./leibniz.exe } | select -expand TotalMilliseconds
    3535.823
```
  Comparatively ldc runs in 943 milliseconds:
```
    ldmd2 -O leibniz.d
    Measure-Command { ./leibniz.exe } | select -expand Totalmilliseconds
    943.21
```
  I'm sure there is compiler switch magic that I don't know about that could improve them.
- mrsmrtss 54 days ago
  Looking closer at the benchmarks, it seems that C# benchmark is not using AOT, so Go and even Java GraalVM get here an unfair advantage (when looking at the non SIMD versions). There is a non trivial startup time for JIT.
  [-]
  - mrsmrtss 54 days ago
    Sorry, I can't seem to edit my answer anymore, but I was mistaken, C# version is using AOT. But the are other significant differences here:
```
  > var rounds = int.Parse(File.ReadAllText("rounds.txt"));

  > var pi = 1.0D;
  > var x = 1.0D;

  > for (var i = 2; i < rounds + 2; i++) {
  >     x = -x;
  >     pi += x / (2 \* i - 1);
  > }

  > pi \*= 4;
  > Console.WriteLine(pi);
```
    For example, if we change the type of 'rounds' variable here from int to double (like it is also in Go version), the code runs significantly faster on my machine.
    [-]
    - neonsunset 54 days ago
      Try that on ARM64 and the result will be the opposite :)
      On M4 Max, Go takes 0.982s to run while C# (non-SIMD) and F# are ~0.51s. Changing it to be closer to Go makes the performance worse in a similar manner.
- Aurornis 50 days ago
  Reading the repo, the benchmark includes the entire program execution from startup to reading the file.
  For the sub-second compiled languages, it's basically a benchmark of startup times, not performance in the hot loop.
  [-]
  - igouy 50 days ago
    How much difference would it make for these tiny programs?
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
  - nheer 48 days ago
    Author here. I just tested this with Zig by timing the segments inside the program:
```
  1 billion iterations:
  - Startup + file read: 0.01ms
  - Computation: ~200ms
  - Overhead: 0.01%
```
    Even at 1 million iterations (0.24ms total), startup is only 4% overhead. At 1 billion it's essentially zero.
    The benchmark is definitely measuring the hot loop, not startup time.
- neonsunset 54 days ago
  > Go fastest language with own compiler (ie not dependent to GCC/LLVM).
  C# is using CoreCLR/NativeAOT. Which does not use GCC or LLVM also. Its compiler is more capable than that of Go.
Hizonner 50 days ago
This sort of thing is pretty meaningless unless the code is all written by people who know how to get performance out of their languages (and they're allowed to do so). Did you use the right representation of the data? Did you use the right library? Did you use the right optimization options? Did you choose the fast compiler or the slow one? Did you know there was a faster or slower one? If you're using fancy stuff, did you use it right?
I did the same sort of thing with the Seive of Eratosthenes once, on a smaller scale. My Haskell and Python implementations varied by almost a factor of 4 (although you could argue that I changed the algorithm too much on the fastest Python one). OK, yes, all the Haskell ones were faster than the fastest Python one, and the C one was another 4 times faster than the fastest Haskell one... but they were still over the place.
[-]
- ajross 50 days ago
  It's an extremely simple algorithm, just one loop with an iterated expression inside it. You can check the source code at: https://github.com/niklas-heer/speed-comparison/tree/master/...
  It's true this is a microbenchmark and not super informative about "Big Problems" (because nothing is). But it absolutely shows up code generation and interpretation performance in an interesting way.
  Note in particular the huge delta between rust 1.92 and nightly. I'm gonna guess that's down to the autovectorizer having a hole that the implementation slipped through, and they fixed it.
  [-]
  - pjscott 50 days ago
    The delta there is because the Rust 1.92 version uses the straightforward iterative code and the 1.94-nightly version explicitly uses std::simd vectorization. Compare the source code:
    https://github.com/niklas-heer/speed-comparison/blob/master/...
    https://github.com/niklas-heer/speed-comparison/blob/master/...
  - Hizonner 48 days ago
    Take their Haskell implementation. If I compile it unoptimized (with a slight change to hardwire the iteration count at one billion instead of reading it from a file), I get bored waiting for it to finish after several minutes of clock time. If I compile it with "-O3", it runs in 4.75 seconds (it's an old machine).
    Suppose I remove the strictness annotations (3 exclamation points, in places that aren't obvious to a naive programmer coming from almost any other language). If I then compile it unoptimized, it gets up to over 30GB of resident memory before I get bored (it's an old machine with a lot of memory). It would probably die with an out of memory error if I tried to run it to completion. However, if I compile that same modified code optimized, the compiler infers the strictness and the program runs in exactly the same time as it does with the annotations there. BUT it's far from obvious to the casual observer when the compiler can make those inferences and when it can't.
    I had ChatGPT rewrite the Haskell code to use unboxed numeric types. It ran in 1.5 seconds (the C version takes 1.27). The rewrite mostly consists of sprinkling "#" liberally throughout the code, but also requires using a few specialized functions. I had ChatGPT do it because I have never used unboxed types, and you could argue that they're not common idiom. However, anybody who actually wrote that kind of numerical code in Haskell on a regular basis would use unboxed types as a matter of course.
    So which one is the right time?
    [-]
    - ajross 48 days ago
      > So which one is the right time?
      You're acting like this is a gotcha, but the answer is obviously "all of them" and that indeed, this tells you interesting things about the behavior of your compiler. There are lots of variant scores in the linked article that reflect different ways of expressing the problem.
      But also, it tells you something about the limitations of your language too. For example, the biggest single reason that C/C++ (and languages like Fortran/Zig/D and sometimes C# and Rust whose code generation is isomorphic to them) sit at the top of the list is that they autovectorize. SIMD code isn't a common idiom either, but the compiler figures it out anyway.
      And apparently Haskell isn't capable of doing enough static analysis to fall back to an unboxed implementation (though given the simplicity of this code, that should be possible I think). That's not a "flaw" and it doesn't mean Haskell "loses", but it absolutely is an important thing to note. And charts like this show us where those holes lie.
      [-]
      - Hizonner 48 days ago
        > You're acting like this is a gotcha, but the answer is obviously "all of them" and that indeed, this tells you interesting things about the behavior of your compiler.
        They tell me interesting things if I know enough about the language to know the difference. It tells me things if I'm getting into the weeds with Haskell specifically. That doesn't make the big comparison chart useful in any way.
        I still don't know anything that lets me compare anything with any other language unless I actually know that language nearly as well. And I definitely don't get much out of a long list of languages, most of which I know not at all or at most at a "hello world" level, with only a couple of the entries tagged with even minimal information about compilers or their configurations at all. Especially when, on top of that, I don't know how much the person writing the test code knew.
        At most I get "this language does a pretty good/poor job on this type of task when given code that may or may not be what a 'native expert' would write.".
        And that's not news. Nobody (with any sophistication) would write that code for real in Python, or probably in Haskell either, because most seasoned programmers know that if you want speed on a task like that, you write it in a more traditional compiled procedural language. It's also not a kind of code that most people write to begin with. If you want an arctangent (which is really what it's doing), you use the library function, and the underlying implementation of that is either handcrafted C, or, more likely, a single, crafted CPU instruction with some call handling code wrapped around it.
        So what is the overall chart giving me that I can use?
        [-]
        ajross 48 days ago
        > So what is the overall chart giving me that I can use?
        "If you write in C or an analog, your math will autovectorize nicely"
        "If you use a runtime with a managed heap, you're likely to take a penalty even on math stuff that doesn't look heap limited"
        "rust 1.92 is, surprisingly, well behind clang on autovectorizable code"
        I mean, I think that stuff is interesting.
        > If you want an arctangent (which is really what it's doing), you use the library function
        If you just want to call library functions, you're 100% at the mercy of whatever platform you picked[1]. So sure, don't look at benchmarks, they can only prove you wrong.
        [1] Picked without, I'll note, having looked carefully at cross-platform benchmarks before having made the choice! Because you inexplicably don't think they tell you anything useful.
  - Aurornis 50 days ago
    > Note in particular the huge delta between rust 1.92 and nightly. I'm gonna guess that's down to the autovectorizer having a hole that the implementation slipped through, and they fixed it.
    The benchmark also includes startup time, file I/O, and console printing. There could have been a one-time startup cost somewhere that got removed.
    The benchmark is not really testing the Leibniz loop performance for the very fast languages, it's testing startup, I/O, console printing, etc.
    [-]
    - igouy 50 days ago
      Let's compare some independent measurements with in-process measurements of tiny tiny nbody programs:
      https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
      How significant are the differences for such tiny tiny programs?

How I love pypy for certain tasks. On my laptop:

  ᐅ time uv run -p cpython-3.14 leibniz.py
  3.1415926525880504
  
  ________________________________________________________
  Executed in   38.24 secs    fish           external
     usr time   37.91 secs  158.00 micros   37.91 secs
     sys time    0.16 secs  724.00 micros    0.16 secs
  
  ᐅ time uv run -p pypy leibniz.py
  3.1415926525880504
  
  ________________________________________________________
  Executed in    1.52 secs    fish           external
     usr time    1.16 secs    0.25 millis    1.16 secs
     sys time    0.02 secs    1.29 millis    0.02 secs

It was a free 25x speedup.

[-]

eigenspace 46 days ago
Sure, but at the end of the day, youre still just polishing a turd, and you give up a LOT of ecosystem just to get a still deeply unimpressive benchmark time.

viktorcode 50 days ago
After seeing Swift's result I had to look into the source to confirm that yes, it was not written by someone who knows the language.
But this is a good benchmark results that demonstrate what performance level can you expect from every language when someone not versed in it does the code porting. Fair play
[-]
- nheer 48 days ago
  Author here. There are actually 3 Swift versions in the benchmark:
```
  - Swift (standard): 893ms
  - Swift (relaxed): 903ms (uses fast-math equivalent)
  - Swift (SIMD): 509ms (explicit SIMD4)
```
  The standard version uses x *= -1.0 which creates a loop-carried dependency that blocks auto-vectorization - same issue as Crystal, Odin, Ada. The SIMD version uses the branchless i & 0x1 trick and is ~1.75x faster.
  Fair point that someone versed in Swift would probably use the better pattern in the standard version too. PRs welcome to improve it! The goal was idiomatic-ish code, but I'm not an expert in all 40+ languages.
- igouy 50 days ago
  Even in verbose Java it's barely 20 lines.
  Makes the benchmarks game 100 lines seem like major apps.
  https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
  https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
- pizlonator 50 days ago
  What do you think they could have done better in the Swift code?
  [-]
  - viktorcode 50 days ago
    Using overflow operators instead of the ones that check for that each iteration.
    [-]
    - Someone 50 days ago
      Reading https://github.com/niklas-heer/speed-comparison/blob/master/..., I think the only overflow checks could be in
      for i in 2...rounds+2
      and I would hope/expect the compiler to be smart enough to know that it only has to check “rounds+2” once there. Swift isn’t exactly new anymore, and it’s supported by a large company.
      What do I overlook?
      [-]
      - pizlonator 49 days ago
        There’s no way overflow checks are responsible for that enormous speed difference from C
drob518 50 days ago
Startup time doesn’t seem to be factored in correctly, so any language that uses a bytecode (e.g. Java) or is compiling from source (e.g. Ruby, Python, etc.) will look poor on this. If the kids of applications that you write are ones that exit after a fraction of a second, then sure, this will tell you something. But if you’re writing server apps that run for days/weeks/months, then this is useless.
[-]
- vhdd 50 days ago
  Python took 86 seconds, if I'm reading it correctly. I can see your point holding for a language like Java, but most of Python's time spent cannot have been startup time, but actual execution time.
  [-]
  - drob518 50 days ago
    Yea, for Python it’s also just slow (not the language, but CPython; you can do much better with PyPy).
Aurornis 50 days ago
Reading the fine print, the benchmark is not just the Leibniz formula like it says in the chart title. It also includes file I/O, startup time, and console printing:
> Why do you also count reading a file and printing the output?
> Because I think this is a more realistic scenario to compare speeds.
Which is fine, but should be noted more prominently. The startup time and console printing obviously aren't relevant for something like the Python run, but at the top of the chart where runs are a fraction of a second it probably accounts for a lot of the differences.
Running the inner loop 100 times over would have made the other effects negligible. As written, trying to measure millisecond differences between entire programs isn't really useful unless someone has a highly specific use case where they're re-running a program for fractions of a second instead of using a long-running process.
arohner 50 days ago
The Clojure version is not AOT'd, so it's measuring startup + compiler time. When properly compiled it should be comparable to the Java implementation.
nheer 48 days ago
Hi everyone, author here!
I’m genuinely blown away by all the interest in what started as a silly little experiment. The project grew way beyond its original scope. My initial curiosity was simply: how could you set up a pipeline to do automatic speed comparisons? I was less interested in the results as a definitive measure and more in the infrastructure challenge itself.
But then the interest kept growing. I tried to modernize things, but one thing became quite notable: the difference between a language that gets actively optimized by its community (like Julia) versus one that just sits there unoptimized is striking.
Honestly, I got overwhelmed. Managing all those implementations, keeping versions up to date, reviewing contributions—it was a lot. I basically tapped out for about a year.
Now I’m back, and with AI assistance, maintaining this has become much more realistic—updating versions, helping optimize implementations, etc. That said, I’m always happy to accept contributions from folks who know their languages better than I do.
Thank you all for your interest and the thoughtful discussion!
[-]
- igouy 48 days ago
  > how could you set up a pipeline to do automatic speed comparisons?
  This I like!
  > … actively optimized … versus one that just sits there unoptimized is striking.
  See N=50,000,000 nbody #1 #2 #3 #4 #5 jdk-23
  https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
  > a silly little experiment
  This I don't like.
  [-]
  - nheer 48 days ago
    Thank you for your comment.
    > > a silly little experiment
    > This I don't like.
    What do you mean with that?
    To explain what I meant. I knew that that in the end this was a microbenchmark. It can certainly give you some clue about a language, but it doesn't tell you the whole picture. In the end it tells you how good a language is (or can be) at loops and floating point math. That's what I meant. I hope that makes it clearer.
    [-]
    - igouy 48 days ago
      femto benchmark !
      Take kostya or hanabi1224 or attractivechaos or … as your starting point and do better.
      https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
      [-]
      - nheer 48 days ago
        Thanks for the links - the benchmarks game is a great resource and has been around for a long time.
        That said, I'm not sure what "do better" means here. This is an open source project I maintain in my spare time. If you see room for improvement, PRs are always welcome. That's how open source works - if something is useful to you and you want it improved, contribute.
        The project exists because some people find it useful. If it's not for you, that's fine too.
        [-]
        igouy 47 days ago
        And if not useful perhaps entertaining.
andrepd 50 days ago
This is meaningless. The benchmarks are (1) run in github actions, (2) include file and console IO, and (3) are compiled with different compiler flags...
[-]
- tliltocatl 50 days ago
  It is meaningful as an indication of a realistic developer setup rather than a fine-tuned setup you'll only see in a HPC context.
  [-]
  - andrepd 50 days ago
    It's certainly not meaningful of anything in the first place, since it only tests codegen for a sequence of floating point calculations, which is hardly a representative workload for anything. Even if it were, using an unreliable virtualised environment only introduces noise, especially in the presence of syscalls... So it's worthless even if the entire purpose was to evaluate how fast floating point sums and divisions run in a CI server x)
  - pizlonator 50 days ago
    Exactly.
    Also, winners don’t make excuses.
    (Not even being snarky. You have to spiritually accept that as a fact if you are in the PL perf game.)
sph 50 days ago
C# wins hands down in the performance / lines of code metric.
There is very little superfluous or that cannot be inferred by the compiler here: https://github.com/niklas-heer/speed-comparison/blob/master/...
[-]
- forgotpwd16 50 days ago
  C# places quite low though (at ~1s). C# (SIMD) that you see towards the top is more complicated: https://github.com/niklas-heer/speed-comparison/blob/master/.... In your metric winner is Nim: https://github.com/niklas-heer/speed-comparison/blob/master/... (followed by Julia and D).
  [-]
  - mrsmrtss 50 days ago
    C# non SIMD (naive non optimized version) is in the same ballbark as other similar GC languages. Nim version is not some naive version also and seem rather specially crafted so it can be vectorized and still looses to C# SIMD.
    [-]
    - forgotpwd16 50 days ago
      Loses? My comparison is regarding GP's metric perf/lines_of_code. Let m := perf/lines_of_code = 1/(t × lines_of_code) [highest is better], or to make comparison simpler*, m' := 1/m = t × lines_of_code [lowest is better]. Then**:
      Nim 1672 Julia 3012 D 3479 C# (SIMD) 5853 C# 8919
      >Nim version is not some naive version
      It's direct translation of formula, using `mod` rather `x = -x`.
      *Rather comparing numbers << 1. **No blank/comment lines. As cloc and similar tools count.
      [-]
      - neonsunset 49 days ago
        Nim "cheats" in a similar way C and C++ submissions do: -fno-signed-zeros -fno-trapping-math
        Although arguably these flags are more reasonable than allowing the use of -march=native.
        Also consider the inherent advantage popular languages have: you don't need to break out to a completely niche language, while achieving high performance. Saying this, this microbenchmark is naive and does not showcase realistic bottlenecks applications would face like how well-optimized standard library and popular frameworks are, whether the compiler deals with complexity and abstractions well, whether there are issues with multi-threaded scaling, etc etc. You can tell this by performance of dynamically typed languages - since all data is defined in scope of a single function, the compiler needs to do very little work and can hide the true cost of using something like Lua (LuaJIT).
        [-]
        archargelod 49 days ago
        > Nim "cheats" in a similar way C and C++ submissions do: -fno-signed-zeros -fno-trapping-math
        I don't see these flags in Nim compilation config. The only extra option used is "-march=native"[0].
        [0] https://github.com/niklas-heer/speed-comparison/blob/9681e8e...
        [-]
        neonsunset 49 days ago
        https://github.com/niklas-heer/speed-comparison/blob/9681e8e...
        forgotpwd16 49 days ago
        Per the rules[0]: "Use idiomatic code for the language. Compiler optimizations flags are fine."
        Agree with the rest of your comment.
        [0]: https://github.com/niklas-heer/speed-comparison#rules

gus_massa 50 days ago

It would be nice to have a different color for languages that give the exact IEEE 754 result. Other languages can be using some kind of fast math. Sorted by accuracy:

   #   Language          Accuracy
  14   Swift (SIMD)          8.69
   9   Fortran 90            9.44
   2   C# (SIMD)             9.49
   .   [All the others]      9.50

[-]

pklausler 50 days ago

Is "accuracy" defined anywhere on that page?

[-]

gus_massa 49 days ago

From https://github.com/niklas-heer/speed-comparison/blob/master/...

  // [I removed the error checking]
  def pi_accuracy(value_str):
      """Calculate accuracy of computed pi value.
      Returns the number of correct decimal places (higher is better).
      """
      value = float(value_str)
      # math.pi is available in MicroPython
      accuracy = 1 - (value / math.pi)
      return -math.log10(abs(accuracy))

henning 50 days ago
Some implementations seem vectorization-friendly like the C one that uses a bit-twiddling trick to avoid the `x = -x` line that the Odin implementation and others have.
When you put these programs into Godbolt to see what's going on with them, so much of the code is just the I/O part that it's annoying to analyze

igouy 50 days ago

iirc Last time I looked at this:

    public static void main(String[] args) throws FileNotFoundException {
        Scanner s = new Scanner(new File("rounds.txt"));
        long rounds = s.nextLong();
        s.close();

        double sum = 0.0;
        double flip = -1.0;
        for (long i = 1; i <= rounds; i++) {
            flip *= -1.0;
            sum += flip / (2 * i - 1);
        }

        System.out.println(sum * 4.0);
    }

:the measurements changed dramatically if the order was switched something like:

            sum += flip / (2 * i - 1); 
            flip *= -1.0;

YMMV

[-]

vips7L 46 days ago
IIRC scanner is also terribly slow.

kiriberty 50 days ago
And the winner is (Drumroll)... Python - the most popular language in the AI world
[-]
- empiricus 50 days ago
  Well, python for AI is just the syntactic sugar to call pytorch cuda code on the gpu.
klaff 50 days ago
I think I get why C++ thru C are all similar (all compile to similar assembly?), but I don't get why Go thru maybe Racket are all in what looks like a pretty narrow clump. Is there a common element there?
[-]
- pjscott 50 days ago
  The common element is that they're written with the most obvious version of the code, while the ones in the faster bucket are either explicitly vectorized or written in non-obvious ways to help the compiler auto-vectorize. For example, consider the Objective C version of the loop in leibniz.m:
```
  for (long i = 2; i <= rounds + 2; i++) {
      x *= -1.0;
      pi += x / (2.0 * i - 1.0);
  }
```
  With my older version of Clang, the resulting assembly at -O3 isn't vectorized. Now look at the C version in leibniz.c:
```
  rounds += 2u; // do this outside the loop
  for (unsigned i=2u; i < rounds; ++i) // use ++i instead of i++
  {
      double x = -1.0 + 2.0 * (i & 0x1); // allows vectorization
      pi += (x / (2u * i - 1u)); // double / unsigned = double
  }
```
  This produces vectorized code when I compile it. When I replace the Objective C loop with that code, the compiler also produces vectorized code.
  You see something similar in the other kings-of-speed languages. Zig? It's the C code ported directly to a different syntax. D? Exact same. Fortran 90? Slightly different, but still obviously written with compiler vectorization in mind.
  (For what it's worth, the trunk version of Clang is able to auto-vectorize either version of the loop without help.)
- ajross 50 days ago
  I think it's SIMD generation. Managed runtimes have a much harder time autovectorizing, because you can't do any static analysis about things like array sizes. Note that the true low-level tools are all clustered around 2-300ms, and that the next level up are all the "managed" runtimes around 1-2s.
  The one exception is sort of an exception that proves the rule: it's marked "C# (SIMD)", and looks like a native compiler and not a managed one.
- Someone 50 days ago
  They’re measuring program execution time, including program startup and tear down. Languages with a more complex runtime take longer for the startup, and all seem to have roughly equally optimized that.
- f1shy 50 days ago
  Some features some of those languages have:
  - run bytecode - very high level - GC memory
  But not all have these traits. Not sure.
amelius 50 days ago
That first big jump in the graph, I thought that it must be the divide between auto-gc'd and non auto-gc'd languages. But then I noticed that Rust is on the wrong side of the divide.
Qem 50 days ago
It appears Raku runtime improve a lot. It used to end last in comparisons like that, by a large margin, and now is surpassing Perl and CPython.
theanonymousone 54 days ago
Is there a explanation for why C is slower than C++?
[-]
- tliltocatl 50 days ago
  The code looks 100% identical except for the namespace prefixes. Must be something particular about github setup, because on mine (gcc15.2.1/clang20.1.8/Ryzen5600X) the run time is indistinguishably close. Interestingly, with default flags but -O3 clang is 30% slower, with flags from the script (-s -static -flto $MARCH_FLAG -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math) clang is a bit faster.
  A nitpick is that benchmarking C/C++ with $MARCH_FLAG -mtune=native and math magic is kinda unfair for Zig/Julia (Nim seem to support those) - unless you are running Gentoo it's unlikely to be used for real applications.
  [-]
  - AlotOfReading 50 days ago
    The actual assembly generated for the hot loop is identical in both C and C++ on Clang, as you'd expect. It's also identical at the IR level.
- AlotOfReading 50 days ago
  It's probably down to the measurement noise of benchmarking on GitHub actions.
  [-]
  - drob518 50 days ago
    I suspect this is it. Any benchmark that takes less than a second to run should have its iteration count increased such that it takes at least a second, and preferably 5+ seconds, to run. Otherwise CPU scheduling, network processing, etc. is perturbing everything.
    [-]
    - igouy 50 days ago
      What if instead we measured with …
      BenchExec "uses the cgroups feature of the Linux kernel to correctly handle groups of processes and uses Linux user namespaces to create a container that restricts interference of [each program] with the benchmarking host."
      https://github.com/sosy-lab/benchexec
      [-]
      - drob518 50 days ago
        Certainly better, but you’re always going to be better off maximizing the runtime to a level where it just swamps any of the other effects. Then do multiple runs and take an average.
- mutkach 50 days ago
  Probably LLVM runs different sets of optimization passes for C and C++. Need to look at the IR, or assembly to know exactly what happens.
  [-]
  - pizlonator 50 days ago
    It doesn’t as far as I know.
    (I have spent a good amount of time hacking the llvm pass pipeline for my personal project so if there was a significant difference I probably would have seen it by now)
    [-]
    - mutkach 50 days ago
      You are correct, that was an uneducated guess on my part.
      I just glanced at the IR which was different for some attributes (nounwind vs mustprogress norecurse), but the resulting assembly is 100% identical for every optimization level.
dvh 50 days ago
Python: how much is pi?
Swift: 3.7
Python: that's incorrect!
Swift: yeah, but it's fast!
xnacly 50 days ago
the rust example is so far off being useful and file io seems completly dumb in this context
[-]
- pizlonator 50 days ago
  Real programs have to do IO and the C and C++ code runs faster while also doing IO.
  What do you think they could have done better assuming that the IO is a necessary part of the benchmark?
  Also good job to the Rust devs for making the benchmark so much faster in nightly. I wonder what they did.
  [-]
  - igouy 50 days ago
    > Assuming that the IO is a necessary part of the benchmark …
    So make it significant, move giga like benchmarks game reverse-complement & fasta & …
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
  - Aurornis 50 days ago
    The file I/O is probably irrelevant, but the startup time is not.
    The differences among the really fast languages are probably in different startup times if I had to guess.
    [-]
    - igouy 49 days ago
      What if we try to measure rather than guess :-)
      https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    - pizlonator 50 days ago
      > differences among the really fast languages are probably in different startup times
      Startup times matter a great deal.
systems 50 days ago
why is ocaml so low, didnt expect this
[-]
- pjscott 50 days ago
  As with all the ahead-of-time compiled languages that I checked, the answer is that it generates non-SIMD code for the hot loop. The assembly code I see in godbolt.org isn't bad at all; the compiler just didn't do anything super clever.