100M-Row Challenge with PHP

(github.com)

76 points | by brentroose 4 hours ago

6 comments

brentroose 4 hours ago
A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.
That's why I built a performance challenge for the PHP community
The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).
I hope people will have fun with it :)
[-]
- Tade0 1 hour ago
  Pitch this to whoever is in charge of performance at Wordpress.
  A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.
  [-]
  - embedding-shape 1 hour ago
    Microbenchmarks are very different from optimizing performance in real applications in wide use though, they could do great on this specific benchmark but still have no clue about how to actually make something large like Wordpress to perform OK out of the box.
  - monkey_monkey 15 minutes ago
    That's often a skill issue.
- user3939382 1 hour ago
  exec(‘c program that does the parsing’);
  Where do I get my prize? ;)
  [-]
  - brentroose 1 hour ago
    The FAQ states that solutions like FFI are not allowed because the goal is to solve it with PHP :)
    [-]
    - kpcyrd 31 minutes ago
      What about using the filesystem as an optimized dict implementation?
- gib444 1 hour ago
  > A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds
  That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)
  [-]
  - brentroose 1 hour ago
    Almost all, actually. I wrote about it here: https://stitcher.io/blog/11-million-rows-in-seconds
    A couple of things I did:
    - Cursor based pagination - Combining insert statements - Using database transactions to prevent fsync calls - Moving calculations from the database to PHP - Avoiding serialization where possible
    [-]
    - tiffanyh 1 hour ago
      Aren’t these optimizations less about PHP, and more about optimizing how your using the database.
      [-]
      - hu3 42 minutes ago
        It's still valid as as example to the language community of how to apply these optimizations.
      - swasheck 39 minutes ago
        in all my years doing database tuning/admin/reliability/etc, performance have overwhelmingly been in the bad query/bad data pattern categories. the data platform is rarely the issue
pxtail 2 hours ago
Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!
https://elephpant.me/
[-]
- t1234s 1 hour ago
  Elephpants should be for second and third place. First place should be the double-clawed hammer.
- thih9 59 minutes ago
  Excellent project. My favorites: the joker, php storm, phplashy, Molly.
tveita 1 hour ago
> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others
Except that the generator script generates dates relative to time() ?
Retr0id 1 hour ago
How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)
[-]
- brentroose 1 hour ago
  Around 7GB
spiderfarmer 2 hours ago
Awesome. I’ll be following this. I’ll probably learn a ton.
wangzhongwang 1 hour ago
[dead]