Notes on Sorted Data

(amit.prasad.me)

49 points | by surprisetalk 6 days ago

3 comments

  • kristianp 1 hour ago
    I find this approach strange. Surely if you're sorting data, don't do it naively by raw bytes. Use your knowledge of the data type and sort using the appropriate comparison operations of the data type. E.g. sorting by 64 bit ints little endian, do the comparison based on that. If using packed numbers, they must be unpacked first.
    • amitprasad 42 minutes ago
      Generic data stores often don’t have this luxury — if you’re designing a system in which data is relatively opaque, you’re often forced to work with bytes. (e.g. rocksdb, etc)
  • amitprasad 2 hours ago
    Unexpected seeing this posted here.

    I wrote this post mostly out of interest for a personal project and thus it's not actually a very holistic exploration of the topic. May revisit and update it in the future :)

  • Rakshath_1 5 hours ago
    This is a really solid deep-dive. I like how you move from this seems obviouscases (ints, strings) into the subtle edge cases where ordering quietly breaks and then show practical encodings that actually work in byte-lex order. The examples make the pitfalls very concrete, especially the varint and tuple sections. Nice balance between theory and systems-level pragmatism
    • Sesse__ 3 hours ago
      In contrast, I found it rather lacking. No discussion of the most common way to sort floats as bytes (shift the sign bit down and XOR the other 31 bits with the resulting masks), nor NaNs and +/-0 for that matter. Varint sorting introduces its own homegrown serialization but doesn't discuss the issue of overlong encodings. Nothing about string collation or Unicode issues in general. Composite data suggests adding NULs, but what if there are NULs in the actual data? (It is briefly mentioned, but only as in “you can't”.)
      • amitprasad 2 hours ago
        author here -- agreed on all fronts. Mentioned this in the other comment but I approached the topic from a relatively narrow perspective (I was working on a specific project at the time)

        I think it's worth including these things in a future update to the post, but I didn't have the time / need to explore it back then.

        In the meantime, I'd point to the following post on Unicode that remains very nice to read >20 years later: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

        • Sesse__ 2 hours ago
          Sure, not all posts want to be the end-all of the topic. :-)

          And yes, the Joel post is a good introduction, if a bit old by now. Notably, of course, it doesn't say anything about _processing_ Unicode text. (E.g., don't sort by code point, don't break in the middle of a grapheme cluster, etc. etc.) But I believe that this is outside the scope of his intention.