Notes on Sorted Data

(amit.prasad.me)

54 points | by surprisetalk 58 days ago

3 comments

kristianp 52 days ago
I find this approach strange. Surely if you're sorting data, don't do it naively by raw bytes. Use your knowledge of the data type and sort using the appropriate comparison operations of the data type. E.g. sorting by 64 bit ints little endian, do the comparison based on that. If using packed numbers, they must be unpacked first.
[-]
- amitprasad 52 days ago
  Generic data stores often don’t have this luxury — if you’re designing a system in which data is relatively opaque, you’re often forced to work with bytes. (e.g. rocksdb, etc)
amitprasad 52 days ago
Unexpected seeing this posted here.
I wrote this post mostly out of interest for a personal project and thus it's not actually a very holistic exploration of the topic. May revisit and update it in the future :)
Rakshath_1 52 days ago
[flagged]
[-]
- Sesse__ 52 days ago
  In contrast, I found it rather lacking. No discussion of the most common way to sort floats as bytes (shift the sign bit down and XOR the other 31 bits with the resulting masks), nor NaNs and +/-0 for that matter. Varint sorting introduces its own homegrown serialization but doesn't discuss the issue of overlong encodings. Nothing about string collation or Unicode issues in general. Composite data suggests adding NULs, but what if there are NULs in the actual data? (It is briefly mentioned, but only as in “you can't”.)
  [-]
  - amitprasad 52 days ago
    author here -- agreed on all fronts. Mentioned this in the other comment but I approached the topic from a relatively narrow perspective (I was working on a specific project at the time)
    I think it's worth including these things in a future update to the post, but I didn't have the time / need to explore it back then.
    In the meantime, I'd point to the following post on Unicode that remains very nice to read >20 years later: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
    [-]
    - Sesse__ 52 days ago
      Sure, not all posts want to be the end-all of the topic. :-)
      And yes, the Joel post is a good introduction, if a bit old by now. Notably, of course, it doesn't say anything about _processing_ Unicode text. (E.g., don't sort by code point, don't break in the middle of a grapheme cluster, etc. etc.) But I believe that this is outside the scope of his intention.