TPSV, an Alternative to TSV (and CSV)

(chtenb.dev)

29 points | by ctenb 8 hours ago

11 comments

  • montroser 4 hours ago
    This is pretty under-specified...

    > A cell starts with | and ends with one or more tabs.

        |one\t|two|three
    
    How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?

    > A line that starts with a cell is a row. Any other lines are ignored.

    Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?

    • jasonthorsness 3 hours ago
      The spec says the last cell does not need to end in a tab, so this would be two cells IMO
    • bvrmn 3 hours ago
      I think spec tries and fails to translate code implementation into human language. In code cell separator is `\t|`.
  • Hashex129542 4 hours ago
    We need binary formats. In this era we are capable for it. Throw away the text formats.
    • zzo38computer 1 hour ago
      Which formats are helpful can depend on the use. I think DER (which is a binary format) is not so bad (although I added a few additional types (such as key/value list, BCD string, and TRON string), but not all uses are required to use them). I had also made up Multi-DER, which is simply any number of DER concatenated together (there are formats of JSON like that too). (I had also made up TER which is a text format and a program to convert TER to DER. I also wrote a program to convert JSON to DER. It would also be possible to convert CSV, etc.)

      It was also my idea of an operating system design, it will have a binary format used for most stuff, similar to DER but different in some ways (including different types are available), which is intended to be interoperable among most of the programs on the system.

    • account-5 3 hours ago
      Text is universally accessible and widely supported. Binary has it's benefits, but human facing, it has to be text.
    • voidfunc 3 hours ago
      Yep, and stuff it into a sqlite db too you have a query interface all built.
      • culi 1 hour ago
        What's a good program that non-technical people can use to write sqlite db data. I think it's a great idea in theory but lacking in support
    • smallerize 1 hour ago
      Ok but how do you type them? How do you search them? How do you copy-and-paste between documents?
    • rr808 1 hour ago
      parquet?
    • bsder 3 hours ago
      Data will always outlive the program that originally produced it.

      This is why you should almost always use text formats.

  • karmakaze 2 hours ago
    Is there a text format like TSV/CSV that can represent nested/repeating sub-structures?

    We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.

    • culi 1 hour ago
      Well there's JSONL which is used heavily in scientific programs (especially in biology)

      But CSV represented as JSON is usually accomplished like so:

        {
          "headers": ["name", "habitat", "food"],
          "data": [
            ["Acorn Woodpecker", "forest", "grain"],
            ["American Goldfinch", "grassland", "grain"],
            ["Anhinga", "wetland", "fish"],
            ["Australian Reed Warbler", "wetland", "grub"],
            ["Black Vulture", "forest", null]
          ]
        }
  • Hackbraten 8 hours ago
    Good on you to leverage EditorConfig settings. Almost every modern IDE or editor supports it either out of the box or with a plug-in.
  • CJefferson 2 hours ago
    Honestly at this point my favorite format is JSONLines (one JSON object per line).

    It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.

  • stevage 3 hours ago
    I hate this kind of format. It's trying to be both a data format for computers and a display format for humans. Much better off just using a tool that can edit CSV files as tables.

    Also it doesn't seem to say anything about the header row?

  • DrillShopper 2 hours ago
    Or we could use the actual characters for this purpose - the FS (file separator), GS (group separator), RS (record separator), and US (unit separator).

    ASCII (and through it, Unicode) has these values specifically for this purpose.

    • EvanAnderson 1 hour ago
      I did an ETL project for an ERP system that used these separators years ago. It was ridiculously easy because I didn't have to worry about escaping. Parsing was an easy state machine.

      Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.

    • addoo 1 hour ago
      I’m pretty sure part of the intent is that it should be easy to write (type) in this format. Separator characters are not that. Depending on the editor, they’re not especially readable either.
  • bvrmn 3 hours ago
    According to spec it's nearly impossible to correctly edit files in this format by hand.
    • mkl 3 hours ago
      How so? All you need is a text editor that preserves tabs.
      • bvrmn 3 hours ago
        1. It's quite easy to miss a tab and use only `|`.

        2. Generated TPSV would look like an unreadable hard to edit mess. I doubt any tool would calculate max column length to adjust tab count for all cells. It basically kills any streaming.

        • mkl 53 minutes ago
          You have a very strange definition of "nearly impossible".

          > 1. It's quite easy to miss a tab and use only `|`.

          Any format is hard to edit manually if you don't follow the requirements of the format (which are very simple in this case).

          > 2. Generated TPSV would look like an unreadable hard to edit mess.

          CSVs are much less readable than this, but still entirely possible to edit.

  • helix278 7 hours ago
    I like that there is plenty of room for comments, and the multiline extension is also cool. The backslash almost looks like what I would write on paper if I wanted to sneak something into the previous line :)
  • AstroJetson 3 hours ago
    > A row with too many cells has the superfluous cells ignored.

    Ummm, how do you figure out what row has too many cells? Can all the rows before this one have too few cells?

    • benjaminl 3 hours ago
      The spec says that the first row specifies the number of columns.
    • mkl 3 hours ago
      No:

      > 3. The first row defines the number of columns

  • tomhovv 1 hour ago
    [flagged]