> A cell starts with | and ends with one or more tabs.
|one\t|two|three
How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?
> A line that starts with a cell is a row. Any other lines are ignored.
Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?
Which formats are helpful can depend on the use. I think DER (which is a binary format) is not so bad (although I added a few additional types (such as key/value list, BCD string, and TRON string), but not all uses are required to use them). I had also made up Multi-DER, which is simply any number of DER concatenated together (there are formats of JSON like that too). (I had also made up TER which is a text format and a program to convert TER to DER. I also wrote a program to convert JSON to DER. It would also be possible to convert CSV, etc.)
It was also my idea of an operating system design, it will have a binary format used for most stuff, similar to DER but different in some ways (including different types are available), which is intended to be interoperable among most of the programs on the system.
Is there a text format like TSV/CSV that can represent nested/repeating sub-structures?
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
Honestly at this point my favorite format is JSONLines (one JSON object per line).
It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.
I hate this kind of format. It's trying to be both a data format for computers and a display format for humans. Much better off just using a tool that can edit CSV files as tables.
Also it doesn't seem to say anything about the header row?
Or we could use the actual characters for this purpose - the FS (file separator), GS (group separator), RS (record separator), and US (unit separator).
ASCII (and through it, Unicode) has these values specifically for this purpose.
I did an ETL project for an ERP system that used these separators years ago. It was ridiculously easy because I didn't have to worry about escaping. Parsing was an easy state machine.
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.
I’m pretty sure part of the intent is that it should be easy to write (type) in this format. Separator characters are not that. Depending on the editor, they’re not especially readable either.
1. It's quite easy to miss a tab and use only `|`.
2. Generated TPSV would look like an unreadable hard to edit mess. I doubt any tool would calculate max column length to adjust tab count for all cells. It basically kills any streaming.
I like that there is plenty of room for comments, and the multiline extension is also cool. The backslash almost looks like what I would write on paper if I wanted to sneak something into the previous line :)
> A cell starts with | and ends with one or more tabs.
How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?> A line that starts with a cell is a row. Any other lines are ignored.
Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?
It was also my idea of an operating system design, it will have a binary format used for most stuff, similar to DER but different in some ways (including different types are available), which is intended to be interoperable among most of the programs on the system.
This is why you should almost always use text formats.
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
But CSV represented as JSON is usually accomplished like so:
It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.
Also it doesn't seem to say anything about the header row?
ASCII (and through it, Unicode) has these values specifically for this purpose.
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.
2. Generated TPSV would look like an unreadable hard to edit mess. I doubt any tool would calculate max column length to adjust tab count for all cells. It basically kills any streaming.
> 1. It's quite easy to miss a tab and use only `|`.
Any format is hard to edit manually if you don't follow the requirements of the format (which are very simple in this case).
> 2. Generated TPSV would look like an unreadable hard to edit mess.
CSVs are much less readable than this, but still entirely possible to edit.
Ummm, how do you figure out what row has too many cells? Can all the rows before this one have too few cells?
> 3. The first row defines the number of columns