Use string views instead of passing std:wstring by const&

(giodicanio.com)

30 points | by Orochikaku 2 days ago

4 comments

Panzerschrek 4 minutes ago
System APIs requiring passing a null-terminated string are also painful to use from other languages, where strings are not null-terminated by default. They basically require taking a copy of a string and adding a null-terminator before performing a call.
delta_p_delta_x 2 hours ago
The zero-terminated string is by far C's worst design decision. It is single-handedly the cause for most performance, correctness, and security bugs, including many high-profile CVEs. I really do wish Pascal strings had caught on earlier and platform/kernel APIs used it, instead of an unqualified pointer-to-char that then hides an O(n) string traversal (by the platform) to find the null byte.
There are then questions about the length prefix, with a simple solution: make this a platform-specific detail and use the machine word. 16-bit platforms get strings of length ~2^16, 32 b platforms get 2^32 (which is a 4 GB-long string, which is more than 1000× as long as the entire Lord of the Rings trilogy), 64 b platforms get 2^64 (which is ~10^19).
Edit: I think a lot of commenters are focusing on the 'Pascalness' of Pascal strings, which I was using as an umbrella terminology for length-prefixed strings.
[-]
- david2ndaccount 1 hour ago
  Pascal strings might be the only string design worse than C strings. C Strings at least let you take a zero copy substring of the tail. Pascal strings require a copy for any substring! Strings should be two machine words - length + pointer (aka what is commonly called a string view). This is no different than any other array view. Strings are not a special case.
  [-]
  - Joker_vD 1 hour ago
    Yeah, I too feel that storing the array's length glued to the array's data is not that good of an idea, it should be stored next to the pointer to the array aka in the array view. But the thrall of having to pass around only a single pointer is quite a strong one.
    [-]
    - kstrauser 21 minutes ago
      Is there a reason for the string not to be a struct, so that you're still just passing around a pointer to that struct (or even just passing it by value)?
      [-]
      - tczMUFlmoNk 4 minutes ago
        I might guess that GP is referring not to interface ergonomics (for which a struct is a perfectly satisfactory solution, as you describe), but to implementation efficiency. A pointer is one word. A slice / string view is two words: a length and a pointer. A pointer to a slice is one word, but requires an additional indirection. I personally agree that slices are probably the best all-around choice, but taking double the memory (and incurring double the register pressure, etc.) is a trade-off that's fair to mention.
  - delta_p_delta_x 1 hour ago
    > C Strings at least let you take a zero copy substring of the tail
    This is a special-case optimisation that I'm happy to lose in favour of the massive performance and security benefits otherwise.
    Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.
    I think what was unsaid in your second point is that we really need to type-differentiate constant strings, dynamic strings, and string 'views', which Rust does in-language, and C++ does with the standard library. I prefer Rust's approach.
    [-]
    - vlovich123 54 minutes ago
      If I recall correctly a pascal string has the length before the string. Ie to get the length you dereference the pointer and look backwards N bytes to get the length. A pascal string is still a single pointer.
      You cannot cheaply take an arbitrary view of the interior string - you can only truncate cheaply (and oob checks are easier to automate). That’s why pointer + length is important because it’s a generic view. For arrays it’s more complicated because you can have a stride which is important for multidimensional arrays.
    - masklinn 31 minutes ago
      > Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.
      Length + pointer is a record string, a pascal string has the length at the head of the buffer, behind the pointer.
  - gizmo686 1 hour ago
    C strings also allow you to do a 0 copy split by replacing all instances of the delimeter with null (although you need to keep track of the end-of-list seperatly).
    [-]
    - masklinn 28 minutes ago
      You also need to own the buffer otherwise you’re corrupting someone else’s data, or straight up segfaulting.
  - theamk 1 hour ago
    x86 had 6 general-purpose working registers total. Using length + pointers would have caused a lot of extra spills.
    [-]
    - masklinn 27 minutes ago
      “Sure your software crashes and your machines get owned, but at least they’re not-working very fast!”
- jmyeet 47 minutes ago
  The C string and C++'s backwards compatibility supporting it is why I think both C and C++ are irredeemable. Beyond the bounds overflow issue, there's no concept of ownership. Like if you pass a string to a C function, who is responsible for freeing it? You? The function you called? What if freeing it is conditional somehow? How would you know? What if an error prevents that free?
  C++ strings had no choice but to copy to underlying string because of this unknown ownership and then added more ownership issues by letting you call the naked pointer within to pass it to C functions. In fact, that's an issue with pretty much every C++ container, including the smart pointers: you can just call get() an break out of the lifecycle management in unpredictable ways.
  string_view came much later onto the scene and doesn't have ownership so you avoid a sometimes unnecessary copy but honestly it just makes things more complex.
  I honestly think that as long as we continue to use C/C++ for crucial software and operating systems, we'll be dealing with buffer overflow CVEs until the end of time.
- theamk 1 hour ago
  First common 32 bit system was Win 95, which required 4MB of RAM (not GB!). The 4-byte prefix would be considered extremely wasteful in those times - maybe not for a single string, but anytime when there is a list of strings involved, such as constants list. (As a point of reference, Turbo Pascal's default strings still had 1-byte length field).
  Plus, C-style strings allow a lot of optimizations - if you have a mutable buffer with data, you can make a string out of them with zero copy and zero allocations. strtok(3) is an example of such approach, but I've implemented plenty of similar parsers back in the day. INI, CSV, JSON, XML - query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
  Compared to this, Pascal strings would be incredibly painful to use... So you query file size, allocate, read it, and then what? 1-byte length is too short, and for 2+ byte length, you need a secondary buffer to copy string to. And how big should this buffer be? Are you going to be dynamically resizing it or wasting some space?
  And sure, _today_ I no longer write code like that, I don't mind dropping std::string into my code, it'd just a meg or so of libraries and 3x overhead for short strings - but that's nothing those days. But back when those conventions were established, it was really really important.
  [-]
  - kstrauser 19 minutes ago
    > First common 32 bit system was Win 95
    We're just going to ignore Amigas, and any Unix workstations?
  - delta_p_delta_x 55 minutes ago
    > zero copy and zero allocations
    This is a red herring, because when you actually read the strings out, you still need to iterate through the length for each string—zero copy, zero allocation, but linear complexity.
    > query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
    I write parsers in a very different way—I keep the file buffer around as read-only until the end of the pipeline, prepare string views into the buffer, and pipe those along to the next step.
    [-]
    - dh2022 30 minutes ago
      I think the concern was conserving memory ( which was scarce back then) and not iterating through each substring.
      [-]
      - delta_p_delta_x 17 minutes ago
        I am very sceptical about that. Much safer and cleaner languages like ML and Lisp were contemporary to C, and were equally developed on memory-scarce hardware.
  - amluto 1 hour ago
    > query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
    I have also done this, but I would argue that, even at the time, the design was very poor. A much much better solution would have been wise pointers — pass around the length of the string separately from the pointer, much like string_view or Rust’s &str. Then you could skip the NULL-writing part.
    Maybe C strings made sense on even older machines which had severely limited registers —- if you have an accumulator and one resister usable as a pointer, you want to minimize the number of variables involved in a computation.
quotemstr 19 minutes ago
It's usually the case that the more strident someone is in a blog post decrying innovation, the more wrong he is. The current article is no exception.
It's possible to define your own string_view workalike that has a c_str() and binds to whatever is stringlike can has a c_str. It's a few hundred lines of code. You don't have to live with the double indirection.
breuwi 1 hour ago
[Deleted, misread]
[-]
- Matheus28 1 hour ago
  Since C++11, data() is also required to be null terminated. Per your own source and cppreference.
  [-]
  - breuwi 1 hour ago
    LOL, I need to learn to click on the more modern tabs. Will delete comment.
- beached_whale 1 hour ago
  std::string since C++ 11 guarantees the buffer is zero terminated. The reasoning being thread safety of const members. https://eel.is/c++draft/basic.string#general-3