Follow

hot take: a programming language's string type and its array type should be entirely separate, and the preferred way to iterate over the characters in your string should be an iterator object, not `for x in 0 .. str.len()` or similar

the existence of more than 2^16 characters means that even if you're using UTF-16 you'll have characters that span multiple codepoints, which means that 'get the nth character' can't be an O(1) operation (and will be O(n) unless you store some bookkeeping info))

unless you use utf-32 as your backing store, in which case, *why*

Show thread
@hierarchon and it gets even better, since even if you store one codepoint per index it's still not entirely good because of combining characters and normalization forms

@hierarchon yeah

and some languages are just using [u8] which means you have to do utf validation manually...

ugh

@hierarchon because you can fit a predictable number of code points in a register or cache line or linear span of memory, provided you don't mind trading vast swathes of memory for random access

apparently Python's Unicode string scheme as of 3.3 is more clever and uses either UTF-8, UTF-16, or UTF-32 on a per-string basis, *depending on the largest code point in the string* python.org/dev/peps/pep-0393/

@hierarchon variable-length UTF encodings can use hardware acceleration for some ops, if you have the right hardware and no morals, for example, intel.com/content/dam/www/publ "Unicode Processing and PCMPxSTRy" section

@hierarchon but even UTF-32 won't save you from the problems of iterating across decomposed characters and other such sequences that have more than one code point in them (like most new emoji), which is why Swift has so many flavors of view into Strings developer.apple.com/documentat

@hierarchon tl;dr: some modern language designers agree with you, but it kinda sucks Rust doesn't have the ability to iterate over grapheme clusters in std

@vyr on the other hand, 'iterating over grapheme clusters' is an operation that changes with the unicode version (is 🏳️ zwj ⚧️ one grapheme cluster or two?), and it's easier to update a library than it is to release a new rustc version

@hierarchon Python, Java, Go, Swift, Haskell, and many other languages update Unicode databases when they update the standard library, but i guess it depends on your release cadence

@vyr sure, i think this is just in line with rust's 'very small stdlib' philosophy

@vyr yeah if you wanna be hyperclever with it you can get away with it, especially in a high-level language like python

@hierarchon I think a string should present as an array of chatacters, regardless of how it’s actually stored. Might be a good idea to preprocess it a bit to chunk the string into ranges addressed by grapheme clusters,* and then toss those chunks into a B-tree to get good-enough lookup times, even if character scanning within each chunk is in linear time.

Nonetheless, I personally see index-based looping as a pure incarnation of evil, so right there with you on using a high-level iterator.

(*: bearing in mind that grapheme clusters can be arbitrarily long — zalgo text for instance — but also are what most average people consider to be the atomic units in a string)

@hierarchon but don’t take me too seriously — I’m quite tipsy :ms_robot_thinkgrin:

@hierarchon hot take: arrays should only be used if there's a really good reason to. give me std::vector over x[] any day.

@gemsys i mean i'm using 'array' to mean 'any structure that's effectively a contiguous region of memory and no extra bookkeeping information', whether that's C++'s vector or C's [] or Rust's Vec or whatever

@hierarchon but, like, in Rust a string *is* a Vec, if I remember correctly. What is an example of a string type which isn't implemented as a vector of characters?

@gemsys obviously it's fine to *implement* it that way but it's not like Rust just did `type String = Vec<u8>`

@hierarchon I mean they literally did lol. from string.rs:

pub struct String {
vec: Vec<u8>,
}

@gemsys but string doesn't expose the same *API*, that's the point

@hierarchon ah, I see; then yes, I'd agree. I read "string type and array type as entirely separate" as the underlying data structure being completely different, not the programmer-facing API being different

@gemsys ah, yeah. i'm contrasting this with something like Haskell, where the String type is literally just a list of Unicode codepoints

Sign in to participate in the conversation
inherently digital

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!