hot take: a programming language's string type and its array type should be entirely separate, and the preferred way to iterate over the characters in your string should be an iterator object, not `for x in 0 .. str.len()` or similar

Follow

the existence of more than 2^16 characters means that even if you're using UTF-16 you'll have characters that span multiple codepoints, which means that 'get the nth character' can't be an O(1) operation (and will be O(n) unless you store some bookkeeping info))

unless you use utf-32 as your backing store, in which case, *why*

@hierarchon and it gets even better, since even if you store one codepoint per index it's still not entirely good because of combining characters and normalization forms

@hierarchon yeah

and some languages are just using [u8] which means you have to do utf validation manually...

ugh

@hierarchon because you can fit a predictable number of code points in a register or cache line or linear span of memory, provided you don't mind trading vast swathes of memory for random access

apparently Python's Unicode string scheme as of 3.3 is more clever and uses either UTF-8, UTF-16, or UTF-32 on a per-string basis, *depending on the largest code point in the string* python.org/dev/peps/pep-0393/

@hierarchon variable-length UTF encodings can use hardware acceleration for some ops, if you have the right hardware and no morals, for example, intel.com/content/dam/www/publ "Unicode Processing and PCMPxSTRy" section

@hierarchon but even UTF-32 won't save you from the problems of iterating across decomposed characters and other such sequences that have more than one code point in them (like most new emoji), which is why Swift has so many flavors of view into Strings developer.apple.com/documentat

@hierarchon tl;dr: some modern language designers agree with you, but it kinda sucks Rust doesn't have the ability to iterate over grapheme clusters in std

@vyr on the other hand, 'iterating over grapheme clusters' is an operation that changes with the unicode version (is 🏳️ zwj ⚧️ one grapheme cluster or two?), and it's easier to update a library than it is to release a new rustc version

@hierarchon Python, Java, Go, Swift, Haskell, and many other languages update Unicode databases when they update the standard library, but i guess it depends on your release cadence

@vyr sure, i think this is just in line with rust's 'very small stdlib' philosophy

@vyr yeah if you wanna be hyperclever with it you can get away with it, especially in a high-level language like python

@hierarchon I think a string should present as an array of chatacters, regardless of how it’s actually stored. Might be a good idea to preprocess it a bit to chunk the string into ranges addressed by grapheme clusters,* and then toss those chunks into a B-tree to get good-enough lookup times, even if character scanning within each chunk is in linear time.

Nonetheless, I personally see index-based looping as a pure incarnation of evil, so right there with you on using a high-level iterator.

(*: bearing in mind that grapheme clusters can be arbitrarily long — zalgo text for instance — but also are what most average people consider to be the atomic units in a string)

@hierarchon but don’t take me too seriously — I’m quite tipsy :ms_robot_thinkgrin:

Sign in to participate in the conversation
inherently digital

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!