In utf-8, bytes (uint8_t) may not represent a whole "code point". A code point being an individually meaningful element in utf-8 like a space an 'e' or a modifier code point like an accent or a ZWJ. Most utf-8 libraries will let you address individual code points but it might still garble the text if you split between an 'e' and a '`'. To prevent this, splitting should be done in between graphemes (sequences of code points that render like a single unit*). And even graphemes have their problems.
Yes, I understand a little about Unicode in this kind of problem, but a code point is an individual logical item even if it is composed of multiple bytes; being a kind of 'string' in itself. I should have asked more carefully, what would be a better system in your view?
Thanks for the link, will check it out after Christmas.
I personally believe that Swift's strings where graphemes are the smallest indexable unit are the gold standard for writing logic that might truncate multilingual text. It's still not perfect though, they add overhead and updates to Unicode might change behaviour so there's that but it should handle most cases gracefully.
Very interesting blog post about graphemes that parallels my experience writing a terminal text editor: https://mitchellh.com/writing/grapheme-clusters-in-terminals
* Event that is not a proper description of a grapheme
And don't even get me started on regexes