Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In utf-8, bytes (uint8_t) may not represent a whole "code point". A code point being an individually meaningful element in utf-8 like a space an 'e' or a modifier code point like an accent or a ZWJ. Most utf-8 libraries will let you address individual code points but it might still garble the text if you split between an 'e' and a '`'. To prevent this, splitting should be done in between graphemes (sequences of code points that render like a single unit*). And even graphemes have their problems.

Very interesting blog post about graphemes that parallels my experience writing a terminal text editor: https://mitchellh.com/writing/grapheme-clusters-in-terminals

* Event that is not a proper description of a grapheme

And don't even get me started on regexes



Yes, I understand a little about Unicode in this kind of problem, but a code point is an individual logical item even if it is composed of multiple bytes; being a kind of 'string' in itself. I should have asked more carefully, what would be a better system in your view?

Thanks for the link, will check it out after Christmas.


I personally believe that Swift's strings where graphemes are the smallest indexable unit are the gold standard for writing logic that might truncate multilingual text. It's still not perfect though, they add overhead and updates to Unicode might change behaviour so there's that but it should handle most cases gracefully.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: