Hacker News Re-Imagined

Let's stop ascribing meaning to code points (2017)

  • 101 points
  • 12 days ago

  • @tosh
  • Created a post

Let's stop ascribing meaning to code points (2017)


@upofadown 10 days

Replying to @tosh 🎙

Python3 is a good example of what happens when you represent strings as an index of code points. I always found that a bit ironic in that one of the major justifications for python3 was to improve Unicode. In the end it failed to do that because of this.

Reply


@tialaramex 10 days

Replying to @tosh 🎙

It's also easier to get away with Rust's decision to say no, strings aren't sequences of code points (which they aren't) if you do all the work to support ASCII on bytes anyway.

Rust defines things like is_ascii_hexdigit() on both char (a Unicode scalar) and u8 (a byte) and so if you're writing some low-level code which cares only about bytes you aren't expected to either turn them into a string to find out if the byte you're looking at is an ASCII digit or improvise something.

This sort of thing means the programmer who is moving bytes and is very angry about the notion of character encoding needn't touch Rust's Unicode stuff, while the programmer who is responsible for text rendering isn't given a bunch of undifferentiated bytes and told "Good luck". Somebody needs to figure out what the encoding is, but likely that programmer actually cares about the difference between ISO-8859-1 and Windows code page 1252 or at least is aware that somebody else might care.

Reply


@torstenvl 10 days

Replying to @tosh 🎙

> Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.

It isn't possible to determine where to slice a string at grapheme cluster boundaries without indexing into the code points to find their combining classes.

The author's point here is self-contradictory, because if string slicing requires anything more than slicing at code point boundaries, you're going to need fast indexing to code points.

What the author appears to want to argue is that the trade-offs are worth it. However, it's possible to make that argument without the bad faith claim that there's no value to the other perspective.

Reply


@kazinator 10 days

Replying to @tosh 🎙

> This is because code points have no intrinsic meaning. They are not “characters”.

This is simply false. The true statement is: "Not all code points are characters".

- Many code points are characters. For instance, everything in the ASCII range.

- Codepoints in the ASCII range are often used for delimiters: quotes, commas, various brackets ... they have semantics.

- Generic text manipulating routines don't require code points to have semantics, but they require indexing.

We can make an analogy here to UTF-8. Let's pretend that "character" means "valid multi-byte UTF-8 code" and "code point" means "byte".

Code point (i.e. byte) access to a UTF-8 string is extremely useful.

UTF-8 strings can be processed by code that doesn't understand UTF-8 at all; for isntance you can split a UTF-8 string on commas or spaces using some function written in 1980. That function will use pointers or indices or some combination thereof into the string, using subroutines that just blindly copy ranges of bytes without caring what they mean. The UTF-8 won't be torn apart because the delimiters don't occur in the middle of a UTF-8 sequence.

Reply


@wereHamster 10 days

Replying to @tosh 🎙

> UTF-16 is mostly a “worst of both worlds” compromise at this point, and the main programming language I can think of that uses it (and exposes it in this form) is Javascript, and that too in a broken way.

Unicode predates JavaScript. JS use UCS-2!

From https://mathiasbynens.be/notes/javascript-encoding

    JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
    
    The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

Reply


@zajio1am 10 days

Replying to @tosh 🎙

Fundamental problem with grapheme clusters as a basic programming language concept is that they are not uniformly defined, they depend on grapheme_extend property of unicode characters, which are known only for already defined codepoints. Which means that such splitting is problematic to be done in forward-compatible manner. It is also costly and in many cases not necessary. Which makes it more suitable to be provided by a library than to be a core language concept.

Secondary issue is that with codepoints (or bytes) as basic concept one can compose grapheme clusters in the same manner as general text composing, while if grapheme clusters are basic concept then one needs specialized operation for composing them from codepoints.

OTOH, pretty much agree that O(1) access to code point is not important and keeping internal representation in UTF-8 (or in input encoding) is OK for most purposes.

Reply


@josephg 10 days

Replying to @tosh 🎙

As someone who's written a lot of unicode-aware code in multiple languages, I don't agree at all. The problem I think a lot about is collaborative text editing. I need to express an insert into a text document - eg, "Insert 'a' at position X". The obvious question that comes up is - how should we define "X"? There's a mess of options:

- UTF-8 byte offset (from the start of the string). (Or UCS2 offset)

- Extended grapheme cluster offset

- Code point offset

You can also use a line/column position, but the column position ends up being one of the 3 above.

I want collaborative editing protocols to work across multiple programs, running on multiple systems. (Eg, a user is in a web browser on a phone, talking to a server written in rust.)

This post says that codepoint offsets are Bad, but in my mind they're the only sane answer to this problem.

Using byte offsets has two problems:

1) You have to pick an encoding, and thats problematic for cross-language compatibility. UTF-8 byte offsets are meaningless in javascript - they're slow & expensive to convert. UCS2 offsets are meaningless in rust.

2) They make it possible to express invalid data. Inserting in the middle of a codepoint is an invalid operation. I don't want to worry about different systems handling that case differently. Using codepoint offsets make it impossible to even represent the idea of inserting in the middle of a character.

Using grapheme cluster offsets is problematic because the grapheme clustering rules change all the time. I want editing operations to be readable from any programming language, and across time. Saying "Insert 'a' at position 20" (measured in grapheme clusters) is ambiguous because "position 20" will drift as unicode's grapheme cluster rules change. The result is that old android phones and new iphones can't reliably edit a document together. And old editing traces can't be reliably replayed.

Measuring codepoints is better than the other options because its stable, cross platform and well defined. If you aren't aware of those benefits, you aren't understanding my use case.

Reply


@ThrowawayTestr 10 days

Replying to @tosh 🎙

Why isn't UTF-32 used everywhere for everything? Is an extra 3 bytes per character really that much?

Reply


@avgcorrection 10 days

Replying to @tosh 🎙

Yes, it’s definitely annoying that you can’t get the grapheme clusters of a string without going outside the standard library. That’s a very basic need when you are dealing with user input and you care about the individual “characters” that have been sent in.

Reply


@cryptonector 10 days

Replying to @tosh 🎙

> However, you don’t need code point indexing here, byte indexing works fine! UTF8 is designed so that you can check if you’re on a code point boundary even if you just byte-index directly.

Yes, UTF-8 is self-resynchronizing in either direction. I.e., if you pick a random byte index into a UTF-8 string, you can check if that byte's value is the start of a codepoint, and if not then you can scan backwards (or forward) to find the start of the current (or next) codepoint. Do be careful not to overrun the bounds of the string, if you're writing C code anyways.

Do note that start-of-codepoint is not the same thing as start-of-grapheme-cluster or start-of-character.

Also, TFA doesn't touch at all on forms and normalization. And I think TFA is confused about "meaning". Unicode codepoints very much have meaning, and that had better not change. TFA seems to be mostly about indexing into strings, and that content is fine.

> One very common misconception I’ve seen is that code points have cross-language intrinsic meaning.

Well, maybe that depends on what the meaning of "meaning" is.

Unicode codepoints have normative meanings -- that is, meanings assigned by the Unicode Consortium. Those "meanings" are embodied in a) their names and descriptions, b) their normative glyphs. Some codepoints are combining codepoints, and their meaning lies in the changes to the base codepoint's glyph when applied -- this is certainly a kind of "meaning".

But of course people can use codepoints (really, characters) in ways which do not comport to the meanings assigned by the UC. That's fine, of course. The real meanings of the words we write and how we use them, and the glyphs they are composed of, vary over time because human language evolves.

But TFA writes "cross-language intrinsic meaning". That's a more specific claim that is more likely to be true.

It's certainly true that Indic language glyphs have no meaning to me when mixed with Latin scripts, since I know nothing about them, though I can look up their meaning, and I can learn about them. And it's also true that confusable codepoint assignments (e.g., some Greek charaters look just like Latin characters, and vice-versa) may not, for the reader, have the UC's intended meaning, since to the reader the only meaning will only be the rendered glyph's rather than the codepoint's! After all, being confusable, the reader isn't likely to notice that some glyph is not Latin but Greek (or whatever).

But if some text does not mix confusable scripts with the intent of creating confusion, and if it is clear to the reader what is intended, then a reader familiar with the scripts (and languages) being used in the text will be able to discern intended meaning, and the UC-assigned meanings of the codepoints used will be... meaningful even though the text is mixed-script text.

So, yes, with caveats, Unicode codepoints do have "cross-language intrinsic meaning".

IMO, it would have been better for TFA not to mix the two things, codepoint meaning and string indexing issues.

Also, string indexing is just not needed. In general you have to parse text from first code unit to last. You might tokenize text, and then indexing into a sequence of tokens might have meaning / be useful, but indexing into UTF-8/16/32 by code unit is not really useful.

Reply


@dahfizz 10 days

Replying to @tosh 🎙

Unpopular opinion: written languages need to modernize. The printing press and typewriter forced the Romantic language from a format optimized for hand writing (cursive) into a format optimized for the modern world (print).

I think other languages would benefit from having a "print" format.

Reply


@Groxx 10 days

Replying to @tosh 🎙

Completely agreed, and this is a pretty nice overview of the problems with thinking of strings in terms of "characters" or "code points".

Graphemes are almost always the closest to what people mean when they say "character".

But you also shouldn't do logic based on grapheme, unless you're contributing to harfbuzz and know enough to know exactly why this advice is wrong. Don't split or concatenate strings for any reason if you're doing internationalized stuff. E.g. ask your translators to give you a separate string for a drop-cap, and to remove it from the string that follows, do not just pluck out the first grapheme because it could look like nonsense.

---

You should literally never need or want to interact with Unicode directly, unless you're building the foundational layers of other systems (rendering, Unicode normalization, etc). If you find that you have to, you're probably losing encoding information somewhere - that loss is the bug to fix, don't try to patch it somehow, you'll just cause weirder errors elsewhere - or doing something fundamentally irrational, like splitting a string somehow. The never-ending pain you encounter while doing this stuff is a sign you're Doing It Wrong™ and should step back and question the basics, not that it just needs one more fix to work correctly.

If you're doing single-language logging for developers or whatever? Yeah, go wild. Though watch out for irrationally chopped user input, ya gotta make sure your log analyzer won't choke on bad UTF-8.

Reply


@CoastalCoder 10 days

Replying to @tosh 🎙

When I first read the headline, I thought "code point" referred to a specific location in a program's text. (E.g., file/line#/col#, or memory address.)

It was fun watching my brain try to make sense of that meaning.

Reply


@hgs3 10 days

Replying to @tosh 🎙

The author must have missed the part of the Unicode spec where it ascribes dozens of properties to code points: it assigns them a general category (e.g. control character, lower case letter, punctuation, and more), a case mapping (if applicable), a numeric value (if applicable) and the kind of number it is (ordinal or otherwise), the script they are typically written in, and many more.

I think the confusion with Unicode is that programmers apply their non-technical preconception of what a "character" is rather than understanding the Unicode definition. Unicode defines a character as a representation of something - maybe it represents a line break (control character), a letter, an ideographic, a combining mark, or something else. A "code point" is just a number that a character is assigned for numerical representation in memory. A "grapheme cluster" is what a user perceives as a character - it's how non-programmers see text. What needs to happen is programmers need to hammer in their heads the Unicode definition of "character" just like they relearned to count from zero.

Reply


About Us

site design / logo © 2022 Box Piper