String Length in JS: UTF-16, Surrogates, and Emoji

Marcus Thorne • Frontend Architecture Lead • FindDevTools Security Lab

If you ask JavaScript the length of the string `"A"`, it returns `1`. If you ask it the length of the `"💩"` emoji, it returns `2`. If you ask it the length of the `"👨‍👩‍👧‍👦"` family emoji, it returns `11`. This bizarre behavior breaks text truncation, validation limits, and database schemas every single day. The root of the problem lies deep within JS string encoding.

The UTF-16 Legacy

When JavaScript was designed in 1995, it utilized the UCS-2 encoding standard, which was essentially a 16-bit space containing 65,536 characters. This seemed like more than enough for global alphabets. `String.prototype.length` was coded to count the number of 16-bit memory units.

However, the Unicode standard exploded beyond 65k characters (partially due to emojis). To map these new characters without breaking legacy systems, UTF-16 was adopted. Characters that exceed the 16-bit limit are represented by a "surrogate pair"—two distinct 16-bit blocks glued together.

The Impact on Developers

Because `String.length` still blindly counts 16-bit memory units, any character mapped as a surrogate pair (like standard emojis) returns a length of 2. For composite emojis, like the family emoji, multiple surrogates are combined with invisible Zero-Width Joiner (ZWJ) characters. A single visual icon comprises 11 distinct 16-bit blocks in memory.

Properly Counting Graphemes

To accurately count how many visual characters a user typed (a "grapheme cluster"), modern developers must utilize `Array.from(string).length` or ES6 spread syntax `[...string].length`. The V8 engine iterator understands surrogate pairs and steps over them simultaneously, returning the true human-readable length. We leverage this exact grapheme expansion when building the FindDevTools character and casing converters.

This is a 1000+ word deep dive...

Technical Deep Dive & Specification Reference

document are to be interpreted as described in [RFC2119]. UCS characters are designated by the U+HHHH notation, where HHHH is a string of from 4 to 6 hexadecimal digits representing the character number in ISO/IEC 10646. Yergeau Standards Track [Page 3] RFC 3629 UTF-8 November 2003 3. UTF-8 definition UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646] In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.

The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded. The table below summarizes the format of these different octet types.

The letter x indicates bits available for encoding bits of the character number. Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Encoding a character to UTF-8 proceeds as follows: 1. Determine the number of octets required from the character number and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character.

2. Prepare the high-order bits of the octets as per the second column of the table. 3. Fill in the bits marked x from the bits of the character number, expressed in binary. Start by putting the lowest-order bit of the character number in the lowest-order position of the last octet of the sequence, then put the next higher-order bit of the character number in the next higher-order position of that octet, etc.

When the x bits of the last octet are filled in, move on to the next to last octet, then to the preceding one, etc. until all x bits are filled in. Yergeau Standards Track [Page 4] RFC 3629 UTF-8 November 2003 The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above. This contrasts with CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for use on the Internet.

CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8. Decoding a UTF-8 character proceeds as follows: 1. Initialize a binary number with all bits set to 0. Up to 21 bits may be needed.

2. Determine which bits encode the character number from the number of octets in the sequence and the second column of the table above (the bits marked x). 3. Distribute the bits from the sequence to the binary number, first the lower-order bits from the last octet of the sequence and proceeding to the left until no x bits are left. The binary number is now equal to the character number.

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below. 4.

Syntax of UTF-8 Byte Sequences For the convenience of implementors using ABNF, a definition of UTF-8 in ABNF syntax is given here. A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234]. UTF8-octets = *( UTF8-char ) UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1 = %x00-7F UTF8-2 = %xC2-DF UTF8-tail Yergeau Standards Track [Page 5] RFC 3629 UTF-8 November 2003 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / %xF4 %x80-8F 2( UTF8-tail ) UTF8-tail = %x80-BF NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative.

Implementors are urged to rely on the authoritative source, rather than on this ABNF. 5. Versions of the standards ISO/IEC 10646 is updated from time to time by publication of amendments and additional parts; similarly, new versions of the Unicode standard are published over time. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly. In general, the changes amount to adding new characters, which does not pose particular problems with old data.

In 1996, Amendment 5 to the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The justification for allowing such an incompatible change was that there were no major implementations and no significant amounts of data containing Hangul. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change (see Unicode Consortium Policies [1]). New versions, and in particular any incompatible changes, have consequences regarding MIME charset labels, to be discussed in MIME registration (Section 8).

6. Byte order mark (BOM) The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but the BOM name hints at a second possible usage of the character: to prepend a U+FEFF character to a stream of UCS characters as a "signature". A receiver of such a serialized stream may then use the initial character as a hint that the stream consists of UCS characters and also to recognize which UCS encoding is involved and, with encodings having a multi-octet encoding unit, as a way to Yergeau Standards Track [Page 6] RFC 3629 UTF-8 November 2003 recognize the serialization order of the octets. UTF-8 having a single-octet encoding unit, this last function is useless and the BOM will always appear as the octet sequence EF BB BF.

It is important to understand that the character U+FEFF appearing at any position other than the beginning of a stream MUST be interpreted with the semantics for the zero-width non-breaking space, and MUST NOT be interpreted as a signature. When interpreted as a signature, the Unicode standard suggests than an initial U+FEFF character may be stripped before processing the text. Such stripping is necessary in some cases (e.g., when concatenating two strings, because otherwise the resulting string may contain an unintended "ZERO WIDTH NO-BREAK SPACE" at the connection point), but might affect an external process at a different layer (such as a digital signature or a count of the characters) that is relying on the presence of all characters in the stream. It is therefore RECOMMENDED to avoid stripping an initial U+FEFF interpreted as a signature without a good reason, to ignore it instead of stripping it when appropriate (such as for display) and to strip it only when really necessary. U+FEFF in the first position of a stream MAY be interpreted as a zero-width non-breaking space, and is not always a signature.

In an attempt at diminishing this uncertainty, Unicode 3.2 adds a new character, U+2060 "WORD JOINER", with exactly the same semantics and usage as U+FEFF except for the signature function, and strongly recommends its exclusive use for expressing word-joining semantics. Eventually, following this recommendation will make it all but certain that any initial U+FEFF is a signature, not an intended "ZERO WIDTH NO-BREAK SPACE". In the meantime, the uncertainty unfortunately remains and may affect Internet protocols. Protocol specifications MAY restrict usage of U+FEFF as a signature in order to reduce or eliminate the potential ill effects of this uncertainty. In the interest of striking a balance between the advantages (reduction of uncertainty) and drawbacks (loss of the signature function) of such restrictions, it is useful to distinguish a few cases: o A protocol SHOULD forbid use of U+FEFF as a signature for those textual protocol elements that the protocol mandates to be always UTF-8, the signature function being totally useless in those cases.

o A protocol SHOULD also forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol provides character encoding identification mechanisms, when it is expected that implementations of the protocol will be in a position to always use the mechanisms properly. This will be the case when Yergeau Standards Track [Page 7] RFC 3629 UTF-8 November 2003 the protocol elements are maintained tightly under the control of the implementation from the time of their creation to the time of their (properly labeled) transmission. o A protocol SHOULD NOT forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol does not provide character encoding identification mechanisms, when a ban would be unenforceable, or when it is expected that implementations of the protocol will not be in a position to always use the mechanisms properly. The latter two cases are likely to occur with larger protocol elements such as MIME entities, especially when implementations of the protocol will obtain such entities from file systems, from protocols that do not have encoding identification mechanisms for payloads (such as FTP) or from other protocols that do not guarantee proper identification of character encoding (such as HTTP). When a protocol forbids use of U+FEFF as a signature for a certain protocol element, then any initial U+FEFF in that protocol element MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE".

When a protocol does NOT forbid use of U+FEFF as a signature for a certain protocol element, then implementations SHOULD be prepared to handle a signature in that element and react appropriately: using the signature to identify the character encoding as necessary and stripping or ignoring the signature as appropriate. 7. Examples The character sequence U+0041 U+2262 U+0391 U+002E "A." is encoded in UTF-8 as follows: --+--------+-----+-- 41 E2 89 A2 CE 91 2E --+--------+-----+-- The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo", meaning "the Korean language") is encoded in UTF-8 as follows: --------+--------+-------- ED 95 9C EA B5 AD EC 96 B4 --------+--------+-------- The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo", meaning "the Japanese language") is encoded in UTF-8 as follows: --------+--------+-------- E6 97 A5 E6 9C AC E8 AA 9E --------+--------+-------- Yergeau Standards Track [Page 8] RFC 3629 UTF-8 November 2003 The character U+233B4 (a Chinese character meaning 'stump of tree'), prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: --------+----------- EF BB BF F0 A3 8E B4 --------+----------- 8. MIME registration This memo serves as the basis for registration of the MIME charset parameter for UTF-8, according to [RFC2978]. The charset parameter value is "UTF-8".

This string labels media types containing text consisting of characters from.