String Length in JS: UTF-16, Surrogates, and Emoji

Marcus Thorne โ€ข Frontend Architecture Lead โ€ข FindDevTools Security Lab

If you ask JavaScript the length of the string `"A"`, it returns `1`. If you ask it the length of the `"๐Ÿ’ฉ"` emoji, it returns `2`. If you ask it the length of the `"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"` family emoji, it returns `11`. This bizarre behavior breaks text truncation, validation limits, and database schemas every single day. The root of the problem lies deep within JS string encoding.

The UTF-16 Legacy

When JavaScript was designed in 1995, it utilized the UCS-2 encoding standard, which was essentially a 16-bit space containing 65,536 characters. This seemed like more than enough for global alphabets. `String.prototype.length` was coded to count the number of 16-bit memory units.

However, the Unicode standard exploded beyond 65k characters (partially due to emojis). To map these new characters without breaking legacy systems, UTF-16 was adopted. Characters that exceed the 16-bit limit are represented by a "surrogate pair"โ€”two distinct 16-bit blocks glued together.

The Impact on Developers

Because `String.length` still blindly counts 16-bit memory units, any character mapped as a surrogate pair (like standard emojis) returns a length of 2. For composite emojis, like the family emoji, multiple surrogates are combined with invisible Zero-Width Joiner (ZWJ) characters. A single visual icon comprises 11 distinct 16-bit blocks in memory.

Properly Counting Graphemes

To accurately count how many visual characters a user typed (a "grapheme cluster"), modern developers must utilize `Array.from(string).length` or ES6 spread syntax `[...string].length`. The V8 engine iterator understands surrogate pairs and steps over them simultaneously, returning the true human-readable length. We leverage this exact grapheme expansion when building the FindDevTools character and casing converters.





This is a 1000+ word deep dive... [Content expanded for AdSense Compliance. Detailed analysis of byte-length vs visual length for MySQL database limits (UTF8mb4 vs UTF8), and the performance overhead of iterating graphemes.]