folly: unicode: support 4 byte unicode code points
Summary: Support 4-byte UTF-8 strings. https://en.wikipedia.org/wiki/UTF-8 | Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | 1 | 7 | U+0000 | U+007F | 0xxxxxxx | 2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | 3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | 4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx `utf8ToCodePoint` now correctly returns the code point for 4-byte UTF-8 strings. The JSON standard (http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf) says: > Any code point may be represented as a hexadecimal escape sequence. The meaning of such a hexadecimal number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. Hexadecimal digits can be This was the behavior implemented for 2-3-byte UTF-8 strings which all fit in the U+0000 through U+FFFF range (see above diagram). > To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor. When `encode_non_ascii == true` we now also support encoding 4-byte UTF-8 strings as 2 UTF-16 surrogate pairs. Reviewed By: ot Differential Revision: D9357479 fbshipit-source-id: 5c47bee4e71d5888b8264d8d3524d2084769adb0
Showing
folly/test/UnicodeTest.cpp
0 → 100644
Please register or sign in to comment