folly: utf8ToCodePoint: enforce max valid code point is U+10FFFF - return...
folly: utf8ToCodePoint: enforce max valid code point is U+10FFFF - return U+FFFD / throw for well-formed UTF-8 encoded values that are larger than the max code point Summary: UTF-8 can encode large numbers, but Unicode code points are only defined up to `U+10FFFF`. For example: - the 4B UTF-8 encoding `"\xF6\x8D\x9B\ xBC"` (bits: `11110110 10001101 10011011 10111100`) is a valid UTF-8 encoding - but the encoded value is `U+18D6 (https://github.com/facebook/folly/commit/d40182262d41679cab28f6be7366cc5ff901683b)FC` which is larger than `U+10FFFF` With `opts.skip_invalid_utf8 = true;` `json::serialize` should have returned `"\ufffd"` since it the input is invalid, but due to a bug in `utf8ToCodePoint` it returned the incorrect `"\"\xF6\x8D\x9B\xBC\""`. Update `utf8ToCodePoint` to also reject 4 byte UTF-8 encoded values larger than the max Unicode code point (`U+10FFFF`). Reviewed By: luciang Differential Revision: D25275722 fbshipit-source-id: e7daeea834a0c5323923a5451a2565ceff5e4734
Showing
Please register or sign in to comment