thrift: varint: BMI2 (pdep) based varint encoding: branchless 2-5x faster than loop unrolled
Summary: BMI2 (`pdep`) varint encoding that's mostly branchless. It's 2-5x faster than the current loop-unrolled version. Being mostly branchless there's less variability in micro-benchmark runtime compared to the loop-unrolled version: - the loop-unrolled versions are slowest when encoding random numbers across the entire 64-bit range (some likely large) and branch prediction has most failures. Kept the fast-pass for values <127 (encoded in 1 byte) which are likely to be frequent. I couldn't find a fully branchless version that performed better anyway. TLDR: - `u8`: unroll the two possible values (1B and 2B encoding). Faster in micro-benchmarks than branchless versions I tried (needed more instructions to produce the same value without branches). - `u16` & `u32`: -- u16 encodes in up to 3B, u32 in up to 5B. -- Use `pdep` to encode into a u64 (8 bytes). Write 8 bytes to `QueueAppender`, but keep track of only the bytes that had to be written. This is faster than appending a buffer of bytes using &u64 and size. -- u16 could be written by encoding using `_pdep_u32` (3 bytes max fit in u32) and using smaller 16B lookup tables. In micro-benchmark that's not faster than using the same code as the one to encode u32 using `_pdep_u64`. In prod will perform better due to sharing the same lookup tables with u32 and u64 versions (less d-cache pressure). - `u64`: needs up to 10B. `pdep` to encode first 8B and unconditionally write last 2B too (but keep track of `QueueAppender` size properly). Reviewed By: vitaut Differential Revision: D29250074 fbshipit-source-id: 1f6a266f45248fcbea30a62ed347564589cb3348
Showing
Please register or sign in to comment