thrift: varint: BMI2 (pdep) based varint encoding: branchless 2-5x faster than loop unrolled

Summary: BMI2 (`pdep`) varint encoding that's mostly branchless. It's 2-5x faster than the current loop-unrolled version. Being mostly branchless there's less variability in micro-benchmark runtime compared to the loop-unrolled version: - the loop-unrolled versions are slowest when encoding random numbers across the entire 64-bit range (some likely large) and branch prediction has most failures. Kept the fast-pass for values <127 (encoded in 1 byte) which are likely to be frequent. I couldn't find a fully branchless version that performed better anyway. TLDR: - `u8`: unroll the two possible values (1B and 2B encoding). Faster in micro-benchmarks than branchless versions I tried (needed more instructions to produce the same value without branches). - `u16` & `u32`: -- u16 encodes in up to 3B, u32 in up to 5B. -- Use `pdep` to encode into a u64 (8 bytes). Write 8 bytes to `QueueAppender`, but keep track of only the bytes that had to be written. This is faster than appending a buffer of bytes using &u64 and size. -- u16 could be written by encoding using `_pdep_u32` (3 bytes max fit in u32) and using smaller 16B lookup tables. In micro-benchmark that's not faster than using the same code as the one to encode u32 using `_pdep_u64`. In prod will perform better due to sharing the same lookup tables with u32 and u64 versions (less d-cache pressure). - `u64`: needs up to 10B. `pdep` to encode first 8B and unconditionally write last 2B too (but keep track of `QueueAppender` size properly). Reviewed By: vitaut Differential Revision: D29250074 fbshipit-source-id: 1f6a266f45248fcbea30a62ed347564589cb3348

thrift: varint: BMI2 (pdep) based varint encoding: branchless 2-5x faster than loop unrolled
Summary: BMI2 (`pdep`) varint encoding that's mostly branchless. It's 2-5x faster than the current loop-unrolled version. Being mostly branchless there's less variability in micro-benchmark runtime compared to the loop-unrolled version: - the loop-unrolled versions are slowest when encoding random numbers across the entire 64-bit range (some likely large) and branch prediction has most failures. Kept the fast-pass for values <127 (encoded in 1 byte) which are likely to be frequent. I couldn't find a fully branchless version that performed better anyway. TLDR: - `u8`: unroll the two possible values (1B and 2B encoding). Faster in micro-benchmarks than branchless versions I tried (needed more instructions to produce the same value without branches). - `u16` & `u32`: -- u16 encodes in up to 3B, u32 in up to 5B. -- Use `pdep` to encode into a u64 (8 bytes). Write 8 bytes to `QueueAppender`, but keep track of only the bytes that had to be written. This is faster than appending a buffer of bytes using &u64 and size. -- u16 could be written by encoding using `_pdep_u32` (3 bytes max fit in u32) and using smaller 16B lookup tables. In micro-benchmark that's not faster than using the same code as the one to encode u32 using `_pdep_u64`. In prod will perform better due to sharing the same lookup tables with u32 and u64 versions (less d-cache pressure). - `u64`: needs up to 10B. `pdep` to encode first 8B and unconditionally write last 2B too (but keep track of `QueueAppender` size properly). Reviewed By: vitaut Differential Revision: D29250074 fbshipit-source-id: 1f6a266f45248fcbea30a62ed347564589cb3348
4baba282 · Lucian Grijincu · Facebook GitHub Bot · dd7d175a · 4baba282
Commit 4baba282 authored Jul 21, 2021 by Lucian Grijincu Committed by Facebook GitHub Bot Jul 21, 2021
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 7 deletions

folly/io/Cursor.h folly/io/Cursor.h +12 -7

No files found.
--- a/folly/io/Cursor.h
+++ b/folly/io/Cursor.h
@@ -811,10 +811,12 @@ template <class Derived>
 class Writable {
 public:
  template <class T>
-  typename std::enable_if<std::is_arithmetic<T>::value>::type write(T value) {
+  typename std::enable_if<std::is_arithmetic<T>::value>::type write(
+      T value, size_t n = sizeof(T)) {
+    assert(n <= sizeof(T));
    const uint8_t* u8 = reinterpret_cast<const uint8_t*>(&value);
    Derived* d = static_cast<Derived*>(this);
-    d->push(u8, sizeof(T));
+    d->push(u8, n);
  }

  template <class T>
@@ -1201,13 +1203,15 @@ class QueueAppender : public detail::Writable<QueueAppender> {
  }

  template <class T>
-  typename std::enable_if<std::is_arithmetic<T>::value>::type write(T value) {
+  typename std::enable_if<std::is_arithmetic<T>::value>::type write(
+      T value, size_t n = sizeof(T)) {
    // We can't fail.
+    assert(n <= sizeof(T));
    if (length() >= sizeof(T)) {
      storeUnaligned(queueCache_.writableData(), value);
-      queueCache_.appendUnsafe(sizeof(T));
+      queueCache_.appendUnsafe(n);
    } else {
-      writeSlow<T>(value);
+      writeSlow<T>(value, n);
    }
  }

@@ -1259,12 +1263,13 @@ class QueueAppender : public detail::Writable<QueueAppender> {

  template <class T>
  typename std::enable_if<std::is_arithmetic<T>::value>::type FOLLY_NOINLINE
-  writeSlow(T value) {
+  writeSlow(T value, size_t n = sizeof(T)) {
+    assert(n <= sizeof(T));
    queueCache_.queue()->preallocate(sizeof(T), growth_);
    queueCache_.fillCache();

    storeUnaligned(queueCache_.writableData(), value);
-    queueCache_.appendUnsafe(sizeof(T));
+    queueCache_.appendUnsafe(n);
  }
 };