integer division with controlled rounding in Math.h

Summary: C++'s integer division performs truncation, but it is fairly common that people actually want to round up. There are actually four rounding modes that make some sense: toward negative infinity (floor), toward positive infinity (ceil), toward zero (truncation), and away from zero (?). It is pretty common that code that wants ceil actually writes (a + b - 1) / b, which doesn't work at all for negative values (and has a potential overflow issue). This diff adds 4 templated functions for performing integer division: divFloor, divCeil, divTrunc, and divRoundAway. They are not subject to unnecessary internal underflow or overflow, and they work correctly across their entire input domain. I did a bit of benchmarking across x86_64, arm64, and 32-bit ARM. Only 32-bit ARM was different. That's not surprising since it doesn't have an integer division instruction, and the function that implements integer division doesn't produce the remainder for free. On 32-bit ARM a branchful version that doesn't need the modulus is used. Reviewed By: yfeldblum Differential Revision: D3806743 fbshipit-source-id: c14c56717e96f135321920e64acbfe9dcb1fe039

integer division with controlled rounding in Math.h
Summary: C++'s integer division performs truncation, but it is fairly common that people actually want to round up. There are actually four rounding modes that make some sense: toward negative infinity (floor), toward positive infinity (ceil), toward zero (truncation), and away from zero (?). It is pretty common that code that wants ceil actually writes (a + b - 1) / b, which doesn't work at all for negative values (and has a potential overflow issue). This diff adds 4 templated functions for performing integer division: divFloor, divCeil, divTrunc, and divRoundAway. They are not subject to unnecessary internal underflow or overflow, and they work correctly across their entire input domain. I did a bit of benchmarking across x86_64, arm64, and 32-bit ARM. Only 32-bit ARM was different. That's not surprising since it doesn't have an integer division instruction, and the function that implements integer division doesn't produce the remainder for free. On 32-bit ARM a branchful version that doesn't need the modulus is used. Reviewed By: yfeldblum Differential Revision: D3806743 fbshipit-source-id: c14c56717e96f135321920e64acbfe9dcb1fe039
1d9d5cbe · Nathan Bronson · Facebook Github Bot 4 · f6b5f78b · 1d9d5cbe · 1d9d5cbe
Commit 1d9d5cbe authored Sep 07, 2016 by Nathan Bronson Committed by Facebook Github Bot 4 Sep 07, 2016
5 changed files
--- a/folly/Makefile.am
+++ b/folly/Makefile.am
@@ -254,6 +254,7 @@ nobase_follyinclude_HEADERS = \
 	MallctlHelper.h \
 	Malloc.h \
 	MapUtil.h \
+	Math.h \
 	Memory.h \
 	MemoryMapping.h \
 	MicroSpinLock.h \

--- a/folly/Math.h
+++ b/folly/Math.h
+/*
+ * Copyright 2016 Facebook, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Some arithmetic functions that seem to pop up or get hand-rolled a lot.
+ * So far they are all focused on integer division.
+ */
+
+#pragma once
+
+#include <stdint.h>
+
+#include <limits>
+#include <type_traits>
+
+namespace folly {
+
+namespace detail {
+
+template <typename T>
+inline constexpr T divFloorBranchless(T num, T denom) {
+  // floor != trunc when the answer isn't exact and truncation went the
+  // wrong way (truncation went toward positive infinity).  That happens
+  // when the true answer is negative, which happens when num and denom
+  // have different signs.  The following code compiles branch-free on
+  // many platforms.
+  return (num / denom) +
+      ((num % denom) != 0 ? 1 : 0) *
+      (std::is_signed<T>::value && (num ^ denom) < 0 ? -1 : 0);
+}
+
+template <typename T>
+inline constexpr T divFloorBranchful(T num, T denom) {
+  // First case handles negative result by preconditioning numerator.
+  // Preconditioning decreases the magnitude of the numerator, which is
+  // itself sign-dependent.  Second case handles zero or positive rational
+  // result, where trunc and floor are the same.
+  return std::is_signed<T>::value && (num ^ denom) < 0 && num != 0
+      ? (num + (num > 0 ? -1 : 1)) / denom - 1
+      : num / denom;
+}
+
+template <typename T>
+inline constexpr T divCeilBranchless(T num, T denom) {
+  // ceil != trunc when the answer isn't exact (truncation occurred)
+  // and truncation went away from positive infinity.  That happens when
+  // the true answer is positive, which happens when num and denom have
+  // the same sign.
+  return (num / denom) +
+      ((num % denom) != 0 ? 1 : 0) *
+      (std::is_signed<T>::value && (num ^ denom) < 0 ? 0 : 1);
+}
+
+template <typename T>
+inline constexpr T divCeilBranchful(T num, T denom) {
+  // First case handles negative or zero rational result, where trunc and ceil
+  // are the same.
+  // Second case handles positive result by preconditioning numerator.
+  // Preconditioning decreases the magnitude of the numerator, which is
+  // itself sign-dependent.
+  return (std::is_signed<T>::value && (num ^ denom) < 0) || num == 0
+      ? num / denom
+      : (num + (num > 0 ? -1 : 1)) / denom + 1;
+}
+
+template <typename T>
+inline constexpr T divRoundAwayBranchless(T num, T denom) {
+  // away != trunc whenever truncation actually occurred, which is when
+  // there is a non-zero remainder.  If the unrounded result is negative
+  // then fixup moves it toward negative infinity.  If the unrounded
+  // result is positive then adjustment makes it larger.
+  return (num / denom) +
+      ((num % denom) != 0 ? 1 : 0) *
+      (std::is_signed<T>::value && (num ^ denom) < 0 ? -1 : 1);
+}
+
+template <typename T>
+inline constexpr T divRoundAwayBranchful(T num, T denom) {
+  // First case of second ternary operator handles negative rational
+  // result, which is the same as divFloor.  Second case of second ternary
+  // operator handles positive result, which is the same as divCeil.
+  // Zero case is separated for simplicity.
+  return num == 0 ? 0
+                  : (num + (num > 0 ? -1 : 1)) / denom +
+          (std::is_signed<T>::value && (num ^ denom) < 0 ? -1 : 1);
+}
+
+template <typename N, typename D>
+using IdivResultType = typename std::enable_if<
+    std::is_integral<N>::value && std::is_integral<D>::value &&
+        !std::is_same<N, bool>::value &&
+        !std::is_same<D, bool>::value,
+    decltype(N{1} / D{1})>::type;
+}
+
+#if defined(__arm__) && !FOLLY_A64
+constexpr auto kIntegerDivisionGivesRemainder = false;
+#else
+constexpr auto kIntegerDivisionGivesRemainder = true;
+#endif
+
+/**
+ * Returns num/denom, rounded toward negative infinity.  Put another way,
+ * returns the largest integral value that is less than or equal to the
+ * exact (not rounded) fraction num/denom.
+ *
+ * The matching remainder (num - divFloor(num, denom) * denom) can be
+ * negative only if denom is negative, unlike in truncating division.
+ * Note that for unsigned types this is the same as the normal integer
+ * division operator.  divFloor is equivalent to python's integral division
+ * operator //.
+ *
+ * This function undergoes the same integer promotion rules as a
+ * built-in operator, except that we don't allow bool -> int promotion.
+ * This function is undefined if denom == 0.  It is also undefined if the
+ * result type T is a signed type, num is std::numeric_limits<T>::min(),
+ * and denom is equal to -1 after conversion to the result type.
+ */
+template <typename N, typename D>
+inline constexpr detail::IdivResultType<N, D> divFloor(N num, D denom) {
+  using R = decltype(num / denom);
+  return kIntegerDivisionGivesRemainder && std::is_signed<R>::value
+      ? detail::divFloorBranchless<R>(num, denom)
+      : detail::divFloorBranchful<R>(num, denom);
+}
+
+/**
+ * Returns num/denom, rounded toward positive infinity.  Put another way,
+ * returns the smallest integral value that is greater than or equal to
+ * the exact (not rounded) fraction num/denom.
+ *
+ * This function undergoes the same integer promotion rules as a
+ * built-in operator, except that we don't allow bool -> int promotion.
+ * This function is undefined if denom == 0.  It is also undefined if the
+ * result type T is a signed type, num is std::numeric_limits<T>::min(),
+ * and denom is equal to -1 after conversion to the result type.
+ */
+template <typename N, typename D>
+inline constexpr detail::IdivResultType<N, D> divCeil(N num, D denom) {
+  using R = decltype(num / denom);
+  return kIntegerDivisionGivesRemainder && std::is_signed<R>::value
+      ? detail::divCeilBranchless<R>(num, denom)
+      : detail::divCeilBranchful<R>(num, denom);
+}
+
+/**
+ * Returns num/denom, rounded toward zero.  If num and denom are non-zero
+ * and have different signs (so the unrounded fraction num/denom is
+ * negative), returns divCeil, otherwise returns divFloor.  If T is an
+ * unsigned type then this is always equal to divFloor.
+ *
+ * Note that this is the same as the normal integer division operator,
+ * at least since C99 (before then the rounding for negative results was
+ * implementation defined).  This function is here for completeness and
+ * as a place to hang this comment.
+ *
+ * This function undergoes the same integer promotion rules as a
+ * built-in operator, except that we don't allow bool -> int promotion.
+ * This function is undefined if denom == 0.  It is also undefined if the
+ * result type T is a signed type, num is std::numeric_limits<T>::min(),
+ * and denom is equal to -1 after conversion to the result type.
+ */
+template <typename N, typename D>
+inline constexpr detail::IdivResultType<N, D> divTrunc(N num, D denom) {
+  return num / denom;
+}
+
+/**
+ * Returns num/denom, rounded away from zero.  If num and denom are
+ * non-zero and have different signs (so the unrounded fraction num/denom
+ * is negative), returns divFloor, otherwise returns divCeil.  If T is
+ * an unsigned type then this is always equal to divCeil.
+ *
+ * This function undergoes the same integer promotion rules as a
+ * built-in operator, except that we don't allow bool -> int promotion.
+ * This function is undefined if denom == 0.  It is also undefined if the
+ * result type T is a signed type, num is std::numeric_limits<T>::min(),
+ * and denom is equal to -1 after conversion to the result type.
+ */
+template <typename N, typename D>
+inline constexpr detail::IdivResultType<N, D> divRoundAway(N num, D denom) {
+  using R = decltype(num / denom);
+  return kIntegerDivisionGivesRemainder && std::is_signed<R>::value
+      ? detail::divRoundAwayBranchless<R>(num, denom)
+      : detail::divRoundAwayBranchful<R>(num, denom);
+}
+
+} // namespace folly
--- a/folly/test/Makefile.am
+++ b/folly/test/Makefile.am
@@ -12,6 +12,7 @@ TESTS= \
 	conv_test \
 	expected_test \
 	range_test \
+	math_test \
 	bits_test \
 	bit_iterator_test

@@ -135,6 +136,9 @@ expected_test_LDADD = libfollytestmain.la $(top_builddir)/libfollybenchmark.la
 range_test_SOURCES = RangeTest.cpp
 range_test_LDADD = libfollytestmain.la

+math_test_SOURCES = MathTest.cpp
+math_test_LDADD = libfollytestmain.la $(top_builddir)/libfollybenchmark.la
+
 bits_test_SOURCES = BitsTest.cpp
 bits_test_LDADD = libfollytestmain.la $(top_builddir)/libfollybenchmark.la


--- a/folly/test/MathBenchmark.cpp
+++ b/folly/test/MathBenchmark.cpp
+/*
+ * Copyright 2016 Facebook, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <folly/Math.h>
+
+#include <algorithm>
+
+#include <folly/Benchmark.h>
+
+namespace {
+template <typename T>
+T brokenButWidespreadDivCeil(T num, T denom) {
+  return (num + denom - 1) / denom;
+}
+
+template <typename T>
+T viaFloatDivCeil(T num, T denom) {
+  return static_cast<T>(ceilf(static_cast<float>(num) / denom));
+}
+
+template <typename T>
+T viaDoubleDivCeil(T num, T denom) {
+  return static_cast<T>(ceil(static_cast<double>(num) / denom));
+}
+
+template <typename T>
+T viaLongDoubleDivCeil(T num, T denom) {
+  return static_cast<T>(ceill(static_cast<long double>(num) / denom));
+}
+
+template <typename T>
+std::vector<T> divValues() {
+  std::vector<T> rv;
+  for (T i = 1; i < std::numeric_limits<T>::max() && i <= 1000; ++i) {
+    rv.push_back(i);
+    rv.push_back(-i);
+    rv.push_back(std::numeric_limits<T>::max() / i);
+    auto x = std::numeric_limits<T>::min() / i;
+    if (x != 0) {
+      rv.push_back(x);
+    }
+  }
+  return rv;
+}
+
+template <typename T, typename F>
+void runDivTests(const F& func, size_t iters) {
+  std::vector<T> denoms;
+  std::vector<T> numers;
+  BENCHMARK_SUSPEND {
+    denoms = divValues<T>();
+    numers = denoms;
+    numers.push_back(0);
+    std::mt19937 rnd(1234);
+    std::shuffle(denoms.begin(), denoms.end(), rnd);
+    std::shuffle(numers.begin(), numers.end(), rnd);
+  }
+  T dep = 0;
+  while (true) {
+    for (T d : denoms) {
+      for (T n : numers) {
+        n ^= dep;
+        if (std::is_signed<T>::value && n == std::numeric_limits<T>::min() &&
+            d == -1) {
+          // min / -1 overflows in two's complement
+          d = -2;
+        }
+        dep = func(n, d);
+
+        if (--iters == 0) {
+          folly::doNotOptimizeAway(dep);
+          return;
+        }
+      }
+    }
+  }
+}
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncInt8, iters) {
+  runDivTests<int8_t>(&folly::divTrunc<int8_t, int8_t>, iters);
+}
+BENCHMARK(divFloorInt8, iters) {
+  runDivTests<int8_t>(&folly::divFloor<int8_t, int8_t>, iters);
+}
+BENCHMARK(divCeilInt8, iters) {
+  runDivTests<int8_t>(&folly::divCeil<int8_t, int8_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilInt8, iters) {
+  runDivTests<int8_t>(&folly::detail::divCeilBranchless<int8_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilInt8, iters) {
+  runDivTests<int8_t>(&folly::detail::divCeilBranchful<int8_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilInt8, iters) {
+  runDivTests<int8_t>(&brokenButWidespreadDivCeil<int8_t>, iters);
+}
+BENCHMARK_RELATIVE(viaFloatDivCeilInt8, iters) {
+  runDivTests<int8_t>(&viaFloatDivCeil<int8_t>, iters);
+}
+BENCHMARK_RELATIVE(viaDoubleDivCeilInt8, iters) {
+  runDivTests<int8_t>(&viaDoubleDivCeil<int8_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilInt8, iters) {
+  runDivTests<int8_t>(&viaLongDoubleDivCeil<int8_t>, iters);
+}
+BENCHMARK(divRoundAwayInt8, iters) {
+  runDivTests<int8_t>(&folly::divRoundAway<int8_t, int8_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncInt16, iters) {
+  runDivTests<int16_t>(&folly::divTrunc<int16_t, int16_t>, iters);
+}
+BENCHMARK(divFloorInt16, iters) {
+  runDivTests<int16_t>(&folly::divFloor<int16_t, int16_t>, iters);
+}
+BENCHMARK(divCeilInt16, iters) {
+  runDivTests<int16_t>(&folly::divCeil<int16_t, int16_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilInt16, iters) {
+  runDivTests<int16_t>(&folly::detail::divCeilBranchless<int16_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilInt16, iters) {
+  runDivTests<int16_t>(&folly::detail::divCeilBranchful<int16_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilInt16, iters) {
+  runDivTests<int16_t>(&brokenButWidespreadDivCeil<int16_t>, iters);
+}
+BENCHMARK_RELATIVE(viaFloatDivCeilInt16, iters) {
+  runDivTests<int16_t>(&viaFloatDivCeil<int16_t>, iters);
+}
+BENCHMARK_RELATIVE(viaDoubleDivCeilInt16, iters) {
+  runDivTests<int16_t>(&viaDoubleDivCeil<int16_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilInt16, iters) {
+  runDivTests<int16_t>(&viaLongDoubleDivCeil<int16_t>, iters);
+}
+BENCHMARK(divRoundAwayInt16, iters) {
+  runDivTests<int16_t>(&folly::divRoundAway<int16_t, int16_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncInt32, iters) {
+  runDivTests<int32_t>(&folly::divTrunc<int32_t, int32_t>, iters);
+}
+BENCHMARK(divFloorInt32, iters) {
+  runDivTests<int32_t>(&folly::divFloor<int32_t, int32_t>, iters);
+}
+BENCHMARK(divCeilInt32, iters) {
+  runDivTests<int32_t>(&folly::divCeil<int32_t, int32_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilInt32, iters) {
+  runDivTests<int32_t>(&folly::detail::divCeilBranchless<int32_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilInt32, iters) {
+  runDivTests<int32_t>(&folly::detail::divCeilBranchful<int32_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilInt32, iters) {
+  runDivTests<int32_t>(&brokenButWidespreadDivCeil<int32_t>, iters);
+}
+BENCHMARK_RELATIVE(approxViaFloatDivCeilInt32, iters) {
+  runDivTests<int32_t>(&viaFloatDivCeil<int32_t>, iters);
+}
+BENCHMARK_RELATIVE(viaDoubleDivCeilInt32, iters) {
+  runDivTests<int32_t>(&viaDoubleDivCeil<int32_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilInt32, iters) {
+  runDivTests<int32_t>(&viaLongDoubleDivCeil<int32_t>, iters);
+}
+BENCHMARK(divRoundAwayInt32, iters) {
+  runDivTests<int32_t>(&folly::divRoundAway<int32_t, int32_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncInt64, iters) {
+  runDivTests<int64_t>(&folly::divTrunc<int64_t, int64_t>, iters);
+}
+BENCHMARK(divFloorInt64, iters) {
+  runDivTests<int64_t>(&folly::divFloor<int64_t, int64_t>, iters);
+}
+BENCHMARK(divCeilInt64, iters) {
+  runDivTests<int64_t>(&folly::divCeil<int64_t, int64_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilInt64, iters) {
+  runDivTests<int64_t>(&folly::detail::divCeilBranchless<int64_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilInt64, iters) {
+  runDivTests<int64_t>(&folly::detail::divCeilBranchful<int64_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilInt64, iters) {
+  runDivTests<int64_t>(&brokenButWidespreadDivCeil<int64_t>, iters);
+}
+BENCHMARK_RELATIVE(approxViaFloatDivCeilInt64, iters) {
+  runDivTests<int64_t>(&viaFloatDivCeil<int64_t>, iters);
+}
+BENCHMARK_RELATIVE(approxViaDoubleDivCeilInt64, iters) {
+  runDivTests<int64_t>(&viaDoubleDivCeil<int64_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilInt64, iters) {
+  runDivTests<int64_t>(&viaLongDoubleDivCeil<int64_t>, iters);
+}
+BENCHMARK(divRoundAwayInt64, iters) {
+  runDivTests<int64_t>(&folly::divRoundAway<int64_t, int64_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncUint8, iters) {
+  runDivTests<uint8_t>(&folly::divTrunc<uint8_t, uint8_t>, iters);
+}
+BENCHMARK(divFloorUint8, iters) {
+  runDivTests<uint8_t>(&folly::divFloor<uint8_t, uint8_t>, iters);
+}
+BENCHMARK(divCeilUint8, iters) {
+  runDivTests<uint8_t>(&folly::divCeil<uint8_t, uint8_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilUint8, iters) {
+  runDivTests<uint8_t>(&folly::detail::divCeilBranchless<uint8_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilUint8, iters) {
+  runDivTests<uint8_t>(&folly::detail::divCeilBranchful<uint8_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilUint8, iters) {
+  runDivTests<uint8_t>(&brokenButWidespreadDivCeil<uint8_t>, iters);
+}
+BENCHMARK_RELATIVE(viaFloatDivCeilUint8, iters) {
+  runDivTests<uint8_t>(&viaFloatDivCeil<uint8_t>, iters);
+}
+BENCHMARK_RELATIVE(viaDoubleDivCeilUint8, iters) {
+  runDivTests<uint8_t>(&viaDoubleDivCeil<uint8_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilUint8, iters) {
+  runDivTests<uint8_t>(&viaLongDoubleDivCeil<uint8_t>, iters);
+}
+BENCHMARK(divRoundAwayUint8, iters) {
+  runDivTests<uint8_t>(&folly::divRoundAway<uint8_t, uint8_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncUint16, iters) {
+  runDivTests<uint16_t>(&folly::divTrunc<uint16_t, uint16_t>, iters);
+}
+BENCHMARK(divFloorUint16, iters) {
+  runDivTests<uint16_t>(&folly::divFloor<uint16_t, uint16_t>, iters);
+}
+BENCHMARK(divCeilUint16, iters) {
+  runDivTests<uint16_t>(&folly::divCeil<uint16_t, uint16_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilUint16, iters) {
+  runDivTests<uint16_t>(&folly::detail::divCeilBranchless<uint16_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilUint16, iters) {
+  runDivTests<uint16_t>(&folly::detail::divCeilBranchful<uint16_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilUint16, iters) {
+  runDivTests<uint16_t>(&brokenButWidespreadDivCeil<uint16_t>, iters);
+}
+BENCHMARK_RELATIVE(viaFloatDivCeilUint16, iters) {
+  runDivTests<uint16_t>(&viaFloatDivCeil<uint16_t>, iters);
+}
+BENCHMARK_RELATIVE(viaDoubleDivCeilUint16, iters) {
+  runDivTests<uint16_t>(&viaDoubleDivCeil<uint16_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilUint16, iters) {
+  runDivTests<uint16_t>(&viaLongDoubleDivCeil<uint16_t>, iters);
+}
+BENCHMARK(divRoundAwayUint16, iters) {
+  runDivTests<uint16_t>(&folly::divRoundAway<uint16_t, uint16_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncUint32, iters) {
+  runDivTests<uint32_t>(&folly::divTrunc<uint32_t, uint32_t>, iters);
+}
+BENCHMARK(divFloorUint32, iters) {
+  runDivTests<uint32_t>(&folly::divFloor<uint32_t, uint32_t>, iters);
+}
+BENCHMARK(divCeilUint32, iters) {
+  runDivTests<uint32_t>(&folly::divCeil<uint32_t, uint32_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilUint32, iters) {
+  runDivTests<uint32_t>(&folly::detail::divCeilBranchless<uint32_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilUint32, iters) {
+  runDivTests<uint32_t>(&folly::detail::divCeilBranchful<uint32_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilUint32, iters) {
+  runDivTests<uint32_t>(&brokenButWidespreadDivCeil<uint32_t>, iters);
+}
+BENCHMARK_RELATIVE(approxViaFloatDivCeilUint32, iters) {
+  runDivTests<uint32_t>(&viaFloatDivCeil<uint32_t>, iters);
+}
+BENCHMARK_RELATIVE(viaDoubleDivCeilUint32, iters) {
+  runDivTests<uint32_t>(&viaDoubleDivCeil<uint32_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilUint32, iters) {
+  runDivTests<uint32_t>(&viaLongDoubleDivCeil<uint32_t>, iters);
+}
+BENCHMARK(divRoundAwayUint32, iters) {
+  runDivTests<uint32_t>(&folly::divRoundAway<uint32_t, uint32_t>, iters);
+}
+
+BENCHMARK_DRAW_LINE();
+BENCHMARK(divTruncUint64, iters) {
+  runDivTests<uint64_t>(&folly::divTrunc<uint64_t, uint64_t>, iters);
+}
+BENCHMARK(divFloorUint64, iters) {
+  runDivTests<uint64_t>(&folly::divFloor<uint64_t, uint64_t>, iters);
+}
+BENCHMARK(divCeilUint64, iters) {
+  runDivTests<uint64_t>(&folly::divCeil<uint64_t, uint64_t>, iters);
+}
+BENCHMARK_RELATIVE(branchlessDivCeilUint64, iters) {
+  runDivTests<uint64_t>(&folly::detail::divCeilBranchless<uint64_t>, iters);
+}
+BENCHMARK_RELATIVE(branchfulDivCeilUint64, iters) {
+  runDivTests<uint64_t>(&folly::detail::divCeilBranchful<uint64_t>, iters);
+}
+BENCHMARK_RELATIVE(brokenButWidespreadDivCeilUint64, iters) {
+  runDivTests<uint64_t>(&brokenButWidespreadDivCeil<uint64_t>, iters);
+}
+BENCHMARK_RELATIVE(approxViaFloatDivCeilUint64, iters) {
+  runDivTests<uint64_t>(&viaFloatDivCeil<uint64_t>, iters);
+}
+BENCHMARK_RELATIVE(approxViaDoubleDivCeilUint64, iters) {
+  runDivTests<uint64_t>(&viaDoubleDivCeil<uint64_t>, iters);
+}
+BENCHMARK_RELATIVE(viaLongDoubleDivCeilUint64, iters) {
+  runDivTests<uint64_t>(&viaLongDoubleDivCeil<uint64_t>, iters);
+}
+BENCHMARK(divRoundAwayUint64, iters) {
+  runDivTests<uint64_t>(&folly::divRoundAway<uint64_t, uint64_t>, iters);
+}
+
+int main(int argc, char** argv) {
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+  folly::runBenchmarks();
+  return 0;
+}
+
+/*
+Benchmarks run single-threaded on a dual Xeon E5-2660 @ 2.2 Ghz with
+hyperthreading (16 physical cores, 20 MB cache per socket, 256 GB RAM)
+
+Benchmarks used --bm_min_iters=10000000.
+
+divTrunc is just a native integral division.  viaDoubleViaCeil doesn't
+have full accuracy for Int64 or Uint64.  There is a loop-carried
+dependency for all of the div* tests, but there is a bit of extra slack
+(a predictable call, a load that should be from the L1, and a predictable
+not-taken branch in addition to the loop's branch) in the driving loop,
+so the benchmark driver's attempt to subtract the overhead of the loop
+might mean that the latency numbers here are slightly too low or too high.
+
+The branchful implementation's branch is very predictable in this
+microbenchmark for unsigned types, since it only needs to predict a
+zero numerator.  That's likely to be true in real life as well, so we
+make this the default.
+
+I was surprised at the speed of float and double division, but
+the only case where it actually wins by much and is correct is for
+int16_t.  (float + ceil is faster for the 32-bit case, but is only
+an approximation.)  I ran a similar benchmark setup for ARM and ARM64.
+On ARM the conditional versions win by quite a bit.  32-bit ARM doesn't
+have a native integer divide, so getting the remainder after a division
+(to see if truncation occurred) is more work than preconditioning the
+numerator to make truncation go in the correct direction.  64-bit ARM
+had the same winners and losers as x86_64, at least on the two physical
+instances I tested.
+
+============================================================================
+folly/test/MathBenchmark.cpp                    relative  time/iter  iters/s
+============================================================================
+----------------------------------------------------------------------------
+divTruncInt8                                                 8.89ns  112.44M
+divFloorInt8                                                10.99ns   91.00M
+divCeilInt8                                                 10.95ns   91.33M
+branchlessDivCeilInt8                            100.40%    10.91ns   91.69M
+branchfulDivCeilInt8                              88.87%    12.32ns   81.16M
+brokenButWidespreadDivCeilInt8                   109.20%    10.03ns   99.73M
+viaFloatDivCeilInt8                              109.68%     9.98ns  100.17M
+viaDoubleDivCeilInt8                              95.47%    11.47ns   87.19M
+viaLongDoubleDivCeilInt8                          31.65%    34.59ns   28.91M
+divRoundAwayInt8                                            10.42ns   95.97M
+----------------------------------------------------------------------------
+divTruncInt16                                                8.68ns  115.17M
+divFloorInt16                                               10.94ns   91.38M
+divCeilInt16                                                10.91ns   91.70M
+branchlessDivCeilInt16                            99.44%    10.97ns   91.18M
+branchfulDivCeilInt16                             81.68%    13.35ns   74.90M
+brokenButWidespreadDivCeilInt16                  109.50%     9.96ns  100.40M
+viaFloatDivCeilInt16                             108.04%    10.09ns   99.07M
+viaDoubleDivCeilInt16                             85.38%    12.77ns   78.29M
+viaLongDoubleDivCeilInt16                         29.99%    36.36ns   27.50M
+divRoundAwayInt16                                           10.59ns   94.46M
+----------------------------------------------------------------------------
+divTruncInt32                                                8.38ns  119.29M
+divFloorInt32                                               11.01ns   90.84M
+divCeilInt32                                                11.12ns   89.91M
+branchlessDivCeilInt32                           101.94%    10.91ns   91.66M
+branchfulDivCeilInt32                             84.67%    13.14ns   76.12M
+brokenButWidespreadDivCeilInt32                  117.61%     9.46ns  105.75M
+approxViaFloatDivCeilInt32                       115.98%     9.59ns  104.28M
+viaDoubleDivCeilInt32                             89.86%    12.38ns   80.79M
+viaLongDoubleDivCeilInt32                         30.84%    36.06ns   27.73M
+divRoundAwayInt32                                           11.30ns   88.50M
+----------------------------------------------------------------------------
+divTruncInt64                                               16.07ns   62.21M
+divFloorInt64                                               18.37ns   54.45M
+divCeilInt64                                                18.61ns   53.74M
+branchlessDivCeilInt64                           100.43%    18.53ns   53.97M
+branchfulDivCeilInt64                             84.65%    21.98ns   45.49M
+brokenButWidespreadDivCeilInt64                  108.47%    17.16ns   58.29M
+approxViaFloatDivCeilInt64                       190.99%     9.74ns  102.64M
+approxViaDoubleDivCeilInt64                      148.64%    12.52ns   79.88M
+viaLongDoubleDivCeilInt64                         52.01%    35.77ns   27.95M
+divRoundAwayInt64                                           18.79ns   53.21M
+----------------------------------------------------------------------------
+divTruncUint8                                                7.76ns  128.89M
+divFloorUint8                                                8.29ns  120.61M
+divCeilUint8                                                 9.61ns  104.09M
+branchlessDivCeilUint8                           112.00%     8.58ns  116.58M
+branchfulDivCeilUint8                            114.01%     8.43ns  118.67M
+brokenButWidespreadDivCeilUint8                  100.48%     9.56ns  104.58M
+viaFloatDivCeilUint8                             103.53%     9.28ns  107.76M
+viaDoubleDivCeilUint8                             85.75%    11.20ns   89.26M
+viaLongDoubleDivCeilUint8                         27.72%    34.65ns   28.86M
+divRoundAwayUint8                                            9.60ns  104.11M
+----------------------------------------------------------------------------
+divTruncUint16                                               8.39ns  119.19M
+divFloorUint16                                               8.28ns  120.82M
+divCeilUint16                                                9.90ns  100.96M
+branchlessDivCeilUint16                          100.23%     9.88ns  101.19M
+branchfulDivCeilUint16                           107.83%     9.19ns  108.87M
+brokenButWidespreadDivCeilUint16                  99.89%     9.92ns  100.85M
+viaFloatDivCeilUint16                            100.54%     9.85ns  101.50M
+viaDoubleDivCeilUint16                            77.38%    12.80ns   78.13M
+viaLongDoubleDivCeilUint16                        27.30%    36.28ns   27.56M
+divRoundAwayUint16                                           9.82ns  101.85M
+----------------------------------------------------------------------------
+divTruncUint32                                               8.12ns  123.20M
+divFloorUint32                                               8.09ns  123.58M
+divCeilUint32                                                8.44ns  118.55M
+branchlessDivCeilUint32                           88.27%     9.56ns  104.64M
+branchfulDivCeilUint32                            98.91%     8.53ns  117.25M
+brokenButWidespreadDivCeilUint32                  93.48%     9.02ns  110.82M
+approxViaFloatDivCeilUint32                       86.29%     9.78ns  102.30M
+viaDoubleDivCeilUint32                            66.76%    12.63ns   79.15M
+viaLongDoubleDivCeilUint32                        23.35%    36.13ns   27.68M
+divRoundAwayUint32                                           8.47ns  118.03M
+----------------------------------------------------------------------------
+divTruncUint64                                              12.38ns   80.79M
+divFloorUint64                                              12.27ns   81.47M
+divCeilUint64                                               12.66ns   78.99M
+branchlessDivCeilUint64                           93.46%    13.55ns   73.83M
+branchfulDivCeilUint64                           100.30%    12.62ns   79.23M
+brokenButWidespreadDivCeilUint64                  99.41%    12.73ns   78.53M
+approxViaFloatDivCeilUint64                      106.59%    11.88ns   84.19M
+approxViaDoubleDivCeilUint64                      92.14%    13.74ns   72.78M
+viaLongDoubleDivCeilUint64                        33.51%    37.78ns   26.47M
+divRoundAwayUint64                                          12.34ns   81.02M
+============================================================================
+*/
--- a/folly/test/MathTest.cpp
+++ b/folly/test/MathTest.cpp
+/*
+ * Copyright 2016 Facebook, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <folly/Math.h>
+
+#include <algorithm>
+#include <type_traits>
+#include <utility>
+#include <vector>
+
+#include <glog/logging.h>
+#include <gtest/gtest.h>
+
+#include <folly/Portability.h>
+
+using namespace folly;
+using namespace folly::detail;
+
+namespace {
+
+// Workaround for https://llvm.org/bugs/show_bug.cgi?id=16404,
+// issues with __int128 multiplication and UBSAN
+template <typename T>
+T mul(T lhs, T rhs) {
+  if (rhs < 0) {
+    rhs = -rhs;
+    lhs = -lhs;
+  }
+  T accum = 0;
+  while (rhs != 0) {
+    if ((rhs & 1) != 0) {
+      accum += lhs;
+    }
+    lhs += lhs;
+    rhs >>= 1;
+  }
+  return accum;
+}
+
+template <typename T, typename B>
+T referenceDivFloor(T numer, T denom) {
+  // rv = largest integral value <= numer / denom
+  B n = numer;
+  B d = denom;
+  if (d < 0) {
+    d = -d;
+    n = -n;
+  }
+  B r = n / d;
+  while (mul(r, d) > n) {
+    --r;
+  }
+  while (mul(r + 1, d) <= n) {
+    ++r;
+  }
+  T rv = static_cast<T>(r);
+  assert(static_cast<B>(rv) == r);
+  return rv;
+}
+
+template <typename T, typename B>
+T referenceDivCeil(T numer, T denom) {
+  // rv = smallest integral value >= numer / denom
+  B n = numer;
+  B d = denom;
+  if (d < 0) {
+    d = -d;
+    n = -n;
+  }
+  B r = n / d;
+  while (mul(r, d) < n) {
+    ++r;
+  }
+  while (mul(r - 1, d) >= n) {
+    --r;
+  }
+  T rv = static_cast<T>(r);
+  assert(static_cast<B>(rv) == r);
+  return rv;
+}
+
+template <typename T, typename B>
+T referenceDivRoundAway(T numer, T denom) {
+  if ((numer < 0) != (denom < 0)) {
+    return referenceDivFloor<T, B>(numer, denom);
+  } else {
+    return referenceDivCeil<T, B>(numer, denom);
+  }
+}
+
+template <typename T>
+std::vector<T> cornerValues() {
+  std::vector<T> rv;
+  for (T i = 1; i < 24; ++i) {
+    rv.push_back(i);
+    rv.push_back(std::numeric_limits<T>::max() / i);
+    rv.push_back(std::numeric_limits<T>::max() - i);
+    rv.push_back(std::numeric_limits<T>::max() / 2 - i);
+    if (std::is_signed<T>::value) {
+      rv.push_back(-i);
+      rv.push_back(std::numeric_limits<T>::min() / i);
+      rv.push_back(std::numeric_limits<T>::min() + i);
+      rv.push_back(std::numeric_limits<T>::min() / 2 + i);
+    }
+  }
+  return rv;
+}
+
+template <typename A, typename B, typename C>
+void runDivTests() {
+  using T = decltype(static_cast<A>(1) / static_cast<B>(1));
+  auto numers = cornerValues<A>();
+  numers.push_back(0);
+  auto denoms = cornerValues<B>();
+  for (A n : numers) {
+    for (B d : denoms) {
+      if (std::is_signed<T>::value && n == std::numeric_limits<T>::min() &&
+          d == static_cast<T>(-1)) {
+        // n / d overflows in two's complement
+        continue;
+      }
+      EXPECT_EQ(divCeil(n, d), (referenceDivCeil<T, C>(n, d))) << n << "/" << d;
+      EXPECT_EQ(divFloor(n, d), (referenceDivFloor<T, C>(n, d))) << n << "/"
+                                                                 << d;
+      EXPECT_EQ(divTrunc(n, d), n / d) << n << "/" << d;
+      EXPECT_EQ(divRoundAway(n, d), (referenceDivRoundAway<T, C>(n, d)))
+          << n << "/" << d;
+      T nn = n;
+      T dd = d;
+      EXPECT_EQ(divCeilBranchless(nn, dd), divCeilBranchful(nn, dd));
+      EXPECT_EQ(divFloorBranchless(nn, dd), divFloorBranchful(nn, dd));
+      EXPECT_EQ(divRoundAwayBranchless(nn, dd), divRoundAwayBranchful(nn, dd));
+    }
+  }
+}
+}
+
+TEST(Bits, divTestInt8) {
+  runDivTests<int8_t, int8_t, int64_t>();
+  runDivTests<int8_t, uint8_t, int64_t>();
+  runDivTests<int8_t, int16_t, int64_t>();
+  runDivTests<int8_t, uint16_t, int64_t>();
+  runDivTests<int8_t, int32_t, int64_t>();
+  runDivTests<int8_t, uint32_t, int64_t>();
+#ifdef FOLLY_HAVE_INT128_T
+  runDivTests<int8_t, int64_t, __int128>();
+  runDivTests<int8_t, uint64_t, __int128>();
+#endif
+}
+TEST(Bits, divTestInt16) {
+  runDivTests<int16_t, int8_t, int64_t>();
+  runDivTests<int16_t, uint8_t, int64_t>();
+  runDivTests<int16_t, int16_t, int64_t>();
+  runDivTests<int16_t, uint16_t, int64_t>();
+  runDivTests<int16_t, int32_t, int64_t>();
+  runDivTests<int16_t, uint32_t, int64_t>();
+#ifdef FOLLY_HAVE_INT128_T
+  runDivTests<int16_t, int64_t, __int128>();
+  runDivTests<int16_t, uint64_t, __int128>();
+#endif
+}
+TEST(Bits, divTestInt32) {
+  runDivTests<int32_t, int8_t, int64_t>();
+  runDivTests<int32_t, uint8_t, int64_t>();
+  runDivTests<int32_t, int16_t, int64_t>();
+  runDivTests<int32_t, uint16_t, int64_t>();
+  runDivTests<int32_t, int32_t, int64_t>();
+  runDivTests<int32_t, uint32_t, int64_t>();
+#ifdef FOLLY_HAVE_INT128_T
+  runDivTests<int32_t, int64_t, __int128>();
+  runDivTests<int32_t, uint64_t, __int128>();
+#endif
+}
+#ifdef FOLLY_HAVE_INT128_T
+TEST(Bits, divTestInt64) {
+  runDivTests<int64_t, int8_t, __int128>();
+  runDivTests<int64_t, uint8_t, __int128>();
+  runDivTests<int64_t, int16_t, __int128>();
+  runDivTests<int64_t, uint16_t, __int128>();
+  runDivTests<int64_t, int32_t, __int128>();
+  runDivTests<int64_t, uint32_t, __int128>();
+  runDivTests<int64_t, int64_t, __int128>();
+  runDivTests<int64_t, uint64_t, __int128>();
+}
+#endif
+TEST(Bits, divTestUint8) {
+  runDivTests<uint8_t, int8_t, int64_t>();
+  runDivTests<uint8_t, uint8_t, int64_t>();
+  runDivTests<uint8_t, int16_t, int64_t>();
+  runDivTests<uint8_t, uint16_t, int64_t>();
+  runDivTests<uint8_t, int32_t, int64_t>();
+  runDivTests<uint8_t, uint32_t, int64_t>();
+#ifdef FOLLY_HAVE_INT128_T
+  runDivTests<uint8_t, int64_t, __int128>();
+  runDivTests<uint8_t, uint64_t, __int128>();
+#endif
+}
+TEST(Bits, divTestUint16) {
+  runDivTests<uint16_t, int8_t, int64_t>();
+  runDivTests<uint16_t, uint8_t, int64_t>();
+  runDivTests<uint16_t, int16_t, int64_t>();
+  runDivTests<uint16_t, uint16_t, int64_t>();
+  runDivTests<uint16_t, int32_t, int64_t>();
+  runDivTests<uint16_t, uint32_t, int64_t>();
+#ifdef FOLLY_HAVE_INT128_T
+  runDivTests<uint16_t, int64_t, __int128>();
+  runDivTests<uint16_t, uint64_t, __int128>();
+#endif
+}
+TEST(Bits, divTestUint32) {
+  runDivTests<uint32_t, int8_t, int64_t>();
+  runDivTests<uint32_t, uint8_t, int64_t>();
+  runDivTests<uint32_t, int16_t, int64_t>();
+  runDivTests<uint32_t, uint16_t, int64_t>();
+  runDivTests<uint32_t, int32_t, int64_t>();
+  runDivTests<uint32_t, uint32_t, int64_t>();
+#ifdef FOLLY_HAVE_INT128_T
+  runDivTests<uint32_t, int64_t, __int128>();
+  runDivTests<uint32_t, uint64_t, __int128>();
+#endif
+}
+#ifdef FOLLY_HAVE_INT128_T
+TEST(Bits, divTestUint64) {
+  runDivTests<uint64_t, int8_t, __int128>();
+  runDivTests<uint64_t, uint8_t, __int128>();
+  runDivTests<uint64_t, int16_t, __int128>();
+  runDivTests<uint64_t, uint16_t, __int128>();
+  runDivTests<uint64_t, int32_t, __int128>();
+  runDivTests<uint64_t, uint32_t, __int128>();
+  runDivTests<uint64_t, int64_t, __int128>();
+  runDivTests<uint64_t, uint64_t, __int128>();
+}
+#endif