Optimizations, support for exceptions and big return values for lock_combine

Summary: Some optimizations and changes to make mutex migrations easier: - Add exception handling support, this allows using lock_combine pretty much anywhere a unique_lock would be used and makes transitioning between lock methods easier and more efficient as users aren't required to maintain their own unions anywhere (eg. with folly::Try) - Add support for big return values so people can return anything from the critical section. Without this, users would have to use code of the following form, which is prone to false sharing with metadata for the waiting thread ``` auto value = ReturnValue{}; mutex.lock_combine([&]() { value = critical_section(); }); ``` - Add some optimizations like inlining the combine codepath and an optimistic load to elide a branch. This gets us a ~8% throughput improvement from before. More importantly, This prevents compilers from messing up the generated code to dereference the waiter node whenever they feel like. - Defer time publishing for combinable threads until a preemption. This gets us to the same level of efficiency as std::atomic even on broadwell, takes us to 7x of the baseline (std::mutex) on the NUMA-less machines and almost perfectly parallel in the moderate concurrency levels. I suspect we can do better with NUMA-awareness, but that's for another diff Reviewed By: yfeldblum Differential Revision: D15522658 fbshipit-source-id: 420f4202503305d57b6bd59a9a4ecb67d4dd3c2e

Optimizations, support for exceptions and big return values for lock_combine
Summary: Some optimizations and changes to make mutex migrations easier: - Add exception handling support, this allows using lock_combine pretty much anywhere a unique_lock would be used and makes transitioning between lock methods easier and more efficient as users aren't required to maintain their own unions anywhere (eg. with folly::Try) - Add support for big return values so people can return anything from the critical section. Without this, users would have to use code of the following form, which is prone to false sharing with metadata for the waiting thread ``` auto value = ReturnValue{}; mutex.lock_combine([&]() { value = critical_section(); }); ``` - Add some optimizations like inlining the combine codepath and an optimistic load to elide a branch. This gets us a ~8% throughput improvement from before. More importantly, This prevents compilers from messing up the generated code to dereference the waiter node whenever they feel like. - Defer time publishing for combinable threads until a preemption. This gets us to the same level of efficiency as std::atomic even on broadwell, takes us to 7x of the baseline (std::mutex) on the NUMA-less machines and almost perfectly parallel in the moderate concurrency levels. I suspect we can do better with NUMA-awareness, but that's for another diff Reviewed By: yfeldblum Differential Revision: D15522658 fbshipit-source-id: 420f4202503305d57b6bd59a9a4ecb67d4dd3c2e
886a70d9 · Aaryaman Sagar · Facebook Github Bot · 92c4c7e2 · 886a70d9 · 886a70d9
Commit 886a70d9 authored Jun 12, 2019 by Aaryaman Sagar Committed by Facebook Github Bot Jun 12, 2019
4 changed files
--- a/folly/synchronization/DistributedMutex-inl.h
+++ b/folly/synchronization/DistributedMutex-inl.h
@@ -129,6 +129,18 @@ constexpr auto kCombineUninitialized = std::uint32_t{0b1000};
 // lock holder that the thread has set its next_ pointer in the contention
 // chain
 constexpr auto kCombineWaiting = std::uint32_t{0b1001};
+// kExceptionOccurred is set on the waiter futex when the remote task throws
+// an exception.  It is the caller's responsibility to retrieve the exception
+// and rethrow it in their own context.  Note that when the caller uses a
+// noexcept function as their critical section, they can avoid checking for
+// this value
+//
+// This allows us to avoid all cost of exceptions in the memory layout of the
+// fast path (no errors) as exceptions are stored as an std::exception_ptr in
+// the same union that stores the return value of the critical section.  We
+// also avoid all CPU overhead because the combiner uses a try-catch block
+// without any additional branching to handle exceptions
+constexpr auto kExceptionOccurred = std::uint32_t{0b1010};

 // The number of spins that we are allowed to do before we resort to marking a
 // thread as having slept
@@ -244,7 +256,29 @@ class Waiter {
      kUninitialized};
  // The successor of this node.  This will be the thread that had its address
  // on the mutex previously
-  std::uintptr_t next_{0};
+  //
+  // We can do without making this atomic since the remote thread synchronizes
+  // on the futex variable above.  If this were not atomic, the remote thread
+  // would only be allowed to read from it after the waiter has moved into the
+  // waiting state to avoid risk of a load racing with a write.  However, it
+  // helps to make this atomic because we can use an unconditional load and make
+  // full use of the load buffer to coalesce both reads into a single clock
+  // cycle after the line arrives in the combiner core.  This is a heavily
+  // contended line, so an RFO from the enqueueing thread is highly likely and
+  // has the potential to cause an immediate invalidation; blocking the combiner
+  // thread from making progress until the line is pulled back to read this
+  // value
+  //
+  // Further, making this atomic prevents the compiler from making an incorrect
+  // optimization where it does not load the value as written in the code, but
+  // rather dereferences it through a pointer whenever needed (since the value
+  // of the pointer to this is readily available on the stack).  Doing this
+  // causes multiple invalidation requests from the enqueueing thread, blocking
+  // remote progress
+  //
+  // Note that we use relaxed loads and stores, so this should not have any
+  // additional overhead compared to a regular load on most architectures
+  std::atomic<std::uintptr_t> next_{0};
  // We use an anonymous union for the combined critical section request and
  // the metadata that will be filled in from the leader's end.  Only one is
  // active at a time - if a leader decides to combine the requested critical
@@ -410,9 +444,11 @@ using Request = std::conditional_t<

 /**
 * A template that helps us to transform a callable returning a value to one
- * that returns void so it can be type erased and passed on to the waker.  The
- * return value gets coalesced into the wait struct when it is small enough
- * for optimal data transfer
+ * that returns void so it can be type erased and passed on to the waker.  If
+ * the return value is small enough, it gets coalesced into the wait struct
+ * for optimal data transfer.  When it's not small enough to fit in the waiter
+ * storage buffer, we place it on it's own cacheline with isolation to prevent
+ * false-sharing with the on-stack metadata of the waiter thread
 *
 * This helps a combined critical section feel more normal in the case where
 * the user wants to return a value, for example
@@ -437,6 +473,7 @@ template <typename Func, typename Waiter>
 class TaskWithCoalesce {
 public:
  using ReturnType = folly::invoke_result_t<const Func&>;
+  using StorageType = folly::Unit;
  explicit TaskWithCoalesce(Func func, Waiter& waiter)
      : func_{std::move(func)}, waiter_{waiter} {}

@@ -449,6 +486,7 @@ class TaskWithCoalesce {
  Func func_;
  Waiter& waiter_;

+  static_assert(!std::is_void<ReturnType>{}, "");
  static_assert(alignof(decltype(waiter_.storage_)) >= alignof(ReturnType), "");
  static_assert(sizeof(decltype(waiter_.storage_)) >= sizeof(ReturnType), "");
 };
@@ -457,6 +495,7 @@ template <typename Func, typename Waiter>
 class TaskWithoutCoalesce {
 public:
  using ReturnType = void;
+  using StorageType = folly::Unit;
  explicit TaskWithoutCoalesce(Func func, Waiter&) : func_{std::move(func)} {}

  void operator()() const {
@@ -467,6 +506,52 @@ class TaskWithoutCoalesce {
  Func func_;
 };

+template <typename Func, typename Waiter>
+class TaskWithBigReturnValue {
+ public:
+  // Using storage that is aligned on the cacheline boundary helps us avoid a
+  // situation where the data ends up being allocated on two separate
+  // cachelines.  This would require the remote thread to pull in both lines
+  // to issue a write.
+  //
+  // We also isolate the storage by appending some padding to the end to
+  // ensure we avoid false-sharing with the metadata used while the waiter
+  // waits
+  using ReturnType = folly::invoke_result_t<const Func&>;
+  static const auto kReturnValueAlignment = std::max(
+      alignof(ReturnType),
+      folly::hardware_destructive_interference_size);
+  using StorageType = std::aligned_storage_t<
+      sizeof(std::aligned_storage_t<sizeof(ReturnType), kReturnValueAlignment>),
+      kReturnValueAlignment>;
+
+  explicit TaskWithBigReturnValue(Func func, Waiter&)
+      : func_{std::move(func)} {}
+
+  void operator()() const {
+    DCHECK(storage_);
+    auto value = func_();
+    new (storage_) ReturnType{std::move(value)};
+  }
+
+  void attach(StorageType* storage) {
+    DCHECK(!storage_);
+    storage_ = storage;
+  }
+
+ private:
+  Func func_;
+  StorageType* storage_{nullptr};
+
+  static_assert(!std::is_void<ReturnType>{}, "");
+  static_assert(sizeof(Waiter::storage_) < sizeof(ReturnType), "");
+};
+
+template <typename T, typename = std::enable_if_t<true>>
+constexpr const auto Sizeof = sizeof(T);
+template <typename T>
+constexpr const auto Sizeof<T, std::enable_if_t<std::is_void<T>{}>> = 0;
+
 // we need to use std::integral_constant::value here as opposed to
 // std::integral_constant::operator T() because MSVC errors out with the
 // implicit conversion
@@ -474,7 +559,10 @@ template <typename Func, typename Waiter>
 using CoalescedTask = std::conditional_t<
    std::is_void<folly::invoke_result_t<const Func&>>::value,
    TaskWithoutCoalesce<Func, Waiter>,
-    TaskWithCoalesce<Func, Waiter>>;
+    std::conditional_t<
+        Sizeof<folly::invoke_result_t<const Func&>> <= sizeof(Waiter::storage_),
+        TaskWithCoalesce<Func, Waiter>,
+        TaskWithBigReturnValue<Func, Waiter>>>;

 /**
 * Given a request and a wait node, coalesce them into a CoalescedTask that
@@ -497,26 +585,120 @@ CoalescedTask<Func, Waiter> coalesce(Request& request, Waiter& waiter) {
  return CoalescedTask<Func, Waiter>{request.func_, waiter};
 }

+/**
+ * Given a task, create storage for the return value.  When we get a type
+ * of CoalescedTask, this returns an instance of CoalescedTask::StorageType.
+ * std::nullptr_t otherwise
+ */
+inline std::nullptr_t makeReturnValueStorageFor(std::nullptr_t&) {
+  return {};
+}
+
+template <
+    typename CoalescedTask,
+    typename StorageType = typename CoalescedTask::StorageType>
+StorageType makeReturnValueStorageFor(CoalescedTask&) {
+  return {};
+}
+
+/**
+ * Given a task and storage, attach them together if needed.  This only helps
+ * when we have a task that returns a value bigger than can be coalesced.  In
+ * that case, we need to attach the storage with the task so the return value
+ * can be transferred to this thread from the remote thread
+ */
+template <typename Task, typename Storage>
+void attach(Task&, Storage&) {
+  static_assert(
+      std::is_same<Storage, std::nullptr_t>{} ||
+          std::is_same<Storage, folly::Unit>{},
+      "");
+}
+
+template <
+    typename R,
+    typename W,
+    typename StorageType = typename TaskWithBigReturnValue<R, W>::StorageType>
+void attach(TaskWithBigReturnValue<R, W>& task, StorageType& storage) {
+  task.attach(&storage);
+}
+
+template <typename Request, typename Waiter>
+void throwIfExceptionOccurred(Request&, Waiter& waiter, bool exception) {
+  using Storage = decltype(waiter.storage_);
+  using F = typename Request::F;
+  static_assert(sizeof(Storage) >= sizeof(std::exception_ptr), "");
+  static_assert(alignof(Storage) >= alignof(std::exception_ptr), "");
+
+  // we only need to check for an exception in the waiter struct if the passed
+  // callable is not noexcept
+  //
+  // we need to make another instance of the exception with automatic storage
+  // duration and destroy the exception held in the storage *before throwing* to
+  // avoid leaks.  If we don't destroy the exception_ptr in storage, the
+  // refcount for the internal exception will never hit zero, thereby leaking
+  // memory
+  if (UNLIKELY(!folly::is_nothrow_invocable<const F&>{} && exception)) {
+    auto storage = &waiter.storage_;
+    auto exc = folly::launder(reinterpret_cast<std::exception_ptr*>(storage));
+    auto copy = std::move(*exc);
+    exc->std::exception_ptr::~exception_ptr();
+    std::rethrow_exception(std::move(copy));
+  }
+}
+
 /**
 * Given a CoalescedTask, a wait node and a request.  Detach the return value
 * into the request from the wait node and task.
 */
 template <typename Waiter>
-void detach(std::nullptr_t&, Waiter&) {}
+void detach(std::nullptr_t&, Waiter&, bool exception, std::nullptr_t&) {
+  DCHECK(!exception);
+}

 template <typename Waiter, typename F>
-void detach(RequestWithoutReturn<F>&, Waiter&) {}
+void detach(
+    RequestWithoutReturn<F>& request,
+    Waiter& waiter,
+    bool exception,
+    folly::Unit&) {
+  throwIfExceptionOccurred(request, waiter, exception);
+}

 template <typename Waiter, typename F>
-void detach(RequestWithReturn<F>& request, Waiter& waiter) {
+void detach(
+    RequestWithReturn<F>& request,
+    Waiter& waiter,
+    bool exception,
+    folly::Unit&) {
+  throwIfExceptionOccurred(request, waiter, exception);
+
  using ReturnType = typename RequestWithReturn<F>::ReturnType;
  static_assert(!std::is_same<ReturnType, void>{}, "");
+  static_assert(sizeof(waiter.storage_) >= sizeof(ReturnType), "");

  auto& val = *folly::launder(reinterpret_cast<ReturnType*>(&waiter.storage_));
  new (&request.value_) ReturnType{std::move(val)};
  val.~ReturnType();
 }

+template <typename Waiter, typename F, typename Storage>
+void detach(
+    RequestWithReturn<F>& request,
+    Waiter& waiter,
+    bool exception,
+    Storage& storage) {
+  throwIfExceptionOccurred(request, waiter, exception);
+
+  using ReturnType = typename RequestWithReturn<F>::ReturnType;
+  static_assert(!std::is_same<ReturnType, void>{}, "");
+  static_assert(sizeof(storage) >= sizeof(ReturnType), "");
+
+  auto& val = *folly::launder(reinterpret_cast<ReturnType*>(&storage));
+  new (&request.value_) ReturnType{std::move(val)};
+  val.~ReturnType();
+}
+
 /**
 * Get the time since epoch in nanoseconds
 *
@@ -663,26 +845,72 @@ DistributedMutex<Atomic, TimePublishing>::DistributedMutex()
    : state_{kUnlocked} {}

 template <typename Waiter>
-bool spin(Waiter& waiter, std::uint32_t& sig, std::uint32_t mode) {
-  auto spins = 0;
-  auto waitMode = (mode == kCombineUninitialized) ? kCombineWaiting : kWaiting;
-  while (true) {
-    // publish our current time in the futex as a part of the spin waiting
-    // process
+std::uint64_t publish(
+    std::uint64_t spins,
+    bool& shouldPublish,
+    std::chrono::nanoseconds& previous,
+    Waiter& waiter,
+    std::uint32_t waitMode) {
+  // time publishing has some overhead because it executes an atomic exchange on
+  // the futex word.  If this line is in a remote thread (eg.  the combiner),
+  // then each time we publish a timestamp, this thread has to submit an RFO to
+  // the remote core for the cacheline, blocking progress for both threads.
+  //
+  // the remote core uses a store in the fast path - why then does an RFO make a
+  // difference?  The only educated guess we have here is that the added
+  // roundtrip delays draining of the store buffer, which essentially exerts
+  // backpressure on future stores, preventing parallelization
+  //
+  // if we have requested a combine, time publishing is less important as it
+  // only comes into play when the combiner has exhausted their max combine
+  // passes.  So we defer time publishing to the point when the current thread
+  // gets preempted
+  auto current = time();
+  if ((current - previous) >= kScheduledAwaySpinThreshold) {
+    shouldPublish = true;
+  }
+  previous = current;
+
+  // if we have requested a combine, and this is the first iteration of the
+  // wait-loop, we publish a max timestamp to optimistically convey that we have
+  // not yet been preempted (the remote knows the meaning of max timestamps)
  //
-    // if we are under the maximum number of spins allowed before sleeping, we
-    // publish the exact timestamp, otherwise we publish the minimum possible
+  // then if we are under the maximum number of spins allowed before sleeping,
+  // we publish the exact timestamp, otherwise we publish the minimum possible
  // timestamp to force the waking thread to skip us
-    ++spins;
-    auto now = (spins < kMaxSpins) ? time() : decltype(time())::zero();
+  auto now = ((waitMode == kCombineWaiting) && !spins)
+      ? decltype(time())::max()
+      : (spins < kMaxSpins) ? previous : decltype(time())::zero();
+
+  // the wait mode information is published in the bottom 8 bits of the futex
+  // word, the rest contains time information as computed above.  Overflows are
+  // not really a correctness concern because time publishing is only a
+  // heuristic.  This leaves us 56 bits of nanoseconds (2 years) before we hit
+  // two consecutive wraparounds, so the lack of bits to respresent time is
+  // neither a performance nor correctness concern
  auto data = strip(now) | waitMode;
-    auto signal = waiter.futex_.exchange(data, std::memory_order_acq_rel);
-    signal &= std::numeric_limits<std::uint8_t>::max();
+  auto signal = (shouldPublish || !spins || (waitMode != kCombineWaiting))
+      ? waiter.futex_.exchange(data, std::memory_order_acq_rel)
+      : waiter.futex_.load(std::memory_order_acquire);
+  return signal & std::numeric_limits<std::uint8_t>::max();
+}
+
+template <typename Waiter>
+bool spin(Waiter& waiter, std::uint32_t& sig, std::uint32_t mode) {
+  auto spins = std::uint64_t{0};
+  auto waitMode = (mode == kCombineUninitialized) ? kCombineWaiting : kWaiting;
+  auto previous = time();
+  auto shouldPublish = false;
+  while (true) {
+    auto signal = publish(spins++, shouldPublish, previous, waiter, waitMode);

    // if we got skipped, make a note of it and return if we got a skipped
    // signal or a signal to wake up
    auto skipped = (signal == kSkipped);
-    if (skipped || (signal == kWake) || (signal == kCombined)) {
+    auto combined = (signal == kCombined);
+    auto exceptionOccurred = (signal == kExceptionOccurred);
+    auto woken = (signal == kWake);
+    if (skipped || woken || combined || exceptionOccurred) {
      sig = static_cast<std::uint32_t>(signal);
      return !skipped;
    }
@@ -781,7 +1009,7 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) {
  // when coming out of a futex, we might have some other sleeping threads
  // that we were supposed to wake up, assign that to the next pointer
  DCHECK(next == nullptr);
-  next = extractPtr<Waiter>(waiter->next_);
+  next = extractPtr<Waiter>(waiter->next_.load(std::memory_order_relaxed));
  return false;
 }

@@ -819,7 +1047,7 @@ void wakeTimedWaiters(Atomic* state, bool timedWaiters) {

 template <template <typename> class Atomic, bool TimePublishing>
 template <typename Func>
-auto DistributedMutex<Atomic, TimePublishing>::lock_combine(Func func) noexcept
+auto DistributedMutex<Atomic, TimePublishing>::lock_combine(Func func)
    -> folly::invoke_result_t<const Func&> {
  // invoke the lock implementation function and check whether we came out of
  // it with our task executed as a combined critical section.  This usually
@@ -867,7 +1095,7 @@ template <typename Rep, typename Period, typename Func, typename ReturnType>
 folly::Optional<ReturnType>
 DistributedMutex<Atomic, TimePublishing>::try_lock_combine_for(
    const std::chrono::duration<Rep, Period>& duration,
-    Func func) noexcept {
+    Func func) {
  auto state = try_lock_for(duration);
  if (state) {
    SCOPE_EXIT {
@@ -884,7 +1112,7 @@ template <typename Clock, typename Duration, typename Func, typename ReturnType>
 folly::Optional<ReturnType>
 DistributedMutex<Atomic, TimePublishing>::try_lock_combine_until(
    const std::chrono::time_point<Clock, Duration>& deadline,
-    Func func) noexcept {
+    Func func) {
  auto state = try_lock_until(deadline);
  if (state) {
    SCOPE_EXIT {
@@ -967,7 +1195,9 @@ lockImplementation(
    // constructor
    Waiter<Atomic> state{};
    auto&& task = coalesce(request, state);
+    auto&& storage = makeReturnValueStorageFor(task);
    auto&& address = folly::bit_cast<std::uintptr_t>(&state);
+    attach(task, storage);
    state.initialize(waitMode, std::move(task));
    DCHECK(!(address & 0b1));

@@ -986,7 +1216,7 @@ lockImplementation(
    // was unsuccessful
    previous = atomic.exchange(address, std::memory_order_acq_rel);
    recordTimedWaiterAndClearTimedBit(timedWaiter, previous);
-    state.next_ = previous;
+    state.next_.store(previous, std::memory_order_relaxed);
    if (previous == kUnlocked) {
      return {/* next */ nullptr,
              /* expected */ address,
@@ -1034,8 +1264,10 @@ lockImplementation(
    // if we were given a combine signal, detach the return value from the
    // wait struct into the request, so the current thread can access it
    // outside this function
-    if (signal == kCombined) {
-      detach(request, state);
+    auto combined = (signal == kCombined);
+    auto exceptionOccurred = (signal == kExceptionOccurred);
+    if (combined || exceptionOccurred) {
+      detach(request, state, exceptionOccurred, storage);
    }

    // if we are just coming out of a futex call, then it means that the next
@@ -1045,7 +1277,7 @@ lockImplementation(
    return {/* next */ extractPtr<Waiter<Atomic>>(next),
            /* expected */ expected,
            /* timedWaiter */ timedWaiter,
-            /* combined */ combineRequested && (signal == kCombined),
+            /* combined */ combineRequested && (combined || exceptionOccurred),
            /* waker */ state.metadata_.waker_,
            /* waiters */ extractPtr<Waiter<Atomic>>(state.metadata_.waiters_),
            /* ready */ nextSleeper};
@@ -1055,7 +1287,9 @@ lockImplementation(
 inline bool preempted(std::uint64_t value, std::chrono::nanoseconds now) {
  auto currentTime = recover(strip(now));
  auto nodeTime = recover(value);
-  auto preempted = currentTime > nodeTime + kScheduledAwaySpinThreshold.count();
+  auto preempted =
+      (currentTime > nodeTime + kScheduledAwaySpinThreshold.count()) &&
+      (nodeTime != recover(strip(std::chrono::nanoseconds::max())));

  // we say that the thread has been preempted if its timestamp says so, and
  // also if it is neither uninitialized nor skipped
@@ -1093,46 +1327,21 @@ CombineFunction loadTask(Waiter* current, std::uintptr_t value) {
  return nullptr;
 }

+template <typename Waiter>
+FOLLY_COLD void transferCurrentException(Waiter* waiter) {
+  DCHECK(std::current_exception());
+  new (&waiter->storage_) std::exception_ptr{std::current_exception()};
+  waiter->futex_.store(kExceptionOccurred, std::memory_order_release);
+}
+
 template <template <typename> class Atomic>
-std::uintptr_t tryCombine(
-    std::uintptr_t value,
+FOLLY_ALWAYS_INLINE std::uintptr_t tryCombine(
    Waiter<Atomic>* waiter,
+    std::uintptr_t value,
+    std::uintptr_t next,
    std::uint64_t iteration,
    std::chrono::nanoseconds now,
    CombineFunction task) {
-  // it is important to load the value of next_ before checking the value of
-  // function_ in the next if condition.  This is because of two things, the
-  // first being cache locality - it is helpful to read the value of the
-  // variable that is closer to futex_, since we just loaded from that before
-  // entering this function.  The second is cache coherence, the wait struct
-  // is shared between two threads, one thread is spinning on the futex
-  // waiting for a signal while the other is possibly combining the requested
-  // critical section into its own.  This means that there is a high chance
-  // we would cause the cachelines to bounce between the threads in the next
-  // if block.
-  //
-  // This leads to a degenerate case where the FunctionRef object ends up in a
-  // different cacheline thereby making it seem like benchmarks avoid this
-  // problem.  When compiled differently (eg.  with link time optimization)
-  // the wait struct ends up on the stack in a manner that causes the
-  // FunctionRef object to be in the same cacheline as the other data, thereby
-  // forcing the current thread to bounce on the cacheline twice (first to
-  // load the data from the other thread, that presumably owns the cacheline
-  // due to timestamp publishing) and then to signal the thread
-  //
-  // To avoid this sort of non-deterministic behavior based on compilation and
-  // stack layout, we load the value before executing the other thread's
-  // critical section
-  //
-  // Note that the waiting thread writes the value to the wait struct after
-  // enqueuing, but never writes to it after the value in the futex_ is
-  // initialised (showing that the thread is in the spin loop), this makes it
-  // safe for us to read next_ without synchronization
-  auto next = std::uintptr_t{0};
-  if (isInitialized(value)) {
-    next = waiter->next_;
-  }
-
  // if the waiter has asked for a combine operation, we should combine its
  // critical section and move on to the next waiter
  //
@@ -1147,14 +1356,18 @@ std::uintptr_t tryCombine(
  //     leading to further delays in critical section completion
  //
  // if all the above are satisfied, then we can combine the critical section.
-  // Note that it is only safe to read from the waiter struct if the value is
-  // not uninitialized.  If the state is uninitialized, we synchronize with
-  // the write to the next_ member in the lock function.  If the value is not
-  // uninitialized, there is a race in reading the next_ value
+  // Note that if the waiter is in a combineable state, that means that it had
+  // finished its writes to both the task and the next_ value.  And observing
+  // a waiting state also means that we have acquired the writes to the other
+  // members of the waiter struct, so it's fine to use those values here
  if (isWaitingCombiner(value) &&
      (iteration <= kMaxCombineIterations || preempted(value, now))) {
+    try {
      task();
      waiter->futex_.store(kCombined, std::memory_order_release);
+    } catch (...) {
+      transferCurrentException(waiter);
+    }
    return next;
  }

@@ -1162,10 +1375,11 @@ std::uintptr_t tryCombine(
 }

 template <typename Waiter>
-std::uintptr_t tryWake(
+FOLLY_ALWAYS_INLINE std::uintptr_t tryWake(
    bool publishing,
    Waiter* waiter,
    std::uintptr_t value,
+    std::uintptr_t next,
    std::uintptr_t waker,
    Waiter*& sleepers,
    std::uint64_t iteration,
@@ -1174,7 +1388,7 @@ std::uintptr_t tryWake(
  // we have successfully executed their critical section and can move on to
  // the rest of the chain
  auto now = time();
-  if (auto next = tryCombine(value, waiter, iteration, now, task)) {
+  if (tryCombine(waiter, value, next, iteration, now, task)) {
    return next;
  }

@@ -1217,7 +1431,7 @@ std::uintptr_t tryWake(
    // Can we relax this?
    DCHECK(preempted(value, now));
    DCHECK(!isCombiner(value));
-    auto next = waiter->next_;
+    next = waiter->next_.load(std::memory_order_relaxed);
    waiter->futex_.store(kSkipped, std::memory_order_release);
    return next;
  }
@@ -1260,8 +1474,9 @@ std::uintptr_t tryWake(
  //
  // we also need to collect this sleeper in the list of sleepers being built
  // up
-  auto next = waiter->next_;
-  waiter->next_ = folly::bit_cast<std::uintptr_t>(sleepers);
+  next = waiter->next_.load(std::memory_order_relaxed);
+  auto head = folly::bit_cast<std::uintptr_t>(sleepers);
+  waiter->next_.store(head, std::memory_order_relaxed);
  sleepers = waiter;
  return next;
 }
@@ -1278,13 +1493,23 @@ bool wake(
  // the last published timestamp of the node)
  auto current = &waiter;
  while (current) {
-    // it is important that we load the value of function after the initial
-    // acquire load.  This is required because we need to synchronize with the
-    // construction of the waiter struct before reading from it
+    // it is important that we load the value of function and next_ after the
+    // initial acquire load.  This is required because we need to synchronize
+    // with the construction of the waiter struct before reading from it
+    //
+    // the load from the next_ variable is an optimistic load that assumes
+    // that the waiting thread has probably gone to the waiting state.  If the
+    // waiitng thread is in the waiting state (as revealed by the acquire load
+    // from the futex word), we will see a well formed next_ value because it
+    // happens-before the release store to the futex word.  The atomic load from
+    // next_ is an optimization to avoid branching before loading and prevent
+    // the compiler from eliding the load altogether (and using a pointer
+    // dereference when needed)
    auto value = current->futex_.load(std::memory_order_acquire);
+    auto next = current->next_.load(std::memory_order_relaxed);
    auto task = loadTask(current, value);
-    auto next =
-        tryWake(publishing, current, value, waker, sleepers, iter, task);
+    next =
+        tryWake(publishing, current, value, next, waker, sleepers, iter, task);

    // if there is no next node, we have managed to wake someone up and have
    // successfully migrated the lock to another thread

--- a/folly/synchronization/DistributedMutex.h
+++ b/folly/synchronization/DistributedMutex.h
@@ -276,6 +276,10 @@ class DistributedMutex {
   * Here, because we used a combined critical section, we have introduced a
   * dependency from one -> three that might not obvious to the reader
   *
+   * This function is exception-safe.  If the passed task throws an exception,
+   * it will be propagated to the caller, even if the task is running on
+   * another thread
+   *
   * There are three notable cases where this method causes undefined
   * behavior:
   *
@@ -291,7 +295,7 @@ class DistributedMutex {
   *    at compile time or runtime, so we have no checks against this
   */
  template <typename Task>
-  auto lock_combine(Task task) noexcept -> folly::invoke_result_t<const Task&>;
+  auto lock_combine(Task task) -> folly::invoke_result_t<const Task&>;

  /**
   * Try to combine a task as a combined critical section untill the given time
@@ -311,7 +315,7 @@ class DistributedMutex {
      typename ReturnType = decltype(std::declval<Task&>()())>
  folly::Optional<ReturnType> try_lock_combine_for(
      const std::chrono::duration<Rep, Period>& duration,
-      Task task) noexcept;
+      Task task);

  /**
   * Try to combine a task as a combined critical section untill the given time
@@ -326,7 +330,7 @@ class DistributedMutex {
      typename ReturnType = decltype(std::declval<Task&>()())>
  folly::Optional<ReturnType> try_lock_combine_until(
      const std::chrono::time_point<Clock, Duration>& deadline,
-      Task task) noexcept;
+      Task task);

 private:
  Atomic<std::uintptr_t> state_{0};

--- a/folly/synchronization/test/DistributedMutexTest.cpp
+++ b/folly/synchronization/test/DistributedMutexTest.cpp
@@ -187,8 +187,8 @@ void atomic_notify_one(const ManualAtomic<std::uintptr_t>*) {
 } // namespace test

 namespace {
-DEFINE_int32(stress_factor, 1000, "The stress test factor for tests");
-DEFINE_int32(stress_test_seconds, 2, "Duration for stress tests");
+constexpr auto kStressFactor = 1000;
+constexpr auto kStressTestSeconds = 2;
 constexpr auto kForever = 100h;

 using DSched = test::DeterministicSchedule;
@@ -198,7 +198,7 @@ int sum(int n) {
 }

 template <template <typename> class Atom = std::atomic>
-void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) {
+void basicNThreads(int numThreads, int iterations = kStressFactor) {
  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
  auto&& barrier = std::atomic<int>{0};
  auto&& threads = std::vector<std::thread>{};
@@ -307,8 +307,12 @@ void combineNThreads(int numThreads, std::chrono::seconds duration) {
        auto current = mutex.lock_combine([&]() {
          result.fetch_add(1);
          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
          std::this_thread::yield();
+          SCOPE_EXIT {
            EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+          };
+          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);
          return local.fetch_add(1);
        });
        EXPECT_EQ(current, expected - 1);
@@ -355,8 +359,12 @@ void combineWithLockNThreads(int numThreads, std::chrono::seconds duration) {
      auto current = mutex.lock_combine([&]() {
        auto iteration = total.fetch_add(1);
        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
        std::this_thread::yield();
+        SCOPE_EXIT {
          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+        };
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);
        return iteration;
      });

@@ -406,8 +414,12 @@ void combineWithTryLockNThreads(int numThreads, std::chrono::seconds duration) {
      auto current = mutex.lock_combine([&]() {
        auto iteration = total.fetch_add(1);
        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
        std::this_thread::yield();
+        SCOPE_EXIT {
          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+        };
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);
        return iteration;
      });

@@ -474,8 +486,12 @@ void combineWithLockTryAndTimedNThreads(
      auto current = mutex.lock_combine([&]() {
        auto iteration = total.fetch_add(1);
        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
        std::this_thread::yield();
+        SCOPE_EXIT {
          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+        };
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);

        // return a non-trivially-copyable object that occupies all the
        // storage we use to coalesce returns to test that codepath
@@ -753,162 +769,150 @@ TEST(DistributedMutex, StressHardwareConcurrencyThreads) {
 }

 TEST(DistributedMutex, StressThreeThreadsLockTryAndTimed) {
-  lockWithTryAndTimedNThreads(
-      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+  lockWithTryAndTimedNThreads(3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixThreadsLockTryAndTimed) {
-  lockWithTryAndTimedNThreads(
-      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+  lockWithTryAndTimedNThreads(6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressTwelveThreadsLockTryAndTimed) {
-  lockWithTryAndTimedNThreads(
-      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+  lockWithTryAndTimedNThreads(12, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressTwentyFourThreadsLockTryAndTimed) {
-  lockWithTryAndTimedNThreads(
-      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+  lockWithTryAndTimedNThreads(24, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressFourtyEightThreadsLockTryAndTimed) {
-  lockWithTryAndTimedNThreads(
-      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+  lockWithTryAndTimedNThreads(48, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixtyFourThreadsLockTryAndTimed) {
-  lockWithTryAndTimedNThreads(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+  lockWithTryAndTimedNThreads(64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressHwConcThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreads(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, StressTwoThreadsCombine) {
-  combineNThreads(2, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(2, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressThreeThreadsCombine) {
-  combineNThreads(3, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressFourThreadsCombine) {
-  combineNThreads(4, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(4, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressFiveThreadsCombine) {
-  combineNThreads(5, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(5, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixThreadsCombine) {
-  combineNThreads(6, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSevenThreadsCombine) {
-  combineNThreads(7, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(7, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressEightThreadsCombine) {
-  combineNThreads(8, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(8, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixteenThreadsCombine) {
-  combineNThreads(16, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(16, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressThirtyTwoThreadsCombine) {
-  combineNThreads(32, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(32, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixtyFourThreadsCombine) {
-  combineNThreads(64, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressHundredThreadsCombine) {
-  combineNThreads(100, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreads(100, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombine) {
  combineNThreads(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, StressTwoThreadsCombineAndLock) {
-  combineWithLockNThreads(2, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithLockNThreads(2, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressFourThreadsCombineAndLock) {
-  combineWithLockNThreads(4, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithLockNThreads(4, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressEightThreadsCombineAndLock) {
-  combineWithLockNThreads(8, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithLockNThreads(8, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixteenThreadsCombineAndLock) {
-  combineWithLockNThreads(16, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithLockNThreads(16, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressThirtyTwoThreadsCombineAndLock) {
-  combineWithLockNThreads(32, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithLockNThreads(32, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixtyFourThreadsCombineAndLock) {
-  combineWithLockNThreads(64, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithLockNThreads(64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombineAndLock) {
  combineWithLockNThreads(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, StressThreeThreadsCombineTryLockAndLock) {
-  combineWithTryLockNThreads(
-      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithTryLockNThreads(3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixThreadsCombineTryLockAndLock) {
-  combineWithTryLockNThreads(
-      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithTryLockNThreads(6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressTwelveThreadsCombineTryLockAndLock) {
-  combineWithTryLockNThreads(
-      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithTryLockNThreads(12, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressTwentyFourThreadsCombineTryLockAndLock) {
-  combineWithTryLockNThreads(
-      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithTryLockNThreads(24, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressFourtyEightThreadsCombineTryLockAndLock) {
-  combineWithTryLockNThreads(
-      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithTryLockNThreads(48, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixtyFourThreadsCombineTryLockAndLock) {
-  combineWithTryLockNThreads(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineWithTryLockNThreads(64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombineTryLockAndLock) {
  combineWithTryLockNThreads(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, StressThreeThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
-      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+      3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
-      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+      6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressTwelveThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
-      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+      12, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressTwentyFourThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
-      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+      24, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressFourtyEightThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
-      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+      48, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressSixtyFourThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+      64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressHwConcurrencyThreadsCombineTryLockLockAndTimed) {
  combineWithLockTryAndTimedNThreads(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, StressTryLock) {
  auto&& mutex = DistributedMutex{};

-  for (auto i = 0; i < FLAGS_stress_factor; ++i) {
+  for (auto i = 0; i < kStressFactor; ++i) {
    while (true) {
      auto state = mutex.try_lock();
      if (state) {
@@ -1022,152 +1026,146 @@ TEST(DistributedMutex, DeterministicStressThirtyTwoThreads) {

 TEST(DistributedMutex, DeterministicStressThreeThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
-      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+      3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicStressSixThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
-      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+      6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicStressTwelveThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
-      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+      12, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicStressTwentyFourThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
-      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+      24, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicStressFourtyEightThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
-      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+      48, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicStressSixtyFourThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+      64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicStressHwConcThreadsLockTryAndTimed) {
  lockWithTryAndTimedNThreadsDeterministic(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, CombineDeterministicStressTwoThreads) {
-  combineNThreadsDeterministic(
-      2, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreadsDeterministic(2, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineDeterministicStressFourThreads) {
-  combineNThreadsDeterministic(
-      4, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreadsDeterministic(4, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineDeterministicStressEightThreads) {
-  combineNThreadsDeterministic(
-      8, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreadsDeterministic(8, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineDeterministicStressSixteenThreads) {
-  combineNThreadsDeterministic(
-      16, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreadsDeterministic(16, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineDeterministicStressThirtyTwoThreads) {
-  combineNThreadsDeterministic(
-      32, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreadsDeterministic(32, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineDeterministicStressSixtyFourThreads) {
-  combineNThreadsDeterministic(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+  combineNThreadsDeterministic(64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineDeterministicStressHardwareConcurrencyThreads) {
  combineNThreadsDeterministic(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, CombineAndLockDeterministicStressTwoThreads) {
  combineAndLockNThreadsDeterministic(
-      2, std::chrono::seconds{FLAGS_stress_test_seconds});
+      2, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineAndLockDeterministicStressFourThreads) {
  combineAndLockNThreadsDeterministic(
-      4, std::chrono::seconds{FLAGS_stress_test_seconds});
+      4, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineAndLockDeterministicStressEightThreads) {
  combineAndLockNThreadsDeterministic(
-      8, std::chrono::seconds{FLAGS_stress_test_seconds});
+      8, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineAndLockDeterministicStressSixteenThreads) {
  combineAndLockNThreadsDeterministic(
-      16, std::chrono::seconds{FLAGS_stress_test_seconds});
+      16, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineAndLockDeterministicStressThirtyTwoThreads) {
  combineAndLockNThreadsDeterministic(
-      32, std::chrono::seconds{FLAGS_stress_test_seconds});
+      32, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineAndLockDeterministicStressSixtyFourThreads) {
  combineAndLockNThreadsDeterministic(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+      64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineAndLockDeterministicStressHWConcurrencyThreads) {
  combineAndLockNThreadsDeterministic(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressThreeThreads) {
  combineTryLockAndLockNThreadsDeterministic(
-      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+      3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressSixThreads) {
  combineTryLockAndLockNThreadsDeterministic(
-      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+      6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressTwelveThreads) {
  combineTryLockAndLockNThreadsDeterministic(
-      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+      12, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressTwentyThreads) {
  combineTryLockAndLockNThreadsDeterministic(
-      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+      24, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressFortyThreads) {
  combineTryLockAndLockNThreadsDeterministic(
-      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+      48, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressSixtyThreads) {
  combineTryLockAndLockNThreadsDeterministic(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+      64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressHWConcThreads) {
  combineTryLockAndLockNThreadsDeterministic(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressThreeThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
-      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+      3, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressSixThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
-      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+      6, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressTwelveThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
-      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+      12, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressTwentyThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
-      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+      24, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressFortyThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
-      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+      48, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressSixtyThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+      64, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressHWConcThreads) {
  combineWithTryLockAndTimedNThreadsDeterministic(
      std::thread::hardware_concurrency(),
-      std::chrono::seconds{FLAGS_stress_test_seconds});
+      std::chrono::seconds{kStressTestSeconds});
 }

 TEST(DistributedMutex, TimedLockTimeout) {
@@ -1308,7 +1306,7 @@ namespace {
 template <template <typename> class Atom = std::atomic>
 void stressTryLockWithConcurrentLocks(
    int numThreads,
-    int iterations = FLAGS_stress_factor) {
+    int iterations = kStressFactor) {
  auto&& threads = std::vector<std::thread>{};
  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
  auto&& atomic = std::atomic<std::uint64_t>{0};
@@ -1419,7 +1417,7 @@ TEST(DistributedMutex, DeterministicTryLockWithLocksSixtyFourThreads) {

 namespace {
 template <template <typename> class Atom = std::atomic>
-void concurrentTryLocks(int numThreads, int iterations = FLAGS_stress_factor) {
+void concurrentTryLocks(int numThreads, int iterations = kStressFactor) {
  auto&& threads = std::vector<std::thread>{};
  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
  auto&& atomic = std::atomic<std::uint64_t>{0};
@@ -1643,7 +1641,7 @@ TEST(DistributedMutex, TestAppropriateDestructionAndConstructionWithCombine) {
  }};

  /* sleep override */
-  std::this_thread::sleep_for(std::chrono::seconds{FLAGS_stress_test_seconds});
+  std::this_thread::sleep_for(std::chrono::seconds{kStressTestSeconds});
  stop.store(true);
  thread.join();
 }
@@ -1670,8 +1668,12 @@ void concurrentLocksManyMutexes(int numThreads, std::chrono::seconds duration) {
        ++expected;
        auto result = mutex.lock_combine([&]() {
          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
          std::this_thread::yield();
+          SCOPE_EXIT {
            EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+          };
+          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);
          return total.fetch_add(1, std::memory_order_relaxed);
        });
        EXPECT_EQ(result, expected - 1);
@@ -1691,28 +1693,22 @@ void concurrentLocksManyMutexes(int numThreads, std::chrono::seconds duration) {
 } // namespace

 TEST(DistributedMutex, StressWithManyMutexesAlternatingTwoThreads) {
-  concurrentLocksManyMutexes(
-      2, std::chrono::seconds{FLAGS_stress_test_seconds});
+  concurrentLocksManyMutexes(2, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressWithManyMutexesAlternatingFourThreads) {
-  concurrentLocksManyMutexes(
-      4, std::chrono::seconds{FLAGS_stress_test_seconds});
+  concurrentLocksManyMutexes(4, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressWithManyMutexesAlternatingEightThreads) {
-  concurrentLocksManyMutexes(
-      8, std::chrono::seconds{FLAGS_stress_test_seconds});
+  concurrentLocksManyMutexes(8, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressWithManyMutexesAlternatingSixteenThreads) {
-  concurrentLocksManyMutexes(
-      16, std::chrono::seconds{FLAGS_stress_test_seconds});
+  concurrentLocksManyMutexes(16, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressWithManyMutexesAlternatingThirtyTwoThreads) {
-  concurrentLocksManyMutexes(
-      32, std::chrono::seconds{FLAGS_stress_test_seconds});
+  concurrentLocksManyMutexes(32, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, StressWithManyMutexesAlternatingSixtyFourThreads) {
-  concurrentLocksManyMutexes(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+  concurrentLocksManyMutexes(64, std::chrono::seconds{kStressTestSeconds});
 }

 namespace {
@@ -1733,26 +1729,289 @@ void concurrentLocksManyMutexesDeterministic(

 TEST(DistributedMutex, DeterministicWithManyMutexesAlternatingTwoThreads) {
  concurrentLocksManyMutexesDeterministic(
-      2, std::chrono::seconds{FLAGS_stress_test_seconds});
+      2, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicWithManyMutexesAlternatingFourThreads) {
  concurrentLocksManyMutexesDeterministic(
-      4, std::chrono::seconds{FLAGS_stress_test_seconds});
+      4, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicWithManyMutexesAlternatingEightThreads) {
  concurrentLocksManyMutexesDeterministic(
-      8, std::chrono::seconds{FLAGS_stress_test_seconds});
+      8, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicWithManyMutexesAlternatingSixteenThreads) {
  concurrentLocksManyMutexesDeterministic(
-      16, std::chrono::seconds{FLAGS_stress_test_seconds});
+      16, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicWithManyMtxAlternatingThirtyTwoThreads) {
  concurrentLocksManyMutexesDeterministic(
-      32, std::chrono::seconds{FLAGS_stress_test_seconds});
+      32, std::chrono::seconds{kStressTestSeconds});
 }
 TEST(DistributedMutex, DeterministicWithManyMtxAlternatingSixtyFourThreads) {
  concurrentLocksManyMutexesDeterministic(
-      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+      64, std::chrono::seconds{kStressTestSeconds});
+}
+
+namespace {
+class ExceptionWithConstructionTrack : public std::exception {
+ public:
+  explicit ExceptionWithConstructionTrack(int id)
+      : id_{folly::to<std::string>(id)}, constructionTrack_{id} {}
+
+  const char* what() const noexcept override {
+    return id_.c_str();
+  }
+
+ private:
+  std::string id_;
+  TestConstruction constructionTrack_;
+};
+} // namespace
+
+TEST(DistributedMutex, TestExceptionPropagationUncontended) {
+  TestConstruction::reset();
+  auto&& mutex = folly::DistributedMutex{};
+
+  auto&& thread = std::thread{[&]() {
+    try {
+      mutex.lock_combine([&]() { throw ExceptionWithConstructionTrack{46}; });
+    } catch (std::exception& exc) {
+      auto integer = folly::to<std::uint64_t>(exc.what());
+      EXPECT_EQ(integer, 46);
+      EXPECT_GT(TestConstruction::defaultConstructs(), 0);
+    }
+
+    EXPECT_EQ(
+        TestConstruction::defaultConstructs(), TestConstruction::destructs());
+  }};
+
+  thread.join();
+}
+
+namespace {
+template <template <typename> class Atom = std::atomic>
+void concurrentExceptionPropagationStress(
+    int numThreads,
+    std::chrono::milliseconds t) {
+  TestConstruction::reset();
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& barrier = std::atomic<std::uint64_t>{0};
+
+  for (auto i = 0; i < numThreads; ++i) {
+    threads.push_back(DSched::thread([&]() {
+      for (auto j = 0; !stop.load(); ++j) {
+        auto value = int{0};
+        try {
+          value = mutex.lock_combine([&]() {
+            EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+            EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
+            std::this_thread::yield();
+            SCOPE_EXIT {
+              EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+            };
+            EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);
+
+            // we only throw an exception once every 3 times
+            if (!(j % 3)) {
+              throw ExceptionWithConstructionTrack{j};
+            }
+
+            return j;
+          });
+        } catch (std::exception& exc) {
+          value = folly::to<int>(exc.what());
+        }
+
+        EXPECT_EQ(value, j);
+      }
+    }));
+  }
+
+  /* sleep override */
+  std::this_thread::sleep_for(t);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
+} // namespace
+
+TEST(DistributedMutex, TestExceptionPropagationStressTwoThreads) {
+  concurrentExceptionPropagationStress(
+      2, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationStressFourThreads) {
+  concurrentExceptionPropagationStress(
+      4, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationStressEightThreads) {
+  concurrentExceptionPropagationStress(
+      8, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationStressSixteenThreads) {
+  concurrentExceptionPropagationStress(
+      16, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationStressThirtyTwoThreads) {
+  concurrentExceptionPropagationStress(
+      32, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationStressSixtyFourThreads) {
+  concurrentExceptionPropagationStress(
+      64, std::chrono::seconds{kStressTestSeconds});
+}
+
+namespace {
+void concurrentExceptionPropagationDeterministic(
+    int threads,
+    std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    concurrentExceptionPropagationStress<test::DeterministicAtomic>(
+        threads, time);
+    static_cast<void>(schedule);
+  }
+}
+} // namespace
+
+TEST(DistributedMutex, TestExceptionPropagationDeterministicTwoThreads) {
+  concurrentExceptionPropagationDeterministic(
+      2, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationDeterministicFourThreads) {
+  concurrentExceptionPropagationDeterministic(
+      4, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationDeterministicEightThreads) {
+  concurrentExceptionPropagationDeterministic(
+      8, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationDeterministicSixteenThreads) {
+  concurrentExceptionPropagationDeterministic(
+      16, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationDeterministicThirtyTwoThreads) {
+  concurrentExceptionPropagationDeterministic(
+      32, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, TestExceptionPropagationDeterministicSixtyFourThreads) {
+  concurrentExceptionPropagationDeterministic(
+      64, std::chrono::seconds{kStressTestSeconds});
+}
+
+namespace {
+std::array<std::uint64_t, 8> makeMonotonicArray(int start) {
+  auto array = std::array<std::uint64_t, 8>{};
+  folly::for_each(array, [&](auto& element) { element = start++; });
+  return array;
+}
+
+template <template <typename> class Atom = std::atomic>
+void concurrentBigValueReturnStress(
+    int numThreads,
+    std::chrono::milliseconds t) {
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& barrier = std::atomic<std::uint64_t>{0};
+
+  for (auto i = 0; i < numThreads; ++i) {
+    threads.push_back(DSched::thread([&]() {
+      auto&& value = std::atomic<std::uint64_t>{0};
+
+      for (auto j = 0; !stop.load(); ++j) {
+        auto returned = mutex.lock_combine([&]() {
+          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 1);
+          std::this_thread::yield();
+          // return an entire cacheline worth of data
+          auto current = value.fetch_add(1, std::memory_order_relaxed);
+          SCOPE_EXIT {
+            EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+          };
+          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 2);
+          return makeMonotonicArray(current);
+        });
+
+        auto expected = value.load() - 1;
+        folly::for_each(
+            returned, [&](auto& element) { EXPECT_EQ(element, expected++); });
+      }
+    }));
+  }
+
+  /* sleep override */
+  std::this_thread::sleep_for(t);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
+} // namespace
+
+TEST(DistributedMutex, StressBigValueReturnTwoThreads) {
+  concurrentBigValueReturnStress(2, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, StressBigValueReturnFourThreads) {
+  concurrentBigValueReturnStress(4, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, StressBigValueReturnEightThreads) {
+  concurrentBigValueReturnStress(8, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, StressBigValueReturnSixteenThreads) {
+  concurrentBigValueReturnStress(16, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, StressBigValueReturnThirtyTwoThreads) {
+  concurrentBigValueReturnStress(32, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, StressBigValueReturnSixtyFourThreads) {
+  concurrentBigValueReturnStress(64, std::chrono::seconds{kStressTestSeconds});
+}
+
+namespace {
+void concurrentBigValueReturnDeterministic(
+    int threads,
+    std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    concurrentBigValueReturnStress<test::DeterministicAtomic>(threads, time);
+    static_cast<void>(schedule);
+  }
+}
+} // namespace
+
+TEST(DistributedMutex, DeterministicBigValueReturnTwoThreads) {
+  concurrentBigValueReturnDeterministic(
+      2, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, DeterministicBigValueReturnFourThreads) {
+  concurrentBigValueReturnDeterministic(
+      4, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, DeterministicBigValueReturnEightThreads) {
+  concurrentBigValueReturnDeterministic(
+      8, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, DeterministicBigValueReturnSixteenThreads) {
+  concurrentBigValueReturnDeterministic(
+      16, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, DeterministicBigValueReturnThirtyTwoThreads) {
+  concurrentBigValueReturnDeterministic(
+      32, std::chrono::seconds{kStressTestSeconds});
+}
+TEST(DistributedMutex, DeterministicBigValueReturnSixtyFourThreads) {
+  concurrentBigValueReturnDeterministic(
+      64, std::chrono::seconds{kStressTestSeconds});
 }
 } // namespace folly
--- a/folly/synchronization/test/SmallLocksBenchmark.cpp
+++ b/folly/synchronization/test/SmallLocksBenchmark.cpp
@@ -167,13 +167,6 @@ auto lock_and(FlatCombiningMutexCaching& mutex, std::size_t i, F func) {
  return mutex.lock_combine(func, i);
 }

-template <typename Mutex>
-std::unique_lock<Mutex> lock(Mutex& mutex) {
-  return std::unique_lock<Mutex>{mutex};
-}
-template <typename Mutex, typename Other>
-void unlock(Mutex&, Other) {}
-
 /**
 * Functions to initialize, write and read from data
 *
@@ -285,7 +278,8 @@ runContended(size_t numOps, size_t numThreads, size_t work = FLAGS_work) {
      lockstruct* mutex = &locks[t % threadgroups];
      runbarrier.wait();
      for (size_t op = 0; op < numOps; op += 1) {
-        auto val = lock_and(mutex->mutex, t, [& value = mutex->value, work] {
+        auto val = lock_and(
+            mutex->mutex, t, [& value = mutex->value, work ]() noexcept {
              burn(work);
              return write(value);
            });
@@ -344,7 +338,10 @@ static void runFairness(std::size_t numThreads) {
      while (!stop) {
        std::chrono::steady_clock::time_point prelock =
            std::chrono::steady_clock::now();
-        auto state = lock(mutex->lock);
+        lock_and(mutex->lock, t, [&]() {
+          burn(FLAGS_work);
+          value++;
+        });
        std::chrono::steady_clock::time_point postlock =
            std::chrono::steady_clock::now();
        auto diff = std::chrono::duration_cast<std::chrono::microseconds>(
@@ -354,10 +351,6 @@ static void runFairness(std::size_t numThreads) {
        if (diff > max) {
          max = diff;
        }
-        burn(FLAGS_work);
-        value++;
-        unlock(mutex->lock, std::move(state));
-        burn(FLAGS_unlocked_work);
      }
      {
        std::lock_guard<std::mutex> g(rlock);
@@ -410,9 +403,8 @@ void runUncontended(std::size_t iters) {
  auto&& mutex = Mutex{};
  for (auto i = std::size_t{0}; i < iters; ++i) {
    folly::makeUnpredictable(mutex);
-    auto state = lock(mutex);
+    auto lck = std::unique_lock<Mutex>{mutex};
    folly::makeUnpredictable(mutex);
-    unlock(mutex, std::move(state));
  }
 }

@@ -777,6 +769,8 @@ int main(int argc, char** argv) {
      fairnessTest<folly::SharedMutex>("folly::SharedMutex", numThreads);
      fairnessTest<folly::DistributedMutex>(
          "folly::DistributedMutex", numThreads);
+      fairnessTest<DistributedMutexFlatCombining>(
+          "folly::DistributedMutex (Combining)", numThreads);

      std::cout << std::string(76, '=') << std::endl;
    }
@@ -792,349 +786,367 @@ int main(int argc, char** argv) {
 Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

 ------- std::mutex 2 threads
-Sum: 107741003 Mean: 1923946 stddev: 99873
-Lock time stats in us: mean 1 stddev 40 max 53562
+Sum: 361854376 Mean: 6461685 stddev: 770837
+Lock time stats in us: mean 0 stddev 1 max 63002
 ------- GoogleSpinLock 2 threads
-Sum: 129434359 Mean: 2311327 stddev: 74053
-Lock time stats in us: mean 0 stddev 4 max 53102
+Sum: 463530598 Mean: 8277332 stddev: 759139
+Lock time stats in us: mean 0 stddev 9 max 44995
 ------- folly::MicroSpinLock 2 threads
-Sum: 225366606 Mean: 4024403 stddev: 1884122
-Lock time stats in us: mean 0 stddev 19 max 2278444
+Sum: 454928254 Mean: 8123718 stddev: 1568978
+Lock time stats in us: mean 0 stddev 9 max 118006
 ------- folly::PicoSpinLock<std::uint16_t> 2 threads
-Sum: 150216526 Mean: 2682437 stddev: 216045
-Lock time stats in us: mean 0 stddev 28 max 36826
+Sum: 376990850 Mean: 6731979 stddev: 1295859
+Lock time stats in us: mean 0 stddev 1 max 83007
 ------- folly::MicroLock 2 threads
-Sum: 132299209 Mean: 2362485 stddev: 496423
-Lock time stats in us: mean 0 stddev 32 max 68123
+Sum: 316081944 Mean: 5644320 stddev: 1249068
+Lock time stats in us: mean 0 stddev 13 max 53930
 ------- folly::SharedMutex 2 threads
-Sum: 132465497 Mean: 2365455 stddev: 556997
-Lock time stats in us: mean 0 stddev 32 max 24447
+Sum: 389298695 Mean: 6951762 stddev: 3031794
+Lock time stats in us: mean 0 stddev 2 max 55004
 ------- folly::DistributedMutex 2 threads
-Sum: 166667563 Mean: 2976206 stddev: 183292
-Lock time stats in us: mean 0 stddev 3 max 2834
+Sum: 512343772 Mean: 9148995 stddev: 1168346
+Lock time stats in us: mean 0 stddev 8 max 50830
+------- folly::DistributedMutex (Combining) 2 threads
+Sum: 475079423 Mean: 8483561 stddev: 899288
+Lock time stats in us: mean 0 stddev 1 max 26006
 ============================================================================
 ------- std::mutex 4 threads
-Sum: 56176633 Mean: 1003154 stddev: 20354
-Lock time stats in us: mean 2 stddev 76 max 10151
+Sum: 164126417 Mean: 2930828 stddev: 208327
+Lock time stats in us: mean 0 stddev 2 max 11759
 ------- GoogleSpinLock 4 threads
-Sum: 65060684 Mean: 1161797 stddev: 95631
-Lock time stats in us: mean 1 stddev 66 max 9624
+Sum: 200210044 Mean: 3575179 stddev: 472142
+Lock time stats in us: mean 0 stddev 21 max 16715
 ------- folly::MicroSpinLock 4 threads
-Sum: 124794912 Mean: 2228480 stddev: 752355
-Lock time stats in us: mean 1 stddev 2 max 1973546
+Sum: 168795789 Mean: 3014210 stddev: 825455
+Lock time stats in us: mean 0 stddev 3 max 152163
 ------- folly::PicoSpinLock<std::uint16_t> 4 threads
-Sum: 86858717 Mean: 1551048 stddev: 417050
-Lock time stats in us: mean 1 stddev 2 max 87873
+Sum: 125788231 Mean: 2246218 stddev: 755074
+Lock time stats in us: mean 1 stddev 3 max 151004
 ------- folly::MicroLock 4 threads
-Sum: 64529361 Mean: 1152310 stddev: 363331
-Lock time stats in us: mean 2 stddev 66 max 34196
+Sum: 109091138 Mean: 1948056 stddev: 465388
+Lock time stats in us: mean 1 stddev 39 max 60029
 ------- folly::SharedMutex 4 threads
-Sum: 64509031 Mean: 1151946 stddev: 551973
-Lock time stats in us: mean 2 stddev 5 max 58400
+Sum: 107870343 Mean: 1926256 stddev: 1039541
+Lock time stats in us: mean 1 stddev 2 max 57002
 ------- folly::DistributedMutex 4 threads
-Sum: 76778688 Mean: 1371048 stddev: 89767
-Lock time stats in us: mean 2 stddev 56 max 4038
+Sum: 207229191 Mean: 3700521 stddev: 182811
+Lock time stats in us: mean 0 stddev 21 max 16231
+------- folly::DistributedMutex (Combining) 4 threads
+Sum: 204144735 Mean: 3645441 stddev: 619224
+Lock time stats in us: mean 0 stddev 0 max 27008
 ============================================================================
 ------- std::mutex 8 threads
-Sum: 27905504 Mean: 498312 stddev: 12266
-Lock time stats in us: mean 4 stddev 154 max 10915
+Sum: 82709846 Mean: 1476961 stddev: 173483
+Lock time stats in us: mean 2 stddev 52 max 9404
 ------- GoogleSpinLock 8 threads
-Sum: 34900763 Mean: 623227 stddev: 34990
-Lock time stats in us: mean 3 stddev 4 max 11047
+Sum: 98373671 Mean: 1756672 stddev: 65326
+Lock time stats in us: mean 1 stddev 43 max 20805
 ------- folly::MicroSpinLock 8 threads
-Sum: 65703639 Mean: 1173279 stddev: 367466
-Lock time stats in us: mean 2 stddev 65 max 1985454
+Sum: 94805197 Mean: 1692949 stddev: 633249
+Lock time stats in us: mean 1 stddev 3 max 104517
 ------- folly::PicoSpinLock<std::uint16_t> 8 threads
-Sum: 46642042 Mean: 832893 stddev: 258465
-Lock time stats in us: mean 3 stddev 5 max 90012
+Sum: 41587796 Mean: 742639 stddev: 191868
+Lock time stats in us: mean 4 stddev 103 max 317025
 ------- folly::MicroLock 8 threads
-Sum: 28727093 Mean: 512983 stddev: 105746
-Lock time stats in us: mean 6 stddev 149 max 24648
+Sum: 42414128 Mean: 757395 stddev: 234934
+Lock time stats in us: mean 4 stddev 101 max 39660
 ------- folly::SharedMutex 8 threads
-Sum: 35789774 Mean: 639103 stddev: 420746
-Lock time stats in us: mean 5 stddev 120 max 95030
+Sum: 58861445 Mean: 1051097 stddev: 491231
+Lock time stats in us: mean 3 stddev 73 max 34007
 ------- folly::DistributedMutex 8 threads
-Sum: 33288752 Mean: 594442 stddev: 20581
-Lock time stats in us: mean 5 stddev 129 max 7018
+Sum: 93377108 Mean: 1667448 stddev: 113502
+Lock time stats in us: mean 1 stddev 46 max 11075
+------- folly::DistributedMutex (Combining) 8 threads
+Sum: 131093487 Mean: 2340955 stddev: 187841
+Lock time stats in us: mean 1 stddev 3 max 25004
 ============================================================================
 ------- std::mutex 16 threads
-Sum: 10886472 Mean: 194401 stddev: 9357
-Lock time stats in us: mean 12 stddev 394 max 13293
+Sum: 36606221 Mean: 653682 stddev: 65154
+Lock time stats in us: mean 5 stddev 117 max 13603
 ------- GoogleSpinLock 16 threads
-Sum: 13436731 Mean: 239941 stddev: 25068
-Lock time stats in us: mean 10 stddev 319 max 10127
+Sum: 29830088 Mean: 532680 stddev: 19614
+Lock time stats in us: mean 7 stddev 2 max 10338
 ------- folly::MicroSpinLock 16 threads
-Sum: 28766414 Mean: 513685 stddev: 109667
-Lock time stats in us: mean 7 stddev 149 max 453504
+Sum: 27935153 Mean: 498842 stddev: 197304
+Lock time stats in us: mean 7 stddev 3 max 257433
 ------- folly::PicoSpinLock<std::uint16_t> 16 threads
-Sum: 19795815 Mean: 353496 stddev: 110097
-Lock time stats in us: mean 10 stddev 217 max 164821
+Sum: 12265416 Mean: 219025 stddev: 146399
+Lock time stats in us: mean 17 stddev 350 max 471793
 ------- folly::MicroLock 16 threads
-Sum: 11380567 Mean: 203224 stddev: 25356
-Lock time stats in us: mean 15 stddev 377 max 13342
+Sum: 18180611 Mean: 324653 stddev: 32123
+Lock time stats in us: mean 11 stddev 236 max 40166
 ------- folly::SharedMutex 16 threads
-Sum: 13734684 Mean: 245262 stddev: 132500
-Lock time stats in us: mean 15 stddev 312 max 75465
+Sum: 21734734 Mean: 388120 stddev: 190252
+Lock time stats in us: mean 9 stddev 197 max 107045
 ------- folly::DistributedMutex 16 threads
-Sum: 13463633 Mean: 240422 stddev: 8070
-Lock time stats in us: mean 15 stddev 319 max 17020
+Sum: 42823745 Mean: 764709 stddev: 64251
+Lock time stats in us: mean 4 stddev 100 max 19986
+------- folly::DistributedMutex (Combining) 16 threads
+Sum: 63515255 Mean: 1134200 stddev: 37905
+Lock time stats in us: mean 2 stddev 3 max 32005
 ============================================================================
 ------- std::mutex 32 threads
-Sum: 3584545 Mean: 64009 stddev: 1099
-Lock time stats in us: mean 39 stddev 1197 max 12949
+Sum: 10307832 Mean: 184068 stddev: 2431
+Lock time stats in us: mean 21 stddev 416 max 18397
 ------- GoogleSpinLock 32 threads
-Sum: 4537642 Mean: 81029 stddev: 7258
-Lock time stats in us: mean 28 stddev 946 max 10736
+Sum: 10911809 Mean: 194853 stddev: 2968
+Lock time stats in us: mean 20 stddev 393 max 10765
 ------- folly::MicroSpinLock 32 threads
-Sum: 9493894 Mean: 169533 stddev: 42004
-Lock time stats in us: mean 23 stddev 452 max 934519
+Sum: 7318139 Mean: 130681 stddev: 24742
+Lock time stats in us: mean 29 stddev 586 max 230672
 ------- folly::PicoSpinLock<std::uint16_t> 32 threads
-Sum: 7159818 Mean: 127853 stddev: 20791
-Lock time stats in us: mean 30 stddev 599 max 116982
+Sum: 6424015 Mean: 114714 stddev: 138460
+Lock time stats in us: mean 34 stddev 668 max 879632
 ------- folly::MicroLock 32 threads
-Sum: 4052635 Mean: 72368 stddev: 10196
-Lock time stats in us: mean 38 stddev 1059 max 13123
+Sum: 4893744 Mean: 87388 stddev: 6935
+Lock time stats in us: mean 45 stddev 876 max 14902
 ------- folly::SharedMutex 32 threads
-Sum: 4207373 Mean: 75131 stddev: 36441
-Lock time stats in us: mean 51 stddev 1019 max 89781
+Sum: 6393363 Mean: 114167 stddev: 80211
+Lock time stats in us: mean 34 stddev 671 max 75777
 ------- folly::DistributedMutex 32 threads
-Sum: 4499483 Mean: 80347 stddev: 1684
-Lock time stats in us: mean 48 stddev 954 max 18793
+Sum: 14394775 Mean: 257049 stddev: 36723
+Lock time stats in us: mean 15 stddev 298 max 54654
+------- folly::DistributedMutex (Combining) 32 threads
+Sum: 24232845 Mean: 432729 stddev: 11398
+Lock time stats in us: mean 8 stddev 177 max 35008
 ============================================================================
 ------- std::mutex 64 threads
-Sum: 3584393 Mean: 56006 stddev: 989
-Lock time stats in us: mean 48 stddev 1197 max 12681
+Sum: 10656640 Mean: 166510 stddev: 3340
+Lock time stats in us: mean 23 stddev 402 max 10797
 ------- GoogleSpinLock 64 threads
-Sum: 4541415 Mean: 70959 stddev: 2042
-Lock time stats in us: mean 34 stddev 945 max 12997
+Sum: 11263029 Mean: 175984 stddev: 4669
+Lock time stats in us: mean 22 stddev 381 max 26844
 ------- folly::MicroSpinLock 64 threads
-Sum: 9464010 Mean: 147875 stddev: 43363
-Lock time stats in us: mean 26 stddev 453 max 464213
+Sum: 23284721 Mean: 363823 stddev: 62670
+Lock time stats in us: mean 10 stddev 184 max 168470
 ------- folly::PicoSpinLock<std::uint16_t> 64 threads
-Sum: 6915111 Mean: 108048 stddev: 15833
-Lock time stats in us: mean 36 stddev 620 max 162031
+Sum: 2322545 Mean: 36289 stddev: 6272
+Lock time stats in us: mean 109 stddev 1846 max 1157157
 ------- folly::MicroLock 64 threads
-Sum: 4008803 Mean: 62637 stddev: 6055
-Lock time stats in us: mean 46 stddev 1070 max 25289
+Sum: 4835136 Mean: 75549 stddev: 3484
+Lock time stats in us: mean 52 stddev 887 max 23895
 ------- folly::SharedMutex 64 threads
-Sum: 3580719 Mean: 55948 stddev: 23224
-Lock time stats in us: mean 68 stddev 1198 max 63328
+Sum: 7047147 Mean: 110111 stddev: 53207
+Lock time stats in us: mean 35 stddev 608 max 85181
 ------- folly::DistributedMutex 64 threads
-Sum: 4464065 Mean: 69751 stddev: 2299
-Lock time stats in us: mean 56 stddev 960 max 32873
+Sum: 14491662 Mean: 226432 stddev: 27098
+Lock time stats in us: mean 17 stddev 296 max 55078
+------- folly::DistributedMutex (Combining) 64 threads
+Sum: 23885026 Mean: 373203 stddev: 14431
+Lock time stats in us: mean 10 stddev 179 max 62008
 ============================================================================
 ============================================================================
 folly/synchronization/test/SmallLocksBenchmark.cpprelative  time/iter  iters/s
 ============================================================================
-StdMutexUncontendedBenchmark                                16.40ns   60.98M
-GoogleSpinUncontendedBenchmark                              11.23ns   89.02M
-MicroSpinLockUncontendedBenchmark                           10.94ns   91.45M
-PicoSpinLockUncontendedBenchmark                            20.37ns   49.08M
-MicroLockUncontendedBenchmark                               29.21ns   34.24M
-SharedMutexUncontendedBenchmark                             19.44ns   51.45M
-DistributedMutexUncontendedBenchmark                        29.49ns   33.91M
-AtomicFetchAddUncontendedBenchmark                           5.45ns  183.56M
+StdMutexUncontendedBenchmark                                16.42ns   60.90M
+GoogleSpinUncontendedBenchmark                              11.25ns   88.86M
+MicroSpinLockUncontendedBenchmark                           10.95ns   91.33M
+PicoSpinLockUncontendedBenchmark                            20.38ns   49.07M
+MicroLockUncontendedBenchmark                               28.92ns   34.58M
+SharedMutexUncontendedBenchmark                             19.47ns   51.36M
+DistributedMutexUncontendedBenchmark                        28.89ns   34.62M
+AtomicFetchAddUncontendedBenchmark                           5.47ns  182.91M
 ----------------------------------------------------------------------------
 ----------------------------------------------------------------------------
-std_mutex(1thread)                                         706.81ns    1.41M
-google_spin(1thread)                             103.09%   685.63ns    1.46M
-folly_microspin(1thread)                         117.03%   603.96ns    1.66M
-folly_picospin(1thread)                          102.72%   688.12ns    1.45M
-folly_microlock(1thread)                         103.40%   683.59ns    1.46M
-folly_sharedmutex(1thread)                       103.64%   682.01ns    1.47M
-folly_distributedmutex(1thread)                  101.07%   699.32ns    1.43M
-folly_distributedmutex_combining(1thread)        102.75%   687.89ns    1.45M
-folly_flatcombining_no_caching(1thread)           94.78%   745.77ns    1.34M
-folly_flatcombining_caching(1thread)             100.95%   700.15ns    1.43M
+std_mutex(1thread)                                         900.28ns    1.11M
+google_spin(1thread)                              94.91%   948.60ns    1.05M
+folly_microspin(1thread)                         109.53%   821.97ns    1.22M
+folly_picospin(1thread)                          101.86%   883.88ns    1.13M
+folly_microlock(1thread)                         102.54%   878.02ns    1.14M
+folly_sharedmutex(1thread)                       132.03%   681.86ns    1.47M
+folly_distributedmutex(1thread)                  129.50%   695.23ns    1.44M
+folly_distributedmutex_combining(1thread)        130.73%   688.68ns    1.45M
+folly_flatcombining_no_caching(1thread)          106.73%   843.49ns    1.19M
+folly_flatcombining_caching(1thread)             125.22%   718.96ns    1.39M
 ----------------------------------------------------------------------------
-std_mutex(2thread)                                           1.28us  779.95K
-google_spin(2thread)                             137.96%   929.38ns    1.08M
-folly_microspin(2thread)                         151.64%   845.52ns    1.18M
-folly_picospin(2thread)                          140.81%   910.52ns    1.10M
-folly_microlock(2thread)                         131.62%   974.11ns    1.03M
-folly_sharedmutex(2thread)                       143.97%   890.53ns    1.12M
-folly_distributedmutex(2thread)                  129.20%   992.39ns    1.01M
-folly_distributedmutex_combining(2thread)        131.27%   976.71ns    1.02M
-folly_flatcombining_no_caching(2thread)           93.85%     1.37us  732.01K
-folly_flatcombining_caching(2thread)              97.05%     1.32us  756.98K
+std_mutex(2thread)                                           1.27us  784.90K
+google_spin(2thread)                             126.84%     1.00us  995.55K
+folly_microspin(2thread)                         147.93%   861.24ns    1.16M
+folly_picospin(2thread)                          146.10%   872.06ns    1.15M
+folly_microlock(2thread)                         131.35%   970.00ns    1.03M
+folly_sharedmutex(2thread)                       135.07%   943.23ns    1.06M
+folly_distributedmutex(2thread)                  135.88%   937.63ns    1.07M
+folly_distributedmutex_combining(2thread)        130.37%   977.27ns    1.02M
+folly_flatcombining_no_caching(2thread)           85.64%     1.49us  672.22K
+folly_flatcombining_caching(2thread)              91.98%     1.39us  721.93K
 ----------------------------------------------------------------------------
-std_mutex(4thread)                                           2.65us  376.96K
-google_spin(4thread)                             125.03%     2.12us  471.33K
-folly_microspin(4thread)                         118.43%     2.24us  446.44K
-folly_picospin(4thread)                          122.04%     2.17us  460.05K
-folly_microlock(4thread)                         102.38%     2.59us  385.94K
-folly_sharedmutex(4thread)                       101.76%     2.61us  383.60K
-folly_distributedmutex(4thread)                  137.07%     1.94us  516.71K
-folly_distributedmutex_combining(4thread)        191.98%     1.38us  723.71K
-folly_flatcombining_no_caching(4thread)          106.91%     2.48us  403.02K
-folly_flatcombining_caching(4thread)             111.66%     2.38us  420.91K
+std_mutex(4thread)                                           2.40us  417.44K
+google_spin(4thread)                             111.99%     2.14us  467.49K
+folly_microspin(4thread)                         101.55%     2.36us  423.92K
+folly_picospin(4thread)                           97.89%     2.45us  408.64K
+folly_microlock(4thread)                          79.64%     3.01us  332.45K
+folly_sharedmutex(4thread)                        75.10%     3.19us  313.49K
+folly_distributedmutex(4thread)                  126.16%     1.90us  526.63K
+folly_distributedmutex_combining(4thread)        166.56%     1.44us  695.28K
+folly_flatcombining_no_caching(4thread)           91.79%     2.61us  383.17K
+folly_flatcombining_caching(4thread)             103.95%     2.30us  433.95K
 ----------------------------------------------------------------------------
-std_mutex(8thread)                                           5.21us  191.97K
-google_spin(8thread)                             102.12%     5.10us  196.05K
-folly_microspin(8thread)                          97.02%     5.37us  186.26K
-folly_picospin(8thread)                           83.62%     6.23us  160.53K
-folly_microlock(8thread)                          69.32%     7.51us  133.08K
-folly_sharedmutex(8thread)                        64.22%     8.11us  123.29K
-folly_distributedmutex(8thread)                  175.50%     2.97us  336.91K
-folly_distributedmutex_combining(8thread)        258.13%     2.02us  495.55K
-folly_flatcombining_no_caching(8thread)          137.21%     3.80us  263.41K
-folly_flatcombining_caching(8thread)             174.75%     2.98us  335.48K
+std_mutex(8thread)                                           4.85us  206.37K
+google_spin(8thread)                             111.05%     4.36us  229.18K
+folly_microspin(8thread)                         105.28%     4.60us  217.28K
+folly_picospin(8thread)                           89.06%     5.44us  183.80K
+folly_microlock(8thread)                          73.95%     6.55us  152.62K
+folly_sharedmutex(8thread)                        67.17%     7.21us  138.62K
+folly_distributedmutex(8thread)                  162.16%     2.99us  334.66K
+folly_distributedmutex_combining(8thread)        251.93%     1.92us  519.92K
+folly_flatcombining_no_caching(8thread)          141.99%     3.41us  293.02K
+folly_flatcombining_caching(8thread)             166.26%     2.91us  343.12K
 ----------------------------------------------------------------------------
-std_mutex(16thread)                                         10.06us   99.37K
-google_spin(16thread)                             97.24%    10.35us   96.63K
-folly_microspin(16thread)                         91.23%    11.03us   90.65K
-folly_picospin(16thread)                          58.31%    17.26us   57.94K
-folly_microlock(16thread)                         51.59%    19.51us   51.26K
-folly_sharedmutex(16thread)                       49.87%    20.18us   49.56K
-folly_distributedmutex(16thread)                 155.47%     6.47us  154.49K
-folly_distributedmutex_combining(16thread)       316.70%     3.18us  314.70K
-folly_flatcombining_no_caching(16thread)         198.94%     5.06us  197.68K
-folly_flatcombining_caching(16thread)            184.72%     5.45us  183.55K
+std_mutex(16thread)                                         11.36us   88.01K
+google_spin(16thread)                             99.95%    11.37us   87.96K
+folly_microspin(16thread)                        102.73%    11.06us   90.42K
+folly_picospin(16thread)                          44.00%    25.83us   38.72K
+folly_microlock(16thread)                         52.42%    21.67us   46.14K
+folly_sharedmutex(16thread)                       53.46%    21.26us   47.05K
+folly_distributedmutex(16thread)                 166.17%     6.84us  146.24K
+folly_distributedmutex_combining(16thread)       352.82%     3.22us  310.52K
+folly_flatcombining_no_caching(16thread)         218.07%     5.21us  191.92K
+folly_flatcombining_caching(16thread)            217.69%     5.22us  191.58K
 ----------------------------------------------------------------------------
-std_mutex(32thread)                                         33.80us   29.59K
-google_spin(32thread)                            109.19%    30.95us   32.31K
-folly_microspin(32thread)                        110.23%    30.66us   32.62K
-folly_picospin(32thread)                          39.94%    84.62us   11.82K
-folly_microlock(32thread)                         56.56%    59.75us   16.74K
-folly_sharedmutex(32thread)                       73.92%    45.72us   21.87K
-folly_distributedmutex(32thread)                 192.60%    17.55us   56.99K
-folly_distributedmutex_combining(32thread)       402.79%     8.39us  119.19K
-folly_flatcombining_no_caching(32thread)         235.30%    14.36us   69.63K
-folly_flatcombining_caching(32thread)            259.02%    13.05us   76.64K
+std_mutex(32thread)                                         32.12us   31.13K
+google_spin(32thread)                            115.23%    27.88us   35.87K
+folly_microspin(32thread)                        104.52%    30.74us   32.54K
+folly_picospin(32thread)                          32.81%    97.91us   10.21K
+folly_microlock(32thread)                         57.40%    55.96us   17.87K
+folly_sharedmutex(32thread)                       63.68%    50.45us   19.82K
+folly_distributedmutex(32thread)                 180.17%    17.83us   56.09K
+folly_distributedmutex_combining(32thread)       394.34%     8.15us  122.76K
+folly_flatcombining_no_caching(32thread)         216.41%    14.84us   67.37K
+folly_flatcombining_caching(32thread)            261.99%    12.26us   81.56K
 ----------------------------------------------------------------------------
-std_mutex(64thread)                                         38.86us   25.73K
-google_spin(64thread)                            109.06%    35.63us   28.06K
-folly_microspin(64thread)                        109.92%    35.36us   28.28K
-folly_picospin(64thread)                          37.02%   104.99us    9.53K
-folly_microlock(64thread)                         56.33%    68.99us   14.49K
-folly_sharedmutex(64thread)                       69.39%    56.00us   17.86K
-folly_distributedmutex(64thread)                 194.31%    20.00us   50.00K
-folly_distributedmutex_combining(64thread)       397.54%     9.78us  102.29K
-folly_flatcombining_no_caching(64thread)         230.64%    16.85us   59.35K
-folly_flatcombining_caching(64thread)            254.03%    15.30us   65.37K
+std_mutex(64thread)                                         36.76us   27.20K
+google_spin(64thread)                            115.38%    31.86us   31.39K
+folly_microspin(64thread)                        112.14%    32.78us   30.51K
+folly_picospin(64thread)                          32.34%   113.65us    8.80K
+folly_microlock(64thread)                         57.21%    64.26us   15.56K
+folly_sharedmutex(64thread)                       60.93%    60.33us   16.57K
+folly_distributedmutex(64thread)                 179.79%    20.45us   48.91K
+folly_distributedmutex_combining(64thread)       392.64%     9.36us  106.81K
+folly_flatcombining_no_caching(64thread)         211.85%    17.35us   57.63K
+folly_flatcombining_caching(64thread)            241.45%    15.22us   65.68K
 ----------------------------------------------------------------------------
-std_mutex(128thread)                                        76.62us   13.05K
-google_spin(128thread)                           109.31%    70.09us   14.27K
-folly_microspin(128thread)                       102.86%    74.49us   13.43K
-folly_picospin(128thread)                         42.23%   181.42us    5.51K
-folly_microlock(128thread)                        55.01%   139.29us    7.18K
-folly_sharedmutex(128thread)                      63.50%   120.65us    8.29K
-folly_distributedmutex(128thread)                183.63%    41.72us   23.97K
-folly_distributedmutex_combining(128thread)      388.41%    19.73us   50.69K
-folly_flatcombining_no_caching(128thread)        183.56%    41.74us   23.96K
-folly_flatcombining_caching(128thread)           198.02%    38.69us   25.84K
+std_mutex(128thread)                                        73.05us   13.69K
+google_spin(128thread)                           116.19%    62.87us   15.91K
+folly_microspin(128thread)                        97.45%    74.96us   13.34K
+folly_picospin(128thread)                         31.46%   232.19us    4.31K
+folly_microlock(128thread)                        56.50%   129.29us    7.73K
+folly_sharedmutex(128thread)                      59.54%   122.69us    8.15K
+folly_distributedmutex(128thread)                166.59%    43.85us   22.80K
+folly_distributedmutex_combining(128thread)      379.86%    19.23us   52.00K
+folly_flatcombining_no_caching(128thread)        179.10%    40.79us   24.52K
+folly_flatcombining_caching(128thread)           189.64%    38.52us   25.96K
 ----------------------------------------------------------------------------
-std_mutex_simple(1thread)                                  634.77ns    1.58M
-google_spin_simple(1thread)                      104.06%   610.01ns    1.64M
-folly_microspin_simple(1thread)                  104.59%   606.89ns    1.65M
-folly_picospin_simple(1thread)                    99.37%   638.81ns    1.57M
-folly_microlock_simple(1thread)                  104.08%   609.86ns    1.64M
-folly_sharedmutex_simple(1thread)                 91.77%   691.73ns    1.45M
-folly_distributedmutex_simple(1thread)            98.10%   647.04ns    1.55M
-folly_distributedmutex_combining_simple(1thread  101.90%   622.93ns    1.61M
-folly_flatcombining_no_caching_simple(1thread)    93.71%   677.40ns    1.48M
-folly_flatcombining_caching_simple(1thread)      101.81%   623.46ns    1.60M
-atomics_fetch_add(1thread)                       102.23%   620.90ns    1.61M
-atomic_fetch_xor(1thread)                        104.67%   606.43ns    1.65M
-atomic_cas(1thread)                               84.68%   749.58ns    1.33M
+std_mutex_simple(1thread)                                  666.33ns    1.50M
+google_spin_simple(1thread)                      110.03%   605.58ns    1.65M
+folly_microspin_simple(1thread)                  109.80%   606.87ns    1.65M
+folly_picospin_simple(1thread)                   108.89%   611.94ns    1.63M
+folly_microlock_simple(1thread)                  108.42%   614.59ns    1.63M
+folly_sharedmutex_simple(1thread)                 93.00%   716.47ns    1.40M
+folly_distributedmutex_simple(1thread)            90.08%   739.68ns    1.35M
+folly_distributedmutex_combining_simple(1thread   90.20%   738.73ns    1.35M
+folly_flatcombining_no_caching_simple(1thread)    98.04%   679.68ns    1.47M
+folly_flatcombining_caching_simple(1thread)      105.59%   631.04ns    1.58M
+atomics_fetch_add(1thread)                       108.30%   615.29ns    1.63M
+atomic_fetch_xor(1thread)                        110.52%   602.90ns    1.66M
+atomic_cas(1thread)                              109.86%   606.52ns    1.65M
 ----------------------------------------------------------------------------
-std_mutex_simple(2thread)                                    1.24us  803.81K
-google_spin_simple(2thread)                      123.09%     1.01us  989.38K
-folly_microspin_simple(2thread)                  138.46%   898.48ns    1.11M
-folly_picospin_simple(2thread)                   121.05%     1.03us  973.01K
-folly_microlock_simple(2thread)                  112.54%     1.11us  904.60K
-folly_sharedmutex_simple(2thread)                112.16%     1.11us  901.60K
-folly_distributedmutex_simple(2thread)           119.86%     1.04us  963.47K
-folly_distributedmutex_combining_simple(2thread  130.78%   951.25ns    1.05M
-folly_flatcombining_no_caching_simple(2thread)    93.25%     1.33us  749.54K
-folly_flatcombining_caching_simple(2thread)      102.34%     1.22us  822.65K
-atomics_fetch_add(2thread)                       113.81%     1.09us  914.83K
-atomic_fetch_xor(2thread)                        161.97%   768.09ns    1.30M
-atomic_cas(2thread)                              150.00%   829.41ns    1.21M
+std_mutex_simple(2thread)                                    1.19us  841.25K
+google_spin_simple(2thread)                      107.33%     1.11us  902.89K
+folly_microspin_simple(2thread)                  130.73%   909.27ns    1.10M
+folly_picospin_simple(2thread)                   112.39%     1.06us  945.48K
+folly_microlock_simple(2thread)                  113.89%     1.04us  958.14K
+folly_sharedmutex_simple(2thread)                119.48%   994.86ns    1.01M
+folly_distributedmutex_simple(2thread)           112.44%     1.06us  945.91K
+folly_distributedmutex_combining_simple(2thread  123.12%   965.48ns    1.04M
+folly_flatcombining_no_caching_simple(2thread)    90.56%     1.31us  761.82K
+folly_flatcombining_caching_simple(2thread)      100.66%     1.18us  846.83K
+atomics_fetch_add(2thread)                       119.15%   997.67ns    1.00M
+atomic_fetch_xor(2thread)                        179.85%   660.93ns    1.51M
+atomic_cas(2thread)                              179.40%   662.58ns    1.51M
 ----------------------------------------------------------------------------
-std_mutex_simple(4thread)                                    2.39us  418.75K
-google_spin_simple(4thread)                      109.55%     2.18us  458.74K
-folly_microspin_simple(4thread)                  110.15%     2.17us  461.26K
-folly_picospin_simple(4thread)                   115.62%     2.07us  484.17K
-folly_microlock_simple(4thread)                   88.54%     2.70us  370.77K
-folly_sharedmutex_simple(4thread)                100.50%     2.38us  420.86K
-folly_distributedmutex_simple(4thread)           114.93%     2.08us  481.26K
-folly_distributedmutex_combining_simple(4thread  161.11%     1.48us  674.64K
-folly_flatcombining_no_caching_simple(4thread)   106.27%     2.25us  445.02K
-folly_flatcombining_caching_simple(4thread)      113.01%     2.11us  473.23K
-atomics_fetch_add(4thread)                       156.29%     1.53us  654.48K
-atomic_fetch_xor(4thread)                        285.69%   835.89ns    1.20M
-atomic_cas(4thread)                              270.31%   883.45ns    1.13M
+std_mutex_simple(4thread)                                    2.37us  422.81K
+google_spin_simple(4thread)                      106.35%     2.22us  449.64K
+folly_microspin_simple(4thread)                  110.42%     2.14us  466.89K
+folly_picospin_simple(4thread)                   111.77%     2.12us  472.58K
+folly_microlock_simple(4thread)                   82.17%     2.88us  347.44K
+folly_sharedmutex_simple(4thread)                 93.40%     2.53us  394.89K
+folly_distributedmutex_simple(4thread)           121.00%     1.95us  511.58K
+folly_distributedmutex_combining_simple(4thread  187.65%     1.26us  793.42K
+folly_flatcombining_no_caching_simple(4thread)   104.81%     2.26us  443.13K
+folly_flatcombining_caching_simple(4thread)      112.90%     2.09us  477.34K
+atomics_fetch_add(4thread)                       178.61%     1.32us  755.20K
+atomic_fetch_xor(4thread)                        323.62%   730.84ns    1.37M
+atomic_cas(4thread)                              300.43%   787.23ns    1.27M
 ----------------------------------------------------------------------------
-std_mutex_simple(8thread)                                    4.83us  207.09K
-google_spin_simple(8thread)                      117.15%     4.12us  242.60K
-folly_microspin_simple(8thread)                  106.41%     4.54us  220.37K
-folly_picospin_simple(8thread)                    88.31%     5.47us  182.88K
-folly_microlock_simple(8thread)                   77.90%     6.20us  161.33K
-folly_sharedmutex_simple(8thread)                 72.21%     6.69us  149.55K
-folly_distributedmutex_simple(8thread)           138.98%     3.47us  287.83K
-folly_distributedmutex_combining_simple(8thread  289.79%     1.67us  600.12K
-folly_flatcombining_no_caching_simple(8thread)   134.25%     3.60us  278.03K
-folly_flatcombining_caching_simple(8thread)      149.74%     3.22us  310.10K
-atomics_fetch_add(8thread)                       318.11%     1.52us  658.78K
-atomic_fetch_xor(8thread)                        373.98%     1.29us  774.47K
-atomic_cas(8thread)                              241.00%     2.00us  499.09K
+std_mutex_simple(8thread)                                    5.02us  199.09K
+google_spin_simple(8thread)                      108.93%     4.61us  216.88K
+folly_microspin_simple(8thread)                  116.44%     4.31us  231.82K
+folly_picospin_simple(8thread)                    80.84%     6.21us  160.94K
+folly_microlock_simple(8thread)                   77.18%     6.51us  153.66K
+folly_sharedmutex_simple(8thread)                 76.09%     6.60us  151.48K
+folly_distributedmutex_simple(8thread)           145.27%     3.46us  289.21K
+folly_distributedmutex_combining_simple(8thread  310.65%     1.62us  618.48K
+folly_flatcombining_no_caching_simple(8thread)   139.83%     3.59us  278.39K
+folly_flatcombining_caching_simple(8thread)      163.72%     3.07us  325.95K
+atomics_fetch_add(8thread)                       337.67%     1.49us  672.28K
+atomic_fetch_xor(8thread)                        380.66%     1.32us  757.87K
+atomic_cas(8thread)                              238.04%     2.11us  473.93K
 ----------------------------------------------------------------------------
-std_mutex_simple(16thread)                                  12.03us   83.13K
-google_spin_simple(16thread)                      98.34%    12.23us   81.75K
-folly_microspin_simple(16thread)                 115.19%    10.44us   95.76K
-folly_picospin_simple(16thread)                   54.50%    22.07us   45.31K
-folly_microlock_simple(16thread)                  58.38%    20.61us   48.53K
-folly_sharedmutex_simple(16thread)                69.90%    17.21us   58.11K
-folly_distributedmutex_simple(16thread)          155.15%     7.75us  128.97K
-folly_distributedmutex_combining_simple(16threa  463.66%     2.59us  385.43K
-folly_flatcombining_no_caching_simple(16thread)  279.15%     4.31us  232.05K
-folly_flatcombining_caching_simple(16thread)     207.72%     5.79us  172.67K
-atomics_fetch_add(16thread)                      538.64%     2.23us  447.76K
-atomic_fetch_xor(16thread)                       570.85%     2.11us  474.53K
-atomic_cas(16thread)                             334.73%     3.59us  278.25K
+std_mutex_simple(16thread)                                  12.26us   81.59K
+google_spin_simple(16thread)                     104.37%    11.74us   85.16K
+folly_microspin_simple(16thread)                 116.32%    10.54us   94.91K
+folly_picospin_simple(16thread)                   53.67%    22.83us   43.79K
+folly_microlock_simple(16thread)                  66.39%    18.46us   54.17K
+folly_sharedmutex_simple(16thread)                65.00%    18.85us   53.04K
+folly_distributedmutex_simple(16thread)          171.32%     7.15us  139.79K
+folly_distributedmutex_combining_simple(16threa  445.11%     2.75us  363.17K
+folly_flatcombining_no_caching_simple(16thread)  206.11%     5.95us  168.17K
+folly_flatcombining_caching_simple(16thread)     245.09%     5.00us  199.97K
+atomics_fetch_add(16thread)                      494.82%     2.48us  403.73K
+atomic_fetch_xor(16thread)                       489.90%     2.50us  399.72K
+atomic_cas(16thread)                             232.76%     5.27us  189.91K
 ----------------------------------------------------------------------------
-std_mutex_simple(32thread)                                  30.92us   32.34K
-google_spin_simple(32thread)                     107.22%    28.84us   34.68K
-folly_microspin_simple(32thread)                 106.48%    29.04us   34.44K
-folly_picospin_simple(32thread)                   32.90%    93.97us   10.64K
-folly_microlock_simple(32thread)                  55.77%    55.44us   18.04K
-folly_sharedmutex_simple(32thread)                63.85%    48.42us   20.65K
-folly_distributedmutex_simple(32thread)          170.50%    18.13us   55.14K
-folly_distributedmutex_combining_simple(32threa  562.55%     5.50us  181.94K
-folly_flatcombining_no_caching_simple(32thread)  296.57%    10.43us   95.92K
-folly_flatcombining_caching_simple(32thread)     295.25%    10.47us   95.49K
-atomics_fetch_add(32thread)                      952.20%     3.25us  307.96K
-atomic_fetch_xor(32thread)                       818.15%     3.78us  264.61K
-atomic_cas(32thread)                             634.91%     4.87us  205.34K
+std_mutex_simple(32thread)                                  30.28us   33.03K
+google_spin_simple(32thread)                     106.34%    28.47us   35.12K
+folly_microspin_simple(32thread)                 102.20%    29.62us   33.76K
+folly_picospin_simple(32thread)                   31.56%    95.92us   10.43K
+folly_microlock_simple(32thread)                  53.99%    56.07us   17.83K
+folly_sharedmutex_simple(32thread)                67.49%    44.86us   22.29K
+folly_distributedmutex_simple(32thread)          161.63%    18.73us   53.38K
+folly_distributedmutex_combining_simple(32threa  605.26%     5.00us  199.92K
+folly_flatcombining_no_caching_simple(32thread)  234.62%    12.90us   77.49K
+folly_flatcombining_caching_simple(32thread)     332.21%     9.11us  109.73K
+atomics_fetch_add(32thread)                      909.18%     3.33us  300.30K
+atomic_fetch_xor(32thread)                       779.56%     3.88us  257.49K
+atomic_cas(32thread)                             622.19%     4.87us  205.51K
 ----------------------------------------------------------------------------
-std_mutex_simple(64thread)                                  35.29us   28.33K
-google_spin_simple(64thread)                     107.33%    32.88us   30.41K
-folly_microspin_simple(64thread)                 106.02%    33.29us   30.04K
-folly_picospin_simple(64thread)                   32.93%   107.17us    9.33K
-folly_microlock_simple(64thread)                  54.76%    64.45us   15.52K
-folly_sharedmutex_simple(64thread)                63.74%    55.37us   18.06K
-folly_distributedmutex_simple(64thread)          170.45%    20.71us   48.30K
-folly_distributedmutex_combining_simple(64threa  558.99%     6.31us  158.38K
-folly_flatcombining_no_caching_simple(64thread)  311.86%    11.32us   88.36K
-folly_flatcombining_caching_simple(64thread)     327.64%    10.77us   92.83K
-atomics_fetch_add(64thread)                      858.61%     4.11us  243.28K
-atomic_fetch_xor(64thread)                       738.35%     4.78us  209.20K
-atomic_cas(64thread)                             623.72%     5.66us  176.72K
+std_mutex_simple(64thread)                                  34.33us   29.13K
+google_spin_simple(64thread)                     106.28%    32.30us   30.96K
+folly_microspin_simple(64thread)                  99.86%    34.37us   29.09K
+folly_picospin_simple(64thread)                   31.37%   109.42us    9.14K
+folly_microlock_simple(64thread)                  53.46%    64.21us   15.57K
+folly_sharedmutex_simple(64thread)                62.94%    54.54us   18.33K
+folly_distributedmutex_simple(64thread)          161.26%    21.29us   46.98K
+folly_distributedmutex_combining_simple(64threa  603.87%     5.68us  175.91K
+folly_flatcombining_no_caching_simple(64thread)  247.00%    13.90us   71.95K
+folly_flatcombining_caching_simple(64thread)     310.66%    11.05us   90.50K
+atomics_fetch_add(64thread)                      839.49%     4.09us  244.55K
+atomic_fetch_xor(64thread)                       756.48%     4.54us  220.37K
+atomic_cas(64thread)                             606.85%     5.66us  176.78K
 ----------------------------------------------------------------------------
-std_mutex_simple(128thread)                                 69.21us   14.45K
-google_spin_simple(128thread)                    107.42%    64.43us   15.52K
-folly_microspin_simple(128thread)                 96.36%    71.82us   13.92K
-folly_picospin_simple(128thread)                  31.07%   222.75us    4.49K
-folly_microlock_simple(128thread)                 53.97%   128.25us    7.80K
-folly_sharedmutex_simple(128thread)               60.56%   114.29us    8.75K
-folly_distributedmutex_simple(128thread)         165.16%    41.91us   23.86K
-folly_distributedmutex_combining_simple(128thre  542.63%    12.75us   78.40K
-folly_flatcombining_no_caching_simple(128thread  246.16%    28.12us   35.57K
-folly_flatcombining_caching_simple(128thread)    232.56%    29.76us   33.60K
-atomics_fetch_add(128thread)                     839.43%     8.24us  121.29K
-atomic_fetch_xor(128thread)                      761.39%     9.09us  110.01K
-atomic_cas(128thread)                            598.53%    11.56us   86.48K
+std_mutex_simple(128thread)                                 67.35us   14.85K
+google_spin_simple(128thread)                    106.30%    63.36us   15.78K
+folly_microspin_simple(128thread)                 92.58%    72.75us   13.75K
+folly_picospin_simple(128thread)                  29.87%   225.47us    4.44K
+folly_microlock_simple(128thread)                 52.52%   128.25us    7.80K
+folly_sharedmutex_simple(128thread)               59.79%   112.64us    8.88K
+folly_distributedmutex_simple(128thread)         151.27%    44.52us   22.46K
+folly_distributedmutex_combining_simple(128thre  580.11%    11.61us   86.13K
+folly_flatcombining_no_caching_simple(128thread  219.20%    30.73us   32.55K
+folly_flatcombining_caching_simple(128thread)    225.39%    29.88us   33.46K
+atomics_fetch_add(128thread)                     813.36%     8.28us  120.76K
+atomic_fetch_xor(128thread)                      740.02%     9.10us  109.88K
+atomic_cas(128thread)                            586.66%    11.48us   87.11K
 ============================================================================

 ./small_locks_benchmark --bm_min_iters=100000
@@ -1153,204 +1165,204 @@ DistributedMutexUncontendedBenchmark                        51.47ns   19.43M
 AtomicFetchAddUncontendedBenchmark                          10.67ns   93.73M
 ----------------------------------------------------------------------------
 ----------------------------------------------------------------------------
-std_mutex(1thread)                                           1.36us  737.48K
-google_spin(1thread)                              94.81%     1.43us  699.17K
-folly_microspin(1thread)                         100.17%     1.35us  738.74K
-folly_picospin(1thread)                          100.40%     1.35us  740.41K
-folly_microlock(1thread)                          82.90%     1.64us  611.34K
-folly_sharedmutex(1thread)                       101.07%     1.34us  745.36K
-folly_distributedmutex(1thread)                  101.50%     1.34us  748.54K
-folly_distributedmutex_combining(1thread)         99.09%     1.37us  730.79K
-folly_flatcombining_no_caching(1thread)           91.37%     1.48us  673.80K
-folly_flatcombining_caching(1thread)              99.19%     1.37us  731.48K
+std_mutex(1thread)                                           1.37us  730.43K
+google_spin(1thread)                             104.25%     1.31us  761.46K
+folly_microspin(1thread)                         102.06%     1.34us  745.45K
+folly_picospin(1thread)                          100.68%     1.36us  735.43K
+folly_microlock(1thread)                         104.27%     1.31us  761.64K
+folly_sharedmutex(1thread)                       101.95%     1.34us  744.65K
+folly_distributedmutex(1thread)                   98.63%     1.39us  720.41K
+folly_distributedmutex_combining(1thread)        103.78%     1.32us  758.05K
+folly_flatcombining_no_caching(1thread)           95.44%     1.43us  697.15K
+folly_flatcombining_caching(1thread)              99.11%     1.38us  723.94K
 ----------------------------------------------------------------------------
-std_mutex(2thread)                                           1.65us  605.33K
-google_spin(2thread)                             113.28%     1.46us  685.74K
-folly_microspin(2thread)                         117.23%     1.41us  709.63K
-folly_picospin(2thread)                          113.56%     1.45us  687.40K
-folly_microlock(2thread)                         106.92%     1.55us  647.22K
-folly_sharedmutex(2thread)                       107.24%     1.54us  649.15K
-folly_distributedmutex(2thread)                  114.89%     1.44us  695.47K
-folly_distributedmutex_combining(2thread)         83.44%     1.98us  505.10K
-folly_flatcombining_no_caching(2thread)           75.89%     2.18us  459.42K
-folly_flatcombining_caching(2thread)              76.96%     2.15us  465.86K
+std_mutex(2thread)                                           1.65us  605.36K
+google_spin(2thread)                             111.27%     1.48us  673.61K
+folly_microspin(2thread)                         119.82%     1.38us  725.35K
+folly_picospin(2thread)                          112.46%     1.47us  680.81K
+folly_microlock(2thread)                         106.47%     1.55us  644.54K
+folly_sharedmutex(2thread)                       107.12%     1.54us  648.45K
+folly_distributedmutex(2thread)                  110.80%     1.49us  670.76K
+folly_distributedmutex_combining(2thread)         97.09%     1.70us  587.77K
+folly_flatcombining_no_caching(2thread)           83.37%     1.98us  504.68K
+folly_flatcombining_caching(2thread)             108.62%     1.52us  657.54K
 ----------------------------------------------------------------------------
-std_mutex(4thread)                                           2.88us  347.43K
-google_spin(4thread)                             132.08%     2.18us  458.88K
-folly_microspin(4thread)                         160.15%     1.80us  556.43K
-folly_picospin(4thread)                          189.27%     1.52us  657.60K
-folly_microlock(4thread)                         155.13%     1.86us  538.97K
-folly_sharedmutex(4thread)                       148.96%     1.93us  517.55K
-folly_distributedmutex(4thread)                  106.64%     2.70us  370.51K
-folly_distributedmutex_combining(4thread)        138.83%     2.07us  482.33K
-folly_flatcombining_no_caching(4thread)           87.67%     3.28us  304.59K
-folly_flatcombining_caching(4thread)              93.32%     3.08us  324.23K
+std_mutex(4thread)                                           2.92us  341.96K
+google_spin(4thread)                             137.67%     2.12us  470.78K
+folly_microspin(4thread)                         165.47%     1.77us  565.85K
+folly_picospin(4thread)                          181.92%     1.61us  622.09K
+folly_microlock(4thread)                         149.83%     1.95us  512.35K
+folly_sharedmutex(4thread)                       158.69%     1.84us  542.66K
+folly_distributedmutex(4thread)                  107.42%     2.72us  367.35K
+folly_distributedmutex_combining(4thread)        144.34%     2.03us  493.59K
+folly_flatcombining_no_caching(4thread)           88.43%     3.31us  302.40K
+folly_flatcombining_caching(4thread)              94.20%     3.10us  322.11K
 ----------------------------------------------------------------------------
-std_mutex(8thread)                                           7.01us  142.65K
-google_spin(8thread)                             127.58%     5.49us  182.00K
-folly_microspin(8thread)                         137.50%     5.10us  196.14K
-folly_picospin(8thread)                          114.66%     6.11us  163.56K
-folly_microlock(8thread)                         107.90%     6.50us  153.92K
-folly_sharedmutex(8thread)                       114.21%     6.14us  162.93K
-folly_distributedmutex(8thread)                  129.43%     5.42us  184.63K
-folly_distributedmutex_combining(8thread)        271.46%     2.58us  387.23K
-folly_flatcombining_no_caching(8thread)          148.27%     4.73us  211.50K
-folly_flatcombining_caching(8thread)             170.26%     4.12us  242.88K
+std_mutex(8thread)                                           7.04us  142.02K
+google_spin(8thread)                             128.50%     5.48us  182.49K
+folly_microspin(8thread)                         134.72%     5.23us  191.32K
+folly_picospin(8thread)                          112.37%     6.27us  159.58K
+folly_microlock(8thread)                         109.65%     6.42us  155.71K
+folly_sharedmutex(8thread)                       105.92%     6.65us  150.42K
+folly_distributedmutex(8thread)                  127.22%     5.53us  180.67K
+folly_distributedmutex_combining(8thread)        275.50%     2.56us  391.26K
+folly_flatcombining_no_caching(8thread)          144.99%     4.86us  205.92K
+folly_flatcombining_caching(8thread)             156.31%     4.50us  221.99K
 ----------------------------------------------------------------------------
-std_mutex(16thread)                                         13.11us   76.30K
-google_spin(16thread)                            122.81%    10.67us   93.71K
-folly_microspin(16thread)                         91.61%    14.31us   69.90K
-folly_picospin(16thread)                          62.60%    20.94us   47.76K
-folly_microlock(16thread)                         73.44%    17.85us   56.04K
-folly_sharedmutex(16thread)                       74.68%    17.55us   56.98K
-folly_distributedmutex(16thread)                 142.42%     9.20us  108.67K
-folly_distributedmutex_combining(16thread)       332.10%     3.95us  253.39K
-folly_flatcombining_no_caching(16thread)         177.20%     7.40us  135.21K
-folly_flatcombining_caching(16thread)            186.60%     7.02us  142.37K
+std_mutex(16thread)                                         13.08us   76.44K
+google_spin(16thread)                            121.76%    10.74us   93.07K
+folly_microspin(16thread)                         91.47%    14.30us   69.92K
+folly_picospin(16thread)                          67.95%    19.25us   51.94K
+folly_microlock(16thread)                         73.57%    17.78us   56.24K
+folly_sharedmutex(16thread)                       70.59%    18.53us   53.96K
+folly_distributedmutex(16thread)                 139.74%     9.36us  106.82K
+folly_distributedmutex_combining(16thread)       338.38%     3.87us  258.67K
+folly_flatcombining_no_caching(16thread)         194.08%     6.74us  148.36K
+folly_flatcombining_caching(16thread)            195.03%     6.71us  149.09K
 ----------------------------------------------------------------------------
-std_mutex(32thread)                                         25.45us   39.30K
-google_spin(32thread)                            122.57%    20.76us   48.17K
-folly_microspin(32thread)                         73.58%    34.58us   28.92K
-folly_picospin(32thread)                          50.29%    50.60us   19.76K
-folly_microlock(32thread)                         58.33%    43.63us   22.92K
-folly_sharedmutex(32thread)                       55.89%    45.53us   21.96K
-folly_distributedmutex(32thread)                 142.80%    17.82us   56.12K
-folly_distributedmutex_combining(32thread)       352.23%     7.22us  138.42K
-folly_flatcombining_no_caching(32thread)         237.42%    10.72us   93.30K
-folly_flatcombining_caching(32thread)            251.05%    10.14us   98.66K
+std_mutex(32thread)                                         25.35us   39.45K
+google_spin(32thread)                            122.73%    20.66us   48.41K
+folly_microspin(32thread)                         73.81%    34.35us   29.11K
+folly_picospin(32thread)                          50.66%    50.04us   19.98K
+folly_microlock(32thread)                         58.40%    43.41us   23.03K
+folly_sharedmutex(32thread)                       55.14%    45.98us   21.75K
+folly_distributedmutex(32thread)                 141.36%    17.93us   55.76K
+folly_distributedmutex_combining(32thread)       358.52%     7.07us  141.42K
+folly_flatcombining_no_caching(32thread)         257.78%     9.83us  101.68K
+folly_flatcombining_caching(32thread)            285.82%     8.87us  112.74K
 ----------------------------------------------------------------------------
-std_mutex(64thread)                                         43.02us   23.25K
-google_spin(64thread)                            120.68%    35.65us   28.05K
-folly_microspin(64thread)                         70.09%    61.38us   16.29K
-folly_picospin(64thread)                          42.05%   102.31us    9.77K
-folly_microlock(64thread)                         54.50%    78.94us   12.67K
-folly_sharedmutex(64thread)                       50.37%    85.40us   11.71K
-folly_distributedmutex(64thread)                 135.17%    31.83us   31.42K
-folly_distributedmutex_combining(64thread)       319.01%    13.49us   74.15K
-folly_flatcombining_no_caching(64thread)         218.18%    19.72us   50.72K
-folly_flatcombining_caching(64thread)            211.05%    20.38us   49.06K
+std_mutex(64thread)                                         45.03us   22.21K
+google_spin(64thread)                            124.58%    36.15us   27.66K
+folly_microspin(64thread)                         75.05%    60.00us   16.67K
+folly_picospin(64thread)                          44.98%   100.12us    9.99K
+folly_microlock(64thread)                         56.99%    79.01us   12.66K
+folly_sharedmutex(64thread)                       52.67%    85.49us   11.70K
+folly_distributedmutex(64thread)                 139.71%    32.23us   31.02K
+folly_distributedmutex_combining(64thread)       343.76%    13.10us   76.34K
+folly_flatcombining_no_caching(64thread)         211.67%    21.27us   47.01K
+folly_flatcombining_caching(64thread)            222.51%    20.24us   49.41K
 ----------------------------------------------------------------------------
-std_mutex(128thread)                                        84.62us   11.82K
-google_spin(128thread)                           120.25%    70.37us   14.21K
-folly_microspin(128thread)                        66.54%   127.16us    7.86K
-folly_picospin(128thread)                         33.40%   253.38us    3.95K
-folly_microlock(128thread)                        51.91%   163.03us    6.13K
-folly_sharedmutex(128thread)                      49.51%   170.90us    5.85K
-folly_distributedmutex(128thread)                131.90%    64.15us   15.59K
-folly_distributedmutex_combining(128thread)      273.55%    30.93us   32.33K
-folly_flatcombining_no_caching(128thread)        183.86%    46.02us   21.73K
-folly_flatcombining_caching(128thread)           180.95%    46.76us   21.38K
+std_mutex(128thread)                                        88.78us   11.26K
+google_spin(128thread)                           125.10%    70.96us   14.09K
+folly_microspin(128thread)                        71.00%   125.03us    8.00K
+folly_picospin(128thread)                         30.97%   286.63us    3.49K
+folly_microlock(128thread)                        54.37%   163.28us    6.12K
+folly_sharedmutex(128thread)                      51.69%   171.76us    5.82K
+folly_distributedmutex(128thread)                137.37%    64.63us   15.47K
+folly_distributedmutex_combining(128thread)      281.23%    31.57us   31.68K
+folly_flatcombining_no_caching(128thread)        136.61%    64.99us   15.39K
+folly_flatcombining_caching(128thread)           152.32%    58.29us   17.16K
 ----------------------------------------------------------------------------
-std_mutex_simple(1thread)                                    1.19us  839.60K
-google_spin_simple(1thread)                      100.96%     1.18us  847.68K
-folly_microspin_simple(1thread)                  101.35%     1.18us  850.96K
-folly_picospin_simple(1thread)                   101.04%     1.18us  848.31K
-folly_microlock_simple(1thread)                  100.58%     1.18us  844.50K
-folly_sharedmutex_simple(1thread)                100.75%     1.18us  845.88K
-folly_distributedmutex_simple(1thread)            98.62%     1.21us  828.05K
-folly_distributedmutex_combining_simple(1thread   99.58%     1.20us  836.07K
-folly_flatcombining_no_caching_simple(1thread)    95.63%     1.25us  802.87K
-folly_flatcombining_caching_simple(1thread)       99.37%     1.20us  834.27K
-atomics_fetch_add(1thread)                       101.98%     1.17us  856.25K
-atomic_fetch_xor(1thread)                        101.29%     1.18us  850.43K
-atomic_cas(1thread)                              101.73%     1.17us  854.11K
+std_mutex_simple(1thread)                                    1.63us  611.75K
+google_spin_simple(1thread)                      105.70%     1.55us  646.61K
+folly_microspin_simple(1thread)                  103.24%     1.58us  631.57K
+folly_picospin_simple(1thread)                   109.17%     1.50us  667.87K
+folly_microlock_simple(1thread)                  111.22%     1.47us  680.41K
+folly_sharedmutex_simple(1thread)                136.79%     1.19us  836.83K
+folly_distributedmutex_simple(1thread)           107.21%     1.52us  655.88K
+folly_distributedmutex_combining_simple(1thread  134.79%     1.21us  824.61K
+folly_flatcombining_no_caching_simple(1thread)   127.99%     1.28us  782.99K
+folly_flatcombining_caching_simple(1thread)      133.87%     1.22us  818.93K
+atomics_fetch_add(1thread)                       138.24%     1.18us  845.70K
+atomic_fetch_xor(1thread)                        106.94%     1.53us  654.23K
+atomic_cas(1thread)                              124.81%     1.31us  763.52K
 ----------------------------------------------------------------------------
-std_mutex_simple(2thread)                                    1.60us  623.66K
-google_spin_simple(2thread)                      113.06%     1.42us  705.12K
-folly_microspin_simple(2thread)                  114.38%     1.40us  713.32K
-folly_picospin_simple(2thread)                   112.84%     1.42us  703.74K
-folly_microlock_simple(2thread)                   97.27%     1.65us  606.66K
-folly_sharedmutex_simple(2thread)                111.31%     1.44us  694.20K
-folly_distributedmutex_simple(2thread)           109.21%     1.47us  681.11K
-folly_distributedmutex_combining_simple(2thread  107.91%     1.49us  672.98K
-folly_flatcombining_no_caching_simple(2thread)    89.48%     1.79us  558.04K
-folly_flatcombining_caching_simple(2thread)       98.95%     1.62us  617.14K
-atomics_fetch_add(2thread)                       106.88%     1.50us  666.58K
-atomic_fetch_xor(2thread)                        126.82%     1.26us  790.91K
-atomic_cas(2thread)                              130.34%     1.23us  812.86K
+std_mutex_simple(2thread)                                    1.60us  626.60K
+google_spin_simple(2thread)                       96.04%     1.66us  601.80K
+folly_microspin_simple(2thread)                  111.88%     1.43us  701.02K
+folly_picospin_simple(2thread)                   106.11%     1.50us  664.91K
+folly_microlock_simple(2thread)                   88.90%     1.80us  557.04K
+folly_sharedmutex_simple(2thread)                 90.93%     1.76us  569.79K
+folly_distributedmutex_simple(2thread)            93.93%     1.70us  588.57K
+folly_distributedmutex_combining_simple(2thread  106.86%     1.49us  669.61K
+folly_flatcombining_no_caching_simple(2thread)    85.92%     1.86us  538.37K
+folly_flatcombining_caching_simple(2thread)       98.82%     1.61us  619.24K
+atomics_fetch_add(2thread)                       104.61%     1.53us  655.46K
+atomic_fetch_xor(2thread)                        126.46%     1.26us  792.40K
+atomic_cas(2thread)                              125.92%     1.27us  788.99K
 ----------------------------------------------------------------------------
-std_mutex_simple(4thread)                                    2.74us  364.72K
-google_spin_simple(4thread)                      123.43%     2.22us  450.16K
-folly_microspin_simple(4thread)                  153.56%     1.79us  560.07K
-folly_picospin_simple(4thread)                   146.03%     1.88us  532.59K
-folly_microlock_simple(4thread)                  116.28%     2.36us  424.10K
-folly_sharedmutex_simple(4thread)                142.39%     1.93us  519.33K
-folly_distributedmutex_simple(4thread)           111.84%     2.45us  407.89K
-folly_distributedmutex_combining_simple(4thread  140.61%     1.95us  512.83K
-folly_flatcombining_no_caching_simple(4thread)   101.22%     2.71us  369.17K
-folly_flatcombining_caching_simple(4thread)      105.38%     2.60us  384.35K
-atomics_fetch_add(4thread)                       150.95%     1.82us  550.52K
-atomic_fetch_xor(4thread)                        223.43%     1.23us  814.87K
-atomic_cas(4thread)                              217.57%     1.26us  793.52K
+std_mutex_simple(4thread)                                    2.71us  368.45K
+google_spin_simple(4thread)                      124.52%     2.18us  458.79K
+folly_microspin_simple(4thread)                  146.48%     1.85us  539.69K
+folly_picospin_simple(4thread)                   163.54%     1.66us  602.57K
+folly_microlock_simple(4thread)                  113.17%     2.40us  416.99K
+folly_sharedmutex_simple(4thread)                142.36%     1.91us  524.52K
+folly_distributedmutex_simple(4thread)           108.22%     2.51us  398.74K
+folly_distributedmutex_combining_simple(4thread  141.49%     1.92us  521.30K
+folly_flatcombining_no_caching_simple(4thread)    97.27%     2.79us  358.38K
+folly_flatcombining_caching_simple(4thread)      106.12%     2.56us  390.99K
+atomics_fetch_add(4thread)                       151.10%     1.80us  556.73K
+atomic_fetch_xor(4thread)                        213.14%     1.27us  785.32K
+atomic_cas(4thread)                              218.93%     1.24us  806.65K
 ----------------------------------------------------------------------------
-std_mutex_simple(8thread)                                    6.99us  142.98K
-google_spin_simple(8thread)                      128.58%     5.44us  183.84K
-folly_microspin_simple(8thread)                  131.98%     5.30us  188.69K
-folly_picospin_simple(8thread)                   121.81%     5.74us  174.16K
-folly_microlock_simple(8thread)                  100.06%     6.99us  143.06K
-folly_sharedmutex_simple(8thread)                115.88%     6.04us  165.69K
-folly_distributedmutex_simple(8thread)           123.11%     5.68us  176.02K
-folly_distributedmutex_combining_simple(8thread  307.74%     2.27us  439.99K
-folly_flatcombining_no_caching_simple(8thread)   136.00%     5.14us  194.45K
-folly_flatcombining_caching_simple(8thread)      148.43%     4.71us  212.22K
-atomics_fetch_add(8thread)                       358.67%     1.95us  512.81K
-atomic_fetch_xor(8thread)                        466.73%     1.50us  667.32K
-atomic_cas(8thread)                              371.61%     1.88us  531.31K
+std_mutex_simple(8thread)                                    7.02us  142.50K
+google_spin_simple(8thread)                      127.47%     5.51us  181.64K
+folly_microspin_simple(8thread)                  137.77%     5.09us  196.33K
+folly_picospin_simple(8thread)                   119.78%     5.86us  170.69K
+folly_microlock_simple(8thread)                  108.08%     6.49us  154.02K
+folly_sharedmutex_simple(8thread)                114.77%     6.11us  163.55K
+folly_distributedmutex_simple(8thread)           120.24%     5.84us  171.35K
+folly_distributedmutex_combining_simple(8thread  316.54%     2.22us  451.07K
+folly_flatcombining_no_caching_simple(8thread)   136.43%     5.14us  194.42K
+folly_flatcombining_caching_simple(8thread)      145.04%     4.84us  206.68K
+atomics_fetch_add(8thread)                       358.98%     1.95us  511.55K
+atomic_fetch_xor(8thread)                        505.27%     1.39us  720.02K
+atomic_cas(8thread)                              389.32%     1.80us  554.79K
 ----------------------------------------------------------------------------
-std_mutex_simple(16thread)                                  12.83us   77.96K
-google_spin_simple(16thread)                     122.19%    10.50us   95.26K
-folly_microspin_simple(16thread)                  99.14%    12.94us   77.30K
-folly_picospin_simple(16thread)                   62.74%    20.44us   48.91K
-folly_microlock_simple(16thread)                  75.01%    17.10us   58.48K
-folly_sharedmutex_simple(16thread)                79.92%    16.05us   62.31K
-folly_distributedmutex_simple(16thread)          118.18%    10.85us   92.14K
-folly_distributedmutex_combining_simple(16threa  482.27%     2.66us  376.00K
-folly_flatcombining_no_caching_simple(16thread)  191.45%     6.70us  149.26K
-folly_flatcombining_caching_simple(16thread)     227.12%     5.65us  177.07K
-atomics_fetch_add(16thread)                      612.80%     2.09us  477.77K
-atomic_fetch_xor(16thread)                       551.00%     2.33us  429.58K
-atomic_cas(16thread)                             282.79%     4.54us  220.47K
+std_mutex_simple(16thread)                                  12.78us   78.24K
+google_spin_simple(16thread)                     122.66%    10.42us   95.96K
+folly_microspin_simple(16thread)                  98.10%    13.03us   76.75K
+folly_picospin_simple(16thread)                   72.52%    17.62us   56.74K
+folly_microlock_simple(16thread)                  70.12%    18.23us   54.86K
+folly_sharedmutex_simple(16thread)                76.81%    16.64us   60.09K
+folly_distributedmutex_simple(16thread)          113.84%    11.23us   89.06K
+folly_distributedmutex_combining_simple(16threa  498.99%     2.56us  390.39K
+folly_flatcombining_no_caching_simple(16thread)  193.05%     6.62us  151.04K
+folly_flatcombining_caching_simple(16thread)     220.47%     5.80us  172.49K
+atomics_fetch_add(16thread)                      611.70%     2.09us  478.58K
+atomic_fetch_xor(16thread)                       515.51%     2.48us  403.32K
+atomic_cas(16thread)                             239.86%     5.33us  187.66K
 ----------------------------------------------------------------------------
-std_mutex_simple(32thread)                                  23.09us   43.30K
-google_spin_simple(32thread)                     125.07%    18.46us   54.16K
-folly_microspin_simple(32thread)                  76.39%    30.23us   33.08K
-folly_picospin_simple(32thread)                   46.54%    49.62us   20.16K
-folly_microlock_simple(32thread)                  52.84%    43.71us   22.88K
-folly_sharedmutex_simple(32thread)                53.06%    43.52us   22.98K
-folly_distributedmutex_simple(32thread)          107.10%    21.56us   46.38K
-folly_distributedmutex_combining_simple(32threa  596.57%     3.87us  258.33K
-folly_flatcombining_no_caching_simple(32thread)  274.44%     8.41us  118.84K
-folly_flatcombining_caching_simple(32thread)     312.83%     7.38us  135.46K
-atomics_fetch_add(32thread)                     1082.13%     2.13us  468.59K
-atomic_fetch_xor(32thread)                       552.82%     4.18us  239.39K
-atomic_cas(32thread)                             203.03%    11.37us   87.92K
+std_mutex_simple(32thread)                                  23.80us   42.02K
+google_spin_simple(32thread)                     125.41%    18.98us   52.69K
+folly_microspin_simple(32thread)                  76.32%    31.18us   32.07K
+folly_picospin_simple(32thread)                   48.82%    48.75us   20.51K
+folly_microlock_simple(32thread)                  52.99%    44.92us   22.26K
+folly_sharedmutex_simple(32thread)                54.03%    44.05us   22.70K
+folly_distributedmutex_simple(32thread)          108.28%    21.98us   45.49K
+folly_distributedmutex_combining_simple(32threa  697.71%     3.41us  293.15K
+folly_flatcombining_no_caching_simple(32thread)  291.70%     8.16us  122.56K
+folly_flatcombining_caching_simple(32thread)     412.51%     5.77us  173.32K
+atomics_fetch_add(32thread)                     1074.64%     2.21us  451.52K
+atomic_fetch_xor(32thread)                       577.90%     4.12us  242.81K
+atomic_cas(32thread)                             193.87%    12.28us   81.46K
 ----------------------------------------------------------------------------
-std_mutex_simple(64thread)                                  39.95us   25.03K
-google_spin_simple(64thread)                     124.75%    32.02us   31.23K
-folly_microspin_simple(64thread)                  73.49%    54.36us   18.40K
-folly_picospin_simple(64thread)                   39.80%   100.37us    9.96K
-folly_microlock_simple(64thread)                  50.07%    79.78us   12.53K
-folly_sharedmutex_simple(64thread)                49.52%    80.66us   12.40K
-folly_distributedmutex_simple(64thread)          104.56%    38.20us   26.18K
-folly_distributedmutex_combining_simple(64threa  532.34%     7.50us  133.26K
-folly_flatcombining_no_caching_simple(64thread)  279.23%    14.31us   69.90K
-folly_flatcombining_caching_simple(64thread)     325.10%    12.29us   81.39K
-atomics_fetch_add(64thread)                     1031.51%     3.87us  258.23K
-atomic_fetch_xor(64thread)                       525.68%     7.60us  131.60K
-atomic_cas(64thread)                             187.67%    21.28us   46.98K
+std_mutex_simple(64thread)                                  41.40us   24.16K
+google_spin_simple(64thread)                     125.42%    33.01us   30.30K
+folly_microspin_simple(64thread)                  75.30%    54.98us   18.19K
+folly_picospin_simple(64thread)                   42.87%    96.57us   10.35K
+folly_microlock_simple(64thread)                  50.88%    81.37us   12.29K
+folly_sharedmutex_simple(64thread)                50.08%    82.67us   12.10K
+folly_distributedmutex_simple(64thread)          105.81%    39.12us   25.56K
+folly_distributedmutex_combining_simple(64threa  604.86%     6.84us  146.11K
+folly_flatcombining_no_caching_simple(64thread)  269.82%    15.34us   65.18K
+folly_flatcombining_caching_simple(64thread)     334.78%    12.37us   80.87K
+atomics_fetch_add(64thread)                     1061.21%     3.90us  256.34K
+atomic_fetch_xor(64thread)                       551.00%     7.51us  133.10K
+atomic_cas(64thread)                             183.75%    22.53us   44.39K
 ----------------------------------------------------------------------------
-std_mutex_simple(128thread)                                 78.65us   12.71K
-google_spin_simple(128thread)                    124.05%    63.40us   15.77K
-folly_microspin_simple(128thread)                 70.00%   112.36us    8.90K
-folly_picospin_simple(128thread)                  29.72%   264.60us    3.78K
-folly_microlock_simple(128thread)                 47.74%   164.73us    6.07K
-folly_sharedmutex_simple(128thread)               48.87%   160.93us    6.21K
-folly_distributedmutex_simple(128thread)         104.04%    75.59us   13.23K
-folly_distributedmutex_combining_simple(128thre  426.02%    18.46us   54.17K
-folly_flatcombining_no_caching_simple(128thread  210.85%    37.30us   26.81K
-folly_flatcombining_caching_simple(128thread)    241.48%    32.57us   30.70K
-atomics_fetch_add(128thread)                     992.30%     7.93us  126.17K
-atomic_fetch_xor(128thread)                      525.32%    14.97us   66.79K
-atomic_cas(128thread)                            181.89%    43.24us   23.13K
+std_mutex_simple(128thread)                                 80.97us   12.35K
+google_spin_simple(128thread)                    124.75%    64.90us   15.41K
+folly_microspin_simple(128thread)                 70.93%   114.16us    8.76K
+folly_picospin_simple(128thread)                  32.81%   246.78us    4.05K
+folly_microlock_simple(128thread)                 48.00%   168.69us    5.93K
+folly_sharedmutex_simple(128thread)               49.03%   165.15us    6.06K
+folly_distributedmutex_simple(128thread)         103.96%    77.88us   12.84K
+folly_distributedmutex_combining_simple(128thre  460.68%    17.58us   56.90K
+folly_flatcombining_no_caching_simple(128thread  211.10%    38.35us   26.07K
+folly_flatcombining_caching_simple(128thread)    220.02%    36.80us   27.17K
+atomics_fetch_add(128thread)                    1031.88%     7.85us  127.45K
+atomic_fetch_xor(128thread)                      543.67%    14.89us   67.15K
+atomic_cas(128thread)                            179.37%    45.14us   22.15K
 ============================================================================
 */