Add flat combining to DistributedMutex

Summary: Add combined critical sections to DistributedMutex. The implementation uses the framework within DistributedMutex as the point of reference for contention and resolves contention by either combining the lock requests of peers or migrating the lock based on usage and internal state. This boosts the performance of DistributedMutex more than before - up to 4x relative to the old benchmark on dual socket Broadwell and up to 5x on single socket Skylake machines. The win might be bigger when the cost of mutex migration is higher, eg. when the data being protected is wider than a single L1 cache line. Small critical sections when used in combinable mode, can now go more than 10x faster than the small locks, about 6x faster than std::mutex, up to 2-3x faster than the implementations of flat combining we benchmarked against and about as fast as a CAS instruction/loop (faster on some NUMA-less and more parallel architectures like Skylake). This also allows flat combining to be used in situations where fine-grained locking would be beneficial with virtually no overhead, DistributedMutex retains the original size of 8 bytes. DistributedMutex resolves contention through flat combining up to a constant factor of 2 contention chains to prevent issues with fairness and latency outliers. So we retain the fairness benefits of the original implementation with no noticeable regression when switching between the lock methods. The implementation of combined critical sections here is different from the original flat combining paper. This uses the same stack based LIFO contention chains from DistributedMutex to allow the leader to resolve lock requests from peers. Combine records are located on the stack along with the wait-node as an InlineFunctionRef instance to avoid memory allocation overhead or expensive copying. Using InlineFunctionRef also means that function calls are resolved without having to go through the double lookup of a vtable based implementation. InlineFunctionRef can flatten the virtual table and callable object in-situ so we have just one indirection. Additionally, we use preemption as a signal to speed up lock requests in the case where latency of acquisition would have otherwise gone beyond our control. As a side-bonus, this also results in much simpler code. The API looks like the following ``` auto integer = std::uint64_t{}; auto mutex = folly::DistributedMutex{}; // ... mutex.lock_combine([&]() { foo(); integer++; }); ``` This adds three new methods for symmetry with the old lock functions - folly::invoke_result_t<const Func&> lock_combine(Func) noexcept; - folly::Optional<> try_lock_combine_for(duration, Func) noexcept; - folly::Optional<> try_lock_combine_until(time_point, Func) noexcept; Benchmarks on Broadwell ``` std_mutex_simple(1thread) 617.28ns 1.62M google_spin_simple(1thread) 101.97% 605.33ns 1.65M folly_microspin_simple(1thread) 99.40% 621.01ns 1.61M folly_picospin_simple(1thread) 100.15% 616.36ns 1.62M folly_microlock_simple(1thread) 98.86% 624.37ns 1.60M folly_sharedmutex_simple(1thread) 86.14% 716.59ns 1.40M folly_distributedmutex_simple(1thread) 97.95% 630.21ns 1.59M folly_distributedmutex_flatcombining_simple(1th 98.04% 629.60ns 1.59M folly_flatcombining_no_caching_simple(1thread) 89.85% 687.01ns 1.46M folly_flatcombining_caching_simple(1thread) 78.36% 787.75ns 1.27M atomics_fetch_add(1thread) 97.88% 630.67ns 1.59M atomic_cas(1thread) 102.31% 603.33ns 1.66M ---------------------------------------------------------------------------- std_mutex_simple(2thread) 1.14us 875.72K google_spin_simple(2thread) 125.08% 912.95ns 1.10M folly_microspin_simple(2thread) 116.03% 984.14ns 1.02M folly_picospin_simple(2thread) 117.35% 973.04ns 1.03M folly_microlock_simple(2thread) 102.54% 1.11us 897.95K folly_sharedmutex_simple(2thread) 121.04% 943.42ns 1.06M folly_distributedmutex_simple(2thread) 128.24% 890.48ns 1.12M folly_distributedmutex_flatcombining_simple(2th 107.99% 1.06us 945.66K folly_flatcombining_no_caching_simple(2thread) 83.40% 1.37us 730.33K folly_flatcombining_caching_simple(2thread) 87.47% 1.31us 766.00K atomics_fetch_add(2thread) 115.71% 986.85ns 1.01M atomic_cas(2thread) 171.35% 666.42ns 1.50M ---------------------------------------------------------------------------- std_mutex_simple(4thread) 1.98us 504.43K google_spin_simple(4thread) 103.24% 1.92us 520.76K folly_microspin_simple(4thread) 92.05% 2.15us 464.33K folly_picospin_simple(4thread) 89.16% 2.22us 449.75K folly_microlock_simple(4thread) 66.62% 2.98us 336.06K folly_sharedmutex_simple(4thread) 82.61% 2.40us 416.69K folly_distributedmutex_simple(4thread) 108.83% 1.82us 548.98K folly_distributedmutex_flatcombining_simple(4th 145.24% 1.36us 732.63K folly_flatcombining_no_caching_simple(4thread) 84.77% 2.34us 427.62K folly_flatcombining_caching_simple(4thread) 91.01% 2.18us 459.09K atomics_fetch_add(4thread) 142.86% 1.39us 720.62K atomic_cas(4thread) 223.50% 887.02ns 1.13M ---------------------------------------------------------------------------- std_mutex_simple(8thread) 3.70us 270.40K google_spin_simple(8thread) 110.24% 3.35us 298.09K folly_microspin_simple(8thread) 81.59% 4.53us 220.63K folly_picospin_simple(8thread) 57.61% 6.42us 155.77K folly_microlock_simple(8thread) 54.18% 6.83us 146.49K folly_sharedmutex_simple(8thread) 55.44% 6.67us 149.92K folly_distributedmutex_simple(8thread) 109.86% 3.37us 297.05K folly_distributedmutex_flatcombining_simple(8th 225.14% 1.64us 608.76K folly_flatcombining_no_caching_simple(8thread) 96.25% 3.84us 260.26K folly_flatcombining_caching_simple(8thread) 108.13% 3.42us 292.39K atomics_fetch_add(8thread) 255.40% 1.45us 690.60K atomic_cas(8thread) 183.68% 2.01us 496.66K ---------------------------------------------------------------------------- std_mutex_simple(16thread) 8.70us 114.89K google_spin_simple(16thread) 124.47% 6.99us 143.01K folly_microspin_simple(16thread) 86.46% 10.07us 99.34K folly_picospin_simple(16thread) 40.76% 21.36us 46.83K folly_microlock_simple(16thread) 54.78% 15.89us 62.94K folly_sharedmutex_simple(16thread) 58.14% 14.97us 66.80K folly_distributedmutex_simple(16thread) 124.53% 6.99us 143.08K folly_distributedmutex_flatcombining_simple(16t 324.08% 2.69us 372.34K folly_flatcombining_no_caching_simple(16thread) 134.73% 6.46us 154.79K folly_flatcombining_caching_simple(16thread) 188.24% 4.62us 216.28K atomics_fetch_add(16thread) 340.07% 2.56us 390.72K atomic_cas(16thread) 220.15% 3.95us 252.93K ---------------------------------------------------------------------------- std_mutex_simple(32thread) 25.62us 39.03K google_spin_simple(32thread) 105.21% 24.35us 41.07K folly_microspin_simple(32thread) 79.64% 32.17us 31.08K folly_picospin_simple(32thread) 19.61% 130.67us 7.65K folly_microlock_simple(32thread) 42.97% 59.62us 16.77K folly_sharedmutex_simple(32thread) 52.41% 48.88us 20.46K folly_distributedmutex_simple(32thread) 144.48% 17.73us 56.39K folly_distributedmutex_flatcombining_simple(32t 461.73% 5.55us 180.22K folly_flatcombining_no_caching_simple(32thread) 207.55% 12.34us 81.01K folly_flatcombining_caching_simple(32thread) 237.34% 10.80us 92.64K atomics_fetch_add(32thread) 561.68% 4.56us 219.23K atomic_cas(32thread) 484.13% 5.29us 188.96K ---------------------------------------------------------------------------- std_mutex_simple(64thread) 31.26us 31.99K google_spin_simple(64thread) 99.95% 31.28us 31.97K folly_microspin_simple(64thread) 83.63% 37.38us 26.75K folly_picospin_simple(64thread) 20.88% 149.68us 6.68K folly_microlock_simple(64thread) 45.46% 68.77us 14.54K folly_sharedmutex_simple(64thread) 52.65% 59.38us 16.84K folly_distributedmutex_simple(64thread) 154.90% 20.18us 49.55K folly_distributedmutex_flatcombining_simple(64t 475.05% 6.58us 151.96K folly_flatcombining_no_caching_simple(64thread) 195.63% 15.98us 62.58K folly_flatcombining_caching_simple(64thread) 199.29% 15.69us 63.75K atomics_fetch_add(64thread) 580.23% 5.39us 185.61K atomic_cas(64thread) 510.76% 6.12us 163.39K ---------------------------------------------------------------------------- std_mutex_simple(128thread) 70.53us 14.18K google_spin_simple(128thread) 99.20% 71.09us 14.07K folly_microspin_simple(128thread) 88.73% 79.49us 12.58K folly_picospin_simple(128thread) 22.24% 317.06us 3.15K folly_microlock_simple(128thread) 50.17% 140.57us 7.11K folly_sharedmutex_simple(128thread) 59.53% 118.47us 8.44K folly_distributedmutex_simple(128thread) 172.74% 40.83us 24.49K folly_distributedmutex_flatcombining_simple(128 538.22% 13.10us 76.31K folly_flatcombining_no_caching_simple(128thread 165.11% 42.72us 23.41K folly_flatcombining_caching_simple(128thread) 161.46% 43.68us 22.89K atomics_fetch_add(128thread) 606.51% 11.63us 85.99K atomic_cas(128thread) 578.52% 12.19us 82.03K ``` Reviewed By: djwatson Differential Revision: D13799447 fbshipit-source-id: 923cc35e5060ef79b349690821d8545459248347

Add flat combining to DistributedMutex
Summary: Add combined critical sections to DistributedMutex. The implementation uses the framework within DistributedMutex as the point of reference for contention and resolves contention by either combining the lock requests of peers or migrating the lock based on usage and internal state. This boosts the performance of DistributedMutex more than before - up to 4x relative to the old benchmark on dual socket Broadwell and up to 5x on single socket Skylake machines. The win might be bigger when the cost of mutex migration is higher, eg. when the data being protected is wider than a single L1 cache line. Small critical sections when used in combinable mode, can now go more than 10x faster than the small locks, about 6x faster than std::mutex, up to 2-3x faster than the implementations of flat combining we benchmarked against and about as fast as a CAS instruction/loop (faster on some NUMA-less and more parallel architectures like Skylake). This also allows flat combining to be used in situations where fine-grained locking would be beneficial with virtually no overhead, DistributedMutex retains the original size of 8 bytes. DistributedMutex resolves contention through flat combining up to a constant factor of 2 contention chains to prevent issues with fairness and latency outliers. So we retain the fairness benefits of the original implementation with no noticeable regression when switching between the lock methods. The implementation of combined critical sections here is different from the original flat combining paper. This uses the same stack based LIFO contention chains from DistributedMutex to allow the leader to resolve lock requests from peers. Combine records are located on the stack along with the wait-node as an InlineFunctionRef instance to avoid memory allocation overhead or expensive copying. Using InlineFunctionRef also means that function calls are resolved without having to go through the double lookup of a vtable based implementation. InlineFunctionRef can flatten the virtual table and callable object in-situ so we have just one indirection. Additionally, we use preemption as a signal to speed up lock requests in the case where latency of acquisition would have otherwise gone beyond our control. As a side-bonus, this also results in much simpler code. The API looks like the following ``` auto integer = std::uint64_t{}; auto mutex = folly::DistributedMutex{}; // ... mutex.lock_combine([&]() { foo(); integer++; }); ``` This adds three new methods for symmetry with the old lock functions - folly::invoke_result_t<const Func&> lock_combine(Func) noexcept; - folly::Optional<> try_lock_combine_for(duration, Func) noexcept; - folly::Optional<> try_lock_combine_until(time_point, Func) noexcept; Benchmarks on Broadwell ``` std_mutex_simple(1thread) 617.28ns 1.62M google_spin_simple(1thread) 101.97% 605.33ns 1.65M folly_microspin_simple(1thread) 99.40% 621.01ns 1.61M folly_picospin_simple(1thread) 100.15% 616.36ns 1.62M folly_microlock_simple(1thread) 98.86% 624.37ns 1.60M folly_sharedmutex_simple(1thread) 86.14% 716.59ns 1.40M folly_distributedmutex_simple(1thread) 97.95% 630.21ns 1.59M folly_distributedmutex_flatcombining_simple(1th 98.04% 629.60ns 1.59M folly_flatcombining_no_caching_simple(1thread) 89.85% 687.01ns 1.46M folly_flatcombining_caching_simple(1thread) 78.36% 787.75ns 1.27M atomics_fetch_add(1thread) 97.88% 630.67ns 1.59M atomic_cas(1thread) 102.31% 603.33ns 1.66M ---------------------------------------------------------------------------- std_mutex_simple(2thread) 1.14us 875.72K google_spin_simple(2thread) 125.08% 912.95ns 1.10M folly_microspin_simple(2thread) 116.03% 984.14ns 1.02M folly_picospin_simple(2thread) 117.35% 973.04ns 1.03M folly_microlock_simple(2thread) 102.54% 1.11us 897.95K folly_sharedmutex_simple(2thread) 121.04% 943.42ns 1.06M folly_distributedmutex_simple(2thread) 128.24% 890.48ns 1.12M folly_distributedmutex_flatcombining_simple(2th 107.99% 1.06us 945.66K folly_flatcombining_no_caching_simple(2thread) 83.40% 1.37us 730.33K folly_flatcombining_caching_simple(2thread) 87.47% 1.31us 766.00K atomics_fetch_add(2thread) 115.71% 986.85ns 1.01M atomic_cas(2thread) 171.35% 666.42ns 1.50M ---------------------------------------------------------------------------- std_mutex_simple(4thread) 1.98us 504.43K google_spin_simple(4thread) 103.24% 1.92us 520.76K folly_microspin_simple(4thread) 92.05% 2.15us 464.33K folly_picospin_simple(4thread) 89.16% 2.22us 449.75K folly_microlock_simple(4thread) 66.62% 2.98us 336.06K folly_sharedmutex_simple(4thread) 82.61% 2.40us 416.69K folly_distributedmutex_simple(4thread) 108.83% 1.82us 548.98K folly_distributedmutex_flatcombining_simple(4th 145.24% 1.36us 732.63K folly_flatcombining_no_caching_simple(4thread) 84.77% 2.34us 427.62K folly_flatcombining_caching_simple(4thread) 91.01% 2.18us 459.09K atomics_fetch_add(4thread) 142.86% 1.39us 720.62K atomic_cas(4thread) 223.50% 887.02ns 1.13M ---------------------------------------------------------------------------- std_mutex_simple(8thread) 3.70us 270.40K google_spin_simple(8thread) 110.24% 3.35us 298.09K folly_microspin_simple(8thread) 81.59% 4.53us 220.63K folly_picospin_simple(8thread) 57.61% 6.42us 155.77K folly_microlock_simple(8thread) 54.18% 6.83us 146.49K folly_sharedmutex_simple(8thread) 55.44% 6.67us 149.92K folly_distributedmutex_simple(8thread) 109.86% 3.37us 297.05K folly_distributedmutex_flatcombining_simple(8th 225.14% 1.64us 608.76K folly_flatcombining_no_caching_simple(8thread) 96.25% 3.84us 260.26K folly_flatcombining_caching_simple(8thread) 108.13% 3.42us 292.39K atomics_fetch_add(8thread) 255.40% 1.45us 690.60K atomic_cas(8thread) 183.68% 2.01us 496.66K ---------------------------------------------------------------------------- std_mutex_simple(16thread) 8.70us 114.89K google_spin_simple(16thread) 124.47% 6.99us 143.01K folly_microspin_simple(16thread) 86.46% 10.07us 99.34K folly_picospin_simple(16thread) 40.76% 21.36us 46.83K folly_microlock_simple(16thread) 54.78% 15.89us 62.94K folly_sharedmutex_simple(16thread) 58.14% 14.97us 66.80K folly_distributedmutex_simple(16thread) 124.53% 6.99us 143.08K folly_distributedmutex_flatcombining_simple(16t 324.08% 2.69us 372.34K folly_flatcombining_no_caching_simple(16thread) 134.73% 6.46us 154.79K folly_flatcombining_caching_simple(16thread) 188.24% 4.62us 216.28K atomics_fetch_add(16thread) 340.07% 2.56us 390.72K atomic_cas(16thread) 220.15% 3.95us 252.93K ---------------------------------------------------------------------------- std_mutex_simple(32thread) 25.62us 39.03K google_spin_simple(32thread) 105.21% 24.35us 41.07K folly_microspin_simple(32thread) 79.64% 32.17us 31.08K folly_picospin_simple(32thread) 19.61% 130.67us 7.65K folly_microlock_simple(32thread) 42.97% 59.62us 16.77K folly_sharedmutex_simple(32thread) 52.41% 48.88us 20.46K folly_distributedmutex_simple(32thread) 144.48% 17.73us 56.39K folly_distributedmutex_flatcombining_simple(32t 461.73% 5.55us 180.22K folly_flatcombining_no_caching_simple(32thread) 207.55% 12.34us 81.01K folly_flatcombining_caching_simple(32thread) 237.34% 10.80us 92.64K atomics_fetch_add(32thread) 561.68% 4.56us 219.23K atomic_cas(32thread) 484.13% 5.29us 188.96K ---------------------------------------------------------------------------- std_mutex_simple(64thread) 31.26us 31.99K google_spin_simple(64thread) 99.95% 31.28us 31.97K folly_microspin_simple(64thread) 83.63% 37.38us 26.75K folly_picospin_simple(64thread) 20.88% 149.68us 6.68K folly_microlock_simple(64thread) 45.46% 68.77us 14.54K folly_sharedmutex_simple(64thread) 52.65% 59.38us 16.84K folly_distributedmutex_simple(64thread) 154.90% 20.18us 49.55K folly_distributedmutex_flatcombining_simple(64t 475.05% 6.58us 151.96K folly_flatcombining_no_caching_simple(64thread) 195.63% 15.98us 62.58K folly_flatcombining_caching_simple(64thread) 199.29% 15.69us 63.75K atomics_fetch_add(64thread) 580.23% 5.39us 185.61K atomic_cas(64thread) 510.76% 6.12us 163.39K ---------------------------------------------------------------------------- std_mutex_simple(128thread) 70.53us 14.18K google_spin_simple(128thread) 99.20% 71.09us 14.07K folly_microspin_simple(128thread) 88.73% 79.49us 12.58K folly_picospin_simple(128thread) 22.24% 317.06us 3.15K folly_microlock_simple(128thread) 50.17% 140.57us 7.11K folly_sharedmutex_simple(128thread) 59.53% 118.47us 8.44K folly_distributedmutex_simple(128thread) 172.74% 40.83us 24.49K folly_distributedmutex_flatcombining_simple(128 538.22% 13.10us 76.31K folly_flatcombining_no_caching_simple(128thread 165.11% 42.72us 23.41K folly_flatcombining_caching_simple(128thread) 161.46% 43.68us 22.89K atomics_fetch_add(128thread) 606.51% 11.63us 85.99K atomic_cas(128thread) 578.52% 12.19us 82.03K ``` Reviewed By: djwatson Differential Revision: D13799447 fbshipit-source-id: 923cc35e5060ef79b349690821d8545459248347
f11ea4c6 · Aaryaman Sagar · Facebook Github Bot · 11566445 · f11ea4c6 · f11ea4c6
Commit f11ea4c6 authored Mar 25, 2019 by Aaryaman Sagar Committed by Facebook Github Bot Mar 25, 2019
5 changed files
--- a/folly/synchronization/DistributedMutex-inl.h
+++ b/folly/synchronization/DistributedMutex-inl.h
@@ -15,23 +15,24 @@
 */
 #include <folly/synchronization/DistributedMutex.h>
-#include <folly/CachelinePadded.h>
 #include <folly/Likely.h>
 #include <folly/Portability.h>
 #include <folly/ScopeGuard.h>
 #include <folly/Utility.h>
 #include <folly/chrono/Hardware.h>
 #include <folly/detail/Futex.h>
+#include <folly/functional/Invoke.h>
 #include <folly/lang/Align.h>
+#include <folly/lang/Bits.h>
 #include <folly/portability/Asm.h>
 #include <folly/synchronization/AtomicNotification.h>
 #include <folly/synchronization/AtomicUtil.h>
-#include <folly/synchronization/WaitOptions.h>
+#include <folly/synchronization/detail/InlineFunctionRef.h>
 #include <folly/synchronization/detail/Sleeper.h>
-#include <folly/synchronization/detail/Spin.h>
 #include <glog/logging.h>
+#include <array>
 #include <atomic>
 #include <cstdint>
 #include <limits>
@@ -75,6 +76,13 @@ constexpr auto kTimedWaiter = std::uintptr_t{0b10};
 // this becomes significant for threads that are trying to wake up the
 // uninitialized thread, if they see that the thread is not yet initialized,
 // they can do nothing but spin, and wait for the thread to get initialized
+//
+// This also plays a role in the functioning of flat combining as implemented
+// in DistributedMutex.  When a thread owning the lock goes through the
+// contention chain to either unlock the mutex or combine critical sections
+// from the other end.  The presence of kUninitialized means that the
+// combining thread is not able to make progress after this point.  So we
+// transfer the lock.
 constexpr auto kUninitialized = std::uint32_t{0b0};
 // kWaiting will be set in the waiter's futex structs while they are spinning
 // while waiting for the mutex
@@ -107,6 +115,20 @@ constexpr auto kAboutToWait = std::uint32_t{0b100};
 // had not yet entered futex().  This interleaving causes the thread calling
 // futex() to return spuriously, as the futex word is not what it should be
 constexpr auto kSleeping = std::uint32_t{0b101};
+// kCombined is set by the lock holder to let the waiter thread know that its
+// combine request was successfully completed by the lock holder.  A
+// successful combine means that the thread requesting the combine operation
+// does not need to unlock the mutex; in fact, doing so would be an error.
+constexpr auto kCombined = std::uint32_t{0b111};
+// kCombineUninitialized is like kUninitialized but is set by a thread when it
+// enqueues in hopes of getting its critical section combined with the lock
+// holder
+constexpr auto kCombineUninitialized = std::uint32_t{0b1000};
+// kCombineWaiting is set by a thread when it is ready to have its combine
+// record fulfilled by the lock holder.  In particular, this signals to the
+// lock holder that the thread has set its next_ pointer in the contention
+// chain
+constexpr auto kCombineWaiting = std::uint32_t{0b1001};
 // The number of spins that we are allowed to do before we resort to marking a
 // thread as having slept
@@ -116,15 +138,18 @@ constexpr auto kScheduledAwaySpinThreshold = std::chrono::nanoseconds{200};
 // The maximum number of spins before a thread starts yielding its processor
 // in hopes of getting skipped
 constexpr auto kMaxSpins = 4000;
+// The maximum number of contention chains we can resolve with flat combining.
+// After this number of contention chains, the mutex falls back to regular
+// two-phased mutual exclusion to ensure that we don't starve the combiner
+// thread
+constexpr auto kMaxCombineIterations = 2;
 /**
 * Write only data that is available to the thread that is waking up another.
 * Only the waking thread is allowed to write to this, the thread to be woken
 * is allowed to read from this after a wakeup has been issued
- *
- * Because of the write only semantics of the data here, acquire-release (or
- * stronger) memory ordering is needed to write to this
 */
+template <template <typename> class Atomic>
 class WakerMetadata {
 public:
  // This is the thread that initiated wakeups for the contention chain.
@@ -133,7 +158,7 @@ class WakerMetadata {
  // woke up sees this as the next thread to wake up, it knows that it is the
  // terminal node in the contention chain.  This means that it was the one
  // that took off the thread that had acquired the mutex off the centralized
-  // state.  Therefore, the current thread is the last in it's contention
+  // state.  Therefore, the current thread is the last in its contention
  // chain.  It will fall back to centralized storage to pick up the next
  // waiter or release the mutex
  //
@@ -144,41 +169,354 @@ class WakerMetadata {
  // prohitively large threshold to avoid heap allocations, this strategy
  // however, might cause increased cache misses on wakeup signalling
  std::uintptr_t waker_{0};
+  // the list of threads that the waker had previously seen to be sleeping on
+  // a futex(),
+  //
+  // this is given to the current thread as a means to pass on
+  // information.  When the current thread goes to unlock the mutex and does
+  // not see contention, it should go and wake up the head of this list.  If
+  // the current thread sees a contention chain on the mutex, it should pass
+  // on this list to the next thread that gets woken up
+  std::uintptr_t waiters_{0};
+  // The futex that this waiter will sleep on
+  //
+  // how can we reuse futex_ from above for futex management?
+  Futex<Atomic> sleeper_{kUninitialized};
 };
+/**
+ * Type of the type-erased callable that is used for combining from the lock
+ * holder's end.  This has 48 bytes of inline storage that can be used to
+ * minimize cache misses when combining
+ */
+using CombineFunction = detail::InlineFunctionRef<void(), 48>;
 /**
 * Waiter encapsulates the state required for waiting on the mutex, this
 * contains potentially heavy state and is intended to be allocated on the
 * stack as part of a lock() function call
+ *
+ * To ensure that synchronization does not cause unintended side effects on
+ * the rest of the thread stack (eg. metadata in lockImplementation(), or any
+ * other data in the user's thread), we aggresively pad this struct and use
+ * custom alignment internally to ensure that the relevant data fits within a
+ * single cacheline.  The added alignment here also gives us some room to
+ * wiggle in the bottom few bits of the mutex, where we store extra metadata
 */
 template <template <typename> class Atomic>
 class Waiter {
 public:
-  explicit Waiter(std::uint64_t futex) : futex_{futex} {}
+  Waiter() = default;
+  Waiter(Waiter&&) = delete;
+  Waiter(const Waiter&) = delete;
+  Waiter& operator=(Waiter&&) = delete;
+  Waiter& operator=(const Waiter&) = delete;
+  void initialize(std::uint64_t futex, CombineFunction task) {
+    // we only initialize the function if we were actually given a non-null
+    // task, otherwise
+    if (task) {
+      DCHECK_EQ(futex, kCombineUninitialized);
+      new (&function_) CombineFunction{task};
+    } else {
+      DCHECK((futex == kUninitialized) || (futex == kAboutToWait));
+      new (&metadata_) WakerMetadata<Atomic>{};
+    }
+    // this pedantic store is needed to ensure that the waking thread
+    // synchronizes with the state in the waiter struct when it loads the
+    // value of the futex word
+    //
+    // on x86, this gets optimized away to just a regular store, it might be
+    // needed on platforms where explicit acquire-release barriers are
+    // required for synchronization
+    //
+    // note that we release here at the end of the constructor because
+    // construction is complete here, any thread that acquires this release
+    // will see a well constructed wait node
+    futex_.store(futex, std::memory_order_release);
+  }
+  std::array<std::uint8_t, hardware_destructive_interference_size> padding1;
  // the atomic that this thread will spin on while waiting for the mutex to
  // be unlocked
-  Atomic<std::uint64_t> futex_{kUninitialized};
+  alignas(hardware_destructive_interference_size) Atomic<std::uint64_t> futex_{
-  // metadata for the waker
+      kUninitialized};
-  WakerMetadata wakerMetadata_{};
  // The successor of this node.  This will be the thread that had its address
  // on the mutex previously
  std::uintptr_t next_{0};
-  // the list of threads that the waker had previously seen to be sleeping on
+  // We use an anonymous union for the combined critical section request and
-  // a futex(),
+  // the metadata that will be filled in from the leader's end.  Only one is
+  // active at a time - if a leader decides to combine the requested critical
+  // section into its execution, it will not touch the metadata field.  If a
+  // leader decides to migrate the lock to the waiter, it will not touch the
+  // function
  //
-  // this is given to the current thread as a means to pass on
+  // this allows us to transfer more state when combining a critical section
-  // information.  When the current thread goes to unlock the mutex and does
+  // and reduce the cache misses originating from executing an arbitrary
-  // not see contention, it should go and wake up the head of this list.  If
+  // lambda
-  // the current thread sees a contention chain on the mutex, it should pass
-  // on this list to the next thread that gets woken up
-  std::uintptr_t waiters_{0};
-  // The futex that this waiter will sleep on
  //
-  // how can we reuse futex_ from above for futex management?
+  // note that this is an anonymous union, not an unnamed union, the members
-  Futex<Atomic> sleeper_{kUninitialized};
+  // leak into the surrounding scope
+  union {
+    // metadata for the waker
+    WakerMetadata<Atomic> metadata_;
+    // The critical section that can potentially be combined into the critical
+    // section of the locking thread
+    //
+    // This is kept as a FunctionRef because the original function is preserved
+    // until the lock_combine() function returns.  A consequence of using
+    // FunctionRef here is that we don't need to do any allocations and can
+    // allow users to capture unbounded state into the critical section.  Flat
+    // combining means that the user does not have access to the thread
+    // executing the critical section, so assumptions about thread local
+    // references can be invalidated.  Being able to capture arbitrary state
+    // allows the user to do thread local accesses right before the critical
+    // section and pass them as state to the callable being referenced here
+    CombineFunction function_;
+    // The user is allowed to use a combined critical section that returns a
+    // value.  This buffer is used to implement the value transfer to the
+    // waiting thread.  We reuse the same union because this helps us combine
+    // one synchronization operation with a material value transfer.
+    //
+    // The waker thread needs to synchronize on this cacheline to issue a
+    // wakeup to the waiter, meaning that the entire line needs to be pulled
+    // into the remote core in exclusive mode.  So we reuse the coherence
+    // operation to transfer the return value in addition to the
+    // synchronization signal.  In the case that the user's data item is
+    // small, the data is transferred all inline as part of the same line,
+    // which pretty much arrives into the CPU cache in the same clock cycle or
+    // two after a read-for-ownership request.  This gives us a high chance of
+    // coalescing the entire transitive store buffer together into one cache
+    // coherence operation from the waker's end.  This allows us to make use
+    // of the CPU bus bandwidth which would have otherwise gone to waste.
+    // Benchmarks prove this theory under a wide range of contention, value
+    // sizes, NUMA interactions and processor models
+    //
+    // The current version of the Intel optimization manual confirms this
+    // theory somewhat as well in section 2.3.5.1 (Load and Store Operation
+    // Overview)
+    //
+    //    When an instruction writes data to a memory location [...], the
+    //    processor ensures that it has the line containing this memory location
+    //    is in its L1d cache [...]. If the cache line is not there, it fetches
+    //    from the next levels using a RFO request [...] RFO and storing the
+    //    data happens after instruction retirement.  Therefore, the store
+    //    latency usually does not affect the store instruction itself
+    //
+    // This gives the user the ability to input up to 48 bytes into the
+    // combined critical section through an InlineFunctionRef and output 48
+    // bytes from it basically without any cost.  The type of the entity
+    // stored in the buffer has to be matched by the type erased callable that
+    // the caller has used.  At this point, the caller is still in the
+    // template instantiation leading to the combine request, so it has
+    // knowledge of the return type and can apply the appropriate
+    // reinterpret_cast and launder operation to safely retrieve the data from
+    // this buffer
+    std::aligned_storage_t<48, 8> storage_;
+  };
+  std::array<std::uint8_t, hardware_destructive_interference_size> padding2;
 };
+/**
+ * A template that helps us differentiate between the different ways to return
+ * a value from a combined critical section.  A return value of type void
+ * cannot be stored anywhere, so we use specializations and pick the right one
+ * switched through std::conditional_t
+ *
+ * This is then used by CoalescedTask and its family of functions to implement
+ * efficient return value transfers to the waiting threads
+ */
+template <typename Func>
+class RequestWithReturn {
+ public:
+  using F = Func;
+  using ReturnType = folly::invoke_result_t<const Func&>;
+  explicit RequestWithReturn(Func func) : func_{std::move(func)} {}
+  /**
+   * We need to define the destructor here because C++ requires (with good
+   * reason) that a union with non-default destructor be explicitly destroyed
+   * from the surrounding class, as neither the runtime nor compiler have the
+   * knowledge of what to do with a union at the time of destruction
+   *
+   * Each request that has a valid return value set will have the value
+   * retrieved from the get() method, where the value is destroyed.  So we
+   * don't need to destroy it here
+   */
+  ~RequestWithReturn() {}
+  /**
+   * This method can be used to return a value from the request.  This returns
+   * the underlying value because return type of the function we were
+   * instantiated with is not void
+   */
+  ReturnType get() && {
+    // when the return value has been processed, we destroy the value
+    // contained in this request.  Using a scope_exit means that we don't have
+    // to worry about storing the value somewhere and causing potentially an
+    // extra move
+    //
+    // note that the invariant here is that this function is only called if the
+    // requesting thread had it's critical section combined, and the value_
+    // member constructed through detach()
+    SCOPE_EXIT {
+      value_.~ReturnType();
+    };
+    return std::move(value_);
+  }
+  // this contains a copy of the function the waiter had requested to be
+  // executed as a combined critical section
+  Func func_;
+  // this stores the return value used in the request, we use a union here to
+  // avoid laundering and allow return types that are not default
+  // constructible to be propagated through the execution of the critical
+  // section
+  //
+  // note that this is an anonymous union, the member leaks into the
+  // surrounding scope as a member variable
+  union {
+    ReturnType value_;
+  };
+};
+template <typename Func>
+class RequestWithoutReturn {
+ public:
+  using F = Func;
+  using ReturnType = void;
+  explicit RequestWithoutReturn(Func func) : func_{std::move(func)} {}
+  /**
+   * In this version of the request class, get() returns nothing as there is
+   * no stored value
+   */
+  void get() && {}
+  // this contains a copy of the function the waiter had requested to be
+  // executed as a combined critical section
+  Func func_;
+};
+// we need to use std::integral_constant::value here as opposed to
+// std::integral_constant::operator T() because MSVC errors out with the
+// implicit conversion
+template <typename Func>
+using Request = std::conditional_t<
+    std::is_same<folly::invoke_result_t<const Func&>, void>::value,
+    RequestWithoutReturn<Func>,
+    RequestWithReturn<Func>>;
+/**
+ * A template that helps us to transform a callable returning a value to one
+ * that returns void so it can be type erased and passed on to the waker.  The
+ * return value gets coalesced into the wait struct when it is small enough
+ * for optimal data transfer
+ *
+ * This helps a combined critical section feel more normal in the case where
+ * the user wants to return a value, for example
+ *
+ *    auto value = mutex_.lock_combine([&]() {
+ *      return data_.value();
+ *    });
+ *
+ * Without this, the user would typically create a dummy object that they
+ * would then assign to from within the lambda.  With return value chaining,
+ * this pattern feels more natural
+ *
+ * Note that it is important to copy the entire callble into this class.
+ * Storing something like a reference instead is not desirable because it does
+ * not allow InlineFunctionRef to use inline storage to represent the user's
+ * callable without extra indirections
+ *
+ * We use std::conditional_t and switch to the right type of task with the
+ * CoalescedTask type alias
+ */
+template <typename Func, typename Waiter>
+class TaskWithCoalesce {
+ public:
+  using ReturnType = folly::invoke_result_t<const Func&>;
+  explicit TaskWithCoalesce(Func func, Waiter& waiter)
+      : func_{std::move(func)}, waiter_{waiter} {}
+  void operator()() const {
+    auto value = func_();
+    new (&waiter_.storage_) ReturnType{std::move(value)};
+  }
+ private:
+  Func func_;
+  Waiter& waiter_;
+  static_assert(alignof(decltype(waiter_.storage_)) >= alignof(ReturnType), "");
+  static_assert(sizeof(decltype(waiter_.storage_)) >= sizeof(ReturnType), "");
+};
+template <typename Func, typename Waiter>
+class TaskWithoutCoalesce {
+ public:
+  using ReturnType = void;
+  explicit TaskWithoutCoalesce(Func func, Waiter&) : func_{std::move(func)} {}
+  void operator()() const {
+    func_();
+  }
+ private:
+  Func func_;
+};
+// we need to use std::integral_constant::value here as opposed to
+// std::integral_constant::operator T() because MSVC errors out with the
+// implicit conversion
+template <typename Func, typename Waiter>
+using CoalescedTask = std::conditional_t<
+    std::is_void<folly::invoke_result_t<const Func&>>::value,
+    TaskWithoutCoalesce<Func, Waiter>,
+    TaskWithCoalesce<Func, Waiter>>;
+/**
+ * Given a request and a wait node, coalesce them into a CoalescedTask that
+ * coalesces the return value into the wait node when invoked from a remote
+ * thread
+ *
+ * When given a null request through nullptr_t, coalesce() returns null as well
+ */
+template <typename Waiter>
+std::nullptr_t coalesce(std::nullptr_t&, Waiter&) {
+  return nullptr;
+}
+template <
+    typename Request,
+    typename Waiter,
+    typename Func = typename Request::F>
+CoalescedTask<Func, Waiter> coalesce(Request& request, Waiter& waiter) {
+  static_assert(!std::is_same<Request, std::nullptr_t>{}, "");
+  return CoalescedTask<Func, Waiter>{request.func_, waiter};
+}
+/**
+ * Given a CoalescedTask, a wait node and a request.  Detach the return value
+ * into the request from the wait node and task.
+ */
+template <typename Waiter>
+void detach(std::nullptr_t&, Waiter&) {}
+template <typename Waiter, typename F>
+void detach(RequestWithoutReturn<F>&, Waiter&) {}
+template <typename Waiter, typename F>
+void detach(RequestWithReturn<F>& request, Waiter& waiter) {
+  using ReturnType = typename RequestWithReturn<F>::ReturnType;
+  static_assert(!std::is_same<ReturnType, void>{}, "");
+  auto& val = *folly::launder(reinterpret_cast<ReturnType*>(&waiter.storage_));
+  new (&request.value_) ReturnType{std::move(val)};
+  val.~ReturnType();
+}
 /**
 * Get the time since epoch in nanoseconds
 *
@@ -198,14 +536,14 @@ inline std::chrono::nanoseconds time() {
 * address from a uintptr_t
 */
 template <typename Type>
-Type* extractAddress(std::uintptr_t from) {
+Type* extractPtr(std::uintptr_t from) {
  // shift one bit off the end, to get all 1s followed by a single 0
  auto mask = std::numeric_limits<std::uintptr_t>::max();
  mask >>= 1;
  mask <<= 1;
  CHECK(!(mask & 0b1));
-  return reinterpret_cast<Type*>(from & mask);
+  return folly::bit_cast<Type*>(from & mask);
 }
 /**
@@ -241,7 +579,9 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy {
    next_ = std::exchange(other.next_, nullptr);
    expected_ = std::exchange(other.expected_, 0);
-    wakerMetadata_ = std::exchange(other.wakerMetadata_, {});
+    timedWaiters_ = std::exchange(other.timedWaiters_, false);
+    combined_ = std::exchange(other.combined_, false);
+    waker_ = std::exchange(other.waker_, 0);
    waiters_ = std::exchange(other.waiters_, nullptr);
    ready_ = std::exchange(other.ready_, nullptr);
@@ -260,23 +600,25 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy {
  friend class DistributedMutex<Atomic, TimePublishing>;
  DistributedMutexStateProxy(
-      CachelinePadded<Waiter<Atomic>>* next,
+      Waiter<Atomic>* next,
      std::uintptr_t expected,
      bool timedWaiter = false,
-      WakerMetadata wakerMetadata = {},
+      bool combined = false,
-      CachelinePadded<Waiter<Atomic>>* waiters = nullptr,
+      std::uintptr_t waker = 0,
-      CachelinePadded<Waiter<Atomic>>* ready = nullptr)
+      Waiter<Atomic>* waiters = nullptr,
+      Waiter<Atomic>* ready = nullptr)
      : next_{next},
        expected_{expected},
        timedWaiters_{timedWaiter},
-        wakerMetadata_{wakerMetadata},
+        combined_{combined},
+        waker_{waker},
        waiters_{waiters},
        ready_{ready} {}
  // the next thread that is to be woken up, this being null at the time of
  // unlock() shows that the current thread acquired the mutex without
  // contention or it was the terminal thread in the queue of threads waking up
-  CachelinePadded<Waiter<Atomic>>* next_{nullptr};
+  Waiter<Atomic>* next_{nullptr};
  // this is the value that the current thread should expect to find on
  // unlock, and if this value is not there on unlock, the current thread
  // should assume that other threads are enqueued waiting for the mutex
@@ -298,18 +640,22 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy {
  // done so we can avoid having to issue a atomic_notify_all() call (and
  // subsequently a thundering herd) when waking up timed-wait threads
  bool timedWaiters_{false};
+  // a boolean that contains true if the state proxy is not meant to be passed
+  // to the unlock() function.  This is set only when there is contention and
+  // a thread had asked for its critical section to be combined
+  bool combined_{false};
  // metadata passed along from the thread that woke this thread up
-  WakerMetadata wakerMetadata_{};
+  std::uintptr_t waker_{0};
  // the list of threads that are waiting on a futex
  //
  // the current threads is meant to wake up this list of waiters if it is
  // able to commit an unlock() on the mutex without seeing a contention chain
-  CachelinePadded<Waiter<Atomic>>* waiters_{nullptr};
+  Waiter<Atomic>* waiters_{nullptr};
  // after a thread has woken up from a futex() call, it will have the rest of
  // the threads that it were waiting behind it in this list, a thread that
  // unlocks has to wake up threads from this list if it has any, before it
  // goes to sleep to prevent pathological unfairness
-  CachelinePadded<Waiter<Atomic>>* ready_{nullptr};
+  Waiter<Atomic>* ready_{nullptr};
 };
 template <template <typename> class Atomic, bool TimePublishing>
@@ -317,8 +663,9 @@ DistributedMutex<Atomic, TimePublishing>::DistributedMutex()
    : state_{kUnlocked} {}
 template <typename Waiter>
-bool spin(Waiter& waiter) {
+bool spin(Waiter& waiter, std::uint32_t& sig, std::uint32_t mode) {
  auto spins = 0;
+  auto waitMode = (mode == kCombineUninitialized) ? kCombineWaiting : kWaiting;
  while (true) {
    // publish our current time in the futex as a part of the spin waiting
    // process
@@ -328,14 +675,15 @@ bool spin(Waiter& waiter) {
    // timestamp to force the waking thread to skip us
    ++spins;
    auto now = (spins < kMaxSpins) ? time() : decltype(time())::zero();
-    auto data = strip(now) | kWaiting;
+    auto data = strip(now) | waitMode;
    auto signal = waiter.futex_.exchange(data, std::memory_order_acq_rel);
    signal &= std::numeric_limits<std::uint8_t>::max();
    // if we got skipped, make a note of it and return if we got a skipped
    // signal or a signal to wake up
-    auto skipped = signal == kSkipped;
+    auto skipped = (signal == kSkipped);
-    if (skipped || (signal == kWake)) {
+    if (skipped || (signal == kWake) || (signal == kCombined)) {
+      sig = signal;
      return !skipped;
    }
@@ -379,7 +727,7 @@ void doFutexWake(Waiter* waiter) {
    //
    // this dangilng pointer possibility is why we use a pointer to the futex
    // word, and avoid dereferencing after the store() operation
-    auto sleeper = &(*waiter)->sleeper_;
+    auto sleeper = &waiter->metadata_.sleeper_;
    sleeper->store(kWake, std::memory_order_release);
    futexWake(sleeper, 1);
  }
@@ -389,7 +737,7 @@ template <typename Waiter>
 bool doFutexWait(Waiter* waiter, Waiter*& next) {
  // first we get ready to sleep by calling exchange() on the futex with a
  // kSleeping value
-  DCHECK((*waiter)->futex_.load(std::memory_order_relaxed) == kAboutToWait);
+  DCHECK(waiter->futex_.load(std::memory_order_relaxed) == kAboutToWait);
  // note the semantics of using a futex here, when we exchange the sleeper_
  // with kSleeping, we are getting ready to sleep, but before sleeping we get
@@ -397,7 +745,8 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) {
  // sleeper_ might have changed.  We can also wake up because of a spurious
  // wakeup, so we always check against the value in sleeper_ after returning
  // from futexWait(), if the value is not kWake, then we continue
-  auto pre = (*waiter)->sleeper_.exchange(kSleeping, std::memory_order_acq_rel);
+  auto pre =
+      waiter->metadata_.sleeper_.exchange(kSleeping, std::memory_order_acq_rel);
  // Seeing a kSleeping on a futex word before we set it ourselves means only
  // one thing - an unlocking thread caught us before we went to futex(), and
@@ -424,25 +773,25 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) {
    // Because the corresponding futexWake() above does not synchronize
    // wakeups around the futex word.  Because doing so would become
    // inefficient
-    futexWait(&(*waiter)->sleeper_, kSleeping);
+    futexWait(&waiter->metadata_.sleeper_, kSleeping);
-    pre = (*waiter)->sleeper_.load(std::memory_order_acquire);
+    pre = waiter->metadata_.sleeper_.load(std::memory_order_acquire);
    DCHECK((pre == kSleeping) || (pre == kWake));
  }
  // when coming out of a futex, we might have some other sleeping threads
  // that we were supposed to wake up, assign that to the next pointer
  DCHECK(next == nullptr);
-  next = extractAddress<Waiter>((*waiter)->next_);
+  next = extractPtr<Waiter>(waiter->next_);
  return false;
 }
 template <typename Waiter>
-bool wait(Waiter* waiter, bool shouldSleep, Waiter*& next) {
+bool wait(Waiter* waiter, std::uint32_t mode, Waiter*& next, uint32_t& signal) {
-  if (shouldSleep) {
+  if (mode == kAboutToWait) {
    return doFutexWait(waiter, next);
  }
-  return spin(**waiter);
+  return spin(*waiter, signal, mode);
 }
 inline void recordTimedWaiterAndClearTimedBit(
@@ -461,26 +810,131 @@ inline void recordTimedWaiterAndClearTimedBit(
  }
 }
+template <typename Atomic>
+void wakeTimedWaiters(Atomic* state, bool timedWaiters) {
+  if (UNLIKELY(timedWaiters)) {
+    atomic_notify_one(state);
+  }
+}
+template <template <typename> class Atomic, bool TimePublishing>
+template <typename Func>
+auto DistributedMutex<Atomic, TimePublishing>::lock_combine(Func func) noexcept
+    -> folly::invoke_result_t<const Func&> {
+  // invoke the lock implementation function and check whether we came out of
+  // it with our task executed as a combined critical section.  This usually
+  // happens when the mutex is contended.
+  //
+  // In the absence of contention, we just return from the try_lock() function
+  // with the lock acquired.  So we need to invoke the task and unlock
+  // the mutex before returning
+  auto&& task = Request<Func>{func};
+  auto&& state = lockImplementation(*this, state_, task);
+  if (!state.combined_) {
+    // to avoid having to play a return-value dance when the combinable
+    // returns void, we use a scope exit to perform the unlock after the
+    // function return has been processed
+    SCOPE_EXIT {
+      unlock(std::move(state));
+    };
+    return func();
+  }
+  // if we are here, that means we were able to get our request combined, we
+  // can return the value that was transferred to us
+  //
+  // each thread that enqueues as a part of a contention chain takes up the
+  // responsibility of any timed waiter that had come immediately before it,
+  // so we wake up timed waiters before exiting the lock function.  Another
+  // strategy might be to add the timed waiter information to the metadata and
+  // let a single leader wake up a timed waiter for better concurrency.  But
+  // this has proven not to be useful in benchmarks beyond a small 5% delta,
+  // so we avoid taking the complexity hit and branch to wake up timed waiters
+  // from each thread
+  wakeTimedWaiters(&state_, state.timedWaiters_);
+  return std::move(task).get();
+}
 template <template <typename> class Atomic, bool TimePublishing>
 typename DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy
 DistributedMutex<Atomic, TimePublishing>::lock() {
+  auto null = nullptr;
+  return lockImplementation(*this, state_, null);
+}
+template <template <typename> class Atomic, bool TimePublishing>
+template <typename Rep, typename Period, typename Func, typename ReturnType>
+folly::Optional<ReturnType>
+DistributedMutex<Atomic, TimePublishing>::try_lock_combine_for(
+    const std::chrono::duration<Rep, Period>& duration,
+    Func func) noexcept {
+  auto state = try_lock_for(duration);
+  if (state) {
+    SCOPE_EXIT {
+      unlock(std::move(state));
+    };
+    return func();
+  }
+  return folly::none;
+}
+template <template <typename> class Atomic, bool TimePublishing>
+template <typename Clock, typename Duration, typename Func, typename ReturnType>
+folly::Optional<ReturnType>
+DistributedMutex<Atomic, TimePublishing>::try_lock_combine_until(
+    const std::chrono::time_point<Clock, Duration>& deadline,
+    Func func) noexcept {
+  auto state = try_lock_until(deadline);
+  if (state) {
+    SCOPE_EXIT {
+      unlock(std::move(state));
+    };
+    return func();
+  }
+  return folly::none;
+}
+template <
+    template <typename> class Atomic,
+    bool TimePublishing,
+    typename State,
+    typename Request>
+typename DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy
+lockImplementation(
+    DistributedMutex<Atomic, TimePublishing>& mutex,
+    State& atomic,
+    Request& request) {
  // first try and acquire the lock as a fast path, the underlying
  // implementation is slightly faster than using std::atomic::exchange() as
  // is used in this function.  So we get a small perf boost in the
  // uncontended case
-  if (auto state = try_lock()) {
+  //
+  // We only go through this fast path for the lock/unlock usage and avoid this
+  // for combined critical sections.  This check adds unnecessary overhead in
+  // that case as it causes an extra cacheline bounce
+  constexpr auto combineRequested = !std::is_same<Request, std::nullptr_t>{};
+  if (!combineRequested) {
+    if (auto state = mutex.try_lock()) {
      return state;
    }
+  }
  auto previous = std::uintptr_t{0};
-  auto waitMode = kUninitialized;
+  auto waitMode = combineRequested ? kCombineUninitialized : kUninitialized;
  auto nextWaitMode = kAboutToWait;
  auto timedWaiter = false;
-  CachelinePadded<Waiter<Atomic>>* nextSleeper = nullptr;
+  Waiter<Atomic>* nextSleeper = nullptr;
  while (true) {
    // construct the state needed to wait
-    auto&& state = CachelinePadded<Waiter<Atomic>>{waitMode};
+    //
-    auto&& address = reinterpret_cast<std::uintptr_t>(&state);
+    // We can't use auto here because MSVC errors out due to a missing copy
+    // constructor
+    Waiter<Atomic> state{};
+    auto&& task = coalesce(request, state);
+    auto&& address = folly::bit_cast<std::uintptr_t>(&state);
+    state.initialize(waitMode, std::move(task));
    DCHECK(!(address & 0b1));
    // set the locked bit in the address we will be persisting in the mutex
@@ -496,17 +950,24 @@ DistributedMutex<Atomic, TimePublishing>::lock() {
    // other threads that read the address of this value should see the full
    // well-initialized node we are going to wait on if the mutex acquisition
    // was unsuccessful
-    previous = state_.exchange(address, std::memory_order_acq_rel);
+    previous = atomic.exchange(address, std::memory_order_acq_rel);
    recordTimedWaiterAndClearTimedBit(timedWaiter, previous);
-    state->next_ = previous;
+    state.next_ = previous;
    if (previous == kUnlocked) {
-      return {nullptr, address, timedWaiter, {}, nullptr, nextSleeper};
+      return {/* next */ nullptr,
+              /* expected */ address,
+              /* timedWaiter */ timedWaiter,
+              /* combined */ false,
+              /* waker */ 0,
+              /* waiters */ nullptr,
+              /* ready */ nextSleeper};
    }
    DCHECK(previous & kLocked);
    // wait until we get a signal from another thread, if this returns false,
    // we got skipped and had probably been scheduled out, so try again
-    if (!wait(&state, (waitMode == kAboutToWait), nextSleeper)) {
+    auto signal = kUninitialized;
+    if (!wait(&state, waitMode, nextSleeper, signal)) {
      std::swap(waitMode, nextWaitMode);
      continue;
    }
@@ -531,52 +992,172 @@ DistributedMutex<Atomic, TimePublishing>::lock() {
    // relationship until broken
    auto next = previous;
    auto expected = address;
-    if (previous == state->wakerMetadata_.waker_) {
+    if (previous == state.metadata_.waker_) {
      next = 0;
      expected = kLocked;
    }
+    // if we were given a combine signal, detach the return value from the
+    // wait struct into the request, so the current thread can access it
+    // outside this function
+    if (signal == kCombined) {
+      detach(request, state);
+    }
    // if we are just coming out of a futex call, then it means that the next
    // waiter we are responsible for is also a waiter waiting on a futex, so
    // we return that list in the list of ready threads.  We wlil be waking up
    // the ready threads on unlock no matter what
-    return {extractAddress<CachelinePadded<Waiter<Atomic>>>(next),
+    return {/* next */ extractPtr<Waiter<Atomic>>(next),
-            expected,
+            /* expected */ expected,
-            timedWaiter,
+            /* timedWaiter */ timedWaiter,
-            state->wakerMetadata_,
+            /* combined */ combineRequested && (signal == kCombined),
-            extractAddress<CachelinePadded<Waiter<Atomic>>>(state->waiters_),
+            /* waker */ state.metadata_.waker_,
-            nextSleeper};
+            /* waiters */ extractPtr<Waiter<Atomic>>(state.metadata_.waiters_),
+            /* ready */ nextSleeper};
  }
 }
-inline bool preempted(std::uint64_t value) {
+inline bool preempted(std::uint64_t value, std::chrono::nanoseconds now) {
-  auto currentTime = recover(strip(time()));
+  auto currentTime = recover(strip(now));
  auto nodeTime = recover(value);
  auto preempted = currentTime > nodeTime + kScheduledAwaySpinThreshold.count();
  // we say that the thread has been preempted if its timestamp says so, and
  // also if it is neither uninitialized nor skipped
  DCHECK(value != kSkipped);
-  return (preempted) && (value != kUninitialized);
+  return (preempted) && (value != kUninitialized) &&
+      (value != kCombineUninitialized);
 }
 inline bool isSleeper(std::uintptr_t value) {
  return (value == kAboutToWait);
 }
+inline bool isInitialized(std::uintptr_t value) {
+  return (value != kUninitialized) && (value != kCombineUninitialized);
+}
+inline bool isCombiner(std::uintptr_t value) {
+  auto mode = (value & 0xff);
+  return (mode == kCombineWaiting) || (mode == kCombineUninitialized);
+}
+inline bool isWaitingCombiner(std::uintptr_t value) {
+  return (value & 0xff) == kCombineWaiting;
+}
+template <typename Waiter>
+CombineFunction loadTask(Waiter* current, std::uintptr_t value) {
+  // if we know that the waiter is a combiner of some sort, it is safe to read
+  // and copy the value of the function in the waiter struct, since we know
+  // that a waiter would have set it before enqueueing
+  if (isCombiner(value)) {
+    return current->function_;
+  }
+  return nullptr;
+}
+template <template <typename> class Atomic>
+std::uintptr_t tryCombine(
+    std::uintptr_t value,
+    Waiter<Atomic>* waiter,
+    std::uint64_t iteration,
+    std::chrono::nanoseconds now,
+    CombineFunction task) {
+  // it is important to load the value of next_ before checking the value of
+  // function_ in the next if condition.  This is because of two things, the
+  // first being cache locality - it is helpful to read the value of the
+  // variable that is closer to futex_, since we just loaded from that before
+  // entering this function.  The second is cache coherence, the wait struct
+  // is shared between two threads, one thread is spinning on the futex
+  // waiting for a signal while the other is possibly combining the requested
+  // critical section into its own.  This means that there is a high chance
+  // we would cause the cachelines to bounce between the threads in the next
+  // if block.
+  //
+  // This leads to a degenerate case where the FunctionRef object ends up in a
+  // different cacheline thereby making it seem like benchmarks avoid this
+  // problem.  When compiled differently (eg.  with link time optimization)
+  // the wait struct ends up on the stack in a manner that causes the
+  // FunctionRef object to be in the same cacheline as the other data, thereby
+  // forcing the current thread to bounce on the cacheline twice (first to
+  // load the data from the other thread, that presumably owns the cacheline
+  // due to timestamp publishing) and then to signal the thread
+  //
+  // To avoid this sort of non-deterministic behavior based on compilation and
+  // stack layout, we load the value before executing the other thread's
+  // critical section
+  //
+  // Note that the waiting thread writes the value to the wait struct after
+  // enqueuing, but never writes to it after the value in the futex_ is
+  // initialised (showing that the thread is in the spin loop), this makes it
+  // safe for us to read next_ without synchronization
+  auto next = std::uintptr_t{0};
+  if (isInitialized(value)) {
+    next = waiter->next_;
+  }
+  // if the waiter has asked for a combine operation, we should combine its
+  // critical section and move on to the next waiter
+  //
+  // the waiter is combinable if the following conditions are satisfied
+  //
+  //  1) the state in the futex word is not uninitialized (kUninitialized)
+  //  2) it has a valid combine function
+  //  3) we are not past the limit of the number of combines we can perform
+  //     or the waiter thread been preempted.  If the waiter gets preempted,
+  //     its better to just execute their critical section before moving on.
+  //     As they will have to re-queue themselves after preemption anyway,
+  //     leading to further delays in critical section completion
+  //
+  // if all the above are satisfied, then we can combine the critical section.
+  // Note that it is only safe to read from the waiter struct if the value is
+  // not uninitialized.  If the state is uninitialized, we synchronize with
+  // the write to the next_ member in the lock function.  If the value is not
+  // uninitialized, there is a race in reading the next_ value
+  if (isWaitingCombiner(value) &&
+      (iteration <= kMaxCombineIterations || preempted(value, now))) {
+    task();
+    waiter->futex_.store(kCombined, std::memory_order_release);
+    return next;
+  }
+  return 0;
+}
 template <typename Waiter>
 std::uintptr_t tryWake(
    bool publishing,
    Waiter* waiter,
    std::uintptr_t value,
-    WakerMetadata metadata,
+    std::uintptr_t waker,
-    Waiter*& sleepers) {
+    Waiter*& sleepers,
+    std::uint64_t iteration,
+    CombineFunction task) {
+  // try and combine the waiter's request first, if that succeeds that means
+  // we have successfully executed their critical section and can move on to
+  // the rest of the chain
+  auto now = time();
+  if (auto next = tryCombine(value, waiter, iteration, now, task)) {
+    return next;
+  }
  // first we see if we can wake the current thread that is spinning
-  if ((!publishing || !preempted(value)) && !isSleeper(value)) {
+  if ((!publishing || !preempted(value, now)) && !isSleeper(value)) {
-    // we need release here because of the write to wakerMetadata_
+    // the Metadata class should be trivially destructible as we use placement
-    (*waiter)->wakerMetadata_ = metadata;
+    // new to set the relevant metadata without calling any destructor.  We
-    (*waiter)->waiters_ = reinterpret_cast<std::uintptr_t>(sleepers);
+    // need to use placement new because the class contains a futex, which is
-    (*waiter)->futex_.store(kWake, std::memory_order_release);
+    // non-movable and non-copyable
+    using Metadata = std::decay_t<decltype(waiter->metadata_)>;
+    static_assert(std::is_trivially_destructible<Metadata>{}, "");
+    // we need release here because of the write to waker_ and also because we
+    // are unlocking the mutex, the thread we do the handoff to here should
+    // see the modified data
+    new (&waiter->metadata_) Metadata{waker, bit_cast<uintptr_t>(sleepers)};
+    waiter->futex_.store(kWake, std::memory_order_release);
    return 0;
  }
@@ -600,9 +1181,10 @@ std::uintptr_t tryWake(
    // still sees the locked bit, and never gets woken up
    //
    // Can we relax this?
-    DCHECK(preempted(value));
+    DCHECK(preempted(value, now));
-    auto next = (*waiter)->next_;
+    DCHECK(!isCombiner(value));
-    (*waiter)->futex_.store(kSkipped, std::memory_order_release);
+    auto next = waiter->next_;
+    waiter->futex_.store(kSkipped, std::memory_order_release);
    return next;
  }
@@ -623,15 +1205,16 @@ std::uintptr_t tryWake(
  // that the thread was already sleeping, we have synchronized with the write
  // to next_ in the context of the sleeping thread
  //
-  // Also we need to set the value of waiters_ and wakerMetadata_ in the
+  // Also we need to set the value of waiters_ and waker_ in the thread before
-  // thread before doing the exchange because we need to pass on the list of
+  // doing the exchange because we need to pass on the list of sleepers in the
-  // sleepers in the event that we were able to catch the thread before it
+  // event that we were able to catch the thread before it went to futex().
-  // went to futex().  If we were unable to catch the thread before it slept,
+  // If we were unable to catch the thread before it slept, these fields will
-  // these fields will be ignored when the thread wakes up anyway
+  // be ignored when the thread wakes up anyway
  DCHECK(isSleeper(value));
-  (*waiter)->wakerMetadata_ = metadata;
+  waiter->metadata_.waker_ = waker;
-  (*waiter)->waiters_ = reinterpret_cast<std::uintptr_t>(sleepers);
+  waiter->metadata_.waiters_ = folly::bit_cast<std::uintptr_t>(sleepers);
-  auto pre = (*waiter)->sleeper_.exchange(kSleeping, std::memory_order_acq_rel);
+  auto pre =
+      waiter->metadata_.sleeper_.exchange(kSleeping, std::memory_order_acq_rel);
  // we were able to catch the thread before it went to sleep, return true
  if (pre != kSleeping) {
@@ -643,8 +1226,8 @@ std::uintptr_t tryWake(
  //
  // we also need to collect this sleeper in the list of sleepers being built
  // up
-  auto next = (*waiter)->next_;
+  auto next = waiter->next_;
-  (*waiter)->next_ = reinterpret_cast<std::uintptr_t>(sleepers);
+  waiter->next_ = folly::bit_cast<std::uintptr_t>(sleepers);
  sleepers = waiter;
  return next;
 }
@@ -653,15 +1236,24 @@ template <typename Waiter>
 bool wake(
    bool publishing,
    Waiter& waiter,
-    WakerMetadata metadata,
+    std::uintptr_t waker,
-    Waiter*& sleepers) {
+    Waiter*& sleepers,
+    std::uint64_t iter) {
  // loop till we find a node that is either at the end of the list (as
-  // specified by metadata) or we find a node that is active (as specified by
+  // specified by waker) or we find a node that is active (as specified by
  // the last published timestamp of the node)
  auto current = &waiter;
  while (current) {
-    auto value = (*current)->futex_.load(std::memory_order_acquire);
+    // it is important that we load the value of function after the initial
-    auto next = tryWake(publishing, current, value, metadata, sleepers);
+    // acquire load.  This is required because we need to synchronize with the
+    // construction of the waiter struct before reading from it
+    auto value = current->futex_.load(std::memory_order_acquire);
+    auto task = loadTask(current, value);
+    auto next =
+        tryWake(publishing, current, value, waker, sleepers, iter, task);
+    // if there is no next node, we have managed to wake someone up and have
+    // successfully migrated the lock to another thread
    if (!next) {
      return true;
    }
@@ -670,20 +1262,12 @@ bool wake(
    // it, this is because after we skip it the node might wake up and enqueue
    // itself, and thereby gain a new next node
    CHECK(publishing);
-    current =
+    current = (next == waker) ? nullptr : extractPtr<Waiter>(next);
-        (next == metadata.waker_) ? nullptr : extractAddress<Waiter>(next);
  }
  return false;
 }
-template <typename Atomic>
-void wakeTimedWaiters(Atomic* state, bool timedWaiters) {
-  if (UNLIKELY(timedWaiters)) {
-    atomic_notify_one(state);
-  }
-}
 template <typename Atomic, typename Proxy, typename Sleepers>
 bool tryUnlockClean(Atomic& state, Proxy& proxy, Sleepers sleepers) {
  auto expected = proxy.expected_;
@@ -717,6 +1301,7 @@ void DistributedMutex<Atomic, Publish>::unlock(
    DistributedMutex::DistributedMutexStateProxy proxy) {
  // we always wake up ready threads and timed waiters if we saw either
  DCHECK(proxy) << "Invalid proxy passed to DistributedMutex::unlock()";
+  DCHECK(!proxy.combined_) << "Cannot unlock mutex after a successful combine";
  SCOPE_EXIT {
    doFutexWake(proxy.ready_);
    wakeTimedWaiters(&state_, proxy.timedWaiters_);
@@ -726,7 +1311,7 @@ void DistributedMutex<Atomic, Publish>::unlock(
  // don't bother with the mutex state
  auto sleepers = proxy.waiters_;
  if (proxy.next_) {
-    if (wake(Publish, *proxy.next_, proxy.wakerMetadata_, sleepers)) {
+    if (wake(Publish, *proxy.next_, proxy.waker_, sleepers, 0)) {
      return;
    }
@@ -741,7 +1326,7 @@ void DistributedMutex<Atomic, Publish>::unlock(
    proxy.expected_ = kLocked;
  }
-  while (true) {
+  for (std::uint64_t i = 0; true; ++i) {
    // otherwise, since we don't have anyone we need to wake up, we try and
    // release the mutex just as is
    //
@@ -764,13 +1349,10 @@ void DistributedMutex<Atomic, Publish>::unlock(
    // terminal node of the new chain will see kLocked in the central storage
    auto head = state_.exchange(kLocked, std::memory_order_acq_rel);
    recordTimedWaiterAndClearTimedBit(proxy.timedWaiters_, head);
-    auto next = extractAddress<CachelinePadded<Waiter<Atomic>>>(head);
+    auto next = extractPtr<Waiter<Atomic>>(head);
+    auto expected = std::exchange(proxy.expected_, kLocked);
    DCHECK((head & kLocked) && (head != kLocked)) << "incorrect state " << head;
-    if (wake(
+    if (wake(Publish, *next, expected, sleepers, i)) {
-            Publish,
-            *next,
-            {std::exchange(proxy.expected_, kLocked)},
-            sleepers)) {
      break;
    }
  }

--- a/folly/synchronization/DistributedMutex.cpp
+++ b/folly/synchronization/DistributedMutex.cpp
+/*
+ * Copyright 2004-present Facebook, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <folly/synchronization/DistributedMutex.h>
+namespace folly {
+namespace detail {
+namespace distributed_mutex {
+template class DistributedMutex<std::atomic, true>;
+} // namespace distributed_mutex
+} // namespace detail
+} // namespace folly
--- a/folly/synchronization/DistributedMutex.h
+++ b/folly/synchronization/DistributedMutex.h
@@ -15,6 +15,9 @@
 */
 #pragma once
+#include <folly/Optional.h>
+#include <folly/functional/Invoke.h>
 #include <atomic>
 #include <chrono>
 #include <cstdint>
@@ -26,32 +29,39 @@ namespace distributed_mutex {
 /**
 * DistributedMutex is a small, exclusive-only mutex that distributes the
 * bookkeeping required for mutual exclusion in the stacks of threads that are
- * contending for it.  It tries to come at a lower space cost than std::mutex
+ * contending for it.  It has a mode that can combine critical sections when
- * while still trying to maintain the fairness benefits that come from using
+ * the mutex experiences contention, this allows the implementation to elide
- * std::mutex.  DistributedMutex provides the entire API included in
+ * several expensive coherence and synchronization operations to boost
- * std::mutex, and more, with slight modifications.  DistributedMutex is the
+ * throughput, surpassing even some atomic CAS instructions in some cases.  It
+ * has no dependencies on heap allocation and tries to come at a lower space
+ * cost than std::mutex while still trying to maintain the fairness benefits
+ * that come from using std::mutex.  DistributedMutex provides the entire API
+ * included in std::mutex, and more, with slight modifications.  It is the
 * same width as a single pointer (8 bytes on most platforms), where on the
 * other hand, std::mutex and pthread_mutex_t are both 40 bytes.  It is larger
 * than some of the other smaller locks, but the wide majority of cases using
 * the small locks are wasting the difference in alignment padding anyway
 *
 * Benchmark results are good - at the time of writing in the common
- * uncontended case, it is 30% faster than some of the other small mutexes in
+ * uncontended case, it is a few cycles faster than folly::MicroLock but a bit
- * folly and as fast as std::mutex, which recently optimized its uncontended
+ * slower than std::mutex.  In the contended case, for lock/unlock based
- * path.  In the contended case, it is about 4-5x faster than some of the
+ * critical sections, it is about 4-5x faster than some of the smaller locks
- * smaller locks in folly, ~2x faster than std::mutex in clang and ~1.8x
+ * and about ~2x faster than std::mutex.  When used in combinable mode, it can
- * faster in gcc.  DistributedMutex is also resistent to tail latency
+ * go more than 10x faster than the small locks, about 6x faster than
- * pathalogies unlike many of the other small mutexes.  Which sleep for large
+ * std::mutex and up to 2-3x faster than the implementations of flat combining
+ * we benchmarked against.  DistributedMutex is also resistent to tail latency
+ * pathalogies unlike many of the other mutexes in use, which sleep for large
 * time quantums to reduce spin churn, this causes elevated latencies for
 * threads that enter the sleep cycle.  The tail latency of lock acquisition
- * on average up to 10x better with DistributedMutex
+ * can go up to 10x lower because of a more deterministic scheduling algorithm
+ * that is managed almost entirely in userspace
 *
- * DistributedMutex reduces cache line contention by making each thread wait
+ * DistributedMutex reduces cache line contention in userspace and in the
- * on a thread local spinlock and futex.  This allows threads to keep working
+ * kernel by making each thread wait on a thread local spinlock and futex.
- * only on their own cache lines without requiring cache coherence operations
+ * This allows threads to keep working only on their own cache lines without
- * when a mutex heavy contention.  This strategy does not require sequential
+ * requiring cache coherence operations when a mutex heavy contention.  This
- * ordering on the centralized atomic storage for wakeup operations as each
+ * strategy does not require sequential ordering on the centralized atomic
- * thread assigned its own wait state
+ * storage for wakeup operations as each thread assigned its own wait state
 *
 * Non-timed mutex acquisitions are scheduled through intrusive LIFO
 * contention chains.  Each thread starts by spinning for a short quantum and
@@ -88,6 +98,23 @@ namespace distributed_mutex {
 * own, thinking a mutex is functionally identical to a binary semaphore,
 * which, unlike a mutex, is a suitable primitive for that usage
 *
+ * Combined critical sections, allow the implementation to elide several
+ * expensive operations during the lifetime of a critical section that cause
+ * slowdowns with regular lock/unlock based usage.  DistributedMutex resolves
+ * contention through combining up to a constant factor of 2 contention chains
+ * to prevent issues with fairness and latency outliers, so we retain the
+ * fairness benefits of the lock/unlock implementation with no noticeable
+ * regression when switching between the lock methods.  Despite the efficiency
+ * benefits, combined critical sections can only be used when the critical
+ * section does not depend on thread local state and does not introduce new
+ * dependencies between threads when the critical section gets combined.  For
+ * example, locking or unlocking an unrelated mutex in a combined critical
+ * section might lead to unexpected results or even undefined behavior.  This
+ * can happen if, for example, a different thread unlocks a mutex locked by
+ * the calling thread, leading to undefined behavior as the mutex might not
+ * allow locking and unlocking from unrelated threads (the posix and C++
+ * standard disallow this usage for their mutexes)
+ *
 * Timed locking through DistributedMutex is implemented through a centralized
 * algorithm - all waiters wait on the central mutex state, by setting and
 * resetting bits within the pointer-length word.  Since pointer length atomic
@@ -121,8 +148,15 @@ class DistributedMutex {
   *
   * The proxy has no public API and is intended to be for internal usage only
   *
-   * This is not a recursive mutex - trying to acquire the mutex twice from
+   * There are three notable cases where undefined behavior might come up:
+   *  - This is not a recursive mutex.  Trying to acquire the mutex twice from
   *    the same thread without unlocking it results in undefined behavior
+   *  - Thread, coroutine or fiber migrations are disallowed.  This is because
+   *    the implementation requires owning the stack frame through the
+   *    execution of the critical section for both lock/unlock or combined
+   *    critical sections.  This means that you cannot allow another thread,
+   *    fiber or coroutine to unlock the mutex
+   *  - This mutex cannot be used in a program compiled with segmented stacks
   */
  DistributedMutexStateProxy lock();
@@ -132,6 +166,9 @@ class DistributedMutex {
   * The proxy returned by lock must be passed to unlock as an rvalue.  No
   * other option is possible here, since the proxy is only movable and not
   * copyable
+   *
+   * It is undefined behavior to unlock from a thread that did not lock the
+   * mutex
   */
  void unlock(DistributedMutexStateProxy);
@@ -173,6 +210,102 @@ class DistributedMutex {
  DistributedMutexStateProxy try_lock_until(
      const std::chrono::time_point<Clock, Duration>& deadline);
+  /**
+   * Execute a task as a combined critical section
+   *
+   * Unlike traditional lock and unlock methods, lock_combine() enqueues the
+   * passed task for execution on any arbitrary thread.  This allows the
+   * implementation to prevent cache line invalidations originating from
+   * expensive synchronization operations.  The thread holding the lock is
+   * allowed to execute the task before unlocking, thereby forming a "combined
+   * critical section".
+   *
+   * This idea is inspired by Flat Combining.  Flat Combining was introduced
+   * in the SPAA 2010 paper titled "Flat Combining and the
+   * Synchronization-Parallelism Tradeoff", by Danny Hendler, Itai Incze, Nir
+   * Shavit, and Moran Tzafrir -
+   * https://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf.  The
+   * implementation used here is significantly different from that described
+   * in the paper.  The high-level goal of reducing the overhead of
+   * synchronization, however, is the same.
+   *
+   * Combined critical sections work best when kept simple.  Since the
+   * critical section might be executed on any arbitrary thread, relying on
+   * things like thread local state or mutex locking and unlocking might cause
+   * incorrectness.  Associativity is important.  For example
+   *
+   *    auto one = std::unique_lock{one_};
+   *    two_.lock_combine([&]() {
+   *      if (bar()) {
+   *        one.unlock();
+   *      }
+   *    });
+   *
+   * This has the potential to cause undefined behavior because mutexes are
+   * only meant to be acquired and released from the owning thread.  Similar
+   * errors can arise from a combined critical section introducing implicit
+   * dependencies based on the state of the combining thread.  For example
+   *
+   *    // thread 1
+   *    auto one = std::unique_lock{one_};
+   *    auto two = std::unique_lock{two_};
+   *
+   *    // thread 2
+   *    two_.lock_combine([&]() {
+   *      auto three = std::unique_lock{three_};
+   *    });
+   *
+   * Here, because we used a combined critical section, we have introduced a
+   * dependency from one -> three that might not obvious to the reader
+   *
+   * There are three notable cases where undefined behavior might come up:
+   *  - This is not a recursive mutex.  Trying to acquire the mutex twice from
+   *    the same thread without unlocking it results in undefined behavior
+   *  - Thread, coroutine or fiber migrations are disallowed.  This is because
+   *    the implementation requires the locking entity to own the stack frame
+   *    through the execution of the critical section for both lock/unlock or
+   *    combined critical sections.  This means that you cannot allow another
+   *    thread, fiber or coroutine to unlock the mutex
+   *  - This mutex cannot be used in a program compiled with segmented stacks
+   */
+  template <typename Task>
+  auto lock_combine(Task task) noexcept -> folly::invoke_result_t<const Task&>;
+  /**
+   * Try to combine a task as a combined critical section untill the given time
+   *
+   * Like the other try_lock() mehtods, this is allowed to fail spuriously,
+   * and is not guaranteed to return true even when the mutex is currently
+   * unlocked.
+   *
+   * Note that this does not necessarily have the same performance
+   * characteristics as the non-timed version of the combine method.  If
+   * performance is critical, use that one instead
+   */
+  template <
+      typename Rep,
+      typename Period,
+      typename Task,
+      typename ReturnType = decltype(std::declval<Task&>()())>
+  folly::Optional<ReturnType> try_lock_combine_for(
+      const std::chrono::duration<Rep, Period>& duration,
+      Task task) noexcept;
+  /**
+   * Try to combine a task as a combined critical section untill the given time
+   *
+   * Other than the difference in the meaning of the second argument, the
+   * semantics of this function are identical to try_lock_combine_for()
+   */
+  template <
+      typename Clock,
+      typename Duration,
+      typename Task,
+      typename ReturnType = decltype(std::declval<Task&>()())>
+  folly::Optional<ReturnType> try_lock_combine_until(
+      const std::chrono::time_point<Clock, Duration>& deadline,
+      Task task) noexcept;
 private:
  Atomic<std::uintptr_t> state_{0};
 };
@@ -184,6 +317,7 @@ class DistributedMutex {
 * Bring the default instantiation of DistributedMutex into the folly
 * namespace without requiring any template arguments for public usage
 */
+extern template class detail::distributed_mutex::DistributedMutex<>;
 using DistributedMutex = detail::distributed_mutex::DistributedMutex<>;
 } // namespace folly

--- a/folly/synchronization/test/DistributedMutexTest.cpp
+++ b/folly/synchronization/test/DistributedMutexTest.cpp
@@ -16,12 +16,14 @@
 #include <folly/synchronization/DistributedMutex.h>
 #include <folly/MapUtil.h>
 #include <folly/Synchronized.h>
+#include <folly/container/Array.h>
 #include <folly/container/Foreach.h>
 #include <folly/portability/GTest.h>
 #include <folly/synchronization/Baton.h>
 #include <folly/test/DeterministicSchedule.h>
 #include <chrono>
+#include <cmath>
 #include <thread>
 using namespace std::literals;
@@ -186,6 +188,7 @@ void atomic_notify_one(const ManualAtomic<std::uintptr_t>*) {
 namespace {
 DEFINE_int32(stress_factor, 1000, "The stress test factor for tests");
+DEFINE_int32(stress_test_seconds, 2, "Duration for stress tests");
 constexpr auto kForever = 100h;
 using DSched = test::DeterministicSchedule;
@@ -206,6 +209,7 @@ void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) {
      for (auto j = 0; j < iterations; ++j) {
        auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
        result.push_back(id);
        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
      }
@@ -225,14 +229,328 @@ void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) {
  }
  EXPECT_EQ(total, sum(numThreads) * iterations);
 }
+template <template <typename> class Atom = std::atomic>
+void lockWithTryAndTimedNThreads(
+    int numThreads,
+    std::chrono::seconds duration) {
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& barrier = std::atomic<int>{0};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& lockUnlockFunction = [&]() {
+    while (!stop.load()) {
+      auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
+      EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+      std::this_thread::yield();
+      EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+    }
+  };
+  auto tryLockFunction = [&]() {
+    while (!stop.load()) {
+      using Mutex = std::decay_t<decltype(mutex)>;
+      auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
+      if (lck.try_lock()) {
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+      }
+    }
+  };
+  auto timedLockFunction = [&]() {
+    while (!stop.load()) {
+      using Mutex = std::decay_t<decltype(mutex)>;
+      auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
+      if (lck.try_lock_for(kForever)) {
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+      }
+    }
+  };
+  for (auto i = 0; i < (numThreads / 3); ++i) {
+    threads.push_back(DSched::thread(lockUnlockFunction));
+  }
+  for (auto i = 0; i < (numThreads / 3); ++i) {
+    threads.push_back(DSched::thread(tryLockFunction));
+  }
+  for (auto i = 0; i < (numThreads / 3); ++i) {
+    threads.push_back(DSched::thread(timedLockFunction));
+  }
+  /* sleep override */
+  std::this_thread::sleep_for(duration);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
+template <template <typename> class Atom = std::atomic>
+void combineNThreads(int numThreads, std::chrono::seconds duration) {
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& barrier = std::atomic<int>{0};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& function = [&]() {
+    return [&] {
+      auto&& expected = std::uint64_t{0};
+      auto&& local = std::atomic<std::uint64_t>{0};
+      auto&& result = std::atomic<std::uint64_t>{0};
+      while (!stop.load()) {
+        ++expected;
+        auto current = mutex.lock_combine([&]() {
+          result.fetch_add(1);
+          EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+          std::this_thread::yield();
+          EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+          return local.fetch_add(1);
+        });
+        EXPECT_EQ(current, expected - 1);
+      }
+      EXPECT_EQ(expected, result.load());
+    };
+  };
+  for (auto i = 1; i <= numThreads; ++i) {
+    threads.push_back(DSched::thread(function()));
+  }
+  /* sleep override */
+  std::this_thread::sleep_for(duration);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
+template <template <typename> class Atom = std::atomic>
+void combineWithLockNThreads(int numThreads, std::chrono::seconds duration) {
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& barrier = std::atomic<int>{0};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& lockUnlockFunction = [&]() {
+    while (!stop.load()) {
+      auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
+      EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+      std::this_thread::yield();
+      EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+    }
+  };
+  auto&& combineFunction = [&]() {
+    auto&& expected = std::uint64_t{0};
+    auto&& total = std::atomic<std::uint64_t>{0};
+    while (!stop.load()) {
+      ++expected;
+      auto current = mutex.lock_combine([&]() {
+        auto iteration = total.fetch_add(1);
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+        return iteration;
+      });
+      EXPECT_EQ(expected, current + 1);
+    }
+    EXPECT_EQ(expected, total.load());
+  };
+  for (auto i = 1; i < (numThreads / 2); ++i) {
+    threads.push_back(DSched::thread(combineFunction));
+  }
+  for (auto i = 0; i < (numThreads / 2); ++i) {
+    threads.push_back(DSched::thread(lockUnlockFunction));
+  }
+  /* sleep override */
+  std::this_thread::sleep_for(duration);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
+template <template <typename> class Atom = std::atomic>
+void combineWithTryLockNThreads(int numThreads, std::chrono::seconds duration) {
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& barrier = std::atomic<int>{0};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& lockUnlockFunction = [&]() {
+    while (!stop.load()) {
+      auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
+      EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+      std::this_thread::yield();
+      EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+    }
+  };
+  auto&& combineFunction = [&]() {
+    auto&& expected = std::uint64_t{0};
+    auto&& total = std::atomic<std::uint64_t>{0};
+    while (!stop.load()) {
+      ++expected;
+      auto current = mutex.lock_combine([&]() {
+        auto iteration = total.fetch_add(1);
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+        return iteration;
+      });
+      EXPECT_EQ(expected, current + 1);
+    }
+    EXPECT_EQ(expected, total.load());
+  };
+  auto tryLockFunction = [&]() {
+    while (!stop.load()) {
+      using Mutex = std::decay_t<decltype(mutex)>;
+      auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
+      if (lck.try_lock()) {
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+      }
+    }
+  };
+  for (auto i = 0; i < (numThreads / 3); ++i) {
+    threads.push_back(DSched::thread(lockUnlockFunction));
+  }
+  for (auto i = 0; i < (numThreads / 3); ++i) {
+    threads.push_back(DSched::thread(combineFunction));
+  }
+  for (auto i = 0; i < (numThreads / 3); ++i) {
+    threads.push_back(DSched::thread(tryLockFunction));
+  }
+  /* sleep override */
+  std::this_thread::sleep_for(duration);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
+template <template <typename> class Atom = std::atomic>
+void combineWithLockTryAndTimedNThreads(
+    int numThreads,
+    std::chrono::seconds duration) {
+  auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
+  auto&& barrier = std::atomic<int>{0};
+  auto&& threads = std::vector<std::thread>{};
+  auto&& stop = std::atomic<bool>{false};
+  auto&& lockUnlockFunction = [&]() {
+    while (!stop.load()) {
+      auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
+      EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+      std::this_thread::yield();
+      EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+    }
+  };
+  auto&& combineFunction = [&]() {
+    auto&& expected = std::uint64_t{0};
+    auto&& total = std::atomic<std::uint64_t>{0};
+    while (!stop.load()) {
+      ++expected;
+      auto current = mutex.lock_combine([&]() {
+        auto iteration = total.fetch_add(1);
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+        // return a non-trivially-copyable object that occupies all the
+        // storage we use to coalesce returns to test that codepath
+        return folly::make_array(
+            iteration,
+            iteration + 1,
+            iteration + 2,
+            iteration + 3,
+            iteration + 4,
+            iteration + 5);
+      });
+      EXPECT_EQ(expected, current[0] + 1);
+      EXPECT_EQ(expected, current[1]);
+      EXPECT_EQ(expected, current[2] - 1);
+      EXPECT_EQ(expected, current[3] - 2);
+      EXPECT_EQ(expected, current[4] - 3);
+      EXPECT_EQ(expected, current[5] - 4);
+    }
+    EXPECT_EQ(expected, total.load());
+  };
+  auto tryLockFunction = [&]() {
+    while (!stop.load()) {
+      using Mutex = std::decay_t<decltype(mutex)>;
+      auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
+      if (lck.try_lock()) {
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+      }
+    }
+  };
+  auto timedLockFunction = [&]() {
+    while (!stop.load()) {
+      using Mutex = std::decay_t<decltype(mutex)>;
+      auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
+      if (lck.try_lock_for(kForever)) {
+        EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
+        std::this_thread::yield();
+        EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
+      }
+    }
+  };
+  for (auto i = 0; i < (numThreads / 4); ++i) {
+    threads.push_back(DSched::thread(lockUnlockFunction));
+  }
+  for (auto i = 0; i < (numThreads / 4); ++i) {
+    threads.push_back(DSched::thread(combineFunction));
+  }
+  for (auto i = 0; i < (numThreads / 4); ++i) {
+    threads.push_back(DSched::thread(tryLockFunction));
+  }
+  for (auto i = 0; i < (numThreads / 4); ++i) {
+    threads.push_back(DSched::thread(timedLockFunction));
+  }
+  /* sleep override */
+  std::this_thread::sleep_for(duration);
+  stop.store(true);
+  for (auto& thread : threads) {
+    DSched::join(thread);
+  }
+}
 } // namespace
 TEST(DistributedMutex, InternalDetailTestOne) {
  auto value = 0;
  auto ptr = reinterpret_cast<std::uintptr_t>(&value);
-  EXPECT_EQ(detail::distributed_mutex::extractAddress<int>(ptr), &value);
+  EXPECT_EQ(detail::distributed_mutex::extractPtr<int>(ptr), &value);
  ptr = ptr | 0b1;
-  EXPECT_EQ(detail::distributed_mutex::extractAddress<int>(ptr), &value);
+  EXPECT_EQ(detail::distributed_mutex::extractPtr<int>(ptr), &value);
 }
 TEST(DistributedMutex, Basic) {
@@ -434,6 +752,159 @@ TEST(DistributedMutex, StressHardwareConcurrencyThreads) {
  basicNThreads(std::thread::hardware_concurrency());
 }
+TEST(DistributedMutex, StressThreeThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwelveThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwentyFourThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressFourtyEightThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixtyFourThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressHwConcThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreads(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwoThreadsCombine) {
+  combineNThreads(2, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressThreeThreadsCombine) {
+  combineNThreads(3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressFourThreadsCombine) {
+  combineNThreads(4, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressFiveThreadsCombine) {
+  combineNThreads(5, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixThreadsCombine) {
+  combineNThreads(6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSevenThreadsCombine) {
+  combineNThreads(7, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressEightThreadsCombine) {
+  combineNThreads(8, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixteenThreadsCombine) {
+  combineNThreads(16, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressThirtyTwoThreadsCombine) {
+  combineNThreads(32, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixtyFourThreadsCombine) {
+  combineNThreads(64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressHundredThreadsCombine) {
+  combineNThreads(100, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombine) {
+  combineNThreads(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwoThreadsCombineAndLock) {
+  combineWithLockNThreads(2, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressFourThreadsCombineAndLock) {
+  combineWithLockNThreads(4, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressEightThreadsCombineAndLock) {
+  combineWithLockNThreads(8, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixteenThreadsCombineAndLock) {
+  combineWithLockNThreads(16, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressThirtyTwoThreadsCombineAndLock) {
+  combineWithLockNThreads(32, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixtyFourThreadsCombineAndLock) {
+  combineWithLockNThreads(64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombineAndLock) {
+  combineWithLockNThreads(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressThreeThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwelveThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwentyFourThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressFourtyEightThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixtyFourThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombineTryLockAndLock) {
+  combineWithTryLockNThreads(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressThreeThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwelveThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressTwentyFourThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressFourtyEightThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressSixtyFourThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, StressHwConcurrencyThreadsCombineTryLockLockAndTimed) {
+  combineWithLockTryAndTimedNThreads(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
 TEST(DistributedMutex, StressTryLock) {
  auto&& mutex = DistributedMutex{};
@@ -464,6 +935,73 @@ void runBasicNThreadsDeterministic(int threads, int iterations) {
    static_cast<void>(schedule);
  }
 }
+void combineNThreadsDeterministic(int threads, std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    combineNThreads<test::DeterministicAtomic>(threads, time);
+    static_cast<void>(schedule);
+  }
+}
+void combineAndLockNThreadsDeterministic(int threads, std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    combineWithLockNThreads<test::DeterministicAtomic>(threads, time);
+    static_cast<void>(schedule);
+  }
+}
+void combineTryLockAndLockNThreadsDeterministic(
+    int threads,
+    std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    combineWithTryLockNThreads<test::DeterministicAtomic>(threads, time);
+    static_cast<void>(schedule);
+  }
+}
+void lockWithTryAndTimedNThreadsDeterministic(
+    int threads,
+    std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    lockWithTryAndTimedNThreads<test::DeterministicAtomic>(threads, time);
+    static_cast<void>(schedule);
+  }
+}
+void combineWithTryLockAndTimedNThreadsDeterministic(
+    int threads,
+    std::chrono::seconds t) {
+  const auto kNumPasses = 3.0;
+  const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
+  const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
+  for (auto pass = 0; pass < kNumPasses; ++pass) {
+    auto&& schedule = DSched{DSched::uniform(pass)};
+    combineWithLockTryAndTimedNThreads<test::DeterministicAtomic>(
+        threads, time);
+    static_cast<void>(schedule);
+  }
+}
 } // namespace
 TEST(DistributedMutex, DeterministicStressTwoThreads) {
@@ -482,6 +1020,156 @@ TEST(DistributedMutex, DeterministicStressThirtyTwoThreads) {
  runBasicNThreadsDeterministic(32, numIterationsDeterministicTest(32));
 }
+TEST(DistributedMutex, DeterministicStressThreeThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, DeterministicStressSixThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, DeterministicStressTwelveThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, DeterministicStressTwentyFourThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, DeterministicStressFourtyEightThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, DeterministicStressSixtyFourThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, DeterministicStressHwConcThreadsLockTryAndTimed) {
+  lockWithTryAndTimedNThreadsDeterministic(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressTwoThreads) {
+  combineNThreadsDeterministic(
+      2, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressFourThreads) {
+  combineNThreadsDeterministic(
+      4, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressEightThreads) {
+  combineNThreadsDeterministic(
+      8, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressSixteenThreads) {
+  combineNThreadsDeterministic(
+      16, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressThirtyTwoThreads) {
+  combineNThreadsDeterministic(
+      32, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressSixtyFourThreads) {
+  combineNThreadsDeterministic(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineDeterministicStressHardwareConcurrencyThreads) {
+  combineNThreadsDeterministic(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressTwoThreads) {
+  combineAndLockNThreadsDeterministic(
+      2, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressFourThreads) {
+  combineAndLockNThreadsDeterministic(
+      4, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressEightThreads) {
+  combineAndLockNThreadsDeterministic(
+      8, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressSixteenThreads) {
+  combineAndLockNThreadsDeterministic(
+      16, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressThirtyTwoThreads) {
+  combineAndLockNThreadsDeterministic(
+      32, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressSixtyFourThreads) {
+  combineAndLockNThreadsDeterministic(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineAndLockDeterministicStressHWConcurrencyThreads) {
+  combineAndLockNThreadsDeterministic(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressThreeThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressSixThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressTwelveThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressTwentyThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressFortyThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressSixtyThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressHWConcThreads) {
+  combineTryLockAndLockNThreadsDeterministic(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressThreeThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      3, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressSixThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      6, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressTwelveThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      12, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressTwentyThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      24, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressFortyThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      48, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressSixtyThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      64, std::chrono::seconds{FLAGS_stress_test_seconds});
+}
+TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressHWConcThreads) {
+  combineWithTryLockAndTimedNThreadsDeterministic(
+      std::thread::hardware_concurrency(),
+      std::chrono::seconds{FLAGS_stress_test_seconds});
+}
 TEST(DistributedMutex, TimedLockTimeout) {
  auto&& mutex = DistributedMutex{};
  auto&& start = folly::Baton<>{};
@@ -833,4 +1521,131 @@ TEST(DistributedMutex, DeterministicTryLockSixtyFourThreads) {
  }
 }
+namespace {
+class TestConstruction {
+ public:
+  TestConstruction() = delete;
+  explicit TestConstruction(int) {
+    defaultConstructs().fetch_add(1, std::memory_order_relaxed);
+  }
+  TestConstruction(TestConstruction&&) noexcept {
+    moveConstructs().fetch_add(1, std::memory_order_relaxed);
+  }
+  TestConstruction(const TestConstruction&) {
+    copyConstructs().fetch_add(1, std::memory_order_relaxed);
+  }
+  TestConstruction& operator=(const TestConstruction&) {
+    copyAssigns().fetch_add(1, std::memory_order_relaxed);
+    return *this;
+  }
+  TestConstruction& operator=(TestConstruction&&) {
+    moveAssigns().fetch_add(1, std::memory_order_relaxed);
+    return *this;
+  }
+  ~TestConstruction() {
+    destructs().fetch_add(1, std::memory_order_relaxed);
+  }
+  static std::atomic<std::uint64_t>& defaultConstructs() {
+    static auto&& atomic = std::atomic<std::uint64_t>{0};
+    return atomic;
+  }
+  static std::atomic<std::uint64_t>& moveConstructs() {
+    static auto&& atomic = std::atomic<std::uint64_t>{0};
+    return atomic;
+  }
+  static std::atomic<std::uint64_t>& copyConstructs() {
+    static auto&& atomic = std::atomic<std::uint64_t>{0};
+    return atomic;
+  }
+  static std::atomic<std::uint64_t>& moveAssigns() {
+    static auto&& atomic = std::atomic<std::uint64_t>{0};
+    return atomic;
+  }
+  static std::atomic<std::uint64_t>& copyAssigns() {
+    static auto&& atomic = std::atomic<std::uint64_t>{0};
+    return atomic;
+  }
+  static std::atomic<std::uint64_t>& destructs() {
+    static auto&& atomic = std::atomic<std::uint64_t>{0};
+    return atomic;
+  }
+  static void reset() {
+    defaultConstructs().store(0);
+    moveConstructs().store(0);
+    copyConstructs().store(0);
+    copyAssigns().store(0);
+    destructs().store(0);
+  }
+};
+} // namespace
+TEST(DistributedMutex, TestAppropriateDestructionAndConstructionWithCombine) {
+  auto&& mutex = folly::DistributedMutex{};
+  auto&& stop = std::atomic<bool>{false};
+  // test the simple return path to make sure that in the absence of
+  // contention, we get the right number of constructs and destructs
+  mutex.lock_combine([]() { return TestConstruction{1}; });
+  auto moves = TestConstruction::moveConstructs().load();
+  auto defaults = TestConstruction::defaultConstructs().load();
+  EXPECT_EQ(TestConstruction::defaultConstructs().load(), 1);
+  EXPECT_TRUE(moves == 0 || moves == 1);
+  EXPECT_EQ(TestConstruction::destructs().load(), moves + defaults);
+  // loop and make sure we were able to test the path where the critical
+  // section of the thread gets combined, and assert that we see the expected
+  // number of constructions and destructions
+  //
+  // this implements a timed backoff to test the combined path, so we use the
+  // smallest possible delay in tests
+  auto thread = std::thread{[&]() {
+    auto&& duration = std::chrono::milliseconds{10};
+    while (!stop.load()) {
+      TestConstruction::reset();
+      auto&& ready = folly::Baton<>{};
+      auto&& release = folly::Baton<>{};
+      // make one thread start it's critical section, signal and wait for
+      // another thread to enqueue, to test the
+      auto innerThread = std::thread{[&]() {
+        mutex.lock_combine([&]() {
+          ready.post();
+          release.wait();
+          /* sleep override */
+          std::this_thread::sleep_for(duration);
+        });
+      }};
+      // wait for the thread to get in its critical section, then tell it to go
+      ready.wait();
+      release.post();
+      mutex.lock_combine([&]() { return TestConstruction{1}; });
+      innerThread.join();
+      // at this point we should have only one default construct, either 3
+      // or 4 move constructs the same number of destructions as
+      // constructions
+      auto innerDefaults = TestConstruction::defaultConstructs().load();
+      auto innerMoves = TestConstruction::moveConstructs().load();
+      auto destructs = TestConstruction::destructs().load();
+      EXPECT_EQ(innerDefaults, 1);
+      EXPECT_TRUE(innerMoves == 3 || innerMoves == 4 || innerMoves == 1);
+      EXPECT_EQ(destructs, innerMoves + innerDefaults);
+      EXPECT_EQ(TestConstruction::moveAssigns().load(), 0);
+      EXPECT_EQ(TestConstruction::copyAssigns().load(), 0);
+      // increase duration by 100ms each iteration
+      duration = duration + 100ms;
+    }
+  }};
+  /* sleep override */
+  std::this_thread::sleep_for(std::chrono::seconds{FLAGS_stress_test_seconds});
+  stop.store(true);
+  thread.join();
+}
 } // namespace folly
--- a/folly/synchronization/test/SmallLocksBenchmark.cpp
+++ b/folly/synchronization/test/SmallLocksBenchmark.cpp
@@ -15,6 +15,7 @@
 */
 #include <algorithm>
+#include <array>
 #include <cmath>
 #include <condition_variable>
 #include <iostream>
@@ -25,9 +26,12 @@
 #include <google/base/spinlock.h>
 #include <folly/Benchmark.h>
+#include <folly/CachelinePadded.h>
 #include <folly/SharedMutex.h>
+#include <folly/experimental/flat_combining/FlatCombining.h>
 #include <folly/synchronization/DistributedMutex.h>
 #include <folly/synchronization/SmallLocks.h>
+#include <folly/synchronization/Utility.h>
 /* "Work cycle" is just an additional nop loop iteration.
 * A smaller number of work cyles will result in more contention,
@@ -50,13 +54,6 @@ static void burn(size_t n) {
 }
 namespace {
-template <typename Mutex>
-std::unique_lock<Mutex> lock(Mutex& mutex) {
-  return std::unique_lock<Mutex>{mutex};
-}
-template <typename Mutex, typename Other>
-void unlock(Mutex&, Other) {}
 struct SimpleBarrier {
  explicit SimpleBarrier(int count) : count_(count) {}
  void wait() {
@@ -105,8 +102,135 @@ class GoogleSpinLockAdapter {
  SpinLock lock_;
 };
-template <typename Lock>
+class DistributedMutexFlatCombining {
-static void runContended(size_t numOps, size_t numThreads) {
+ public:
+  folly::DistributedMutex mutex_;
+};
+class NoLock {
+ public:
+  void lock() {}
+  void unlock() {}
+};
+class FlatCombiningMutexNoCaching
+    : public folly::FlatCombining<FlatCombiningMutexNoCaching> {
+ public:
+  using Super = folly::FlatCombining<FlatCombiningMutexNoCaching>;
+  template <typename CriticalSection>
+  auto lock_combine(CriticalSection func, std::size_t) {
+    auto record = this->allocRec();
+    auto value = folly::invoke_result_t<CriticalSection&>{};
+    this->requestFC([&]() { value = func(); }, record);
+    this->freeRec(record);
+    return value;
+  }
+};
+class FlatCombiningMutexCaching
+    : public folly::FlatCombining<FlatCombiningMutexCaching> {
+ public:
+  using Super = folly::FlatCombining<FlatCombiningMutexCaching>;
+  FlatCombiningMutexCaching() {
+    for (auto i = 0; i < 256; ++i) {
+      this->records_.push_back(this->allocRec());
+    }
+  }
+  template <typename CriticalSection>
+  auto lock_combine(CriticalSection func, std::size_t index) {
+    auto value = folly::invoke_result_t<CriticalSection&>{};
+    this->requestFC([&]() { value = func(); }, records_.at(index));
+    return value;
+  }
+  std::vector<Super::Rec*> records_;
+};
+template <typename Mutex, typename CriticalSection>
+auto lock_and(Mutex& mutex, std::size_t, CriticalSection func) {
+  auto lck = folly::make_unique_lock(mutex);
+  return func();
+}
+template <typename F>
+auto lock_and(DistributedMutexFlatCombining& mutex, std::size_t, F func) {
+  return mutex.mutex_.lock_combine(std::move(func));
+}
+template <typename F>
+auto lock_and(FlatCombiningMutexNoCaching& mutex, std::size_t i, F func) {
+  return mutex.lock_combine(func, i);
+}
+template <typename F>
+auto lock_and(FlatCombiningMutexCaching& mutex, std::size_t i, F func) {
+  return mutex.lock_combine(func, i);
+}
+template <typename Mutex>
+std::unique_lock<Mutex> lock(Mutex& mutex) {
+  return std::unique_lock<Mutex>{mutex};
+}
+template <typename Mutex, typename Other>
+void unlock(Mutex&, Other) {}
+/**
+ * Functions to initialize, write and read from data
+ *
+ * These are used to do different things in the contended benchmark based on
+ * the type of the data
+ */
+std::uint64_t write(std::uint64_t& value) {
+  return ++value;
+}
+void read(std::uint64_t value) {
+  folly::doNotOptimizeAway(value);
+}
+void initialize(std::uint64_t& value) {
+  value = 1;
+}
+class alignas(folly::hardware_destructive_interference_size) Ints {
+ public:
+  std::array<folly::CachelinePadded<std::uint64_t>, 5> ints_;
+};
+std::uint64_t write(Ints& vec) {
+  auto sum = std::uint64_t{0};
+  for (auto& integer : vec.ints_) {
+    sum += (*integer += 1);
+  }
+  return sum;
+}
+void initialize(Ints&) {}
+class alignas(folly::hardware_destructive_interference_size) AtomicsAdd {
+ public:
+  std::array<folly::CachelinePadded<std::atomic<std::uint64_t>>, 5> ints_;
+};
+std::uint64_t write(AtomicsAdd& atomics) {
+  auto sum = 0;
+  for (auto& integer : atomics.ints_) {
+    sum += integer->fetch_add(1);
+  }
+  return sum;
+}
+void initialize(AtomicsAdd&) {}
+class alignas(folly::hardware_destructive_interference_size) AtomicCas {
+ public:
+  std::atomic<std::uint64_t> integer_{0};
+};
+std::uint64_t write(AtomicCas& atomic) {
+  auto value = atomic.integer_.load();
+  while (!atomic.integer_.compare_exchange_strong(value, value + 1)) {
+  }
+  return value;
+}
+void initialize(AtomicCas&) {}
+template <typename Lock, typename Data = std::uint64_t>
+static void
+runContended(size_t numOps, size_t numThreads, size_t work = FLAGS_work) {
  folly::BenchmarkSuspender braces;
  size_t totalthreads = std::thread::hardware_concurrency();
  if (totalthreads < numThreads) {
@@ -117,11 +241,14 @@ static void runContended(size_t numOps, size_t numThreads) {
    char padding1[128];
    Lock mutex;
    char padding2[128];
-    long value = 1;
+    Data value;
  };
-  auto locks =
+  auto locks = std::vector<lockstruct>(threadgroups);
-      (struct lockstruct*)calloc(threadgroups, sizeof(struct lockstruct));
+  for (auto& data : locks) {
+    initialize(data.value);
+  }
+  folly::makeUnpredictable(locks);
  char padding3[128];
  (void)padding3;
@@ -134,10 +261,11 @@ static void runContended(size_t numOps, size_t numThreads) {
      lockstruct* mutex = &locks[t % threadgroups];
      runbarrier.wait();
      for (size_t op = 0; op < numOps; op += 1) {
-        auto state = lock(mutex->mutex);
+        auto val = lock_and(mutex->mutex, t, [& value = mutex->value, work] {
-        burn(FLAGS_work);
+          burn(work);
-        mutex->value++;
+          return write(value);
-        unlock(mutex->mutex, std::move(state));
+        });
+        read(val);
        burn(FLAGS_unlocked_work);
      }
    });
@@ -257,7 +385,9 @@ template <typename Mutex>
 void runUncontended(std::size_t iters) {
  auto&& mutex = Mutex{};
  for (auto i = std::size_t{0}; i < iters; ++i) {
+    folly::makeUnpredictable(mutex);
    auto state = lock(mutex);
+    folly::makeUnpredictable(mutex);
    unlock(mutex, std::move(state));
  }
 }
@@ -348,6 +478,53 @@ static void folly_sharedmutex(size_t numOps, size_t numThreads) {
 static void folly_distributedmutex(size_t numOps, size_t numThreads) {
  runContended<folly::DistributedMutex>(numOps, numThreads);
 }
+static void folly_distributedmutex_combining(size_t ops, size_t threads) {
+  runContended<DistributedMutexFlatCombining>(ops, threads);
+}
+static void folly_flatcombining_no_caching(size_t numOps, size_t numThreads) {
+  runContended<FlatCombiningMutexNoCaching>(numOps, numThreads);
+}
+static void folly_flatcombining_caching(size_t numOps, size_t numThreads) {
+  runContended<FlatCombiningMutexCaching>(numOps, numThreads);
+}
+static void std_mutex_simple(size_t numOps, size_t numThreads) {
+  runContended<std::mutex, Ints>(numOps, numThreads, 0);
+}
+static void google_spin_simple(size_t numOps, size_t numThreads) {
+  runContended<GoogleSpinLockAdapter, Ints>(numOps, numThreads, 0);
+}
+static void folly_microspin_simple(size_t numOps, size_t numThreads) {
+  runContended<InitLock<folly::MicroSpinLock>, Ints>(numOps, numThreads, 0);
+}
+static void folly_picospin_simple(size_t numOps, size_t numThreads) {
+  runContended<InitLock<folly::PicoSpinLock<uint16_t>>, Ints>(
+      numOps, numThreads, 0);
+}
+static void folly_microlock_simple(size_t numOps, size_t numThreads) {
+  runContended<folly::MicroLock, Ints>(numOps, numThreads, 0);
+}
+static void folly_sharedmutex_simple(size_t numOps, size_t numThreads) {
+  runContended<folly::SharedMutex, Ints>(numOps, numThreads, 0);
+}
+static void folly_distributedmutex_simple(size_t numOps, size_t numThreads) {
+  runContended<folly::DistributedMutex, Ints>(numOps, numThreads, 0);
+}
+static void folly_distributedmutex_combining_simple(size_t o, size_t t) {
+  runContended<DistributedMutexFlatCombining, Ints>(o, t, 0);
+}
+static void atomics_fetch_add(size_t numOps, size_t numThreads) {
+  runContended<NoLock, AtomicsAdd>(numOps, numThreads, 0);
+}
+static void atomic_cas(size_t numOps, size_t numThreads) {
+  runContended<NoLock, AtomicCas>(numOps, numThreads, 0);
+}
+static void folly_flatcombining_no_caching_simple(size_t ops, size_t threads) {
+  runContended<FlatCombiningMutexNoCaching>(ops, threads, 0);
+}
+static void folly_flatcombining_caching_simple(size_t ops, size_t threads) {
+  runContended<FlatCombiningMutexCaching>(ops, threads, 0);
+}
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 1thread, 1)
@@ -357,6 +534,9 @@ BENCH_REL(folly_picospin, 1thread, 1)
 BENCH_REL(folly_microlock, 1thread, 1)
 BENCH_REL(folly_sharedmutex, 1thread, 1)
 BENCH_REL(folly_distributedmutex, 1thread, 1)
+BENCH_REL(folly_distributedmutex_combining, 1thread, 1)
+BENCH_REL(folly_flatcombining_no_caching, 1thread, 1)
+BENCH_REL(folly_flatcombining_caching, 1thread, 1)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 2thread, 2)
 BENCH_REL(google_spin, 2thread, 2)
@@ -365,6 +545,9 @@ BENCH_REL(folly_picospin, 2thread, 2)
 BENCH_REL(folly_microlock, 2thread, 2)
 BENCH_REL(folly_sharedmutex, 2thread, 2)
 BENCH_REL(folly_distributedmutex, 2thread, 2)
+BENCH_REL(folly_distributedmutex_combining, 2thread, 2)
+BENCH_REL(folly_flatcombining_no_caching, 2thread, 2)
+BENCH_REL(folly_flatcombining_caching, 2thread, 2)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 4thread, 4)
 BENCH_REL(google_spin, 4thread, 4)
@@ -373,6 +556,9 @@ BENCH_REL(folly_picospin, 4thread, 4)
 BENCH_REL(folly_microlock, 4thread, 4)
 BENCH_REL(folly_sharedmutex, 4thread, 4)
 BENCH_REL(folly_distributedmutex, 4thread, 4)
+BENCH_REL(folly_distributedmutex_combining, 4thread, 4)
+BENCH_REL(folly_flatcombining_no_caching, 4thread, 4)
+BENCH_REL(folly_flatcombining_caching, 4thread, 4)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 8thread, 8)
 BENCH_REL(google_spin, 8thread, 8)
@@ -381,6 +567,9 @@ BENCH_REL(folly_picospin, 8thread, 8)
 BENCH_REL(folly_microlock, 8thread, 8)
 BENCH_REL(folly_sharedmutex, 8thread, 8)
 BENCH_REL(folly_distributedmutex, 8thread, 8)
+BENCH_REL(folly_distributedmutex_combining, 8thread, 8)
+BENCH_REL(folly_flatcombining_no_caching, 8thread, 8)
+BENCH_REL(folly_flatcombining_caching, 8thread, 8)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 16thread, 16)
 BENCH_REL(google_spin, 16thread, 16)
@@ -389,6 +578,9 @@ BENCH_REL(folly_picospin, 16thread, 16)
 BENCH_REL(folly_microlock, 16thread, 16)
 BENCH_REL(folly_sharedmutex, 16thread, 16)
 BENCH_REL(folly_distributedmutex, 16thread, 16)
+BENCH_REL(folly_distributedmutex_combining, 16thread, 16)
+BENCH_REL(folly_flatcombining_no_caching, 16thread, 16)
+BENCH_REL(folly_flatcombining_caching, 16thread, 16)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 32thread, 32)
 BENCH_REL(google_spin, 32thread, 32)
@@ -397,6 +589,9 @@ BENCH_REL(folly_picospin, 32thread, 32)
 BENCH_REL(folly_microlock, 32thread, 32)
 BENCH_REL(folly_sharedmutex, 32thread, 32)
 BENCH_REL(folly_distributedmutex, 32thread, 32)
+BENCH_REL(folly_distributedmutex_combining, 32thread, 32)
+BENCH_REL(folly_flatcombining_no_caching, 32thread, 32)
+BENCH_REL(folly_flatcombining_caching, 32thread, 32)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 64thread, 64)
 BENCH_REL(google_spin, 64thread, 64)
@@ -405,6 +600,9 @@ BENCH_REL(folly_picospin, 64thread, 64)
 BENCH_REL(folly_microlock, 64thread, 64)
 BENCH_REL(folly_sharedmutex, 64thread, 64)
 BENCH_REL(folly_distributedmutex, 64thread, 64)
+BENCH_REL(folly_distributedmutex_combining, 64thread, 64)
+BENCH_REL(folly_flatcombining_no_caching, 64thread, 64)
+BENCH_REL(folly_flatcombining_caching, 64thread, 64)
 BENCHMARK_DRAW_LINE();
 BENCH_BASE(std_mutex, 128thread, 128)
 BENCH_REL(google_spin, 128thread, 128)
@@ -413,6 +611,114 @@ BENCH_REL(folly_picospin, 128thread, 128)
 BENCH_REL(folly_microlock, 128thread, 128)
 BENCH_REL(folly_sharedmutex, 128thread, 128)
 BENCH_REL(folly_distributedmutex, 128thread, 128)
+BENCH_REL(folly_distributedmutex_combining, 128thread, 128)
+BENCH_REL(folly_flatcombining_no_caching, 128thread, 128)
+BENCH_REL(folly_flatcombining_caching, 128thread, 128)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 1thread, 1)
+BENCH_REL(google_spin_simple, 1thread, 1)
+BENCH_REL(folly_microspin_simple, 1thread, 1)
+BENCH_REL(folly_picospin_simple, 1thread, 1)
+BENCH_REL(folly_microlock_simple, 1thread, 1)
+BENCH_REL(folly_sharedmutex_simple, 1thread, 1)
+BENCH_REL(folly_distributedmutex_simple, 1thread, 1)
+BENCH_REL(folly_distributedmutex_combining_simple, 1thread, 1)
+BENCH_REL(folly_flatcombining_no_caching_simple, 1thread, 1)
+BENCH_REL(folly_flatcombining_caching_simple, 1thread, 1)
+BENCH_REL(atomics_fetch_add, 1thread, 1)
+BENCH_REL(atomic_cas, 1thread, 1)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 2thread, 2)
+BENCH_REL(google_spin_simple, 2thread, 2)
+BENCH_REL(folly_microspin_simple, 2thread, 2)
+BENCH_REL(folly_picospin_simple, 2thread, 2)
+BENCH_REL(folly_microlock_simple, 2thread, 2)
+BENCH_REL(folly_sharedmutex_simple, 2thread, 2)
+BENCH_REL(folly_distributedmutex_simple, 2thread, 2)
+BENCH_REL(folly_distributedmutex_combining_simple, 2thread, 2)
+BENCH_REL(folly_flatcombining_no_caching_simple, 2thread, 2)
+BENCH_REL(folly_flatcombining_caching_simple, 2thread, 2)
+BENCH_REL(atomics_fetch_add, 2thread, 2)
+BENCH_REL(atomic_cas, 2thread, 2)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 4thread, 4)
+BENCH_REL(google_spin_simple, 4thread, 4)
+BENCH_REL(folly_microspin_simple, 4thread, 4)
+BENCH_REL(folly_picospin_simple, 4thread, 4)
+BENCH_REL(folly_microlock_simple, 4thread, 4)
+BENCH_REL(folly_sharedmutex_simple, 4thread, 4)
+BENCH_REL(folly_distributedmutex_simple, 4thread, 4)
+BENCH_REL(folly_distributedmutex_combining_simple, 4thread, 4)
+BENCH_REL(folly_flatcombining_no_caching_simple, 4thread, 4)
+BENCH_REL(folly_flatcombining_caching_simple, 4thread, 4)
+BENCH_REL(atomics_fetch_add, 4thread, 4)
+BENCH_REL(atomic_cas, 4thread, 4)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 8thread, 8)
+BENCH_REL(google_spin_simple, 8thread, 8)
+BENCH_REL(folly_microspin_simple, 8thread, 8)
+BENCH_REL(folly_picospin_simple, 8thread, 8)
+BENCH_REL(folly_microlock_simple, 8thread, 8)
+BENCH_REL(folly_sharedmutex_simple, 8thread, 8)
+BENCH_REL(folly_distributedmutex_simple, 8thread, 8)
+BENCH_REL(folly_distributedmutex_combining_simple, 8thread, 8)
+BENCH_REL(folly_flatcombining_no_caching_simple, 8thread, 8)
+BENCH_REL(folly_flatcombining_caching_simple, 8thread, 8)
+BENCH_REL(atomics_fetch_add, 8thread, 8)
+BENCH_REL(atomic_cas, 8thread, 8)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 16thread, 16)
+BENCH_REL(google_spin_simple, 16thread, 16)
+BENCH_REL(folly_microspin_simple, 16thread, 16)
+BENCH_REL(folly_picospin_simple, 16thread, 16)
+BENCH_REL(folly_microlock_simple, 16thread, 16)
+BENCH_REL(folly_sharedmutex_simple, 16thread, 16)
+BENCH_REL(folly_distributedmutex_simple, 16thread, 16)
+BENCH_REL(folly_distributedmutex_combining_simple, 16thread, 16)
+BENCH_REL(folly_flatcombining_no_caching_simple, 16thread, 16)
+BENCH_REL(folly_flatcombining_caching_simple, 16thread, 16)
+BENCH_REL(atomics_fetch_add, 16thread, 16)
+BENCH_REL(atomic_cas, 16thread, 16)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 32thread, 32)
+BENCH_REL(google_spin_simple, 32thread, 32)
+BENCH_REL(folly_microspin_simple, 32thread, 32)
+BENCH_REL(folly_picospin_simple, 32thread, 32)
+BENCH_REL(folly_microlock_simple, 32thread, 32)
+BENCH_REL(folly_sharedmutex_simple, 32thread, 32)
+BENCH_REL(folly_distributedmutex_simple, 32thread, 32)
+BENCH_REL(folly_distributedmutex_combining_simple, 32thread, 32)
+BENCH_REL(folly_flatcombining_no_caching_simple, 32thread, 32)
+BENCH_REL(folly_flatcombining_caching_simple, 32thread, 32)
+BENCH_REL(atomics_fetch_add, 32thread, 32)
+BENCH_REL(atomic_cas, 32thread, 32)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 64thread, 64)
+BENCH_REL(google_spin_simple, 64thread, 64)
+BENCH_REL(folly_microspin_simple, 64thread, 64)
+BENCH_REL(folly_picospin_simple, 64thread, 64)
+BENCH_REL(folly_microlock_simple, 64thread, 64)
+BENCH_REL(folly_sharedmutex_simple, 64thread, 64)
+BENCH_REL(folly_distributedmutex_simple, 64thread, 64)
+BENCH_REL(folly_distributedmutex_combining_simple, 64thread, 64)
+BENCH_REL(folly_flatcombining_no_caching_simple, 64thread, 64)
+BENCH_REL(folly_flatcombining_caching_simple, 64thread, 64)
+BENCH_REL(atomics_fetch_add, 64thread, 64)
+BENCH_REL(atomic_cas, 64thread, 64)
+BENCHMARK_DRAW_LINE();
+BENCH_BASE(std_mutex_simple, 128thread, 128)
+BENCH_REL(google_spin_simple, 128thread, 128)
+BENCH_REL(folly_microspin_simple, 128thread, 128)
+BENCH_REL(folly_picospin_simple, 128thread, 128)
+BENCH_REL(folly_microlock_simple, 128thread, 128)
+BENCH_REL(folly_sharedmutex_simple, 128thread, 128)
+BENCH_REL(folly_distributedmutex_simple, 128thread, 128)
+BENCH_REL(folly_distributedmutex_combining_simple, 128thread, 128)
+BENCH_REL(folly_flatcombining_no_caching_simple, 128thread, 128)
+BENCH_REL(folly_flatcombining_caching_simple, 128thread, 128)
+BENCH_REL(atomics_fetch_add, 128thread, 128)
+BENCH_REL(atomic_cas, 128thread, 128)
 template <typename Mutex>
 void fairnessTest(std::string type, std::size_t numThreads) {
@@ -585,78 +891,415 @@ Lock time stats in us: mean 56 stddev 960 max 32873
 ============================================================================
 folly/synchronization/test/SmallLocksBenchmark.cpprelative  time/iter  iters/s
 ============================================================================
-StdMutexUncontendedBenchmark                                16.73ns   59.77M
+StdMutexUncontendedBenchmark                                18.85ns   53.04M
-GoogleSpinUncontendedBenchmark                              11.26ns   88.80M
+GoogleSpinUncontendedBenchmark                              11.25ns   88.87M
-MicroSpinLockUncontendedBenchmark                           10.06ns   99.44M
+MicroSpinLockUncontendedBenchmark                           10.95ns   91.34M
-PicoSpinLockUncontendedBenchmark                            11.25ns   88.89M
+PicoSpinLockUncontendedBenchmark                            20.38ns   49.06M
-MicroLockUncontendedBenchmark                               19.20ns   52.09M
+MicroLockUncontendedBenchmark                               28.60ns   34.96M
-SharedMutexUncontendedBenchmark                             19.45ns   51.40M
+SharedMutexUncontendedBenchmark                             19.51ns   51.25M
-DistributedMutexUncontendedBenchmark                        17.02ns   58.75M
+DistributedMutexUncontendedBenchmark                        25.27ns   39.58M
 AtomicFetchAddUncontendedBenchmark                           5.47ns  182.91M
 ----------------------------------------------------------------------------
 ----------------------------------------------------------------------------
-std_mutex(1thread)                                         802.21ns    1.25M
+std_mutex(1thread)                                         797.34ns    1.25M
-google_spin(1thread)                             109.81%   730.52ns    1.37M
+google_spin(1thread)                             101.28%   787.29ns    1.27M
-folly_microspin(1thread)                         119.16%   673.22ns    1.49M
+folly_microspin(1thread)                         118.32%   673.90ns    1.48M
-folly_picospin(1thread)                          119.02%   673.99ns    1.48M
+folly_picospin(1thread)                          118.36%   673.66ns    1.48M
-folly_microlock(1thread)                         131.67%   609.28ns    1.64M
+folly_microlock(1thread)                         117.98%   675.84ns    1.48M
-folly_sharedmutex(1thread)                       118.41%   677.46ns    1.48M
+folly_sharedmutex(1thread)                       118.40%   673.41ns    1.48M
-folly_distributedmutex(1thread)                  100.27%   800.02ns    1.25M
+folly_distributedmutex(1thread)                  116.11%   686.74ns    1.46M
----------------------------------------------------------------------------
+folly_distributedmutex_combining(1thread)        115.05%   693.05ns    1.44M
-std_mutex(2thread)                                           1.30us  769.21K
+folly_flatcombining_no_caching(1thread)           90.40%   882.05ns    1.13M
-google_spin(2thread)                             129.59%     1.00us  996.85K
+folly_flatcombining_caching(1thread)             107.30%   743.08ns    1.35M
-folly_microspin(2thread)                         158.13%   822.13ns    1.22M
+----------------------------------------------------------------------------
-folly_picospin(2thread)                          150.43%   864.23ns    1.16M
+std_mutex(2thread)                                           1.14us  874.72K
-folly_microlock(2thread)                         144.94%   896.92ns    1.11M
+google_spin(2thread)                             120.79%   946.42ns    1.06M
-folly_sharedmutex(2thread)                       120.36%     1.08us  925.83K
+folly_microspin(2thread)                         136.28%   838.90ns    1.19M
-folly_distributedmutex(2thread)                  112.98%     1.15us  869.08K
+folly_picospin(2thread)                          133.80%   854.45ns    1.17M
----------------------------------------------------------------------------
+folly_microlock(2thread)                         111.09%     1.03us  971.76K
-std_mutex(4thread)                                           2.36us  424.08K
+folly_sharedmutex(2thread)                       109.19%     1.05us  955.10K
-google_spin(4thread)                             120.20%     1.96us  509.75K
+folly_distributedmutex(2thread)                  106.62%     1.07us  932.65K
-folly_microspin(4thread)                         109.07%     2.16us  462.53K
+folly_distributedmutex_combining(2thread)        105.45%     1.08us  922.42K
-folly_picospin(4thread)                          113.37%     2.08us  480.78K
+folly_flatcombining_no_caching(2thread)           74.73%     1.53us  653.70K
-folly_microlock(4thread)                          83.88%     2.81us  355.71K
+folly_flatcombining_caching(2thread)              82.78%     1.38us  724.05K
-folly_sharedmutex(4thread)                        90.47%     2.61us  383.65K
+----------------------------------------------------------------------------
-folly_distributedmutex(4thread)                  121.82%     1.94us  516.63K
+std_mutex(4thread)                                           2.39us  418.41K
----------------------------------------------------------------------------
+google_spin(4thread)                             128.49%     1.86us  537.63K
-std_mutex(8thread)                                           5.39us  185.64K
+folly_microspin(4thread)                         102.60%     2.33us  429.28K
-google_spin(8thread)                             127.72%     4.22us  237.10K
+folly_picospin(4thread)                          111.94%     2.14us  468.37K
-folly_microspin(8thread)                         106.70%     5.05us  198.08K
+folly_microlock(4thread)                          78.19%     3.06us  327.16K
-folly_picospin(8thread)                           88.02%     6.12us  163.41K
+folly_sharedmutex(4thread)                        86.30%     2.77us  361.11K
-folly_microlock(8thread)                          79.78%     6.75us  148.11K
+folly_distributedmutex(4thread)                  138.25%     1.73us  578.44K
-folly_sharedmutex(8thread)                        78.25%     6.88us  145.26K
+folly_distributedmutex_combining(4thread)        146.96%     1.63us  614.90K
-folly_distributedmutex(8thread)                  162.74%     3.31us  302.12K
+folly_flatcombining_no_caching(4thread)           87.93%     2.72us  367.90K
----------------------------------------------------------------------------
+folly_flatcombining_caching(4thread)              96.09%     2.49us  402.04K
-std_mutex(16thread)                                         11.74us   85.16K
+----------------------------------------------------------------------------
-google_spin(16thread)                            109.91%    10.68us   93.60K
+std_mutex(8thread)                                           3.84us  260.54K
-folly_microspin(16thread)                        103.93%    11.30us   88.50K
+google_spin(8thread)                              98.58%     3.89us  256.83K
-folly_picospin(16thread)                          50.36%    23.32us   42.89K
+folly_microspin(8thread)                          64.01%     6.00us  166.77K
-folly_microlock(16thread)                         55.85%    21.03us   47.56K
+folly_picospin(8thread)                           64.76%     5.93us  168.72K
-folly_sharedmutex(16thread)                       64.27%    18.27us   54.74K
+folly_microlock(8thread)                          44.31%     8.66us  115.45K
-folly_distributedmutex(16thread)                 181.32%     6.48us  154.41K
+folly_sharedmutex(8thread)                        50.20%     7.65us  130.78K
----------------------------------------------------------------------------
+folly_distributedmutex(8thread)                  120.38%     3.19us  313.64K
-std_mutex(32thread)                                         31.56us   31.68K
+folly_distributedmutex_combining(8thread)        190.44%     2.02us  496.18K
-google_spin(32thread)                             95.17%    33.17us   30.15K
+folly_flatcombining_no_caching(8thread)          102.17%     3.76us  266.19K
-folly_microspin(32thread)                        100.60%    31.38us   31.87K
+folly_flatcombining_caching(8thread)             129.25%     2.97us  336.76K
-folly_picospin(32thread)                          31.30%   100.84us    9.92K
+----------------------------------------------------------------------------
-folly_microlock(32thread)                         55.04%    57.35us   17.44K
+std_mutex(16thread)                                          9.09us  110.05K
-folly_sharedmutex(32thread)                       65.09%    48.49us   20.62K
+google_spin(16thread)                            110.38%     8.23us  121.47K
-folly_distributedmutex(32thread)                 177.39%    17.79us   56.20K
+folly_microspin(16thread)                         79.81%    11.39us   87.83K
----------------------------------------------------------------------------
+folly_picospin(16thread)                          33.62%    27.03us   37.00K
-std_mutex(64thread)                                         39.90us   25.06K
+folly_microlock(16thread)                         49.93%    18.20us   54.95K
-google_spin(64thread)                            110.92%    35.98us   27.80K
+folly_sharedmutex(16thread)                       46.15%    19.69us   50.79K
-folly_microspin(64thread)                        105.98%    37.65us   26.56K
+folly_distributedmutex(16thread)                 145.48%     6.25us  160.10K
-folly_picospin(64thread)                          33.03%   120.80us    8.28K
+folly_distributedmutex_combining(16thread)       275.84%     3.29us  303.56K
-folly_microlock(64thread)                         58.02%    68.78us   14.54K
+folly_flatcombining_no_caching(16thread)         151.81%     5.99us  167.06K
-folly_sharedmutex(64thread)                       68.43%    58.32us   17.15K
+folly_flatcombining_caching(16thread)            153.44%     5.92us  168.86K
-folly_distributedmutex(64thread)                 200.38%    19.91us   50.22K
+----------------------------------------------------------------------------
----------------------------------------------------------------------------
+std_mutex(32thread)                                         26.15us   38.24K
-std_mutex(128thread)                                        75.67us   13.21K
+google_spin(32thread)                            111.41%    23.47us   42.60K
-google_spin(128thread)                           116.14%    65.16us   15.35K
+folly_microspin(32thread)                         84.76%    30.85us   32.41K
-folly_microspin(128thread)                       100.82%    75.06us   13.32K
+folly_picospin(32thread)                          27.30%    95.80us   10.44K
-folly_picospin(128thread)                         44.99%   168.21us    5.94K
+folly_microlock(32thread)                         48.93%    53.45us   18.71K
-folly_microlock(128thread)                        53.93%   140.31us    7.13K
+folly_sharedmutex(32thread)                       54.64%    47.86us   20.89K
-folly_sharedmutex(128thread)                      64.37%   117.55us    8.51K
+folly_distributedmutex(32thread)                 158.31%    16.52us   60.53K
-folly_distributedmutex(128thread)                185.71%    40.75us   24.54K
+folly_distributedmutex_combining(32thread)       314.13%     8.33us  120.12K
+folly_flatcombining_no_caching(32thread)         175.18%    14.93us   66.99K
+folly_flatcombining_caching(32thread)            206.73%    12.65us   79.05K
+----------------------------------------------------------------------------
+std_mutex(64thread)                                         30.72us   32.55K
+google_spin(64thread)                            113.69%    27.02us   37.00K
+folly_microspin(64thread)                         87.23%    35.22us   28.39K
+folly_picospin(64thread)                          27.66%   111.06us    9.00K
+folly_microlock(64thread)                         49.93%    61.53us   16.25K
+folly_sharedmutex(64thread)                       54.00%    56.89us   17.58K
+folly_distributedmutex(64thread)                 162.10%    18.95us   52.77K
+folly_distributedmutex_combining(64thread)       317.85%     9.67us  103.46K
+folly_flatcombining_no_caching(64thread)         160.43%    19.15us   52.22K
+folly_flatcombining_caching(64thread)            185.57%    16.56us   60.40K
+----------------------------------------------------------------------------
+std_mutex(128thread)                                        72.86us   13.72K
+google_spin(128thread)                           114.50%    63.64us   15.71K
+folly_microspin(128thread)                        99.89%    72.95us   13.71K
+folly_picospin(128thread)                         31.49%   231.40us    4.32K
+folly_microlock(128thread)                        57.76%   126.14us    7.93K
+folly_sharedmutex(128thread)                      61.49%   118.50us    8.44K
+folly_distributedmutex(128thread)                188.86%    38.58us   25.92K
+folly_distributedmutex_combining(128thread)      372.60%    19.56us   51.14K
+folly_flatcombining_no_caching(128thread)        149.17%    48.85us   20.47K
+folly_flatcombining_caching(128thread)           165.93%    43.91us   22.77K
+----------------------------------------------------------------------------
+std_mutex_simple(1thread)                                  623.35ns    1.60M
+google_spin_simple(1thread)                      103.37%   603.04ns    1.66M
+folly_microspin_simple(1thread)                  103.18%   604.15ns    1.66M
+folly_picospin_simple(1thread)                   103.27%   603.63ns    1.66M
+folly_microlock_simple(1thread)                  102.75%   606.68ns    1.65M
+folly_sharedmutex_simple(1thread)                 99.03%   629.43ns    1.59M
+folly_distributedmutex_simple(1thread)           100.62%   619.52ns    1.61M
+folly_distributedmutex_combining_simple(1thread   99.43%   626.92ns    1.60M
+folly_flatcombining_no_caching_simple(1thread)    81.20%   767.71ns    1.30M
+folly_flatcombining_caching_simple(1thread)       79.80%   781.15ns    1.28M
+atomics_fetch_add(1thread)                       100.67%   619.22ns    1.61M
+atomic_cas(1thread)                              104.04%   599.13ns    1.67M
+----------------------------------------------------------------------------
+std_mutex_simple(2thread)                                    1.13us  884.14K
+google_spin_simple(2thread)                      119.42%   947.08ns    1.06M
+folly_microspin_simple(2thread)                  118.54%   954.12ns    1.05M
+folly_picospin_simple(2thread)                   117.00%   966.67ns    1.03M
+folly_microlock_simple(2thread)                  114.90%   984.36ns    1.02M
+folly_sharedmutex_simple(2thread)                110.79%     1.02us  979.53K
+folly_distributedmutex_simple(2thread)           110.43%     1.02us  976.34K
+folly_distributedmutex_combining_simple(2thread  105.80%     1.07us  935.43K
+folly_flatcombining_no_caching_simple(2thread)    82.28%     1.37us  727.43K
+folly_flatcombining_caching_simple(2thread)       89.85%     1.26us  794.41K
+atomics_fetch_add(2thread)                       107.37%     1.05us  949.27K
+atomic_cas(2thread)                              173.23%   652.92ns    1.53M
+----------------------------------------------------------------------------
+std_mutex_simple(4thread)                                    2.12us  471.59K
+google_spin_simple(4thread)                      101.25%     2.09us  477.50K
+folly_microspin_simple(4thread)                   97.79%     2.17us  461.17K
+folly_picospin_simple(4thread)                    98.80%     2.15us  465.92K
+folly_microlock_simple(4thread)                   79.65%     2.66us  375.61K
+folly_sharedmutex_simple(4thread)                 82.35%     2.57us  388.35K
+folly_distributedmutex_simple(4thread)           113.43%     1.87us  534.91K
+folly_distributedmutex_combining_simple(4thread  158.22%     1.34us  746.17K
+folly_flatcombining_no_caching_simple(4thread)    89.95%     2.36us  424.22K
+folly_flatcombining_caching_simple(4thread)       98.86%     2.14us  466.24K
+atomics_fetch_add(4thread)                       160.21%     1.32us  755.54K
+atomic_cas(4thread)                              283.73%   747.35ns    1.34M
+----------------------------------------------------------------------------
+std_mutex_simple(8thread)                                    3.81us  262.49K
+google_spin_simple(8thread)                      118.19%     3.22us  310.23K
+folly_microspin_simple(8thread)                   87.11%     4.37us  228.66K
+folly_picospin_simple(8thread)                    66.31%     5.75us  174.05K
+folly_microlock_simple(8thread)                   61.18%     6.23us  160.59K
+folly_sharedmutex_simple(8thread)                 61.65%     6.18us  161.82K
+folly_distributedmutex_simple(8thread)           116.66%     3.27us  306.22K
+folly_distributedmutex_combining_simple(8thread  222.30%     1.71us  583.53K
+folly_flatcombining_no_caching_simple(8thread)   105.97%     3.59us  278.17K
+folly_flatcombining_caching_simple(8thread)      119.21%     3.20us  312.92K
+atomics_fetch_add(8thread)                       248.65%     1.53us  652.70K
+atomic_cas(8thread)                              171.55%     2.22us  450.30K
+----------------------------------------------------------------------------
+std_mutex_simple(16thread)                                   9.02us  110.93K
+google_spin_simple(16thread)                     115.67%     7.79us  128.31K
+folly_microspin_simple(16thread)                  85.45%    10.55us   94.79K
+folly_picospin_simple(16thread)                   46.06%    19.57us   51.09K
+folly_microlock_simple(16thread)                  53.34%    16.90us   59.17K
+folly_sharedmutex_simple(16thread)                47.16%    19.12us   52.31K
+folly_distributedmutex_simple(16thread)          131.65%     6.85us  146.03K
+folly_distributedmutex_combining_simple(16threa  353.51%     2.55us  392.13K
+folly_flatcombining_no_caching_simple(16thread)  175.03%     5.15us  194.16K
+folly_flatcombining_caching_simple(16thread)     169.24%     5.33us  187.73K
+atomics_fetch_add(16thread)                      428.31%     2.10us  475.10K
+atomic_cas(16thread)                             194.29%     4.64us  215.52K
+----------------------------------------------------------------------------
+std_mutex_simple(32thread)                                  22.66us   44.12K
+google_spin_simple(32thread)                     114.91%    19.72us   50.70K
+folly_microspin_simple(32thread)                  70.53%    32.13us   31.12K
+folly_picospin_simple(32thread)                   17.21%   131.71us    7.59K
+folly_microlock_simple(32thread)                  39.17%    57.86us   17.28K
+folly_sharedmutex_simple(32thread)                46.84%    48.39us   20.67K
+folly_distributedmutex_simple(32thread)          128.80%    17.60us   56.83K
+folly_distributedmutex_combining_simple(32threa  397.59%     5.70us  175.43K
+folly_flatcombining_no_caching_simple(32thread)  205.08%    11.05us   90.49K
+folly_flatcombining_caching_simple(32thread)     247.48%     9.16us  109.20K
+atomics_fetch_add(32thread)                      466.03%     4.86us  205.63K
+atomic_cas(32thread)                             439.89%     5.15us  194.10K
+----------------------------------------------------------------------------
+std_mutex_simple(64thread)                                  30.55us   32.73K
+google_spin_simple(64thread)                     105.69%    28.91us   34.59K
+folly_microspin_simple(64thread)                  83.06%    36.79us   27.18K
+folly_picospin_simple(64thread)                   20.28%   150.63us    6.64K
+folly_microlock_simple(64thread)                  45.10%    67.75us   14.76K
+folly_sharedmutex_simple(64thread)                54.07%    56.50us   17.70K
+folly_distributedmutex_simple(64thread)          151.84%    20.12us   49.70K
+folly_distributedmutex_combining_simple(64threa  465.77%     6.56us  152.45K
+folly_flatcombining_no_caching_simple(64thread)  186.46%    16.39us   61.03K
+folly_flatcombining_caching_simple(64thread)     250.81%    12.18us   82.09K
+atomics_fetch_add(64thread)                      530.59%     5.76us  173.67K
+atomic_cas(64thread)                             510.57%     5.98us  167.12K
+----------------------------------------------------------------------------
+std_mutex_simple(128thread)                                 69.85us   14.32K
+google_spin_simple(128thread)                     97.54%    71.61us   13.97K
+folly_microspin_simple(128thread)                 88.01%    79.36us   12.60K
+folly_picospin_simple(128thread)                  22.31%   313.13us    3.19K
+folly_microlock_simple(128thread)                 50.49%   138.34us    7.23K
+folly_sharedmutex_simple(128thread)               59.30%   117.78us    8.49K
+folly_distributedmutex_simple(128thread)         174.90%    39.94us   25.04K
+folly_distributedmutex_combining_simple(128thre  531.75%    13.14us   76.13K
+folly_flatcombining_no_caching_simple(128thread  212.56%    32.86us   30.43K
+folly_flatcombining_caching_simple(128thread)    183.68%    38.03us   26.30K
+atomics_fetch_add(128thread)                     629.64%    11.09us   90.15K
+atomic_cas(128thread)                            562.01%    12.43us   80.46K
+============================================================================
+./small_locks_benchmark --bm_min_iters=100000
+Intel(R) Xeon(R) D-2191 CPU @ 1.60GHz
+============================================================================
+folly/synchronization/test/SmallLocksBenchmark.cpprelative  time/iter  iters/s
+============================================================================
+StdMutexUncontendedBenchmark                                37.65ns   26.56M
+GoogleSpinUncontendedBenchmark                              21.97ns   45.52M
+MicroSpinLockUncontendedBenchmark                           21.97ns   45.53M
+PicoSpinLockUncontendedBenchmark                            40.80ns   24.51M
+MicroLockUncontendedBenchmark                               57.76ns   17.31M
+SharedMutexUncontendedBenchmark                             39.55ns   25.29M
+DistributedMutexUncontendedBenchmark                        51.47ns   19.43M
+AtomicFetchAddUncontendedBenchmark                          10.67ns   93.73M
+----------------------------------------------------------------------------
+----------------------------------------------------------------------------
+std_mutex(1thread)                                           1.36us  737.48K
+google_spin(1thread)                              94.81%     1.43us  699.17K
+folly_microspin(1thread)                         100.17%     1.35us  738.74K
+folly_picospin(1thread)                          100.40%     1.35us  740.41K
+folly_microlock(1thread)                          82.90%     1.64us  611.34K
+folly_sharedmutex(1thread)                       101.07%     1.34us  745.36K
+folly_distributedmutex(1thread)                  101.50%     1.34us  748.54K
+folly_distributedmutex_combining(1thread)         99.09%     1.37us  730.79K
+folly_flatcombining_no_caching(1thread)           91.37%     1.48us  673.80K
+folly_flatcombining_caching(1thread)              99.19%     1.37us  731.48K
+----------------------------------------------------------------------------
+std_mutex(2thread)                                           1.65us  605.33K
+google_spin(2thread)                             113.28%     1.46us  685.74K
+folly_microspin(2thread)                         117.23%     1.41us  709.63K
+folly_picospin(2thread)                          113.56%     1.45us  687.40K
+folly_microlock(2thread)                         106.92%     1.55us  647.22K
+folly_sharedmutex(2thread)                       107.24%     1.54us  649.15K
+folly_distributedmutex(2thread)                  114.89%     1.44us  695.47K
+folly_distributedmutex_combining(2thread)         83.44%     1.98us  505.10K
+folly_flatcombining_no_caching(2thread)           75.89%     2.18us  459.42K
+folly_flatcombining_caching(2thread)              76.96%     2.15us  465.86K
+----------------------------------------------------------------------------
+std_mutex(4thread)                                           2.88us  347.43K
+google_spin(4thread)                             132.08%     2.18us  458.88K
+folly_microspin(4thread)                         160.15%     1.80us  556.43K
+folly_picospin(4thread)                          189.27%     1.52us  657.60K
+folly_microlock(4thread)                         155.13%     1.86us  538.97K
+folly_sharedmutex(4thread)                       148.96%     1.93us  517.55K
+folly_distributedmutex(4thread)                  106.64%     2.70us  370.51K
+folly_distributedmutex_combining(4thread)        138.83%     2.07us  482.33K
+folly_flatcombining_no_caching(4thread)           87.67%     3.28us  304.59K
+folly_flatcombining_caching(4thread)              93.32%     3.08us  324.23K
+----------------------------------------------------------------------------
+std_mutex(8thread)                                           7.01us  142.65K
+google_spin(8thread)                             127.58%     5.49us  182.00K
+folly_microspin(8thread)                         137.50%     5.10us  196.14K
+folly_picospin(8thread)                          114.66%     6.11us  163.56K
+folly_microlock(8thread)                         107.90%     6.50us  153.92K
+folly_sharedmutex(8thread)                       114.21%     6.14us  162.93K
+folly_distributedmutex(8thread)                  129.43%     5.42us  184.63K
+folly_distributedmutex_combining(8thread)        271.46%     2.58us  387.23K
+folly_flatcombining_no_caching(8thread)          148.27%     4.73us  211.50K
+folly_flatcombining_caching(8thread)             170.26%     4.12us  242.88K
+----------------------------------------------------------------------------
+std_mutex(16thread)                                         13.11us   76.30K
+google_spin(16thread)                            122.81%    10.67us   93.71K
+folly_microspin(16thread)                         91.61%    14.31us   69.90K
+folly_picospin(16thread)                          62.60%    20.94us   47.76K
+folly_microlock(16thread)                         73.44%    17.85us   56.04K
+folly_sharedmutex(16thread)                       74.68%    17.55us   56.98K
+folly_distributedmutex(16thread)                 142.42%     9.20us  108.67K
+folly_distributedmutex_combining(16thread)       332.10%     3.95us  253.39K
+folly_flatcombining_no_caching(16thread)         177.20%     7.40us  135.21K
+folly_flatcombining_caching(16thread)            186.60%     7.02us  142.37K
+----------------------------------------------------------------------------
+std_mutex(32thread)                                         25.45us   39.30K
+google_spin(32thread)                            122.57%    20.76us   48.17K
+folly_microspin(32thread)                         73.58%    34.58us   28.92K
+folly_picospin(32thread)                          50.29%    50.60us   19.76K
+folly_microlock(32thread)                         58.33%    43.63us   22.92K
+folly_sharedmutex(32thread)                       55.89%    45.53us   21.96K
+folly_distributedmutex(32thread)                 142.80%    17.82us   56.12K
+folly_distributedmutex_combining(32thread)       352.23%     7.22us  138.42K
+folly_flatcombining_no_caching(32thread)         237.42%    10.72us   93.30K
+folly_flatcombining_caching(32thread)            251.05%    10.14us   98.66K
+----------------------------------------------------------------------------
+std_mutex(64thread)                                         43.02us   23.25K
+google_spin(64thread)                            120.68%    35.65us   28.05K
+folly_microspin(64thread)                         70.09%    61.38us   16.29K
+folly_picospin(64thread)                          42.05%   102.31us    9.77K
+folly_microlock(64thread)                         54.50%    78.94us   12.67K
+folly_sharedmutex(64thread)                       50.37%    85.40us   11.71K
+folly_distributedmutex(64thread)                 135.17%    31.83us   31.42K
+folly_distributedmutex_combining(64thread)       319.01%    13.49us   74.15K
+folly_flatcombining_no_caching(64thread)         218.18%    19.72us   50.72K
+folly_flatcombining_caching(64thread)            211.05%    20.38us   49.06K
+----------------------------------------------------------------------------
+std_mutex(128thread)                                        84.62us   11.82K
+google_spin(128thread)                           120.25%    70.37us   14.21K
+folly_microspin(128thread)                        66.54%   127.16us    7.86K
+folly_picospin(128thread)                         33.40%   253.38us    3.95K
+folly_microlock(128thread)                        51.91%   163.03us    6.13K
+folly_sharedmutex(128thread)                      49.51%   170.90us    5.85K
+folly_distributedmutex(128thread)                131.90%    64.15us   15.59K
+folly_distributedmutex_combining(128thread)      273.55%    30.93us   32.33K
+folly_flatcombining_no_caching(128thread)        183.86%    46.02us   21.73K
+folly_flatcombining_caching(128thread)           180.95%    46.76us   21.38K
+----------------------------------------------------------------------------
+std_mutex_simple(1thread)                                    1.20us  833.55K
+google_spin_simple(1thread)                      105.03%     1.14us  875.52K
+folly_microspin_simple(1thread)                  102.64%     1.17us  855.57K
+folly_picospin_simple(1thread)                   101.94%     1.18us  849.74K
+folly_microlock_simple(1thread)                  101.01%     1.19us  841.96K
+folly_sharedmutex_simple(1thread)                100.82%     1.19us  840.37K
+folly_distributedmutex_simple(1thread)           100.15%     1.20us  834.83K
+folly_distributedmutex_combining_simple(1thread  102.37%     1.17us  853.32K
+folly_flatcombining_no_caching_simple(1thread)    93.19%     1.29us  776.81K
+folly_flatcombining_caching_simple(1thread)      100.03%     1.20us  833.80K
+atomic_fetch_add(1thread)                         98.13%     1.22us  817.99K
+atomic_cas(1thread)                              101.95%     1.18us  849.82K
+----------------------------------------------------------------------------
+std_mutex_simple(2thread)                                    1.56us  641.79K
+google_spin_simple(2thread)                      110.31%     1.41us  707.98K
+folly_microspin_simple(2thread)                  115.05%     1.35us  738.35K
+folly_picospin_simple(2thread)                   110.28%     1.41us  707.78K
+folly_microlock_simple(2thread)                  107.14%     1.45us  687.60K
+folly_sharedmutex_simple(2thread)                113.16%     1.38us  726.22K
+folly_distributedmutex_simple(2thread)           108.31%     1.44us  695.14K
+folly_distributedmutex_combining_simple(2thread  104.39%     1.49us  669.95K
+folly_flatcombining_no_caching_simple(2thread)    87.04%     1.79us  558.63K
+folly_flatcombining_caching_simple(2thread)       97.59%     1.60us  626.30K
+atomic_fetch_add(2thread)                        103.06%     1.51us  661.42K
+atomic_cas(2thread)                              123.77%     1.26us  794.32K
+----------------------------------------------------------------------------
+std_mutex_simple(4thread)                                    2.72us  368.29K
+google_spin_simple(4thread)                      122.17%     2.22us  449.96K
+folly_microspin_simple(4thread)                  142.12%     1.91us  523.43K
+folly_picospin_simple(4thread)                   160.27%     1.69us  590.27K
+folly_microlock_simple(4thread)                  143.16%     1.90us  527.24K
+folly_sharedmutex_simple(4thread)                139.18%     1.95us  512.61K
+folly_distributedmutex_simple(4thread)           111.52%     2.43us  410.71K
+folly_distributedmutex_combining_simple(4thread  138.74%     1.96us  510.96K
+folly_flatcombining_no_caching_simple(4thread)    96.48%     2.81us  355.34K
+folly_flatcombining_caching_simple(4thread)      105.15%     2.58us  387.28K
+atomic_fetch_add(4thread)                        148.73%     1.83us  547.75K
+atomic_cas(4thread)                              213.49%     1.27us  786.28K
+----------------------------------------------------------------------------
+std_mutex_simple(8thread)                                    7.04us  142.04K
+google_spin_simple(8thread)                      127.59%     5.52us  181.23K
+folly_microspin_simple(8thread)                  135.94%     5.18us  193.09K
+folly_picospin_simple(8thread)                   113.86%     6.18us  161.72K
+folly_microlock_simple(8thread)                  112.07%     6.28us  159.18K
+folly_sharedmutex_simple(8thread)                113.25%     6.22us  160.86K
+folly_distributedmutex_simple(8thread)           124.12%     5.67us  176.30K
+folly_distributedmutex_combining_simple(8thread  309.01%     2.28us  438.91K
+folly_flatcombining_no_caching_simple(8thread)   134.62%     5.23us  191.21K
+folly_flatcombining_caching_simple(8thread)      147.13%     4.79us  208.99K
+atomic_fetch_add(8thread)                        347.94%     2.02us  494.21K
+atomic_cas(8thread)                              412.06%     1.71us  585.28K
+----------------------------------------------------------------------------
+std_mutex_simple(16thread)                                  12.87us   77.73K
+google_spin_simple(16thread)                     122.44%    10.51us   95.17K
+folly_microspin_simple(16thread)                  99.49%    12.93us   77.33K
+folly_picospin_simple(16thread)                   72.60%    17.72us   56.43K
+folly_microlock_simple(16thread)                  80.39%    16.00us   62.48K
+folly_sharedmutex_simple(16thread)                78.76%    16.34us   61.22K
+folly_distributedmutex_simple(16thread)          118.58%    10.85us   92.17K
+folly_distributedmutex_combining_simple(16threa  483.44%     2.66us  375.76K
+folly_flatcombining_no_caching_simple(16thread)  194.22%     6.62us  150.96K
+folly_flatcombining_caching_simple(16thread)     229.03%     5.62us  178.02K
+atomic_fetch_add(16thread)                       617.57%     2.08us  480.01K
+atomic_cas(16thread)                             258.86%     4.97us  201.20K
+----------------------------------------------------------------------------
+std_mutex_simple(32thread)                                  22.85us   43.77K
+google_spin_simple(32thread)                     123.96%    18.43us   54.25K
+folly_microspin_simple(32thread)                  73.35%    31.15us   32.11K
+folly_picospin_simple(32thread)                   46.43%    49.21us   20.32K
+folly_microlock_simple(32thread)                  55.62%    41.08us   24.34K
+folly_sharedmutex_simple(32thread)                52.67%    43.38us   23.05K
+folly_distributedmutex_simple(32thread)          106.87%    21.38us   46.78K
+folly_distributedmutex_combining_simple(32threa  581.80%     3.93us  254.64K
+folly_flatcombining_no_caching_simple(32thread)  280.19%     8.15us  122.63K
+folly_flatcombining_caching_simple(32thread)     350.87%     6.51us  153.57K
+atomic_fetch_add(32thread)                      1031.35%     2.22us  451.41K
+atomic_cas(32thread)                             209.10%    10.93us   91.52K
+----------------------------------------------------------------------------
+std_mutex_simple(64thread)                                  39.55us   25.28K
+google_spin_simple(64thread)                     124.15%    31.86us   31.39K
+folly_microspin_simple(64thread)                  72.27%    54.73us   18.27K
+folly_picospin_simple(64thread)                   39.96%    98.98us   10.10K
+folly_microlock_simple(64thread)                  53.10%    74.48us   13.43K
+folly_sharedmutex_simple(64thread)                48.83%    81.00us   12.35K
+folly_distributedmutex_simple(64thread)          103.91%    38.06us   26.27K
+folly_distributedmutex_combining_simple(64threa  520.61%     7.60us  131.63K
+folly_flatcombining_no_caching_simple(64thread)  288.46%    13.71us   72.93K
+folly_flatcombining_caching_simple(64thread)     306.57%    12.90us   77.51K
+atomic_fetch_add(64thread)                       982.24%     4.03us  248.34K
+atomic_cas(64thread)                             191.87%    20.61us   48.51K
+----------------------------------------------------------------------------
+std_mutex_simple(128thread)                                 77.79us   12.85K
+google_spin_simple(128thread)                    123.39%    63.05us   15.86K
+folly_microspin_simple(128thread)                 69.13%   112.53us    8.89K
+folly_picospin_simple(128thread)                  30.32%   256.57us    3.90K
+folly_microlock_simple(128thread)                 50.78%   153.20us    6.53K
+folly_sharedmutex_simple(128thread)               48.00%   162.07us    6.17K
+folly_distributedmutex_simple(128thread)         102.79%    75.68us   13.21K
+folly_distributedmutex_combining_simple(128thre  433.00%    17.97us   55.66K
+folly_flatcombining_no_caching_simple(128thread  186.46%    41.72us   23.97K
+folly_flatcombining_caching_simple(128thread)    204.22%    38.09us   26.25K
+atomic_fetch_add(128thread)                      965.10%     8.06us  124.06K
+atomic_cas(128thread)                            184.01%    42.28us   23.65K
 ============================================================================
 */