Commit f11ea4c6 authored by Aaryaman Sagar's avatar Aaryaman Sagar Committed by Facebook Github Bot

Add flat combining to DistributedMutex

Summary:
Add combined critical sections to DistributedMutex.  The implementation uses
the framework within DistributedMutex as the point of reference for contention
and resolves contention by either combining the lock requests of peers or
migrating the lock based on usage and internal state.  This boosts the
performance of DistributedMutex more than before - up to 4x relative to the
old benchmark on dual socket Broadwell and up to 5x on single socket Skylake
machines.  The win might be bigger when the cost of mutex migration is higher,
eg.  when the data being protected is wider than a single L1 cache line.
Small critical sections when used in combinable mode, can now go more than 10x
faster than the small locks, about 6x faster than std::mutex, up to 2-3x
faster than the implementations of flat combining we benchmarked against and
about as fast as a CAS instruction/loop (faster on some NUMA-less and more
parallel architectures like Skylake).  This also allows flat combining to be
used in situations where fine-grained locking would be beneficial with
virtually no overhead, DistributedMutex retains the original size of 8 bytes.
DistributedMutex resolves contention through flat combining up to a constant
factor of 2 contention chains to prevent issues with fairness and latency
outliers.  So we retain the fairness benefits of the original implementation
with no noticeable regression when switching between the lock methods.

The implementation of combined critical sections here is different from the
original flat combining paper.  This uses the same stack based LIFO contention
chains from DistributedMutex to allow the leader to resolve lock requests from
peers.  Combine records are located on the stack along with the wait-node as an
InlineFunctionRef instance to avoid memory allocation overhead or expensive
copying.  Using InlineFunctionRef also means that function calls are resolved
without having to go through the double lookup of a vtable based implementation.
InlineFunctionRef can flatten the virtual table and callable object in-situ
so we have just one indirection.  Additionally, we use preemption as a signal
to speed up lock requests in the case where latency of acquisition would have
otherwise gone beyond our control.  As a side-bonus, this also results in much
simpler code.

The API looks like the following
```
auto integer = std::uint64_t{};
auto mutex = folly::DistributedMutex{};

// ...

mutex.lock_combine([&]() {
  foo();
  integer++;
});
```

This adds three new methods for symmetry with the old lock functions
- folly::invoke_result_t<const Func&> lock_combine(Func) noexcept;
- folly::Optional<> try_lock_combine_for(duration, Func) noexcept;
- folly::Optional<> try_lock_combine_until(time_point, Func) noexcept;

Benchmarks on Broadwell
```
std_mutex_simple(1thread)                                  617.28ns    1.62M
google_spin_simple(1thread)                      101.97%   605.33ns    1.65M
folly_microspin_simple(1thread)                   99.40%   621.01ns    1.61M
folly_picospin_simple(1thread)                   100.15%   616.36ns    1.62M
folly_microlock_simple(1thread)                   98.86%   624.37ns    1.60M
folly_sharedmutex_simple(1thread)                 86.14%   716.59ns    1.40M
folly_distributedmutex_simple(1thread)            97.95%   630.21ns    1.59M
folly_distributedmutex_flatcombining_simple(1th   98.04%   629.60ns    1.59M
folly_flatcombining_no_caching_simple(1thread)    89.85%   687.01ns    1.46M
folly_flatcombining_caching_simple(1thread)       78.36%   787.75ns    1.27M
atomics_fetch_add(1thread)                        97.88%   630.67ns    1.59M
atomic_cas(1thread)                              102.31%   603.33ns    1.66M
----------------------------------------------------------------------------
std_mutex_simple(2thread)                                    1.14us  875.72K
google_spin_simple(2thread)                      125.08%   912.95ns    1.10M
folly_microspin_simple(2thread)                  116.03%   984.14ns    1.02M
folly_picospin_simple(2thread)                   117.35%   973.04ns    1.03M
folly_microlock_simple(2thread)                  102.54%     1.11us  897.95K
folly_sharedmutex_simple(2thread)                121.04%   943.42ns    1.06M
folly_distributedmutex_simple(2thread)           128.24%   890.48ns    1.12M
folly_distributedmutex_flatcombining_simple(2th  107.99%     1.06us  945.66K
folly_flatcombining_no_caching_simple(2thread)    83.40%     1.37us  730.33K
folly_flatcombining_caching_simple(2thread)       87.47%     1.31us  766.00K
atomics_fetch_add(2thread)                       115.71%   986.85ns    1.01M
atomic_cas(2thread)                              171.35%   666.42ns    1.50M
----------------------------------------------------------------------------
std_mutex_simple(4thread)                                    1.98us  504.43K
google_spin_simple(4thread)                      103.24%     1.92us  520.76K
folly_microspin_simple(4thread)                   92.05%     2.15us  464.33K
folly_picospin_simple(4thread)                    89.16%     2.22us  449.75K
folly_microlock_simple(4thread)                   66.62%     2.98us  336.06K
folly_sharedmutex_simple(4thread)                 82.61%     2.40us  416.69K
folly_distributedmutex_simple(4thread)           108.83%     1.82us  548.98K
folly_distributedmutex_flatcombining_simple(4th  145.24%     1.36us  732.63K
folly_flatcombining_no_caching_simple(4thread)    84.77%     2.34us  427.62K
folly_flatcombining_caching_simple(4thread)       91.01%     2.18us  459.09K
atomics_fetch_add(4thread)                       142.86%     1.39us  720.62K
atomic_cas(4thread)                              223.50%   887.02ns    1.13M
----------------------------------------------------------------------------
std_mutex_simple(8thread)                                    3.70us  270.40K
google_spin_simple(8thread)                      110.24%     3.35us  298.09K
folly_microspin_simple(8thread)                   81.59%     4.53us  220.63K
folly_picospin_simple(8thread)                    57.61%     6.42us  155.77K
folly_microlock_simple(8thread)                   54.18%     6.83us  146.49K
folly_sharedmutex_simple(8thread)                 55.44%     6.67us  149.92K
folly_distributedmutex_simple(8thread)           109.86%     3.37us  297.05K
folly_distributedmutex_flatcombining_simple(8th  225.14%     1.64us  608.76K
folly_flatcombining_no_caching_simple(8thread)    96.25%     3.84us  260.26K
folly_flatcombining_caching_simple(8thread)      108.13%     3.42us  292.39K
atomics_fetch_add(8thread)                       255.40%     1.45us  690.60K
atomic_cas(8thread)                              183.68%     2.01us  496.66K
----------------------------------------------------------------------------
std_mutex_simple(16thread)                                   8.70us  114.89K
google_spin_simple(16thread)                     124.47%     6.99us  143.01K
folly_microspin_simple(16thread)                  86.46%    10.07us   99.34K
folly_picospin_simple(16thread)                   40.76%    21.36us   46.83K
folly_microlock_simple(16thread)                  54.78%    15.89us   62.94K
folly_sharedmutex_simple(16thread)                58.14%    14.97us   66.80K
folly_distributedmutex_simple(16thread)          124.53%     6.99us  143.08K
folly_distributedmutex_flatcombining_simple(16t  324.08%     2.69us  372.34K
folly_flatcombining_no_caching_simple(16thread)  134.73%     6.46us  154.79K
folly_flatcombining_caching_simple(16thread)     188.24%     4.62us  216.28K
atomics_fetch_add(16thread)                      340.07%     2.56us  390.72K
atomic_cas(16thread)                             220.15%     3.95us  252.93K
----------------------------------------------------------------------------
std_mutex_simple(32thread)                                  25.62us   39.03K
google_spin_simple(32thread)                     105.21%    24.35us   41.07K
folly_microspin_simple(32thread)                  79.64%    32.17us   31.08K
folly_picospin_simple(32thread)                   19.61%   130.67us    7.65K
folly_microlock_simple(32thread)                  42.97%    59.62us   16.77K
folly_sharedmutex_simple(32thread)                52.41%    48.88us   20.46K
folly_distributedmutex_simple(32thread)          144.48%    17.73us   56.39K
folly_distributedmutex_flatcombining_simple(32t  461.73%     5.55us  180.22K
folly_flatcombining_no_caching_simple(32thread)  207.55%    12.34us   81.01K
folly_flatcombining_caching_simple(32thread)     237.34%    10.80us   92.64K
atomics_fetch_add(32thread)                      561.68%     4.56us  219.23K
atomic_cas(32thread)                             484.13%     5.29us  188.96K
----------------------------------------------------------------------------
std_mutex_simple(64thread)                                  31.26us   31.99K
google_spin_simple(64thread)                      99.95%    31.28us   31.97K
folly_microspin_simple(64thread)                  83.63%    37.38us   26.75K
folly_picospin_simple(64thread)                   20.88%   149.68us    6.68K
folly_microlock_simple(64thread)                  45.46%    68.77us   14.54K
folly_sharedmutex_simple(64thread)                52.65%    59.38us   16.84K
folly_distributedmutex_simple(64thread)          154.90%    20.18us   49.55K
folly_distributedmutex_flatcombining_simple(64t  475.05%     6.58us  151.96K
folly_flatcombining_no_caching_simple(64thread)  195.63%    15.98us   62.58K
folly_flatcombining_caching_simple(64thread)     199.29%    15.69us   63.75K
atomics_fetch_add(64thread)                      580.23%     5.39us  185.61K
atomic_cas(64thread)                             510.76%     6.12us  163.39K
----------------------------------------------------------------------------
std_mutex_simple(128thread)                                 70.53us   14.18K
google_spin_simple(128thread)                     99.20%    71.09us   14.07K
folly_microspin_simple(128thread)                 88.73%    79.49us   12.58K
folly_picospin_simple(128thread)                  22.24%   317.06us    3.15K
folly_microlock_simple(128thread)                 50.17%   140.57us    7.11K
folly_sharedmutex_simple(128thread)               59.53%   118.47us    8.44K
folly_distributedmutex_simple(128thread)         172.74%    40.83us   24.49K
folly_distributedmutex_flatcombining_simple(128  538.22%    13.10us   76.31K
folly_flatcombining_no_caching_simple(128thread  165.11%    42.72us   23.41K
folly_flatcombining_caching_simple(128thread)    161.46%    43.68us   22.89K
atomics_fetch_add(128thread)                     606.51%    11.63us   85.99K
atomic_cas(128thread)                            578.52%    12.19us   82.03K
```

Reviewed By: djwatson

Differential Revision: D13799447

fbshipit-source-id: 923cc35e5060ef79b349690821d8545459248347
parent 11566445
...@@ -15,23 +15,24 @@ ...@@ -15,23 +15,24 @@
*/ */
#include <folly/synchronization/DistributedMutex.h> #include <folly/synchronization/DistributedMutex.h>
#include <folly/CachelinePadded.h>
#include <folly/Likely.h> #include <folly/Likely.h>
#include <folly/Portability.h> #include <folly/Portability.h>
#include <folly/ScopeGuard.h> #include <folly/ScopeGuard.h>
#include <folly/Utility.h> #include <folly/Utility.h>
#include <folly/chrono/Hardware.h> #include <folly/chrono/Hardware.h>
#include <folly/detail/Futex.h> #include <folly/detail/Futex.h>
#include <folly/functional/Invoke.h>
#include <folly/lang/Align.h> #include <folly/lang/Align.h>
#include <folly/lang/Bits.h>
#include <folly/portability/Asm.h> #include <folly/portability/Asm.h>
#include <folly/synchronization/AtomicNotification.h> #include <folly/synchronization/AtomicNotification.h>
#include <folly/synchronization/AtomicUtil.h> #include <folly/synchronization/AtomicUtil.h>
#include <folly/synchronization/WaitOptions.h> #include <folly/synchronization/detail/InlineFunctionRef.h>
#include <folly/synchronization/detail/Sleeper.h> #include <folly/synchronization/detail/Sleeper.h>
#include <folly/synchronization/detail/Spin.h>
#include <glog/logging.h> #include <glog/logging.h>
#include <array>
#include <atomic> #include <atomic>
#include <cstdint> #include <cstdint>
#include <limits> #include <limits>
...@@ -75,6 +76,13 @@ constexpr auto kTimedWaiter = std::uintptr_t{0b10}; ...@@ -75,6 +76,13 @@ constexpr auto kTimedWaiter = std::uintptr_t{0b10};
// this becomes significant for threads that are trying to wake up the // this becomes significant for threads that are trying to wake up the
// uninitialized thread, if they see that the thread is not yet initialized, // uninitialized thread, if they see that the thread is not yet initialized,
// they can do nothing but spin, and wait for the thread to get initialized // they can do nothing but spin, and wait for the thread to get initialized
//
// This also plays a role in the functioning of flat combining as implemented
// in DistributedMutex. When a thread owning the lock goes through the
// contention chain to either unlock the mutex or combine critical sections
// from the other end. The presence of kUninitialized means that the
// combining thread is not able to make progress after this point. So we
// transfer the lock.
constexpr auto kUninitialized = std::uint32_t{0b0}; constexpr auto kUninitialized = std::uint32_t{0b0};
// kWaiting will be set in the waiter's futex structs while they are spinning // kWaiting will be set in the waiter's futex structs while they are spinning
// while waiting for the mutex // while waiting for the mutex
...@@ -107,6 +115,20 @@ constexpr auto kAboutToWait = std::uint32_t{0b100}; ...@@ -107,6 +115,20 @@ constexpr auto kAboutToWait = std::uint32_t{0b100};
// had not yet entered futex(). This interleaving causes the thread calling // had not yet entered futex(). This interleaving causes the thread calling
// futex() to return spuriously, as the futex word is not what it should be // futex() to return spuriously, as the futex word is not what it should be
constexpr auto kSleeping = std::uint32_t{0b101}; constexpr auto kSleeping = std::uint32_t{0b101};
// kCombined is set by the lock holder to let the waiter thread know that its
// combine request was successfully completed by the lock holder. A
// successful combine means that the thread requesting the combine operation
// does not need to unlock the mutex; in fact, doing so would be an error.
constexpr auto kCombined = std::uint32_t{0b111};
// kCombineUninitialized is like kUninitialized but is set by a thread when it
// enqueues in hopes of getting its critical section combined with the lock
// holder
constexpr auto kCombineUninitialized = std::uint32_t{0b1000};
// kCombineWaiting is set by a thread when it is ready to have its combine
// record fulfilled by the lock holder. In particular, this signals to the
// lock holder that the thread has set its next_ pointer in the contention
// chain
constexpr auto kCombineWaiting = std::uint32_t{0b1001};
// The number of spins that we are allowed to do before we resort to marking a // The number of spins that we are allowed to do before we resort to marking a
// thread as having slept // thread as having slept
...@@ -116,15 +138,18 @@ constexpr auto kScheduledAwaySpinThreshold = std::chrono::nanoseconds{200}; ...@@ -116,15 +138,18 @@ constexpr auto kScheduledAwaySpinThreshold = std::chrono::nanoseconds{200};
// The maximum number of spins before a thread starts yielding its processor // The maximum number of spins before a thread starts yielding its processor
// in hopes of getting skipped // in hopes of getting skipped
constexpr auto kMaxSpins = 4000; constexpr auto kMaxSpins = 4000;
// The maximum number of contention chains we can resolve with flat combining.
// After this number of contention chains, the mutex falls back to regular
// two-phased mutual exclusion to ensure that we don't starve the combiner
// thread
constexpr auto kMaxCombineIterations = 2;
/** /**
* Write only data that is available to the thread that is waking up another. * Write only data that is available to the thread that is waking up another.
* Only the waking thread is allowed to write to this, the thread to be woken * Only the waking thread is allowed to write to this, the thread to be woken
* is allowed to read from this after a wakeup has been issued * is allowed to read from this after a wakeup has been issued
*
* Because of the write only semantics of the data here, acquire-release (or
* stronger) memory ordering is needed to write to this
*/ */
template <template <typename> class Atomic>
class WakerMetadata { class WakerMetadata {
public: public:
// This is the thread that initiated wakeups for the contention chain. // This is the thread that initiated wakeups for the contention chain.
...@@ -133,7 +158,7 @@ class WakerMetadata { ...@@ -133,7 +158,7 @@ class WakerMetadata {
// woke up sees this as the next thread to wake up, it knows that it is the // woke up sees this as the next thread to wake up, it knows that it is the
// terminal node in the contention chain. This means that it was the one // terminal node in the contention chain. This means that it was the one
// that took off the thread that had acquired the mutex off the centralized // that took off the thread that had acquired the mutex off the centralized
// state. Therefore, the current thread is the last in it's contention // state. Therefore, the current thread is the last in its contention
// chain. It will fall back to centralized storage to pick up the next // chain. It will fall back to centralized storage to pick up the next
// waiter or release the mutex // waiter or release the mutex
// //
...@@ -144,41 +169,354 @@ class WakerMetadata { ...@@ -144,41 +169,354 @@ class WakerMetadata {
// prohitively large threshold to avoid heap allocations, this strategy // prohitively large threshold to avoid heap allocations, this strategy
// however, might cause increased cache misses on wakeup signalling // however, might cause increased cache misses on wakeup signalling
std::uintptr_t waker_{0}; std::uintptr_t waker_{0};
// the list of threads that the waker had previously seen to be sleeping on
// a futex(),
//
// this is given to the current thread as a means to pass on
// information. When the current thread goes to unlock the mutex and does
// not see contention, it should go and wake up the head of this list. If
// the current thread sees a contention chain on the mutex, it should pass
// on this list to the next thread that gets woken up
std::uintptr_t waiters_{0};
// The futex that this waiter will sleep on
//
// how can we reuse futex_ from above for futex management?
Futex<Atomic> sleeper_{kUninitialized};
}; };
/**
* Type of the type-erased callable that is used for combining from the lock
* holder's end. This has 48 bytes of inline storage that can be used to
* minimize cache misses when combining
*/
using CombineFunction = detail::InlineFunctionRef<void(), 48>;
/** /**
* Waiter encapsulates the state required for waiting on the mutex, this * Waiter encapsulates the state required for waiting on the mutex, this
* contains potentially heavy state and is intended to be allocated on the * contains potentially heavy state and is intended to be allocated on the
* stack as part of a lock() function call * stack as part of a lock() function call
*
* To ensure that synchronization does not cause unintended side effects on
* the rest of the thread stack (eg. metadata in lockImplementation(), or any
* other data in the user's thread), we aggresively pad this struct and use
* custom alignment internally to ensure that the relevant data fits within a
* single cacheline. The added alignment here also gives us some room to
* wiggle in the bottom few bits of the mutex, where we store extra metadata
*/ */
template <template <typename> class Atomic> template <template <typename> class Atomic>
class Waiter { class Waiter {
public: public:
explicit Waiter(std::uint64_t futex) : futex_{futex} {} Waiter() = default;
Waiter(Waiter&&) = delete;
Waiter(const Waiter&) = delete;
Waiter& operator=(Waiter&&) = delete;
Waiter& operator=(const Waiter&) = delete;
void initialize(std::uint64_t futex, CombineFunction task) {
// we only initialize the function if we were actually given a non-null
// task, otherwise
if (task) {
DCHECK_EQ(futex, kCombineUninitialized);
new (&function_) CombineFunction{task};
} else {
DCHECK((futex == kUninitialized) || (futex == kAboutToWait));
new (&metadata_) WakerMetadata<Atomic>{};
}
// this pedantic store is needed to ensure that the waking thread
// synchronizes with the state in the waiter struct when it loads the
// value of the futex word
//
// on x86, this gets optimized away to just a regular store, it might be
// needed on platforms where explicit acquire-release barriers are
// required for synchronization
//
// note that we release here at the end of the constructor because
// construction is complete here, any thread that acquires this release
// will see a well constructed wait node
futex_.store(futex, std::memory_order_release);
}
std::array<std::uint8_t, hardware_destructive_interference_size> padding1;
// the atomic that this thread will spin on while waiting for the mutex to // the atomic that this thread will spin on while waiting for the mutex to
// be unlocked // be unlocked
Atomic<std::uint64_t> futex_{kUninitialized}; alignas(hardware_destructive_interference_size) Atomic<std::uint64_t> futex_{
// metadata for the waker kUninitialized};
WakerMetadata wakerMetadata_{};
// The successor of this node. This will be the thread that had its address // The successor of this node. This will be the thread that had its address
// on the mutex previously // on the mutex previously
std::uintptr_t next_{0}; std::uintptr_t next_{0};
// the list of threads that the waker had previously seen to be sleeping on // We use an anonymous union for the combined critical section request and
// a futex(), // the metadata that will be filled in from the leader's end. Only one is
// active at a time - if a leader decides to combine the requested critical
// section into its execution, it will not touch the metadata field. If a
// leader decides to migrate the lock to the waiter, it will not touch the
// function
// //
// this is given to the current thread as a means to pass on // this allows us to transfer more state when combining a critical section
// information. When the current thread goes to unlock the mutex and does // and reduce the cache misses originating from executing an arbitrary
// not see contention, it should go and wake up the head of this list. If // lambda
// the current thread sees a contention chain on the mutex, it should pass
// on this list to the next thread that gets woken up
std::uintptr_t waiters_{0};
// The futex that this waiter will sleep on
// //
// how can we reuse futex_ from above for futex management? // note that this is an anonymous union, not an unnamed union, the members
Futex<Atomic> sleeper_{kUninitialized}; // leak into the surrounding scope
union {
// metadata for the waker
WakerMetadata<Atomic> metadata_;
// The critical section that can potentially be combined into the critical
// section of the locking thread
//
// This is kept as a FunctionRef because the original function is preserved
// until the lock_combine() function returns. A consequence of using
// FunctionRef here is that we don't need to do any allocations and can
// allow users to capture unbounded state into the critical section. Flat
// combining means that the user does not have access to the thread
// executing the critical section, so assumptions about thread local
// references can be invalidated. Being able to capture arbitrary state
// allows the user to do thread local accesses right before the critical
// section and pass them as state to the callable being referenced here
CombineFunction function_;
// The user is allowed to use a combined critical section that returns a
// value. This buffer is used to implement the value transfer to the
// waiting thread. We reuse the same union because this helps us combine
// one synchronization operation with a material value transfer.
//
// The waker thread needs to synchronize on this cacheline to issue a
// wakeup to the waiter, meaning that the entire line needs to be pulled
// into the remote core in exclusive mode. So we reuse the coherence
// operation to transfer the return value in addition to the
// synchronization signal. In the case that the user's data item is
// small, the data is transferred all inline as part of the same line,
// which pretty much arrives into the CPU cache in the same clock cycle or
// two after a read-for-ownership request. This gives us a high chance of
// coalescing the entire transitive store buffer together into one cache
// coherence operation from the waker's end. This allows us to make use
// of the CPU bus bandwidth which would have otherwise gone to waste.
// Benchmarks prove this theory under a wide range of contention, value
// sizes, NUMA interactions and processor models
//
// The current version of the Intel optimization manual confirms this
// theory somewhat as well in section 2.3.5.1 (Load and Store Operation
// Overview)
//
// When an instruction writes data to a memory location [...], the
// processor ensures that it has the line containing this memory location
// is in its L1d cache [...]. If the cache line is not there, it fetches
// from the next levels using a RFO request [...] RFO and storing the
// data happens after instruction retirement. Therefore, the store
// latency usually does not affect the store instruction itself
//
// This gives the user the ability to input up to 48 bytes into the
// combined critical section through an InlineFunctionRef and output 48
// bytes from it basically without any cost. The type of the entity
// stored in the buffer has to be matched by the type erased callable that
// the caller has used. At this point, the caller is still in the
// template instantiation leading to the combine request, so it has
// knowledge of the return type and can apply the appropriate
// reinterpret_cast and launder operation to safely retrieve the data from
// this buffer
std::aligned_storage_t<48, 8> storage_;
};
std::array<std::uint8_t, hardware_destructive_interference_size> padding2;
}; };
/**
* A template that helps us differentiate between the different ways to return
* a value from a combined critical section. A return value of type void
* cannot be stored anywhere, so we use specializations and pick the right one
* switched through std::conditional_t
*
* This is then used by CoalescedTask and its family of functions to implement
* efficient return value transfers to the waiting threads
*/
template <typename Func>
class RequestWithReturn {
public:
using F = Func;
using ReturnType = folly::invoke_result_t<const Func&>;
explicit RequestWithReturn(Func func) : func_{std::move(func)} {}
/**
* We need to define the destructor here because C++ requires (with good
* reason) that a union with non-default destructor be explicitly destroyed
* from the surrounding class, as neither the runtime nor compiler have the
* knowledge of what to do with a union at the time of destruction
*
* Each request that has a valid return value set will have the value
* retrieved from the get() method, where the value is destroyed. So we
* don't need to destroy it here
*/
~RequestWithReturn() {}
/**
* This method can be used to return a value from the request. This returns
* the underlying value because return type of the function we were
* instantiated with is not void
*/
ReturnType get() && {
// when the return value has been processed, we destroy the value
// contained in this request. Using a scope_exit means that we don't have
// to worry about storing the value somewhere and causing potentially an
// extra move
//
// note that the invariant here is that this function is only called if the
// requesting thread had it's critical section combined, and the value_
// member constructed through detach()
SCOPE_EXIT {
value_.~ReturnType();
};
return std::move(value_);
}
// this contains a copy of the function the waiter had requested to be
// executed as a combined critical section
Func func_;
// this stores the return value used in the request, we use a union here to
// avoid laundering and allow return types that are not default
// constructible to be propagated through the execution of the critical
// section
//
// note that this is an anonymous union, the member leaks into the
// surrounding scope as a member variable
union {
ReturnType value_;
};
};
template <typename Func>
class RequestWithoutReturn {
public:
using F = Func;
using ReturnType = void;
explicit RequestWithoutReturn(Func func) : func_{std::move(func)} {}
/**
* In this version of the request class, get() returns nothing as there is
* no stored value
*/
void get() && {}
// this contains a copy of the function the waiter had requested to be
// executed as a combined critical section
Func func_;
};
// we need to use std::integral_constant::value here as opposed to
// std::integral_constant::operator T() because MSVC errors out with the
// implicit conversion
template <typename Func>
using Request = std::conditional_t<
std::is_same<folly::invoke_result_t<const Func&>, void>::value,
RequestWithoutReturn<Func>,
RequestWithReturn<Func>>;
/**
* A template that helps us to transform a callable returning a value to one
* that returns void so it can be type erased and passed on to the waker. The
* return value gets coalesced into the wait struct when it is small enough
* for optimal data transfer
*
* This helps a combined critical section feel more normal in the case where
* the user wants to return a value, for example
*
* auto value = mutex_.lock_combine([&]() {
* return data_.value();
* });
*
* Without this, the user would typically create a dummy object that they
* would then assign to from within the lambda. With return value chaining,
* this pattern feels more natural
*
* Note that it is important to copy the entire callble into this class.
* Storing something like a reference instead is not desirable because it does
* not allow InlineFunctionRef to use inline storage to represent the user's
* callable without extra indirections
*
* We use std::conditional_t and switch to the right type of task with the
* CoalescedTask type alias
*/
template <typename Func, typename Waiter>
class TaskWithCoalesce {
public:
using ReturnType = folly::invoke_result_t<const Func&>;
explicit TaskWithCoalesce(Func func, Waiter& waiter)
: func_{std::move(func)}, waiter_{waiter} {}
void operator()() const {
auto value = func_();
new (&waiter_.storage_) ReturnType{std::move(value)};
}
private:
Func func_;
Waiter& waiter_;
static_assert(alignof(decltype(waiter_.storage_)) >= alignof(ReturnType), "");
static_assert(sizeof(decltype(waiter_.storage_)) >= sizeof(ReturnType), "");
};
template <typename Func, typename Waiter>
class TaskWithoutCoalesce {
public:
using ReturnType = void;
explicit TaskWithoutCoalesce(Func func, Waiter&) : func_{std::move(func)} {}
void operator()() const {
func_();
}
private:
Func func_;
};
// we need to use std::integral_constant::value here as opposed to
// std::integral_constant::operator T() because MSVC errors out with the
// implicit conversion
template <typename Func, typename Waiter>
using CoalescedTask = std::conditional_t<
std::is_void<folly::invoke_result_t<const Func&>>::value,
TaskWithoutCoalesce<Func, Waiter>,
TaskWithCoalesce<Func, Waiter>>;
/**
* Given a request and a wait node, coalesce them into a CoalescedTask that
* coalesces the return value into the wait node when invoked from a remote
* thread
*
* When given a null request through nullptr_t, coalesce() returns null as well
*/
template <typename Waiter>
std::nullptr_t coalesce(std::nullptr_t&, Waiter&) {
return nullptr;
}
template <
typename Request,
typename Waiter,
typename Func = typename Request::F>
CoalescedTask<Func, Waiter> coalesce(Request& request, Waiter& waiter) {
static_assert(!std::is_same<Request, std::nullptr_t>{}, "");
return CoalescedTask<Func, Waiter>{request.func_, waiter};
}
/**
* Given a CoalescedTask, a wait node and a request. Detach the return value
* into the request from the wait node and task.
*/
template <typename Waiter>
void detach(std::nullptr_t&, Waiter&) {}
template <typename Waiter, typename F>
void detach(RequestWithoutReturn<F>&, Waiter&) {}
template <typename Waiter, typename F>
void detach(RequestWithReturn<F>& request, Waiter& waiter) {
using ReturnType = typename RequestWithReturn<F>::ReturnType;
static_assert(!std::is_same<ReturnType, void>{}, "");
auto& val = *folly::launder(reinterpret_cast<ReturnType*>(&waiter.storage_));
new (&request.value_) ReturnType{std::move(val)};
val.~ReturnType();
}
/** /**
* Get the time since epoch in nanoseconds * Get the time since epoch in nanoseconds
* *
...@@ -198,14 +536,14 @@ inline std::chrono::nanoseconds time() { ...@@ -198,14 +536,14 @@ inline std::chrono::nanoseconds time() {
* address from a uintptr_t * address from a uintptr_t
*/ */
template <typename Type> template <typename Type>
Type* extractAddress(std::uintptr_t from) { Type* extractPtr(std::uintptr_t from) {
// shift one bit off the end, to get all 1s followed by a single 0 // shift one bit off the end, to get all 1s followed by a single 0
auto mask = std::numeric_limits<std::uintptr_t>::max(); auto mask = std::numeric_limits<std::uintptr_t>::max();
mask >>= 1; mask >>= 1;
mask <<= 1; mask <<= 1;
CHECK(!(mask & 0b1)); CHECK(!(mask & 0b1));
return reinterpret_cast<Type*>(from & mask); return folly::bit_cast<Type*>(from & mask);
} }
/** /**
...@@ -241,7 +579,9 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy { ...@@ -241,7 +579,9 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy {
next_ = std::exchange(other.next_, nullptr); next_ = std::exchange(other.next_, nullptr);
expected_ = std::exchange(other.expected_, 0); expected_ = std::exchange(other.expected_, 0);
wakerMetadata_ = std::exchange(other.wakerMetadata_, {}); timedWaiters_ = std::exchange(other.timedWaiters_, false);
combined_ = std::exchange(other.combined_, false);
waker_ = std::exchange(other.waker_, 0);
waiters_ = std::exchange(other.waiters_, nullptr); waiters_ = std::exchange(other.waiters_, nullptr);
ready_ = std::exchange(other.ready_, nullptr); ready_ = std::exchange(other.ready_, nullptr);
...@@ -260,23 +600,25 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy { ...@@ -260,23 +600,25 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy {
friend class DistributedMutex<Atomic, TimePublishing>; friend class DistributedMutex<Atomic, TimePublishing>;
DistributedMutexStateProxy( DistributedMutexStateProxy(
CachelinePadded<Waiter<Atomic>>* next, Waiter<Atomic>* next,
std::uintptr_t expected, std::uintptr_t expected,
bool timedWaiter = false, bool timedWaiter = false,
WakerMetadata wakerMetadata = {}, bool combined = false,
CachelinePadded<Waiter<Atomic>>* waiters = nullptr, std::uintptr_t waker = 0,
CachelinePadded<Waiter<Atomic>>* ready = nullptr) Waiter<Atomic>* waiters = nullptr,
Waiter<Atomic>* ready = nullptr)
: next_{next}, : next_{next},
expected_{expected}, expected_{expected},
timedWaiters_{timedWaiter}, timedWaiters_{timedWaiter},
wakerMetadata_{wakerMetadata}, combined_{combined},
waker_{waker},
waiters_{waiters}, waiters_{waiters},
ready_{ready} {} ready_{ready} {}
// the next thread that is to be woken up, this being null at the time of // the next thread that is to be woken up, this being null at the time of
// unlock() shows that the current thread acquired the mutex without // unlock() shows that the current thread acquired the mutex without
// contention or it was the terminal thread in the queue of threads waking up // contention or it was the terminal thread in the queue of threads waking up
CachelinePadded<Waiter<Atomic>>* next_{nullptr}; Waiter<Atomic>* next_{nullptr};
// this is the value that the current thread should expect to find on // this is the value that the current thread should expect to find on
// unlock, and if this value is not there on unlock, the current thread // unlock, and if this value is not there on unlock, the current thread
// should assume that other threads are enqueued waiting for the mutex // should assume that other threads are enqueued waiting for the mutex
...@@ -298,18 +640,22 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy { ...@@ -298,18 +640,22 @@ class DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy {
// done so we can avoid having to issue a atomic_notify_all() call (and // done so we can avoid having to issue a atomic_notify_all() call (and
// subsequently a thundering herd) when waking up timed-wait threads // subsequently a thundering herd) when waking up timed-wait threads
bool timedWaiters_{false}; bool timedWaiters_{false};
// a boolean that contains true if the state proxy is not meant to be passed
// to the unlock() function. This is set only when there is contention and
// a thread had asked for its critical section to be combined
bool combined_{false};
// metadata passed along from the thread that woke this thread up // metadata passed along from the thread that woke this thread up
WakerMetadata wakerMetadata_{}; std::uintptr_t waker_{0};
// the list of threads that are waiting on a futex // the list of threads that are waiting on a futex
// //
// the current threads is meant to wake up this list of waiters if it is // the current threads is meant to wake up this list of waiters if it is
// able to commit an unlock() on the mutex without seeing a contention chain // able to commit an unlock() on the mutex without seeing a contention chain
CachelinePadded<Waiter<Atomic>>* waiters_{nullptr}; Waiter<Atomic>* waiters_{nullptr};
// after a thread has woken up from a futex() call, it will have the rest of // after a thread has woken up from a futex() call, it will have the rest of
// the threads that it were waiting behind it in this list, a thread that // the threads that it were waiting behind it in this list, a thread that
// unlocks has to wake up threads from this list if it has any, before it // unlocks has to wake up threads from this list if it has any, before it
// goes to sleep to prevent pathological unfairness // goes to sleep to prevent pathological unfairness
CachelinePadded<Waiter<Atomic>>* ready_{nullptr}; Waiter<Atomic>* ready_{nullptr};
}; };
template <template <typename> class Atomic, bool TimePublishing> template <template <typename> class Atomic, bool TimePublishing>
...@@ -317,8 +663,9 @@ DistributedMutex<Atomic, TimePublishing>::DistributedMutex() ...@@ -317,8 +663,9 @@ DistributedMutex<Atomic, TimePublishing>::DistributedMutex()
: state_{kUnlocked} {} : state_{kUnlocked} {}
template <typename Waiter> template <typename Waiter>
bool spin(Waiter& waiter) { bool spin(Waiter& waiter, std::uint32_t& sig, std::uint32_t mode) {
auto spins = 0; auto spins = 0;
auto waitMode = (mode == kCombineUninitialized) ? kCombineWaiting : kWaiting;
while (true) { while (true) {
// publish our current time in the futex as a part of the spin waiting // publish our current time in the futex as a part of the spin waiting
// process // process
...@@ -328,14 +675,15 @@ bool spin(Waiter& waiter) { ...@@ -328,14 +675,15 @@ bool spin(Waiter& waiter) {
// timestamp to force the waking thread to skip us // timestamp to force the waking thread to skip us
++spins; ++spins;
auto now = (spins < kMaxSpins) ? time() : decltype(time())::zero(); auto now = (spins < kMaxSpins) ? time() : decltype(time())::zero();
auto data = strip(now) | kWaiting; auto data = strip(now) | waitMode;
auto signal = waiter.futex_.exchange(data, std::memory_order_acq_rel); auto signal = waiter.futex_.exchange(data, std::memory_order_acq_rel);
signal &= std::numeric_limits<std::uint8_t>::max(); signal &= std::numeric_limits<std::uint8_t>::max();
// if we got skipped, make a note of it and return if we got a skipped // if we got skipped, make a note of it and return if we got a skipped
// signal or a signal to wake up // signal or a signal to wake up
auto skipped = signal == kSkipped; auto skipped = (signal == kSkipped);
if (skipped || (signal == kWake)) { if (skipped || (signal == kWake) || (signal == kCombined)) {
sig = signal;
return !skipped; return !skipped;
} }
...@@ -379,7 +727,7 @@ void doFutexWake(Waiter* waiter) { ...@@ -379,7 +727,7 @@ void doFutexWake(Waiter* waiter) {
// //
// this dangilng pointer possibility is why we use a pointer to the futex // this dangilng pointer possibility is why we use a pointer to the futex
// word, and avoid dereferencing after the store() operation // word, and avoid dereferencing after the store() operation
auto sleeper = &(*waiter)->sleeper_; auto sleeper = &waiter->metadata_.sleeper_;
sleeper->store(kWake, std::memory_order_release); sleeper->store(kWake, std::memory_order_release);
futexWake(sleeper, 1); futexWake(sleeper, 1);
} }
...@@ -389,7 +737,7 @@ template <typename Waiter> ...@@ -389,7 +737,7 @@ template <typename Waiter>
bool doFutexWait(Waiter* waiter, Waiter*& next) { bool doFutexWait(Waiter* waiter, Waiter*& next) {
// first we get ready to sleep by calling exchange() on the futex with a // first we get ready to sleep by calling exchange() on the futex with a
// kSleeping value // kSleeping value
DCHECK((*waiter)->futex_.load(std::memory_order_relaxed) == kAboutToWait); DCHECK(waiter->futex_.load(std::memory_order_relaxed) == kAboutToWait);
// note the semantics of using a futex here, when we exchange the sleeper_ // note the semantics of using a futex here, when we exchange the sleeper_
// with kSleeping, we are getting ready to sleep, but before sleeping we get // with kSleeping, we are getting ready to sleep, but before sleeping we get
...@@ -397,7 +745,8 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) { ...@@ -397,7 +745,8 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) {
// sleeper_ might have changed. We can also wake up because of a spurious // sleeper_ might have changed. We can also wake up because of a spurious
// wakeup, so we always check against the value in sleeper_ after returning // wakeup, so we always check against the value in sleeper_ after returning
// from futexWait(), if the value is not kWake, then we continue // from futexWait(), if the value is not kWake, then we continue
auto pre = (*waiter)->sleeper_.exchange(kSleeping, std::memory_order_acq_rel); auto pre =
waiter->metadata_.sleeper_.exchange(kSleeping, std::memory_order_acq_rel);
// Seeing a kSleeping on a futex word before we set it ourselves means only // Seeing a kSleeping on a futex word before we set it ourselves means only
// one thing - an unlocking thread caught us before we went to futex(), and // one thing - an unlocking thread caught us before we went to futex(), and
...@@ -424,25 +773,25 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) { ...@@ -424,25 +773,25 @@ bool doFutexWait(Waiter* waiter, Waiter*& next) {
// Because the corresponding futexWake() above does not synchronize // Because the corresponding futexWake() above does not synchronize
// wakeups around the futex word. Because doing so would become // wakeups around the futex word. Because doing so would become
// inefficient // inefficient
futexWait(&(*waiter)->sleeper_, kSleeping); futexWait(&waiter->metadata_.sleeper_, kSleeping);
pre = (*waiter)->sleeper_.load(std::memory_order_acquire); pre = waiter->metadata_.sleeper_.load(std::memory_order_acquire);
DCHECK((pre == kSleeping) || (pre == kWake)); DCHECK((pre == kSleeping) || (pre == kWake));
} }
// when coming out of a futex, we might have some other sleeping threads // when coming out of a futex, we might have some other sleeping threads
// that we were supposed to wake up, assign that to the next pointer // that we were supposed to wake up, assign that to the next pointer
DCHECK(next == nullptr); DCHECK(next == nullptr);
next = extractAddress<Waiter>((*waiter)->next_); next = extractPtr<Waiter>(waiter->next_);
return false; return false;
} }
template <typename Waiter> template <typename Waiter>
bool wait(Waiter* waiter, bool shouldSleep, Waiter*& next) { bool wait(Waiter* waiter, std::uint32_t mode, Waiter*& next, uint32_t& signal) {
if (shouldSleep) { if (mode == kAboutToWait) {
return doFutexWait(waiter, next); return doFutexWait(waiter, next);
} }
return spin(**waiter); return spin(*waiter, signal, mode);
} }
inline void recordTimedWaiterAndClearTimedBit( inline void recordTimedWaiterAndClearTimedBit(
...@@ -461,26 +810,131 @@ inline void recordTimedWaiterAndClearTimedBit( ...@@ -461,26 +810,131 @@ inline void recordTimedWaiterAndClearTimedBit(
} }
} }
template <typename Atomic>
void wakeTimedWaiters(Atomic* state, bool timedWaiters) {
if (UNLIKELY(timedWaiters)) {
atomic_notify_one(state);
}
}
template <template <typename> class Atomic, bool TimePublishing>
template <typename Func>
auto DistributedMutex<Atomic, TimePublishing>::lock_combine(Func func) noexcept
-> folly::invoke_result_t<const Func&> {
// invoke the lock implementation function and check whether we came out of
// it with our task executed as a combined critical section. This usually
// happens when the mutex is contended.
//
// In the absence of contention, we just return from the try_lock() function
// with the lock acquired. So we need to invoke the task and unlock
// the mutex before returning
auto&& task = Request<Func>{func};
auto&& state = lockImplementation(*this, state_, task);
if (!state.combined_) {
// to avoid having to play a return-value dance when the combinable
// returns void, we use a scope exit to perform the unlock after the
// function return has been processed
SCOPE_EXIT {
unlock(std::move(state));
};
return func();
}
// if we are here, that means we were able to get our request combined, we
// can return the value that was transferred to us
//
// each thread that enqueues as a part of a contention chain takes up the
// responsibility of any timed waiter that had come immediately before it,
// so we wake up timed waiters before exiting the lock function. Another
// strategy might be to add the timed waiter information to the metadata and
// let a single leader wake up a timed waiter for better concurrency. But
// this has proven not to be useful in benchmarks beyond a small 5% delta,
// so we avoid taking the complexity hit and branch to wake up timed waiters
// from each thread
wakeTimedWaiters(&state_, state.timedWaiters_);
return std::move(task).get();
}
template <template <typename> class Atomic, bool TimePublishing> template <template <typename> class Atomic, bool TimePublishing>
typename DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy typename DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy
DistributedMutex<Atomic, TimePublishing>::lock() { DistributedMutex<Atomic, TimePublishing>::lock() {
auto null = nullptr;
return lockImplementation(*this, state_, null);
}
template <template <typename> class Atomic, bool TimePublishing>
template <typename Rep, typename Period, typename Func, typename ReturnType>
folly::Optional<ReturnType>
DistributedMutex<Atomic, TimePublishing>::try_lock_combine_for(
const std::chrono::duration<Rep, Period>& duration,
Func func) noexcept {
auto state = try_lock_for(duration);
if (state) {
SCOPE_EXIT {
unlock(std::move(state));
};
return func();
}
return folly::none;
}
template <template <typename> class Atomic, bool TimePublishing>
template <typename Clock, typename Duration, typename Func, typename ReturnType>
folly::Optional<ReturnType>
DistributedMutex<Atomic, TimePublishing>::try_lock_combine_until(
const std::chrono::time_point<Clock, Duration>& deadline,
Func func) noexcept {
auto state = try_lock_until(deadline);
if (state) {
SCOPE_EXIT {
unlock(std::move(state));
};
return func();
}
return folly::none;
}
template <
template <typename> class Atomic,
bool TimePublishing,
typename State,
typename Request>
typename DistributedMutex<Atomic, TimePublishing>::DistributedMutexStateProxy
lockImplementation(
DistributedMutex<Atomic, TimePublishing>& mutex,
State& atomic,
Request& request) {
// first try and acquire the lock as a fast path, the underlying // first try and acquire the lock as a fast path, the underlying
// implementation is slightly faster than using std::atomic::exchange() as // implementation is slightly faster than using std::atomic::exchange() as
// is used in this function. So we get a small perf boost in the // is used in this function. So we get a small perf boost in the
// uncontended case // uncontended case
if (auto state = try_lock()) { //
// We only go through this fast path for the lock/unlock usage and avoid this
// for combined critical sections. This check adds unnecessary overhead in
// that case as it causes an extra cacheline bounce
constexpr auto combineRequested = !std::is_same<Request, std::nullptr_t>{};
if (!combineRequested) {
if (auto state = mutex.try_lock()) {
return state; return state;
} }
}
auto previous = std::uintptr_t{0}; auto previous = std::uintptr_t{0};
auto waitMode = kUninitialized; auto waitMode = combineRequested ? kCombineUninitialized : kUninitialized;
auto nextWaitMode = kAboutToWait; auto nextWaitMode = kAboutToWait;
auto timedWaiter = false; auto timedWaiter = false;
CachelinePadded<Waiter<Atomic>>* nextSleeper = nullptr; Waiter<Atomic>* nextSleeper = nullptr;
while (true) { while (true) {
// construct the state needed to wait // construct the state needed to wait
auto&& state = CachelinePadded<Waiter<Atomic>>{waitMode}; //
auto&& address = reinterpret_cast<std::uintptr_t>(&state); // We can't use auto here because MSVC errors out due to a missing copy
// constructor
Waiter<Atomic> state{};
auto&& task = coalesce(request, state);
auto&& address = folly::bit_cast<std::uintptr_t>(&state);
state.initialize(waitMode, std::move(task));
DCHECK(!(address & 0b1)); DCHECK(!(address & 0b1));
// set the locked bit in the address we will be persisting in the mutex // set the locked bit in the address we will be persisting in the mutex
...@@ -496,17 +950,24 @@ DistributedMutex<Atomic, TimePublishing>::lock() { ...@@ -496,17 +950,24 @@ DistributedMutex<Atomic, TimePublishing>::lock() {
// other threads that read the address of this value should see the full // other threads that read the address of this value should see the full
// well-initialized node we are going to wait on if the mutex acquisition // well-initialized node we are going to wait on if the mutex acquisition
// was unsuccessful // was unsuccessful
previous = state_.exchange(address, std::memory_order_acq_rel); previous = atomic.exchange(address, std::memory_order_acq_rel);
recordTimedWaiterAndClearTimedBit(timedWaiter, previous); recordTimedWaiterAndClearTimedBit(timedWaiter, previous);
state->next_ = previous; state.next_ = previous;
if (previous == kUnlocked) { if (previous == kUnlocked) {
return {nullptr, address, timedWaiter, {}, nullptr, nextSleeper}; return {/* next */ nullptr,
/* expected */ address,
/* timedWaiter */ timedWaiter,
/* combined */ false,
/* waker */ 0,
/* waiters */ nullptr,
/* ready */ nextSleeper};
} }
DCHECK(previous & kLocked); DCHECK(previous & kLocked);
// wait until we get a signal from another thread, if this returns false, // wait until we get a signal from another thread, if this returns false,
// we got skipped and had probably been scheduled out, so try again // we got skipped and had probably been scheduled out, so try again
if (!wait(&state, (waitMode == kAboutToWait), nextSleeper)) { auto signal = kUninitialized;
if (!wait(&state, waitMode, nextSleeper, signal)) {
std::swap(waitMode, nextWaitMode); std::swap(waitMode, nextWaitMode);
continue; continue;
} }
...@@ -531,52 +992,172 @@ DistributedMutex<Atomic, TimePublishing>::lock() { ...@@ -531,52 +992,172 @@ DistributedMutex<Atomic, TimePublishing>::lock() {
// relationship until broken // relationship until broken
auto next = previous; auto next = previous;
auto expected = address; auto expected = address;
if (previous == state->wakerMetadata_.waker_) { if (previous == state.metadata_.waker_) {
next = 0; next = 0;
expected = kLocked; expected = kLocked;
} }
// if we were given a combine signal, detach the return value from the
// wait struct into the request, so the current thread can access it
// outside this function
if (signal == kCombined) {
detach(request, state);
}
// if we are just coming out of a futex call, then it means that the next // if we are just coming out of a futex call, then it means that the next
// waiter we are responsible for is also a waiter waiting on a futex, so // waiter we are responsible for is also a waiter waiting on a futex, so
// we return that list in the list of ready threads. We wlil be waking up // we return that list in the list of ready threads. We wlil be waking up
// the ready threads on unlock no matter what // the ready threads on unlock no matter what
return {extractAddress<CachelinePadded<Waiter<Atomic>>>(next), return {/* next */ extractPtr<Waiter<Atomic>>(next),
expected, /* expected */ expected,
timedWaiter, /* timedWaiter */ timedWaiter,
state->wakerMetadata_, /* combined */ combineRequested && (signal == kCombined),
extractAddress<CachelinePadded<Waiter<Atomic>>>(state->waiters_), /* waker */ state.metadata_.waker_,
nextSleeper}; /* waiters */ extractPtr<Waiter<Atomic>>(state.metadata_.waiters_),
/* ready */ nextSleeper};
} }
} }
inline bool preempted(std::uint64_t value) { inline bool preempted(std::uint64_t value, std::chrono::nanoseconds now) {
auto currentTime = recover(strip(time())); auto currentTime = recover(strip(now));
auto nodeTime = recover(value); auto nodeTime = recover(value);
auto preempted = currentTime > nodeTime + kScheduledAwaySpinThreshold.count(); auto preempted = currentTime > nodeTime + kScheduledAwaySpinThreshold.count();
// we say that the thread has been preempted if its timestamp says so, and // we say that the thread has been preempted if its timestamp says so, and
// also if it is neither uninitialized nor skipped // also if it is neither uninitialized nor skipped
DCHECK(value != kSkipped); DCHECK(value != kSkipped);
return (preempted) && (value != kUninitialized); return (preempted) && (value != kUninitialized) &&
(value != kCombineUninitialized);
} }
inline bool isSleeper(std::uintptr_t value) { inline bool isSleeper(std::uintptr_t value) {
return (value == kAboutToWait); return (value == kAboutToWait);
} }
inline bool isInitialized(std::uintptr_t value) {
return (value != kUninitialized) && (value != kCombineUninitialized);
}
inline bool isCombiner(std::uintptr_t value) {
auto mode = (value & 0xff);
return (mode == kCombineWaiting) || (mode == kCombineUninitialized);
}
inline bool isWaitingCombiner(std::uintptr_t value) {
return (value & 0xff) == kCombineWaiting;
}
template <typename Waiter>
CombineFunction loadTask(Waiter* current, std::uintptr_t value) {
// if we know that the waiter is a combiner of some sort, it is safe to read
// and copy the value of the function in the waiter struct, since we know
// that a waiter would have set it before enqueueing
if (isCombiner(value)) {
return current->function_;
}
return nullptr;
}
template <template <typename> class Atomic>
std::uintptr_t tryCombine(
std::uintptr_t value,
Waiter<Atomic>* waiter,
std::uint64_t iteration,
std::chrono::nanoseconds now,
CombineFunction task) {
// it is important to load the value of next_ before checking the value of
// function_ in the next if condition. This is because of two things, the
// first being cache locality - it is helpful to read the value of the
// variable that is closer to futex_, since we just loaded from that before
// entering this function. The second is cache coherence, the wait struct
// is shared between two threads, one thread is spinning on the futex
// waiting for a signal while the other is possibly combining the requested
// critical section into its own. This means that there is a high chance
// we would cause the cachelines to bounce between the threads in the next
// if block.
//
// This leads to a degenerate case where the FunctionRef object ends up in a
// different cacheline thereby making it seem like benchmarks avoid this
// problem. When compiled differently (eg. with link time optimization)
// the wait struct ends up on the stack in a manner that causes the
// FunctionRef object to be in the same cacheline as the other data, thereby
// forcing the current thread to bounce on the cacheline twice (first to
// load the data from the other thread, that presumably owns the cacheline
// due to timestamp publishing) and then to signal the thread
//
// To avoid this sort of non-deterministic behavior based on compilation and
// stack layout, we load the value before executing the other thread's
// critical section
//
// Note that the waiting thread writes the value to the wait struct after
// enqueuing, but never writes to it after the value in the futex_ is
// initialised (showing that the thread is in the spin loop), this makes it
// safe for us to read next_ without synchronization
auto next = std::uintptr_t{0};
if (isInitialized(value)) {
next = waiter->next_;
}
// if the waiter has asked for a combine operation, we should combine its
// critical section and move on to the next waiter
//
// the waiter is combinable if the following conditions are satisfied
//
// 1) the state in the futex word is not uninitialized (kUninitialized)
// 2) it has a valid combine function
// 3) we are not past the limit of the number of combines we can perform
// or the waiter thread been preempted. If the waiter gets preempted,
// its better to just execute their critical section before moving on.
// As they will have to re-queue themselves after preemption anyway,
// leading to further delays in critical section completion
//
// if all the above are satisfied, then we can combine the critical section.
// Note that it is only safe to read from the waiter struct if the value is
// not uninitialized. If the state is uninitialized, we synchronize with
// the write to the next_ member in the lock function. If the value is not
// uninitialized, there is a race in reading the next_ value
if (isWaitingCombiner(value) &&
(iteration <= kMaxCombineIterations || preempted(value, now))) {
task();
waiter->futex_.store(kCombined, std::memory_order_release);
return next;
}
return 0;
}
template <typename Waiter> template <typename Waiter>
std::uintptr_t tryWake( std::uintptr_t tryWake(
bool publishing, bool publishing,
Waiter* waiter, Waiter* waiter,
std::uintptr_t value, std::uintptr_t value,
WakerMetadata metadata, std::uintptr_t waker,
Waiter*& sleepers) { Waiter*& sleepers,
std::uint64_t iteration,
CombineFunction task) {
// try and combine the waiter's request first, if that succeeds that means
// we have successfully executed their critical section and can move on to
// the rest of the chain
auto now = time();
if (auto next = tryCombine(value, waiter, iteration, now, task)) {
return next;
}
// first we see if we can wake the current thread that is spinning // first we see if we can wake the current thread that is spinning
if ((!publishing || !preempted(value)) && !isSleeper(value)) { if ((!publishing || !preempted(value, now)) && !isSleeper(value)) {
// we need release here because of the write to wakerMetadata_ // the Metadata class should be trivially destructible as we use placement
(*waiter)->wakerMetadata_ = metadata; // new to set the relevant metadata without calling any destructor. We
(*waiter)->waiters_ = reinterpret_cast<std::uintptr_t>(sleepers); // need to use placement new because the class contains a futex, which is
(*waiter)->futex_.store(kWake, std::memory_order_release); // non-movable and non-copyable
using Metadata = std::decay_t<decltype(waiter->metadata_)>;
static_assert(std::is_trivially_destructible<Metadata>{}, "");
// we need release here because of the write to waker_ and also because we
// are unlocking the mutex, the thread we do the handoff to here should
// see the modified data
new (&waiter->metadata_) Metadata{waker, bit_cast<uintptr_t>(sleepers)};
waiter->futex_.store(kWake, std::memory_order_release);
return 0; return 0;
} }
...@@ -600,9 +1181,10 @@ std::uintptr_t tryWake( ...@@ -600,9 +1181,10 @@ std::uintptr_t tryWake(
// still sees the locked bit, and never gets woken up // still sees the locked bit, and never gets woken up
// //
// Can we relax this? // Can we relax this?
DCHECK(preempted(value)); DCHECK(preempted(value, now));
auto next = (*waiter)->next_; DCHECK(!isCombiner(value));
(*waiter)->futex_.store(kSkipped, std::memory_order_release); auto next = waiter->next_;
waiter->futex_.store(kSkipped, std::memory_order_release);
return next; return next;
} }
...@@ -623,15 +1205,16 @@ std::uintptr_t tryWake( ...@@ -623,15 +1205,16 @@ std::uintptr_t tryWake(
// that the thread was already sleeping, we have synchronized with the write // that the thread was already sleeping, we have synchronized with the write
// to next_ in the context of the sleeping thread // to next_ in the context of the sleeping thread
// //
// Also we need to set the value of waiters_ and wakerMetadata_ in the // Also we need to set the value of waiters_ and waker_ in the thread before
// thread before doing the exchange because we need to pass on the list of // doing the exchange because we need to pass on the list of sleepers in the
// sleepers in the event that we were able to catch the thread before it // event that we were able to catch the thread before it went to futex().
// went to futex(). If we were unable to catch the thread before it slept, // If we were unable to catch the thread before it slept, these fields will
// these fields will be ignored when the thread wakes up anyway // be ignored when the thread wakes up anyway
DCHECK(isSleeper(value)); DCHECK(isSleeper(value));
(*waiter)->wakerMetadata_ = metadata; waiter->metadata_.waker_ = waker;
(*waiter)->waiters_ = reinterpret_cast<std::uintptr_t>(sleepers); waiter->metadata_.waiters_ = folly::bit_cast<std::uintptr_t>(sleepers);
auto pre = (*waiter)->sleeper_.exchange(kSleeping, std::memory_order_acq_rel); auto pre =
waiter->metadata_.sleeper_.exchange(kSleeping, std::memory_order_acq_rel);
// we were able to catch the thread before it went to sleep, return true // we were able to catch the thread before it went to sleep, return true
if (pre != kSleeping) { if (pre != kSleeping) {
...@@ -643,8 +1226,8 @@ std::uintptr_t tryWake( ...@@ -643,8 +1226,8 @@ std::uintptr_t tryWake(
// //
// we also need to collect this sleeper in the list of sleepers being built // we also need to collect this sleeper in the list of sleepers being built
// up // up
auto next = (*waiter)->next_; auto next = waiter->next_;
(*waiter)->next_ = reinterpret_cast<std::uintptr_t>(sleepers); waiter->next_ = folly::bit_cast<std::uintptr_t>(sleepers);
sleepers = waiter; sleepers = waiter;
return next; return next;
} }
...@@ -653,15 +1236,24 @@ template <typename Waiter> ...@@ -653,15 +1236,24 @@ template <typename Waiter>
bool wake( bool wake(
bool publishing, bool publishing,
Waiter& waiter, Waiter& waiter,
WakerMetadata metadata, std::uintptr_t waker,
Waiter*& sleepers) { Waiter*& sleepers,
std::uint64_t iter) {
// loop till we find a node that is either at the end of the list (as // loop till we find a node that is either at the end of the list (as
// specified by metadata) or we find a node that is active (as specified by // specified by waker) or we find a node that is active (as specified by
// the last published timestamp of the node) // the last published timestamp of the node)
auto current = &waiter; auto current = &waiter;
while (current) { while (current) {
auto value = (*current)->futex_.load(std::memory_order_acquire); // it is important that we load the value of function after the initial
auto next = tryWake(publishing, current, value, metadata, sleepers); // acquire load. This is required because we need to synchronize with the
// construction of the waiter struct before reading from it
auto value = current->futex_.load(std::memory_order_acquire);
auto task = loadTask(current, value);
auto next =
tryWake(publishing, current, value, waker, sleepers, iter, task);
// if there is no next node, we have managed to wake someone up and have
// successfully migrated the lock to another thread
if (!next) { if (!next) {
return true; return true;
} }
...@@ -670,20 +1262,12 @@ bool wake( ...@@ -670,20 +1262,12 @@ bool wake(
// it, this is because after we skip it the node might wake up and enqueue // it, this is because after we skip it the node might wake up and enqueue
// itself, and thereby gain a new next node // itself, and thereby gain a new next node
CHECK(publishing); CHECK(publishing);
current = current = (next == waker) ? nullptr : extractPtr<Waiter>(next);
(next == metadata.waker_) ? nullptr : extractAddress<Waiter>(next);
} }
return false; return false;
} }
template <typename Atomic>
void wakeTimedWaiters(Atomic* state, bool timedWaiters) {
if (UNLIKELY(timedWaiters)) {
atomic_notify_one(state);
}
}
template <typename Atomic, typename Proxy, typename Sleepers> template <typename Atomic, typename Proxy, typename Sleepers>
bool tryUnlockClean(Atomic& state, Proxy& proxy, Sleepers sleepers) { bool tryUnlockClean(Atomic& state, Proxy& proxy, Sleepers sleepers) {
auto expected = proxy.expected_; auto expected = proxy.expected_;
...@@ -717,6 +1301,7 @@ void DistributedMutex<Atomic, Publish>::unlock( ...@@ -717,6 +1301,7 @@ void DistributedMutex<Atomic, Publish>::unlock(
DistributedMutex::DistributedMutexStateProxy proxy) { DistributedMutex::DistributedMutexStateProxy proxy) {
// we always wake up ready threads and timed waiters if we saw either // we always wake up ready threads and timed waiters if we saw either
DCHECK(proxy) << "Invalid proxy passed to DistributedMutex::unlock()"; DCHECK(proxy) << "Invalid proxy passed to DistributedMutex::unlock()";
DCHECK(!proxy.combined_) << "Cannot unlock mutex after a successful combine";
SCOPE_EXIT { SCOPE_EXIT {
doFutexWake(proxy.ready_); doFutexWake(proxy.ready_);
wakeTimedWaiters(&state_, proxy.timedWaiters_); wakeTimedWaiters(&state_, proxy.timedWaiters_);
...@@ -726,7 +1311,7 @@ void DistributedMutex<Atomic, Publish>::unlock( ...@@ -726,7 +1311,7 @@ void DistributedMutex<Atomic, Publish>::unlock(
// don't bother with the mutex state // don't bother with the mutex state
auto sleepers = proxy.waiters_; auto sleepers = proxy.waiters_;
if (proxy.next_) { if (proxy.next_) {
if (wake(Publish, *proxy.next_, proxy.wakerMetadata_, sleepers)) { if (wake(Publish, *proxy.next_, proxy.waker_, sleepers, 0)) {
return; return;
} }
...@@ -741,7 +1326,7 @@ void DistributedMutex<Atomic, Publish>::unlock( ...@@ -741,7 +1326,7 @@ void DistributedMutex<Atomic, Publish>::unlock(
proxy.expected_ = kLocked; proxy.expected_ = kLocked;
} }
while (true) { for (std::uint64_t i = 0; true; ++i) {
// otherwise, since we don't have anyone we need to wake up, we try and // otherwise, since we don't have anyone we need to wake up, we try and
// release the mutex just as is // release the mutex just as is
// //
...@@ -764,13 +1349,10 @@ void DistributedMutex<Atomic, Publish>::unlock( ...@@ -764,13 +1349,10 @@ void DistributedMutex<Atomic, Publish>::unlock(
// terminal node of the new chain will see kLocked in the central storage // terminal node of the new chain will see kLocked in the central storage
auto head = state_.exchange(kLocked, std::memory_order_acq_rel); auto head = state_.exchange(kLocked, std::memory_order_acq_rel);
recordTimedWaiterAndClearTimedBit(proxy.timedWaiters_, head); recordTimedWaiterAndClearTimedBit(proxy.timedWaiters_, head);
auto next = extractAddress<CachelinePadded<Waiter<Atomic>>>(head); auto next = extractPtr<Waiter<Atomic>>(head);
auto expected = std::exchange(proxy.expected_, kLocked);
DCHECK((head & kLocked) && (head != kLocked)) << "incorrect state " << head; DCHECK((head & kLocked) && (head != kLocked)) << "incorrect state " << head;
if (wake( if (wake(Publish, *next, expected, sleepers, i)) {
Publish,
*next,
{std::exchange(proxy.expected_, kLocked)},
sleepers)) {
break; break;
} }
} }
......
/*
* Copyright 2004-present Facebook, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <folly/synchronization/DistributedMutex.h>
namespace folly {
namespace detail {
namespace distributed_mutex {
template class DistributedMutex<std::atomic, true>;
} // namespace distributed_mutex
} // namespace detail
} // namespace folly
...@@ -15,6 +15,9 @@ ...@@ -15,6 +15,9 @@
*/ */
#pragma once #pragma once
#include <folly/Optional.h>
#include <folly/functional/Invoke.h>
#include <atomic> #include <atomic>
#include <chrono> #include <chrono>
#include <cstdint> #include <cstdint>
...@@ -26,32 +29,39 @@ namespace distributed_mutex { ...@@ -26,32 +29,39 @@ namespace distributed_mutex {
/** /**
* DistributedMutex is a small, exclusive-only mutex that distributes the * DistributedMutex is a small, exclusive-only mutex that distributes the
* bookkeeping required for mutual exclusion in the stacks of threads that are * bookkeeping required for mutual exclusion in the stacks of threads that are
* contending for it. It tries to come at a lower space cost than std::mutex * contending for it. It has a mode that can combine critical sections when
* while still trying to maintain the fairness benefits that come from using * the mutex experiences contention, this allows the implementation to elide
* std::mutex. DistributedMutex provides the entire API included in * several expensive coherence and synchronization operations to boost
* std::mutex, and more, with slight modifications. DistributedMutex is the * throughput, surpassing even some atomic CAS instructions in some cases. It
* has no dependencies on heap allocation and tries to come at a lower space
* cost than std::mutex while still trying to maintain the fairness benefits
* that come from using std::mutex. DistributedMutex provides the entire API
* included in std::mutex, and more, with slight modifications. It is the
* same width as a single pointer (8 bytes on most platforms), where on the * same width as a single pointer (8 bytes on most platforms), where on the
* other hand, std::mutex and pthread_mutex_t are both 40 bytes. It is larger * other hand, std::mutex and pthread_mutex_t are both 40 bytes. It is larger
* than some of the other smaller locks, but the wide majority of cases using * than some of the other smaller locks, but the wide majority of cases using
* the small locks are wasting the difference in alignment padding anyway * the small locks are wasting the difference in alignment padding anyway
* *
* Benchmark results are good - at the time of writing in the common * Benchmark results are good - at the time of writing in the common
* uncontended case, it is 30% faster than some of the other small mutexes in * uncontended case, it is a few cycles faster than folly::MicroLock but a bit
* folly and as fast as std::mutex, which recently optimized its uncontended * slower than std::mutex. In the contended case, for lock/unlock based
* path. In the contended case, it is about 4-5x faster than some of the * critical sections, it is about 4-5x faster than some of the smaller locks
* smaller locks in folly, ~2x faster than std::mutex in clang and ~1.8x * and about ~2x faster than std::mutex. When used in combinable mode, it can
* faster in gcc. DistributedMutex is also resistent to tail latency * go more than 10x faster than the small locks, about 6x faster than
* pathalogies unlike many of the other small mutexes. Which sleep for large * std::mutex and up to 2-3x faster than the implementations of flat combining
* we benchmarked against. DistributedMutex is also resistent to tail latency
* pathalogies unlike many of the other mutexes in use, which sleep for large
* time quantums to reduce spin churn, this causes elevated latencies for * time quantums to reduce spin churn, this causes elevated latencies for
* threads that enter the sleep cycle. The tail latency of lock acquisition * threads that enter the sleep cycle. The tail latency of lock acquisition
* on average up to 10x better with DistributedMutex * can go up to 10x lower because of a more deterministic scheduling algorithm
* that is managed almost entirely in userspace
* *
* DistributedMutex reduces cache line contention by making each thread wait * DistributedMutex reduces cache line contention in userspace and in the
* on a thread local spinlock and futex. This allows threads to keep working * kernel by making each thread wait on a thread local spinlock and futex.
* only on their own cache lines without requiring cache coherence operations * This allows threads to keep working only on their own cache lines without
* when a mutex heavy contention. This strategy does not require sequential * requiring cache coherence operations when a mutex heavy contention. This
* ordering on the centralized atomic storage for wakeup operations as each * strategy does not require sequential ordering on the centralized atomic
* thread assigned its own wait state * storage for wakeup operations as each thread assigned its own wait state
* *
* Non-timed mutex acquisitions are scheduled through intrusive LIFO * Non-timed mutex acquisitions are scheduled through intrusive LIFO
* contention chains. Each thread starts by spinning for a short quantum and * contention chains. Each thread starts by spinning for a short quantum and
...@@ -88,6 +98,23 @@ namespace distributed_mutex { ...@@ -88,6 +98,23 @@ namespace distributed_mutex {
* own, thinking a mutex is functionally identical to a binary semaphore, * own, thinking a mutex is functionally identical to a binary semaphore,
* which, unlike a mutex, is a suitable primitive for that usage * which, unlike a mutex, is a suitable primitive for that usage
* *
* Combined critical sections, allow the implementation to elide several
* expensive operations during the lifetime of a critical section that cause
* slowdowns with regular lock/unlock based usage. DistributedMutex resolves
* contention through combining up to a constant factor of 2 contention chains
* to prevent issues with fairness and latency outliers, so we retain the
* fairness benefits of the lock/unlock implementation with no noticeable
* regression when switching between the lock methods. Despite the efficiency
* benefits, combined critical sections can only be used when the critical
* section does not depend on thread local state and does not introduce new
* dependencies between threads when the critical section gets combined. For
* example, locking or unlocking an unrelated mutex in a combined critical
* section might lead to unexpected results or even undefined behavior. This
* can happen if, for example, a different thread unlocks a mutex locked by
* the calling thread, leading to undefined behavior as the mutex might not
* allow locking and unlocking from unrelated threads (the posix and C++
* standard disallow this usage for their mutexes)
*
* Timed locking through DistributedMutex is implemented through a centralized * Timed locking through DistributedMutex is implemented through a centralized
* algorithm - all waiters wait on the central mutex state, by setting and * algorithm - all waiters wait on the central mutex state, by setting and
* resetting bits within the pointer-length word. Since pointer length atomic * resetting bits within the pointer-length word. Since pointer length atomic
...@@ -121,8 +148,15 @@ class DistributedMutex { ...@@ -121,8 +148,15 @@ class DistributedMutex {
* *
* The proxy has no public API and is intended to be for internal usage only * The proxy has no public API and is intended to be for internal usage only
* *
* This is not a recursive mutex - trying to acquire the mutex twice from * There are three notable cases where undefined behavior might come up:
* - This is not a recursive mutex. Trying to acquire the mutex twice from
* the same thread without unlocking it results in undefined behavior * the same thread without unlocking it results in undefined behavior
* - Thread, coroutine or fiber migrations are disallowed. This is because
* the implementation requires owning the stack frame through the
* execution of the critical section for both lock/unlock or combined
* critical sections. This means that you cannot allow another thread,
* fiber or coroutine to unlock the mutex
* - This mutex cannot be used in a program compiled with segmented stacks
*/ */
DistributedMutexStateProxy lock(); DistributedMutexStateProxy lock();
...@@ -132,6 +166,9 @@ class DistributedMutex { ...@@ -132,6 +166,9 @@ class DistributedMutex {
* The proxy returned by lock must be passed to unlock as an rvalue. No * The proxy returned by lock must be passed to unlock as an rvalue. No
* other option is possible here, since the proxy is only movable and not * other option is possible here, since the proxy is only movable and not
* copyable * copyable
*
* It is undefined behavior to unlock from a thread that did not lock the
* mutex
*/ */
void unlock(DistributedMutexStateProxy); void unlock(DistributedMutexStateProxy);
...@@ -173,6 +210,102 @@ class DistributedMutex { ...@@ -173,6 +210,102 @@ class DistributedMutex {
DistributedMutexStateProxy try_lock_until( DistributedMutexStateProxy try_lock_until(
const std::chrono::time_point<Clock, Duration>& deadline); const std::chrono::time_point<Clock, Duration>& deadline);
/**
* Execute a task as a combined critical section
*
* Unlike traditional lock and unlock methods, lock_combine() enqueues the
* passed task for execution on any arbitrary thread. This allows the
* implementation to prevent cache line invalidations originating from
* expensive synchronization operations. The thread holding the lock is
* allowed to execute the task before unlocking, thereby forming a "combined
* critical section".
*
* This idea is inspired by Flat Combining. Flat Combining was introduced
* in the SPAA 2010 paper titled "Flat Combining and the
* Synchronization-Parallelism Tradeoff", by Danny Hendler, Itai Incze, Nir
* Shavit, and Moran Tzafrir -
* https://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf. The
* implementation used here is significantly different from that described
* in the paper. The high-level goal of reducing the overhead of
* synchronization, however, is the same.
*
* Combined critical sections work best when kept simple. Since the
* critical section might be executed on any arbitrary thread, relying on
* things like thread local state or mutex locking and unlocking might cause
* incorrectness. Associativity is important. For example
*
* auto one = std::unique_lock{one_};
* two_.lock_combine([&]() {
* if (bar()) {
* one.unlock();
* }
* });
*
* This has the potential to cause undefined behavior because mutexes are
* only meant to be acquired and released from the owning thread. Similar
* errors can arise from a combined critical section introducing implicit
* dependencies based on the state of the combining thread. For example
*
* // thread 1
* auto one = std::unique_lock{one_};
* auto two = std::unique_lock{two_};
*
* // thread 2
* two_.lock_combine([&]() {
* auto three = std::unique_lock{three_};
* });
*
* Here, because we used a combined critical section, we have introduced a
* dependency from one -> three that might not obvious to the reader
*
* There are three notable cases where undefined behavior might come up:
* - This is not a recursive mutex. Trying to acquire the mutex twice from
* the same thread without unlocking it results in undefined behavior
* - Thread, coroutine or fiber migrations are disallowed. This is because
* the implementation requires the locking entity to own the stack frame
* through the execution of the critical section for both lock/unlock or
* combined critical sections. This means that you cannot allow another
* thread, fiber or coroutine to unlock the mutex
* - This mutex cannot be used in a program compiled with segmented stacks
*/
template <typename Task>
auto lock_combine(Task task) noexcept -> folly::invoke_result_t<const Task&>;
/**
* Try to combine a task as a combined critical section untill the given time
*
* Like the other try_lock() mehtods, this is allowed to fail spuriously,
* and is not guaranteed to return true even when the mutex is currently
* unlocked.
*
* Note that this does not necessarily have the same performance
* characteristics as the non-timed version of the combine method. If
* performance is critical, use that one instead
*/
template <
typename Rep,
typename Period,
typename Task,
typename ReturnType = decltype(std::declval<Task&>()())>
folly::Optional<ReturnType> try_lock_combine_for(
const std::chrono::duration<Rep, Period>& duration,
Task task) noexcept;
/**
* Try to combine a task as a combined critical section untill the given time
*
* Other than the difference in the meaning of the second argument, the
* semantics of this function are identical to try_lock_combine_for()
*/
template <
typename Clock,
typename Duration,
typename Task,
typename ReturnType = decltype(std::declval<Task&>()())>
folly::Optional<ReturnType> try_lock_combine_until(
const std::chrono::time_point<Clock, Duration>& deadline,
Task task) noexcept;
private: private:
Atomic<std::uintptr_t> state_{0}; Atomic<std::uintptr_t> state_{0};
}; };
...@@ -184,6 +317,7 @@ class DistributedMutex { ...@@ -184,6 +317,7 @@ class DistributedMutex {
* Bring the default instantiation of DistributedMutex into the folly * Bring the default instantiation of DistributedMutex into the folly
* namespace without requiring any template arguments for public usage * namespace without requiring any template arguments for public usage
*/ */
extern template class detail::distributed_mutex::DistributedMutex<>;
using DistributedMutex = detail::distributed_mutex::DistributedMutex<>; using DistributedMutex = detail::distributed_mutex::DistributedMutex<>;
} // namespace folly } // namespace folly
......
...@@ -16,12 +16,14 @@ ...@@ -16,12 +16,14 @@
#include <folly/synchronization/DistributedMutex.h> #include <folly/synchronization/DistributedMutex.h>
#include <folly/MapUtil.h> #include <folly/MapUtil.h>
#include <folly/Synchronized.h> #include <folly/Synchronized.h>
#include <folly/container/Array.h>
#include <folly/container/Foreach.h> #include <folly/container/Foreach.h>
#include <folly/portability/GTest.h> #include <folly/portability/GTest.h>
#include <folly/synchronization/Baton.h> #include <folly/synchronization/Baton.h>
#include <folly/test/DeterministicSchedule.h> #include <folly/test/DeterministicSchedule.h>
#include <chrono> #include <chrono>
#include <cmath>
#include <thread> #include <thread>
using namespace std::literals; using namespace std::literals;
...@@ -186,6 +188,7 @@ void atomic_notify_one(const ManualAtomic<std::uintptr_t>*) { ...@@ -186,6 +188,7 @@ void atomic_notify_one(const ManualAtomic<std::uintptr_t>*) {
namespace { namespace {
DEFINE_int32(stress_factor, 1000, "The stress test factor for tests"); DEFINE_int32(stress_factor, 1000, "The stress test factor for tests");
DEFINE_int32(stress_test_seconds, 2, "Duration for stress tests");
constexpr auto kForever = 100h; constexpr auto kForever = 100h;
using DSched = test::DeterministicSchedule; using DSched = test::DeterministicSchedule;
...@@ -206,6 +209,7 @@ void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) { ...@@ -206,6 +209,7 @@ void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) {
for (auto j = 0; j < iterations; ++j) { for (auto j = 0; j < iterations; ++j) {
auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex}; auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0); EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
result.push_back(id); result.push_back(id);
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1); EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
} }
...@@ -225,14 +229,328 @@ void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) { ...@@ -225,14 +229,328 @@ void basicNThreads(int numThreads, int iterations = FLAGS_stress_factor) {
} }
EXPECT_EQ(total, sum(numThreads) * iterations); EXPECT_EQ(total, sum(numThreads) * iterations);
} }
template <template <typename> class Atom = std::atomic>
void lockWithTryAndTimedNThreads(
int numThreads,
std::chrono::seconds duration) {
auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
auto&& barrier = std::atomic<int>{0};
auto&& threads = std::vector<std::thread>{};
auto&& stop = std::atomic<bool>{false};
auto&& lockUnlockFunction = [&]() {
while (!stop.load()) {
auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
};
auto tryLockFunction = [&]() {
while (!stop.load()) {
using Mutex = std::decay_t<decltype(mutex)>;
auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
if (lck.try_lock()) {
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
}
};
auto timedLockFunction = [&]() {
while (!stop.load()) {
using Mutex = std::decay_t<decltype(mutex)>;
auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
if (lck.try_lock_for(kForever)) {
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
}
};
for (auto i = 0; i < (numThreads / 3); ++i) {
threads.push_back(DSched::thread(lockUnlockFunction));
}
for (auto i = 0; i < (numThreads / 3); ++i) {
threads.push_back(DSched::thread(tryLockFunction));
}
for (auto i = 0; i < (numThreads / 3); ++i) {
threads.push_back(DSched::thread(timedLockFunction));
}
/* sleep override */
std::this_thread::sleep_for(duration);
stop.store(true);
for (auto& thread : threads) {
DSched::join(thread);
}
}
template <template <typename> class Atom = std::atomic>
void combineNThreads(int numThreads, std::chrono::seconds duration) {
auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
auto&& barrier = std::atomic<int>{0};
auto&& threads = std::vector<std::thread>{};
auto&& stop = std::atomic<bool>{false};
auto&& function = [&]() {
return [&] {
auto&& expected = std::uint64_t{0};
auto&& local = std::atomic<std::uint64_t>{0};
auto&& result = std::atomic<std::uint64_t>{0};
while (!stop.load()) {
++expected;
auto current = mutex.lock_combine([&]() {
result.fetch_add(1);
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
return local.fetch_add(1);
});
EXPECT_EQ(current, expected - 1);
}
EXPECT_EQ(expected, result.load());
};
};
for (auto i = 1; i <= numThreads; ++i) {
threads.push_back(DSched::thread(function()));
}
/* sleep override */
std::this_thread::sleep_for(duration);
stop.store(true);
for (auto& thread : threads) {
DSched::join(thread);
}
}
template <template <typename> class Atom = std::atomic>
void combineWithLockNThreads(int numThreads, std::chrono::seconds duration) {
auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
auto&& barrier = std::atomic<int>{0};
auto&& threads = std::vector<std::thread>{};
auto&& stop = std::atomic<bool>{false};
auto&& lockUnlockFunction = [&]() {
while (!stop.load()) {
auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
};
auto&& combineFunction = [&]() {
auto&& expected = std::uint64_t{0};
auto&& total = std::atomic<std::uint64_t>{0};
while (!stop.load()) {
++expected;
auto current = mutex.lock_combine([&]() {
auto iteration = total.fetch_add(1);
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
return iteration;
});
EXPECT_EQ(expected, current + 1);
}
EXPECT_EQ(expected, total.load());
};
for (auto i = 1; i < (numThreads / 2); ++i) {
threads.push_back(DSched::thread(combineFunction));
}
for (auto i = 0; i < (numThreads / 2); ++i) {
threads.push_back(DSched::thread(lockUnlockFunction));
}
/* sleep override */
std::this_thread::sleep_for(duration);
stop.store(true);
for (auto& thread : threads) {
DSched::join(thread);
}
}
template <template <typename> class Atom = std::atomic>
void combineWithTryLockNThreads(int numThreads, std::chrono::seconds duration) {
auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
auto&& barrier = std::atomic<int>{0};
auto&& threads = std::vector<std::thread>{};
auto&& stop = std::atomic<bool>{false};
auto&& lockUnlockFunction = [&]() {
while (!stop.load()) {
auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
};
auto&& combineFunction = [&]() {
auto&& expected = std::uint64_t{0};
auto&& total = std::atomic<std::uint64_t>{0};
while (!stop.load()) {
++expected;
auto current = mutex.lock_combine([&]() {
auto iteration = total.fetch_add(1);
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
return iteration;
});
EXPECT_EQ(expected, current + 1);
}
EXPECT_EQ(expected, total.load());
};
auto tryLockFunction = [&]() {
while (!stop.load()) {
using Mutex = std::decay_t<decltype(mutex)>;
auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
if (lck.try_lock()) {
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
}
};
for (auto i = 0; i < (numThreads / 3); ++i) {
threads.push_back(DSched::thread(lockUnlockFunction));
}
for (auto i = 0; i < (numThreads / 3); ++i) {
threads.push_back(DSched::thread(combineFunction));
}
for (auto i = 0; i < (numThreads / 3); ++i) {
threads.push_back(DSched::thread(tryLockFunction));
}
/* sleep override */
std::this_thread::sleep_for(duration);
stop.store(true);
for (auto& thread : threads) {
DSched::join(thread);
}
}
template <template <typename> class Atom = std::atomic>
void combineWithLockTryAndTimedNThreads(
int numThreads,
std::chrono::seconds duration) {
auto&& mutex = detail::distributed_mutex::DistributedMutex<Atom>{};
auto&& barrier = std::atomic<int>{0};
auto&& threads = std::vector<std::thread>{};
auto&& stop = std::atomic<bool>{false};
auto&& lockUnlockFunction = [&]() {
while (!stop.load()) {
auto lck = std::unique_lock<std::decay_t<decltype(mutex)>>{mutex};
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
};
auto&& combineFunction = [&]() {
auto&& expected = std::uint64_t{0};
auto&& total = std::atomic<std::uint64_t>{0};
while (!stop.load()) {
++expected;
auto current = mutex.lock_combine([&]() {
auto iteration = total.fetch_add(1);
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
// return a non-trivially-copyable object that occupies all the
// storage we use to coalesce returns to test that codepath
return folly::make_array(
iteration,
iteration + 1,
iteration + 2,
iteration + 3,
iteration + 4,
iteration + 5);
});
EXPECT_EQ(expected, current[0] + 1);
EXPECT_EQ(expected, current[1]);
EXPECT_EQ(expected, current[2] - 1);
EXPECT_EQ(expected, current[3] - 2);
EXPECT_EQ(expected, current[4] - 3);
EXPECT_EQ(expected, current[5] - 4);
}
EXPECT_EQ(expected, total.load());
};
auto tryLockFunction = [&]() {
while (!stop.load()) {
using Mutex = std::decay_t<decltype(mutex)>;
auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
if (lck.try_lock()) {
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
}
};
auto timedLockFunction = [&]() {
while (!stop.load()) {
using Mutex = std::decay_t<decltype(mutex)>;
auto lck = std::unique_lock<Mutex>{mutex, std::defer_lock};
if (lck.try_lock_for(kForever)) {
EXPECT_EQ(barrier.fetch_add(1, std::memory_order_relaxed), 0);
std::this_thread::yield();
EXPECT_EQ(barrier.fetch_sub(1, std::memory_order_relaxed), 1);
}
}
};
for (auto i = 0; i < (numThreads / 4); ++i) {
threads.push_back(DSched::thread(lockUnlockFunction));
}
for (auto i = 0; i < (numThreads / 4); ++i) {
threads.push_back(DSched::thread(combineFunction));
}
for (auto i = 0; i < (numThreads / 4); ++i) {
threads.push_back(DSched::thread(tryLockFunction));
}
for (auto i = 0; i < (numThreads / 4); ++i) {
threads.push_back(DSched::thread(timedLockFunction));
}
/* sleep override */
std::this_thread::sleep_for(duration);
stop.store(true);
for (auto& thread : threads) {
DSched::join(thread);
}
}
} // namespace } // namespace
TEST(DistributedMutex, InternalDetailTestOne) { TEST(DistributedMutex, InternalDetailTestOne) {
auto value = 0; auto value = 0;
auto ptr = reinterpret_cast<std::uintptr_t>(&value); auto ptr = reinterpret_cast<std::uintptr_t>(&value);
EXPECT_EQ(detail::distributed_mutex::extractAddress<int>(ptr), &value); EXPECT_EQ(detail::distributed_mutex::extractPtr<int>(ptr), &value);
ptr = ptr | 0b1; ptr = ptr | 0b1;
EXPECT_EQ(detail::distributed_mutex::extractAddress<int>(ptr), &value); EXPECT_EQ(detail::distributed_mutex::extractPtr<int>(ptr), &value);
} }
TEST(DistributedMutex, Basic) { TEST(DistributedMutex, Basic) {
...@@ -434,6 +752,159 @@ TEST(DistributedMutex, StressHardwareConcurrencyThreads) { ...@@ -434,6 +752,159 @@ TEST(DistributedMutex, StressHardwareConcurrencyThreads) {
basicNThreads(std::thread::hardware_concurrency()); basicNThreads(std::thread::hardware_concurrency());
} }
TEST(DistributedMutex, StressThreeThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwelveThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
12, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwentyFourThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
24, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressFourtyEightThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
48, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixtyFourThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressHwConcThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreads(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwoThreadsCombine) {
combineNThreads(2, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressThreeThreadsCombine) {
combineNThreads(3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressFourThreadsCombine) {
combineNThreads(4, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressFiveThreadsCombine) {
combineNThreads(5, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixThreadsCombine) {
combineNThreads(6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSevenThreadsCombine) {
combineNThreads(7, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressEightThreadsCombine) {
combineNThreads(8, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixteenThreadsCombine) {
combineNThreads(16, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressThirtyTwoThreadsCombine) {
combineNThreads(32, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixtyFourThreadsCombine) {
combineNThreads(64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressHundredThreadsCombine) {
combineNThreads(100, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombine) {
combineNThreads(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwoThreadsCombineAndLock) {
combineWithLockNThreads(2, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressFourThreadsCombineAndLock) {
combineWithLockNThreads(4, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressEightThreadsCombineAndLock) {
combineWithLockNThreads(8, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixteenThreadsCombineAndLock) {
combineWithLockNThreads(16, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressThirtyTwoThreadsCombineAndLock) {
combineWithLockNThreads(32, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixtyFourThreadsCombineAndLock) {
combineWithLockNThreads(64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombineAndLock) {
combineWithLockNThreads(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressThreeThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwelveThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
12, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwentyFourThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
24, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressFourtyEightThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
48, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixtyFourThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressHardwareConcurrencyThreadsCombineTryLockAndLock) {
combineWithTryLockNThreads(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressThreeThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwelveThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
12, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTwentyFourThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
24, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressFourtyEightThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
48, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressSixtyFourThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressHwConcurrencyThreadsCombineTryLockLockAndTimed) {
combineWithLockTryAndTimedNThreads(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, StressTryLock) { TEST(DistributedMutex, StressTryLock) {
auto&& mutex = DistributedMutex{}; auto&& mutex = DistributedMutex{};
...@@ -464,6 +935,73 @@ void runBasicNThreadsDeterministic(int threads, int iterations) { ...@@ -464,6 +935,73 @@ void runBasicNThreadsDeterministic(int threads, int iterations) {
static_cast<void>(schedule); static_cast<void>(schedule);
} }
} }
void combineNThreadsDeterministic(int threads, std::chrono::seconds t) {
const auto kNumPasses = 3.0;
const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
for (auto pass = 0; pass < kNumPasses; ++pass) {
auto&& schedule = DSched{DSched::uniform(pass)};
combineNThreads<test::DeterministicAtomic>(threads, time);
static_cast<void>(schedule);
}
}
void combineAndLockNThreadsDeterministic(int threads, std::chrono::seconds t) {
const auto kNumPasses = 3.0;
const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
for (auto pass = 0; pass < kNumPasses; ++pass) {
auto&& schedule = DSched{DSched::uniform(pass)};
combineWithLockNThreads<test::DeterministicAtomic>(threads, time);
static_cast<void>(schedule);
}
}
void combineTryLockAndLockNThreadsDeterministic(
int threads,
std::chrono::seconds t) {
const auto kNumPasses = 3.0;
const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
for (auto pass = 0; pass < kNumPasses; ++pass) {
auto&& schedule = DSched{DSched::uniform(pass)};
combineWithTryLockNThreads<test::DeterministicAtomic>(threads, time);
static_cast<void>(schedule);
}
}
void lockWithTryAndTimedNThreadsDeterministic(
int threads,
std::chrono::seconds t) {
const auto kNumPasses = 3.0;
const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
for (auto pass = 0; pass < kNumPasses; ++pass) {
auto&& schedule = DSched{DSched::uniform(pass)};
lockWithTryAndTimedNThreads<test::DeterministicAtomic>(threads, time);
static_cast<void>(schedule);
}
}
void combineWithTryLockAndTimedNThreadsDeterministic(
int threads,
std::chrono::seconds t) {
const auto kNumPasses = 3.0;
const auto seconds = std::ceil(static_cast<double>(t.count()) / kNumPasses);
const auto time = std::chrono::seconds{static_cast<std::uint64_t>(seconds)};
for (auto pass = 0; pass < kNumPasses; ++pass) {
auto&& schedule = DSched{DSched::uniform(pass)};
combineWithLockTryAndTimedNThreads<test::DeterministicAtomic>(
threads, time);
static_cast<void>(schedule);
}
}
} // namespace } // namespace
TEST(DistributedMutex, DeterministicStressTwoThreads) { TEST(DistributedMutex, DeterministicStressTwoThreads) {
...@@ -482,6 +1020,156 @@ TEST(DistributedMutex, DeterministicStressThirtyTwoThreads) { ...@@ -482,6 +1020,156 @@ TEST(DistributedMutex, DeterministicStressThirtyTwoThreads) {
runBasicNThreadsDeterministic(32, numIterationsDeterministicTest(32)); runBasicNThreadsDeterministic(32, numIterationsDeterministicTest(32));
} }
TEST(DistributedMutex, DeterministicStressThreeThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, DeterministicStressSixThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, DeterministicStressTwelveThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
12, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, DeterministicStressTwentyFourThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
24, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, DeterministicStressFourtyEightThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
48, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, DeterministicStressSixtyFourThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, DeterministicStressHwConcThreadsLockTryAndTimed) {
lockWithTryAndTimedNThreadsDeterministic(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressTwoThreads) {
combineNThreadsDeterministic(
2, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressFourThreads) {
combineNThreadsDeterministic(
4, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressEightThreads) {
combineNThreadsDeterministic(
8, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressSixteenThreads) {
combineNThreadsDeterministic(
16, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressThirtyTwoThreads) {
combineNThreadsDeterministic(
32, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressSixtyFourThreads) {
combineNThreadsDeterministic(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineDeterministicStressHardwareConcurrencyThreads) {
combineNThreadsDeterministic(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressTwoThreads) {
combineAndLockNThreadsDeterministic(
2, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressFourThreads) {
combineAndLockNThreadsDeterministic(
4, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressEightThreads) {
combineAndLockNThreadsDeterministic(
8, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressSixteenThreads) {
combineAndLockNThreadsDeterministic(
16, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressThirtyTwoThreads) {
combineAndLockNThreadsDeterministic(
32, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressSixtyFourThreads) {
combineAndLockNThreadsDeterministic(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineAndLockDeterministicStressHWConcurrencyThreads) {
combineAndLockNThreadsDeterministic(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressThreeThreads) {
combineTryLockAndLockNThreadsDeterministic(
3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressSixThreads) {
combineTryLockAndLockNThreadsDeterministic(
6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressTwelveThreads) {
combineTryLockAndLockNThreadsDeterministic(
12, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressTwentyThreads) {
combineTryLockAndLockNThreadsDeterministic(
24, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressFortyThreads) {
combineTryLockAndLockNThreadsDeterministic(
48, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressSixtyThreads) {
combineTryLockAndLockNThreadsDeterministic(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndLockDeterministicStressHWConcThreads) {
combineTryLockAndLockNThreadsDeterministic(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressThreeThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
3, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressSixThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
6, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressTwelveThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
12, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressTwentyThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
24, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressFortyThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
48, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressSixtyThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
64, std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, CombineTryLockAndTimedDeterministicStressHWConcThreads) {
combineWithTryLockAndTimedNThreadsDeterministic(
std::thread::hardware_concurrency(),
std::chrono::seconds{FLAGS_stress_test_seconds});
}
TEST(DistributedMutex, TimedLockTimeout) { TEST(DistributedMutex, TimedLockTimeout) {
auto&& mutex = DistributedMutex{}; auto&& mutex = DistributedMutex{};
auto&& start = folly::Baton<>{}; auto&& start = folly::Baton<>{};
...@@ -833,4 +1521,131 @@ TEST(DistributedMutex, DeterministicTryLockSixtyFourThreads) { ...@@ -833,4 +1521,131 @@ TEST(DistributedMutex, DeterministicTryLockSixtyFourThreads) {
} }
} }
namespace {
class TestConstruction {
public:
TestConstruction() = delete;
explicit TestConstruction(int) {
defaultConstructs().fetch_add(1, std::memory_order_relaxed);
}
TestConstruction(TestConstruction&&) noexcept {
moveConstructs().fetch_add(1, std::memory_order_relaxed);
}
TestConstruction(const TestConstruction&) {
copyConstructs().fetch_add(1, std::memory_order_relaxed);
}
TestConstruction& operator=(const TestConstruction&) {
copyAssigns().fetch_add(1, std::memory_order_relaxed);
return *this;
}
TestConstruction& operator=(TestConstruction&&) {
moveAssigns().fetch_add(1, std::memory_order_relaxed);
return *this;
}
~TestConstruction() {
destructs().fetch_add(1, std::memory_order_relaxed);
}
static std::atomic<std::uint64_t>& defaultConstructs() {
static auto&& atomic = std::atomic<std::uint64_t>{0};
return atomic;
}
static std::atomic<std::uint64_t>& moveConstructs() {
static auto&& atomic = std::atomic<std::uint64_t>{0};
return atomic;
}
static std::atomic<std::uint64_t>& copyConstructs() {
static auto&& atomic = std::atomic<std::uint64_t>{0};
return atomic;
}
static std::atomic<std::uint64_t>& moveAssigns() {
static auto&& atomic = std::atomic<std::uint64_t>{0};
return atomic;
}
static std::atomic<std::uint64_t>& copyAssigns() {
static auto&& atomic = std::atomic<std::uint64_t>{0};
return atomic;
}
static std::atomic<std::uint64_t>& destructs() {
static auto&& atomic = std::atomic<std::uint64_t>{0};
return atomic;
}
static void reset() {
defaultConstructs().store(0);
moveConstructs().store(0);
copyConstructs().store(0);
copyAssigns().store(0);
destructs().store(0);
}
};
} // namespace
TEST(DistributedMutex, TestAppropriateDestructionAndConstructionWithCombine) {
auto&& mutex = folly::DistributedMutex{};
auto&& stop = std::atomic<bool>{false};
// test the simple return path to make sure that in the absence of
// contention, we get the right number of constructs and destructs
mutex.lock_combine([]() { return TestConstruction{1}; });
auto moves = TestConstruction::moveConstructs().load();
auto defaults = TestConstruction::defaultConstructs().load();
EXPECT_EQ(TestConstruction::defaultConstructs().load(), 1);
EXPECT_TRUE(moves == 0 || moves == 1);
EXPECT_EQ(TestConstruction::destructs().load(), moves + defaults);
// loop and make sure we were able to test the path where the critical
// section of the thread gets combined, and assert that we see the expected
// number of constructions and destructions
//
// this implements a timed backoff to test the combined path, so we use the
// smallest possible delay in tests
auto thread = std::thread{[&]() {
auto&& duration = std::chrono::milliseconds{10};
while (!stop.load()) {
TestConstruction::reset();
auto&& ready = folly::Baton<>{};
auto&& release = folly::Baton<>{};
// make one thread start it's critical section, signal and wait for
// another thread to enqueue, to test the
auto innerThread = std::thread{[&]() {
mutex.lock_combine([&]() {
ready.post();
release.wait();
/* sleep override */
std::this_thread::sleep_for(duration);
});
}};
// wait for the thread to get in its critical section, then tell it to go
ready.wait();
release.post();
mutex.lock_combine([&]() { return TestConstruction{1}; });
innerThread.join();
// at this point we should have only one default construct, either 3
// or 4 move constructs the same number of destructions as
// constructions
auto innerDefaults = TestConstruction::defaultConstructs().load();
auto innerMoves = TestConstruction::moveConstructs().load();
auto destructs = TestConstruction::destructs().load();
EXPECT_EQ(innerDefaults, 1);
EXPECT_TRUE(innerMoves == 3 || innerMoves == 4 || innerMoves == 1);
EXPECT_EQ(destructs, innerMoves + innerDefaults);
EXPECT_EQ(TestConstruction::moveAssigns().load(), 0);
EXPECT_EQ(TestConstruction::copyAssigns().load(), 0);
// increase duration by 100ms each iteration
duration = duration + 100ms;
}
}};
/* sleep override */
std::this_thread::sleep_for(std::chrono::seconds{FLAGS_stress_test_seconds});
stop.store(true);
thread.join();
}
} // namespace folly } // namespace folly
...@@ -15,6 +15,7 @@ ...@@ -15,6 +15,7 @@
*/ */
#include <algorithm> #include <algorithm>
#include <array>
#include <cmath> #include <cmath>
#include <condition_variable> #include <condition_variable>
#include <iostream> #include <iostream>
...@@ -25,9 +26,12 @@ ...@@ -25,9 +26,12 @@
#include <google/base/spinlock.h> #include <google/base/spinlock.h>
#include <folly/Benchmark.h> #include <folly/Benchmark.h>
#include <folly/CachelinePadded.h>
#include <folly/SharedMutex.h> #include <folly/SharedMutex.h>
#include <folly/experimental/flat_combining/FlatCombining.h>
#include <folly/synchronization/DistributedMutex.h> #include <folly/synchronization/DistributedMutex.h>
#include <folly/synchronization/SmallLocks.h> #include <folly/synchronization/SmallLocks.h>
#include <folly/synchronization/Utility.h>
/* "Work cycle" is just an additional nop loop iteration. /* "Work cycle" is just an additional nop loop iteration.
* A smaller number of work cyles will result in more contention, * A smaller number of work cyles will result in more contention,
...@@ -50,13 +54,6 @@ static void burn(size_t n) { ...@@ -50,13 +54,6 @@ static void burn(size_t n) {
} }
namespace { namespace {
template <typename Mutex>
std::unique_lock<Mutex> lock(Mutex& mutex) {
return std::unique_lock<Mutex>{mutex};
}
template <typename Mutex, typename Other>
void unlock(Mutex&, Other) {}
struct SimpleBarrier { struct SimpleBarrier {
explicit SimpleBarrier(int count) : count_(count) {} explicit SimpleBarrier(int count) : count_(count) {}
void wait() { void wait() {
...@@ -105,8 +102,135 @@ class GoogleSpinLockAdapter { ...@@ -105,8 +102,135 @@ class GoogleSpinLockAdapter {
SpinLock lock_; SpinLock lock_;
}; };
template <typename Lock> class DistributedMutexFlatCombining {
static void runContended(size_t numOps, size_t numThreads) { public:
folly::DistributedMutex mutex_;
};
class NoLock {
public:
void lock() {}
void unlock() {}
};
class FlatCombiningMutexNoCaching
: public folly::FlatCombining<FlatCombiningMutexNoCaching> {
public:
using Super = folly::FlatCombining<FlatCombiningMutexNoCaching>;
template <typename CriticalSection>
auto lock_combine(CriticalSection func, std::size_t) {
auto record = this->allocRec();
auto value = folly::invoke_result_t<CriticalSection&>{};
this->requestFC([&]() { value = func(); }, record);
this->freeRec(record);
return value;
}
};
class FlatCombiningMutexCaching
: public folly::FlatCombining<FlatCombiningMutexCaching> {
public:
using Super = folly::FlatCombining<FlatCombiningMutexCaching>;
FlatCombiningMutexCaching() {
for (auto i = 0; i < 256; ++i) {
this->records_.push_back(this->allocRec());
}
}
template <typename CriticalSection>
auto lock_combine(CriticalSection func, std::size_t index) {
auto value = folly::invoke_result_t<CriticalSection&>{};
this->requestFC([&]() { value = func(); }, records_.at(index));
return value;
}
std::vector<Super::Rec*> records_;
};
template <typename Mutex, typename CriticalSection>
auto lock_and(Mutex& mutex, std::size_t, CriticalSection func) {
auto lck = folly::make_unique_lock(mutex);
return func();
}
template <typename F>
auto lock_and(DistributedMutexFlatCombining& mutex, std::size_t, F func) {
return mutex.mutex_.lock_combine(std::move(func));
}
template <typename F>
auto lock_and(FlatCombiningMutexNoCaching& mutex, std::size_t i, F func) {
return mutex.lock_combine(func, i);
}
template <typename F>
auto lock_and(FlatCombiningMutexCaching& mutex, std::size_t i, F func) {
return mutex.lock_combine(func, i);
}
template <typename Mutex>
std::unique_lock<Mutex> lock(Mutex& mutex) {
return std::unique_lock<Mutex>{mutex};
}
template <typename Mutex, typename Other>
void unlock(Mutex&, Other) {}
/**
* Functions to initialize, write and read from data
*
* These are used to do different things in the contended benchmark based on
* the type of the data
*/
std::uint64_t write(std::uint64_t& value) {
return ++value;
}
void read(std::uint64_t value) {
folly::doNotOptimizeAway(value);
}
void initialize(std::uint64_t& value) {
value = 1;
}
class alignas(folly::hardware_destructive_interference_size) Ints {
public:
std::array<folly::CachelinePadded<std::uint64_t>, 5> ints_;
};
std::uint64_t write(Ints& vec) {
auto sum = std::uint64_t{0};
for (auto& integer : vec.ints_) {
sum += (*integer += 1);
}
return sum;
}
void initialize(Ints&) {}
class alignas(folly::hardware_destructive_interference_size) AtomicsAdd {
public:
std::array<folly::CachelinePadded<std::atomic<std::uint64_t>>, 5> ints_;
};
std::uint64_t write(AtomicsAdd& atomics) {
auto sum = 0;
for (auto& integer : atomics.ints_) {
sum += integer->fetch_add(1);
}
return sum;
}
void initialize(AtomicsAdd&) {}
class alignas(folly::hardware_destructive_interference_size) AtomicCas {
public:
std::atomic<std::uint64_t> integer_{0};
};
std::uint64_t write(AtomicCas& atomic) {
auto value = atomic.integer_.load();
while (!atomic.integer_.compare_exchange_strong(value, value + 1)) {
}
return value;
}
void initialize(AtomicCas&) {}
template <typename Lock, typename Data = std::uint64_t>
static void
runContended(size_t numOps, size_t numThreads, size_t work = FLAGS_work) {
folly::BenchmarkSuspender braces; folly::BenchmarkSuspender braces;
size_t totalthreads = std::thread::hardware_concurrency(); size_t totalthreads = std::thread::hardware_concurrency();
if (totalthreads < numThreads) { if (totalthreads < numThreads) {
...@@ -117,11 +241,14 @@ static void runContended(size_t numOps, size_t numThreads) { ...@@ -117,11 +241,14 @@ static void runContended(size_t numOps, size_t numThreads) {
char padding1[128]; char padding1[128];
Lock mutex; Lock mutex;
char padding2[128]; char padding2[128];
long value = 1; Data value;
}; };
auto locks = auto locks = std::vector<lockstruct>(threadgroups);
(struct lockstruct*)calloc(threadgroups, sizeof(struct lockstruct)); for (auto& data : locks) {
initialize(data.value);
}
folly::makeUnpredictable(locks);
char padding3[128]; char padding3[128];
(void)padding3; (void)padding3;
...@@ -134,10 +261,11 @@ static void runContended(size_t numOps, size_t numThreads) { ...@@ -134,10 +261,11 @@ static void runContended(size_t numOps, size_t numThreads) {
lockstruct* mutex = &locks[t % threadgroups]; lockstruct* mutex = &locks[t % threadgroups];
runbarrier.wait(); runbarrier.wait();
for (size_t op = 0; op < numOps; op += 1) { for (size_t op = 0; op < numOps; op += 1) {
auto state = lock(mutex->mutex); auto val = lock_and(mutex->mutex, t, [& value = mutex->value, work] {
burn(FLAGS_work); burn(work);
mutex->value++; return write(value);
unlock(mutex->mutex, std::move(state)); });
read(val);
burn(FLAGS_unlocked_work); burn(FLAGS_unlocked_work);
} }
}); });
...@@ -257,7 +385,9 @@ template <typename Mutex> ...@@ -257,7 +385,9 @@ template <typename Mutex>
void runUncontended(std::size_t iters) { void runUncontended(std::size_t iters) {
auto&& mutex = Mutex{}; auto&& mutex = Mutex{};
for (auto i = std::size_t{0}; i < iters; ++i) { for (auto i = std::size_t{0}; i < iters; ++i) {
folly::makeUnpredictable(mutex);
auto state = lock(mutex); auto state = lock(mutex);
folly::makeUnpredictable(mutex);
unlock(mutex, std::move(state)); unlock(mutex, std::move(state));
} }
} }
...@@ -348,6 +478,53 @@ static void folly_sharedmutex(size_t numOps, size_t numThreads) { ...@@ -348,6 +478,53 @@ static void folly_sharedmutex(size_t numOps, size_t numThreads) {
static void folly_distributedmutex(size_t numOps, size_t numThreads) { static void folly_distributedmutex(size_t numOps, size_t numThreads) {
runContended<folly::DistributedMutex>(numOps, numThreads); runContended<folly::DistributedMutex>(numOps, numThreads);
} }
static void folly_distributedmutex_combining(size_t ops, size_t threads) {
runContended<DistributedMutexFlatCombining>(ops, threads);
}
static void folly_flatcombining_no_caching(size_t numOps, size_t numThreads) {
runContended<FlatCombiningMutexNoCaching>(numOps, numThreads);
}
static void folly_flatcombining_caching(size_t numOps, size_t numThreads) {
runContended<FlatCombiningMutexCaching>(numOps, numThreads);
}
static void std_mutex_simple(size_t numOps, size_t numThreads) {
runContended<std::mutex, Ints>(numOps, numThreads, 0);
}
static void google_spin_simple(size_t numOps, size_t numThreads) {
runContended<GoogleSpinLockAdapter, Ints>(numOps, numThreads, 0);
}
static void folly_microspin_simple(size_t numOps, size_t numThreads) {
runContended<InitLock<folly::MicroSpinLock>, Ints>(numOps, numThreads, 0);
}
static void folly_picospin_simple(size_t numOps, size_t numThreads) {
runContended<InitLock<folly::PicoSpinLock<uint16_t>>, Ints>(
numOps, numThreads, 0);
}
static void folly_microlock_simple(size_t numOps, size_t numThreads) {
runContended<folly::MicroLock, Ints>(numOps, numThreads, 0);
}
static void folly_sharedmutex_simple(size_t numOps, size_t numThreads) {
runContended<folly::SharedMutex, Ints>(numOps, numThreads, 0);
}
static void folly_distributedmutex_simple(size_t numOps, size_t numThreads) {
runContended<folly::DistributedMutex, Ints>(numOps, numThreads, 0);
}
static void folly_distributedmutex_combining_simple(size_t o, size_t t) {
runContended<DistributedMutexFlatCombining, Ints>(o, t, 0);
}
static void atomics_fetch_add(size_t numOps, size_t numThreads) {
runContended<NoLock, AtomicsAdd>(numOps, numThreads, 0);
}
static void atomic_cas(size_t numOps, size_t numThreads) {
runContended<NoLock, AtomicCas>(numOps, numThreads, 0);
}
static void folly_flatcombining_no_caching_simple(size_t ops, size_t threads) {
runContended<FlatCombiningMutexNoCaching>(ops, threads, 0);
}
static void folly_flatcombining_caching_simple(size_t ops, size_t threads) {
runContended<FlatCombiningMutexCaching>(ops, threads, 0);
}
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 1thread, 1) BENCH_BASE(std_mutex, 1thread, 1)
...@@ -357,6 +534,9 @@ BENCH_REL(folly_picospin, 1thread, 1) ...@@ -357,6 +534,9 @@ BENCH_REL(folly_picospin, 1thread, 1)
BENCH_REL(folly_microlock, 1thread, 1) BENCH_REL(folly_microlock, 1thread, 1)
BENCH_REL(folly_sharedmutex, 1thread, 1) BENCH_REL(folly_sharedmutex, 1thread, 1)
BENCH_REL(folly_distributedmutex, 1thread, 1) BENCH_REL(folly_distributedmutex, 1thread, 1)
BENCH_REL(folly_distributedmutex_combining, 1thread, 1)
BENCH_REL(folly_flatcombining_no_caching, 1thread, 1)
BENCH_REL(folly_flatcombining_caching, 1thread, 1)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 2thread, 2) BENCH_BASE(std_mutex, 2thread, 2)
BENCH_REL(google_spin, 2thread, 2) BENCH_REL(google_spin, 2thread, 2)
...@@ -365,6 +545,9 @@ BENCH_REL(folly_picospin, 2thread, 2) ...@@ -365,6 +545,9 @@ BENCH_REL(folly_picospin, 2thread, 2)
BENCH_REL(folly_microlock, 2thread, 2) BENCH_REL(folly_microlock, 2thread, 2)
BENCH_REL(folly_sharedmutex, 2thread, 2) BENCH_REL(folly_sharedmutex, 2thread, 2)
BENCH_REL(folly_distributedmutex, 2thread, 2) BENCH_REL(folly_distributedmutex, 2thread, 2)
BENCH_REL(folly_distributedmutex_combining, 2thread, 2)
BENCH_REL(folly_flatcombining_no_caching, 2thread, 2)
BENCH_REL(folly_flatcombining_caching, 2thread, 2)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 4thread, 4) BENCH_BASE(std_mutex, 4thread, 4)
BENCH_REL(google_spin, 4thread, 4) BENCH_REL(google_spin, 4thread, 4)
...@@ -373,6 +556,9 @@ BENCH_REL(folly_picospin, 4thread, 4) ...@@ -373,6 +556,9 @@ BENCH_REL(folly_picospin, 4thread, 4)
BENCH_REL(folly_microlock, 4thread, 4) BENCH_REL(folly_microlock, 4thread, 4)
BENCH_REL(folly_sharedmutex, 4thread, 4) BENCH_REL(folly_sharedmutex, 4thread, 4)
BENCH_REL(folly_distributedmutex, 4thread, 4) BENCH_REL(folly_distributedmutex, 4thread, 4)
BENCH_REL(folly_distributedmutex_combining, 4thread, 4)
BENCH_REL(folly_flatcombining_no_caching, 4thread, 4)
BENCH_REL(folly_flatcombining_caching, 4thread, 4)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 8thread, 8) BENCH_BASE(std_mutex, 8thread, 8)
BENCH_REL(google_spin, 8thread, 8) BENCH_REL(google_spin, 8thread, 8)
...@@ -381,6 +567,9 @@ BENCH_REL(folly_picospin, 8thread, 8) ...@@ -381,6 +567,9 @@ BENCH_REL(folly_picospin, 8thread, 8)
BENCH_REL(folly_microlock, 8thread, 8) BENCH_REL(folly_microlock, 8thread, 8)
BENCH_REL(folly_sharedmutex, 8thread, 8) BENCH_REL(folly_sharedmutex, 8thread, 8)
BENCH_REL(folly_distributedmutex, 8thread, 8) BENCH_REL(folly_distributedmutex, 8thread, 8)
BENCH_REL(folly_distributedmutex_combining, 8thread, 8)
BENCH_REL(folly_flatcombining_no_caching, 8thread, 8)
BENCH_REL(folly_flatcombining_caching, 8thread, 8)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 16thread, 16) BENCH_BASE(std_mutex, 16thread, 16)
BENCH_REL(google_spin, 16thread, 16) BENCH_REL(google_spin, 16thread, 16)
...@@ -389,6 +578,9 @@ BENCH_REL(folly_picospin, 16thread, 16) ...@@ -389,6 +578,9 @@ BENCH_REL(folly_picospin, 16thread, 16)
BENCH_REL(folly_microlock, 16thread, 16) BENCH_REL(folly_microlock, 16thread, 16)
BENCH_REL(folly_sharedmutex, 16thread, 16) BENCH_REL(folly_sharedmutex, 16thread, 16)
BENCH_REL(folly_distributedmutex, 16thread, 16) BENCH_REL(folly_distributedmutex, 16thread, 16)
BENCH_REL(folly_distributedmutex_combining, 16thread, 16)
BENCH_REL(folly_flatcombining_no_caching, 16thread, 16)
BENCH_REL(folly_flatcombining_caching, 16thread, 16)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 32thread, 32) BENCH_BASE(std_mutex, 32thread, 32)
BENCH_REL(google_spin, 32thread, 32) BENCH_REL(google_spin, 32thread, 32)
...@@ -397,6 +589,9 @@ BENCH_REL(folly_picospin, 32thread, 32) ...@@ -397,6 +589,9 @@ BENCH_REL(folly_picospin, 32thread, 32)
BENCH_REL(folly_microlock, 32thread, 32) BENCH_REL(folly_microlock, 32thread, 32)
BENCH_REL(folly_sharedmutex, 32thread, 32) BENCH_REL(folly_sharedmutex, 32thread, 32)
BENCH_REL(folly_distributedmutex, 32thread, 32) BENCH_REL(folly_distributedmutex, 32thread, 32)
BENCH_REL(folly_distributedmutex_combining, 32thread, 32)
BENCH_REL(folly_flatcombining_no_caching, 32thread, 32)
BENCH_REL(folly_flatcombining_caching, 32thread, 32)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 64thread, 64) BENCH_BASE(std_mutex, 64thread, 64)
BENCH_REL(google_spin, 64thread, 64) BENCH_REL(google_spin, 64thread, 64)
...@@ -405,6 +600,9 @@ BENCH_REL(folly_picospin, 64thread, 64) ...@@ -405,6 +600,9 @@ BENCH_REL(folly_picospin, 64thread, 64)
BENCH_REL(folly_microlock, 64thread, 64) BENCH_REL(folly_microlock, 64thread, 64)
BENCH_REL(folly_sharedmutex, 64thread, 64) BENCH_REL(folly_sharedmutex, 64thread, 64)
BENCH_REL(folly_distributedmutex, 64thread, 64) BENCH_REL(folly_distributedmutex, 64thread, 64)
BENCH_REL(folly_distributedmutex_combining, 64thread, 64)
BENCH_REL(folly_flatcombining_no_caching, 64thread, 64)
BENCH_REL(folly_flatcombining_caching, 64thread, 64)
BENCHMARK_DRAW_LINE(); BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex, 128thread, 128) BENCH_BASE(std_mutex, 128thread, 128)
BENCH_REL(google_spin, 128thread, 128) BENCH_REL(google_spin, 128thread, 128)
...@@ -413,6 +611,114 @@ BENCH_REL(folly_picospin, 128thread, 128) ...@@ -413,6 +611,114 @@ BENCH_REL(folly_picospin, 128thread, 128)
BENCH_REL(folly_microlock, 128thread, 128) BENCH_REL(folly_microlock, 128thread, 128)
BENCH_REL(folly_sharedmutex, 128thread, 128) BENCH_REL(folly_sharedmutex, 128thread, 128)
BENCH_REL(folly_distributedmutex, 128thread, 128) BENCH_REL(folly_distributedmutex, 128thread, 128)
BENCH_REL(folly_distributedmutex_combining, 128thread, 128)
BENCH_REL(folly_flatcombining_no_caching, 128thread, 128)
BENCH_REL(folly_flatcombining_caching, 128thread, 128)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 1thread, 1)
BENCH_REL(google_spin_simple, 1thread, 1)
BENCH_REL(folly_microspin_simple, 1thread, 1)
BENCH_REL(folly_picospin_simple, 1thread, 1)
BENCH_REL(folly_microlock_simple, 1thread, 1)
BENCH_REL(folly_sharedmutex_simple, 1thread, 1)
BENCH_REL(folly_distributedmutex_simple, 1thread, 1)
BENCH_REL(folly_distributedmutex_combining_simple, 1thread, 1)
BENCH_REL(folly_flatcombining_no_caching_simple, 1thread, 1)
BENCH_REL(folly_flatcombining_caching_simple, 1thread, 1)
BENCH_REL(atomics_fetch_add, 1thread, 1)
BENCH_REL(atomic_cas, 1thread, 1)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 2thread, 2)
BENCH_REL(google_spin_simple, 2thread, 2)
BENCH_REL(folly_microspin_simple, 2thread, 2)
BENCH_REL(folly_picospin_simple, 2thread, 2)
BENCH_REL(folly_microlock_simple, 2thread, 2)
BENCH_REL(folly_sharedmutex_simple, 2thread, 2)
BENCH_REL(folly_distributedmutex_simple, 2thread, 2)
BENCH_REL(folly_distributedmutex_combining_simple, 2thread, 2)
BENCH_REL(folly_flatcombining_no_caching_simple, 2thread, 2)
BENCH_REL(folly_flatcombining_caching_simple, 2thread, 2)
BENCH_REL(atomics_fetch_add, 2thread, 2)
BENCH_REL(atomic_cas, 2thread, 2)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 4thread, 4)
BENCH_REL(google_spin_simple, 4thread, 4)
BENCH_REL(folly_microspin_simple, 4thread, 4)
BENCH_REL(folly_picospin_simple, 4thread, 4)
BENCH_REL(folly_microlock_simple, 4thread, 4)
BENCH_REL(folly_sharedmutex_simple, 4thread, 4)
BENCH_REL(folly_distributedmutex_simple, 4thread, 4)
BENCH_REL(folly_distributedmutex_combining_simple, 4thread, 4)
BENCH_REL(folly_flatcombining_no_caching_simple, 4thread, 4)
BENCH_REL(folly_flatcombining_caching_simple, 4thread, 4)
BENCH_REL(atomics_fetch_add, 4thread, 4)
BENCH_REL(atomic_cas, 4thread, 4)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 8thread, 8)
BENCH_REL(google_spin_simple, 8thread, 8)
BENCH_REL(folly_microspin_simple, 8thread, 8)
BENCH_REL(folly_picospin_simple, 8thread, 8)
BENCH_REL(folly_microlock_simple, 8thread, 8)
BENCH_REL(folly_sharedmutex_simple, 8thread, 8)
BENCH_REL(folly_distributedmutex_simple, 8thread, 8)
BENCH_REL(folly_distributedmutex_combining_simple, 8thread, 8)
BENCH_REL(folly_flatcombining_no_caching_simple, 8thread, 8)
BENCH_REL(folly_flatcombining_caching_simple, 8thread, 8)
BENCH_REL(atomics_fetch_add, 8thread, 8)
BENCH_REL(atomic_cas, 8thread, 8)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 16thread, 16)
BENCH_REL(google_spin_simple, 16thread, 16)
BENCH_REL(folly_microspin_simple, 16thread, 16)
BENCH_REL(folly_picospin_simple, 16thread, 16)
BENCH_REL(folly_microlock_simple, 16thread, 16)
BENCH_REL(folly_sharedmutex_simple, 16thread, 16)
BENCH_REL(folly_distributedmutex_simple, 16thread, 16)
BENCH_REL(folly_distributedmutex_combining_simple, 16thread, 16)
BENCH_REL(folly_flatcombining_no_caching_simple, 16thread, 16)
BENCH_REL(folly_flatcombining_caching_simple, 16thread, 16)
BENCH_REL(atomics_fetch_add, 16thread, 16)
BENCH_REL(atomic_cas, 16thread, 16)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 32thread, 32)
BENCH_REL(google_spin_simple, 32thread, 32)
BENCH_REL(folly_microspin_simple, 32thread, 32)
BENCH_REL(folly_picospin_simple, 32thread, 32)
BENCH_REL(folly_microlock_simple, 32thread, 32)
BENCH_REL(folly_sharedmutex_simple, 32thread, 32)
BENCH_REL(folly_distributedmutex_simple, 32thread, 32)
BENCH_REL(folly_distributedmutex_combining_simple, 32thread, 32)
BENCH_REL(folly_flatcombining_no_caching_simple, 32thread, 32)
BENCH_REL(folly_flatcombining_caching_simple, 32thread, 32)
BENCH_REL(atomics_fetch_add, 32thread, 32)
BENCH_REL(atomic_cas, 32thread, 32)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 64thread, 64)
BENCH_REL(google_spin_simple, 64thread, 64)
BENCH_REL(folly_microspin_simple, 64thread, 64)
BENCH_REL(folly_picospin_simple, 64thread, 64)
BENCH_REL(folly_microlock_simple, 64thread, 64)
BENCH_REL(folly_sharedmutex_simple, 64thread, 64)
BENCH_REL(folly_distributedmutex_simple, 64thread, 64)
BENCH_REL(folly_distributedmutex_combining_simple, 64thread, 64)
BENCH_REL(folly_flatcombining_no_caching_simple, 64thread, 64)
BENCH_REL(folly_flatcombining_caching_simple, 64thread, 64)
BENCH_REL(atomics_fetch_add, 64thread, 64)
BENCH_REL(atomic_cas, 64thread, 64)
BENCHMARK_DRAW_LINE();
BENCH_BASE(std_mutex_simple, 128thread, 128)
BENCH_REL(google_spin_simple, 128thread, 128)
BENCH_REL(folly_microspin_simple, 128thread, 128)
BENCH_REL(folly_picospin_simple, 128thread, 128)
BENCH_REL(folly_microlock_simple, 128thread, 128)
BENCH_REL(folly_sharedmutex_simple, 128thread, 128)
BENCH_REL(folly_distributedmutex_simple, 128thread, 128)
BENCH_REL(folly_distributedmutex_combining_simple, 128thread, 128)
BENCH_REL(folly_flatcombining_no_caching_simple, 128thread, 128)
BENCH_REL(folly_flatcombining_caching_simple, 128thread, 128)
BENCH_REL(atomics_fetch_add, 128thread, 128)
BENCH_REL(atomic_cas, 128thread, 128)
template <typename Mutex> template <typename Mutex>
void fairnessTest(std::string type, std::size_t numThreads) { void fairnessTest(std::string type, std::size_t numThreads) {
...@@ -585,78 +891,415 @@ Lock time stats in us: mean 56 stddev 960 max 32873 ...@@ -585,78 +891,415 @@ Lock time stats in us: mean 56 stddev 960 max 32873
============================================================================ ============================================================================
folly/synchronization/test/SmallLocksBenchmark.cpprelative time/iter iters/s folly/synchronization/test/SmallLocksBenchmark.cpprelative time/iter iters/s
============================================================================ ============================================================================
StdMutexUncontendedBenchmark 16.73ns 59.77M StdMutexUncontendedBenchmark 18.85ns 53.04M
GoogleSpinUncontendedBenchmark 11.26ns 88.80M GoogleSpinUncontendedBenchmark 11.25ns 88.87M
MicroSpinLockUncontendedBenchmark 10.06ns 99.44M MicroSpinLockUncontendedBenchmark 10.95ns 91.34M
PicoSpinLockUncontendedBenchmark 11.25ns 88.89M PicoSpinLockUncontendedBenchmark 20.38ns 49.06M
MicroLockUncontendedBenchmark 19.20ns 52.09M MicroLockUncontendedBenchmark 28.60ns 34.96M
SharedMutexUncontendedBenchmark 19.45ns 51.40M SharedMutexUncontendedBenchmark 19.51ns 51.25M
DistributedMutexUncontendedBenchmark 17.02ns 58.75M DistributedMutexUncontendedBenchmark 25.27ns 39.58M
AtomicFetchAddUncontendedBenchmark 5.47ns 182.91M AtomicFetchAddUncontendedBenchmark 5.47ns 182.91M
---------------------------------------------------------------------------- ----------------------------------------------------------------------------
---------------------------------------------------------------------------- ----------------------------------------------------------------------------
std_mutex(1thread) 802.21ns 1.25M std_mutex(1thread) 797.34ns 1.25M
google_spin(1thread) 109.81% 730.52ns 1.37M google_spin(1thread) 101.28% 787.29ns 1.27M
folly_microspin(1thread) 119.16% 673.22ns 1.49M folly_microspin(1thread) 118.32% 673.90ns 1.48M
folly_picospin(1thread) 119.02% 673.99ns 1.48M folly_picospin(1thread) 118.36% 673.66ns 1.48M
folly_microlock(1thread) 131.67% 609.28ns 1.64M folly_microlock(1thread) 117.98% 675.84ns 1.48M
folly_sharedmutex(1thread) 118.41% 677.46ns 1.48M folly_sharedmutex(1thread) 118.40% 673.41ns 1.48M
folly_distributedmutex(1thread) 100.27% 800.02ns 1.25M folly_distributedmutex(1thread) 116.11% 686.74ns 1.46M
---------------------------------------------------------------------------- folly_distributedmutex_combining(1thread) 115.05% 693.05ns 1.44M
std_mutex(2thread) 1.30us 769.21K folly_flatcombining_no_caching(1thread) 90.40% 882.05ns 1.13M
google_spin(2thread) 129.59% 1.00us 996.85K folly_flatcombining_caching(1thread) 107.30% 743.08ns 1.35M
folly_microspin(2thread) 158.13% 822.13ns 1.22M ----------------------------------------------------------------------------
folly_picospin(2thread) 150.43% 864.23ns 1.16M std_mutex(2thread) 1.14us 874.72K
folly_microlock(2thread) 144.94% 896.92ns 1.11M google_spin(2thread) 120.79% 946.42ns 1.06M
folly_sharedmutex(2thread) 120.36% 1.08us 925.83K folly_microspin(2thread) 136.28% 838.90ns 1.19M
folly_distributedmutex(2thread) 112.98% 1.15us 869.08K folly_picospin(2thread) 133.80% 854.45ns 1.17M
---------------------------------------------------------------------------- folly_microlock(2thread) 111.09% 1.03us 971.76K
std_mutex(4thread) 2.36us 424.08K folly_sharedmutex(2thread) 109.19% 1.05us 955.10K
google_spin(4thread) 120.20% 1.96us 509.75K folly_distributedmutex(2thread) 106.62% 1.07us 932.65K
folly_microspin(4thread) 109.07% 2.16us 462.53K folly_distributedmutex_combining(2thread) 105.45% 1.08us 922.42K
folly_picospin(4thread) 113.37% 2.08us 480.78K folly_flatcombining_no_caching(2thread) 74.73% 1.53us 653.70K
folly_microlock(4thread) 83.88% 2.81us 355.71K folly_flatcombining_caching(2thread) 82.78% 1.38us 724.05K
folly_sharedmutex(4thread) 90.47% 2.61us 383.65K ----------------------------------------------------------------------------
folly_distributedmutex(4thread) 121.82% 1.94us 516.63K std_mutex(4thread) 2.39us 418.41K
---------------------------------------------------------------------------- google_spin(4thread) 128.49% 1.86us 537.63K
std_mutex(8thread) 5.39us 185.64K folly_microspin(4thread) 102.60% 2.33us 429.28K
google_spin(8thread) 127.72% 4.22us 237.10K folly_picospin(4thread) 111.94% 2.14us 468.37K
folly_microspin(8thread) 106.70% 5.05us 198.08K folly_microlock(4thread) 78.19% 3.06us 327.16K
folly_picospin(8thread) 88.02% 6.12us 163.41K folly_sharedmutex(4thread) 86.30% 2.77us 361.11K
folly_microlock(8thread) 79.78% 6.75us 148.11K folly_distributedmutex(4thread) 138.25% 1.73us 578.44K
folly_sharedmutex(8thread) 78.25% 6.88us 145.26K folly_distributedmutex_combining(4thread) 146.96% 1.63us 614.90K
folly_distributedmutex(8thread) 162.74% 3.31us 302.12K folly_flatcombining_no_caching(4thread) 87.93% 2.72us 367.90K
---------------------------------------------------------------------------- folly_flatcombining_caching(4thread) 96.09% 2.49us 402.04K
std_mutex(16thread) 11.74us 85.16K ----------------------------------------------------------------------------
google_spin(16thread) 109.91% 10.68us 93.60K std_mutex(8thread) 3.84us 260.54K
folly_microspin(16thread) 103.93% 11.30us 88.50K google_spin(8thread) 98.58% 3.89us 256.83K
folly_picospin(16thread) 50.36% 23.32us 42.89K folly_microspin(8thread) 64.01% 6.00us 166.77K
folly_microlock(16thread) 55.85% 21.03us 47.56K folly_picospin(8thread) 64.76% 5.93us 168.72K
folly_sharedmutex(16thread) 64.27% 18.27us 54.74K folly_microlock(8thread) 44.31% 8.66us 115.45K
folly_distributedmutex(16thread) 181.32% 6.48us 154.41K folly_sharedmutex(8thread) 50.20% 7.65us 130.78K
---------------------------------------------------------------------------- folly_distributedmutex(8thread) 120.38% 3.19us 313.64K
std_mutex(32thread) 31.56us 31.68K folly_distributedmutex_combining(8thread) 190.44% 2.02us 496.18K
google_spin(32thread) 95.17% 33.17us 30.15K folly_flatcombining_no_caching(8thread) 102.17% 3.76us 266.19K
folly_microspin(32thread) 100.60% 31.38us 31.87K folly_flatcombining_caching(8thread) 129.25% 2.97us 336.76K
folly_picospin(32thread) 31.30% 100.84us 9.92K ----------------------------------------------------------------------------
folly_microlock(32thread) 55.04% 57.35us 17.44K std_mutex(16thread) 9.09us 110.05K
folly_sharedmutex(32thread) 65.09% 48.49us 20.62K google_spin(16thread) 110.38% 8.23us 121.47K
folly_distributedmutex(32thread) 177.39% 17.79us 56.20K folly_microspin(16thread) 79.81% 11.39us 87.83K
---------------------------------------------------------------------------- folly_picospin(16thread) 33.62% 27.03us 37.00K
std_mutex(64thread) 39.90us 25.06K folly_microlock(16thread) 49.93% 18.20us 54.95K
google_spin(64thread) 110.92% 35.98us 27.80K folly_sharedmutex(16thread) 46.15% 19.69us 50.79K
folly_microspin(64thread) 105.98% 37.65us 26.56K folly_distributedmutex(16thread) 145.48% 6.25us 160.10K
folly_picospin(64thread) 33.03% 120.80us 8.28K folly_distributedmutex_combining(16thread) 275.84% 3.29us 303.56K
folly_microlock(64thread) 58.02% 68.78us 14.54K folly_flatcombining_no_caching(16thread) 151.81% 5.99us 167.06K
folly_sharedmutex(64thread) 68.43% 58.32us 17.15K folly_flatcombining_caching(16thread) 153.44% 5.92us 168.86K
folly_distributedmutex(64thread) 200.38% 19.91us 50.22K ----------------------------------------------------------------------------
---------------------------------------------------------------------------- std_mutex(32thread) 26.15us 38.24K
std_mutex(128thread) 75.67us 13.21K google_spin(32thread) 111.41% 23.47us 42.60K
google_spin(128thread) 116.14% 65.16us 15.35K folly_microspin(32thread) 84.76% 30.85us 32.41K
folly_microspin(128thread) 100.82% 75.06us 13.32K folly_picospin(32thread) 27.30% 95.80us 10.44K
folly_picospin(128thread) 44.99% 168.21us 5.94K folly_microlock(32thread) 48.93% 53.45us 18.71K
folly_microlock(128thread) 53.93% 140.31us 7.13K folly_sharedmutex(32thread) 54.64% 47.86us 20.89K
folly_sharedmutex(128thread) 64.37% 117.55us 8.51K folly_distributedmutex(32thread) 158.31% 16.52us 60.53K
folly_distributedmutex(128thread) 185.71% 40.75us 24.54K folly_distributedmutex_combining(32thread) 314.13% 8.33us 120.12K
folly_flatcombining_no_caching(32thread) 175.18% 14.93us 66.99K
folly_flatcombining_caching(32thread) 206.73% 12.65us 79.05K
----------------------------------------------------------------------------
std_mutex(64thread) 30.72us 32.55K
google_spin(64thread) 113.69% 27.02us 37.00K
folly_microspin(64thread) 87.23% 35.22us 28.39K
folly_picospin(64thread) 27.66% 111.06us 9.00K
folly_microlock(64thread) 49.93% 61.53us 16.25K
folly_sharedmutex(64thread) 54.00% 56.89us 17.58K
folly_distributedmutex(64thread) 162.10% 18.95us 52.77K
folly_distributedmutex_combining(64thread) 317.85% 9.67us 103.46K
folly_flatcombining_no_caching(64thread) 160.43% 19.15us 52.22K
folly_flatcombining_caching(64thread) 185.57% 16.56us 60.40K
----------------------------------------------------------------------------
std_mutex(128thread) 72.86us 13.72K
google_spin(128thread) 114.50% 63.64us 15.71K
folly_microspin(128thread) 99.89% 72.95us 13.71K
folly_picospin(128thread) 31.49% 231.40us 4.32K
folly_microlock(128thread) 57.76% 126.14us 7.93K
folly_sharedmutex(128thread) 61.49% 118.50us 8.44K
folly_distributedmutex(128thread) 188.86% 38.58us 25.92K
folly_distributedmutex_combining(128thread) 372.60% 19.56us 51.14K
folly_flatcombining_no_caching(128thread) 149.17% 48.85us 20.47K
folly_flatcombining_caching(128thread) 165.93% 43.91us 22.77K
----------------------------------------------------------------------------
std_mutex_simple(1thread) 623.35ns 1.60M
google_spin_simple(1thread) 103.37% 603.04ns 1.66M
folly_microspin_simple(1thread) 103.18% 604.15ns 1.66M
folly_picospin_simple(1thread) 103.27% 603.63ns 1.66M
folly_microlock_simple(1thread) 102.75% 606.68ns 1.65M
folly_sharedmutex_simple(1thread) 99.03% 629.43ns 1.59M
folly_distributedmutex_simple(1thread) 100.62% 619.52ns 1.61M
folly_distributedmutex_combining_simple(1thread 99.43% 626.92ns 1.60M
folly_flatcombining_no_caching_simple(1thread) 81.20% 767.71ns 1.30M
folly_flatcombining_caching_simple(1thread) 79.80% 781.15ns 1.28M
atomics_fetch_add(1thread) 100.67% 619.22ns 1.61M
atomic_cas(1thread) 104.04% 599.13ns 1.67M
----------------------------------------------------------------------------
std_mutex_simple(2thread) 1.13us 884.14K
google_spin_simple(2thread) 119.42% 947.08ns 1.06M
folly_microspin_simple(2thread) 118.54% 954.12ns 1.05M
folly_picospin_simple(2thread) 117.00% 966.67ns 1.03M
folly_microlock_simple(2thread) 114.90% 984.36ns 1.02M
folly_sharedmutex_simple(2thread) 110.79% 1.02us 979.53K
folly_distributedmutex_simple(2thread) 110.43% 1.02us 976.34K
folly_distributedmutex_combining_simple(2thread 105.80% 1.07us 935.43K
folly_flatcombining_no_caching_simple(2thread) 82.28% 1.37us 727.43K
folly_flatcombining_caching_simple(2thread) 89.85% 1.26us 794.41K
atomics_fetch_add(2thread) 107.37% 1.05us 949.27K
atomic_cas(2thread) 173.23% 652.92ns 1.53M
----------------------------------------------------------------------------
std_mutex_simple(4thread) 2.12us 471.59K
google_spin_simple(4thread) 101.25% 2.09us 477.50K
folly_microspin_simple(4thread) 97.79% 2.17us 461.17K
folly_picospin_simple(4thread) 98.80% 2.15us 465.92K
folly_microlock_simple(4thread) 79.65% 2.66us 375.61K
folly_sharedmutex_simple(4thread) 82.35% 2.57us 388.35K
folly_distributedmutex_simple(4thread) 113.43% 1.87us 534.91K
folly_distributedmutex_combining_simple(4thread 158.22% 1.34us 746.17K
folly_flatcombining_no_caching_simple(4thread) 89.95% 2.36us 424.22K
folly_flatcombining_caching_simple(4thread) 98.86% 2.14us 466.24K
atomics_fetch_add(4thread) 160.21% 1.32us 755.54K
atomic_cas(4thread) 283.73% 747.35ns 1.34M
----------------------------------------------------------------------------
std_mutex_simple(8thread) 3.81us 262.49K
google_spin_simple(8thread) 118.19% 3.22us 310.23K
folly_microspin_simple(8thread) 87.11% 4.37us 228.66K
folly_picospin_simple(8thread) 66.31% 5.75us 174.05K
folly_microlock_simple(8thread) 61.18% 6.23us 160.59K
folly_sharedmutex_simple(8thread) 61.65% 6.18us 161.82K
folly_distributedmutex_simple(8thread) 116.66% 3.27us 306.22K
folly_distributedmutex_combining_simple(8thread 222.30% 1.71us 583.53K
folly_flatcombining_no_caching_simple(8thread) 105.97% 3.59us 278.17K
folly_flatcombining_caching_simple(8thread) 119.21% 3.20us 312.92K
atomics_fetch_add(8thread) 248.65% 1.53us 652.70K
atomic_cas(8thread) 171.55% 2.22us 450.30K
----------------------------------------------------------------------------
std_mutex_simple(16thread) 9.02us 110.93K
google_spin_simple(16thread) 115.67% 7.79us 128.31K
folly_microspin_simple(16thread) 85.45% 10.55us 94.79K
folly_picospin_simple(16thread) 46.06% 19.57us 51.09K
folly_microlock_simple(16thread) 53.34% 16.90us 59.17K
folly_sharedmutex_simple(16thread) 47.16% 19.12us 52.31K
folly_distributedmutex_simple(16thread) 131.65% 6.85us 146.03K
folly_distributedmutex_combining_simple(16threa 353.51% 2.55us 392.13K
folly_flatcombining_no_caching_simple(16thread) 175.03% 5.15us 194.16K
folly_flatcombining_caching_simple(16thread) 169.24% 5.33us 187.73K
atomics_fetch_add(16thread) 428.31% 2.10us 475.10K
atomic_cas(16thread) 194.29% 4.64us 215.52K
----------------------------------------------------------------------------
std_mutex_simple(32thread) 22.66us 44.12K
google_spin_simple(32thread) 114.91% 19.72us 50.70K
folly_microspin_simple(32thread) 70.53% 32.13us 31.12K
folly_picospin_simple(32thread) 17.21% 131.71us 7.59K
folly_microlock_simple(32thread) 39.17% 57.86us 17.28K
folly_sharedmutex_simple(32thread) 46.84% 48.39us 20.67K
folly_distributedmutex_simple(32thread) 128.80% 17.60us 56.83K
folly_distributedmutex_combining_simple(32threa 397.59% 5.70us 175.43K
folly_flatcombining_no_caching_simple(32thread) 205.08% 11.05us 90.49K
folly_flatcombining_caching_simple(32thread) 247.48% 9.16us 109.20K
atomics_fetch_add(32thread) 466.03% 4.86us 205.63K
atomic_cas(32thread) 439.89% 5.15us 194.10K
----------------------------------------------------------------------------
std_mutex_simple(64thread) 30.55us 32.73K
google_spin_simple(64thread) 105.69% 28.91us 34.59K
folly_microspin_simple(64thread) 83.06% 36.79us 27.18K
folly_picospin_simple(64thread) 20.28% 150.63us 6.64K
folly_microlock_simple(64thread) 45.10% 67.75us 14.76K
folly_sharedmutex_simple(64thread) 54.07% 56.50us 17.70K
folly_distributedmutex_simple(64thread) 151.84% 20.12us 49.70K
folly_distributedmutex_combining_simple(64threa 465.77% 6.56us 152.45K
folly_flatcombining_no_caching_simple(64thread) 186.46% 16.39us 61.03K
folly_flatcombining_caching_simple(64thread) 250.81% 12.18us 82.09K
atomics_fetch_add(64thread) 530.59% 5.76us 173.67K
atomic_cas(64thread) 510.57% 5.98us 167.12K
----------------------------------------------------------------------------
std_mutex_simple(128thread) 69.85us 14.32K
google_spin_simple(128thread) 97.54% 71.61us 13.97K
folly_microspin_simple(128thread) 88.01% 79.36us 12.60K
folly_picospin_simple(128thread) 22.31% 313.13us 3.19K
folly_microlock_simple(128thread) 50.49% 138.34us 7.23K
folly_sharedmutex_simple(128thread) 59.30% 117.78us 8.49K
folly_distributedmutex_simple(128thread) 174.90% 39.94us 25.04K
folly_distributedmutex_combining_simple(128thre 531.75% 13.14us 76.13K
folly_flatcombining_no_caching_simple(128thread 212.56% 32.86us 30.43K
folly_flatcombining_caching_simple(128thread) 183.68% 38.03us 26.30K
atomics_fetch_add(128thread) 629.64% 11.09us 90.15K
atomic_cas(128thread) 562.01% 12.43us 80.46K
============================================================================
./small_locks_benchmark --bm_min_iters=100000
Intel(R) Xeon(R) D-2191 CPU @ 1.60GHz
============================================================================
folly/synchronization/test/SmallLocksBenchmark.cpprelative time/iter iters/s
============================================================================
StdMutexUncontendedBenchmark 37.65ns 26.56M
GoogleSpinUncontendedBenchmark 21.97ns 45.52M
MicroSpinLockUncontendedBenchmark 21.97ns 45.53M
PicoSpinLockUncontendedBenchmark 40.80ns 24.51M
MicroLockUncontendedBenchmark 57.76ns 17.31M
SharedMutexUncontendedBenchmark 39.55ns 25.29M
DistributedMutexUncontendedBenchmark 51.47ns 19.43M
AtomicFetchAddUncontendedBenchmark 10.67ns 93.73M
----------------------------------------------------------------------------
----------------------------------------------------------------------------
std_mutex(1thread) 1.36us 737.48K
google_spin(1thread) 94.81% 1.43us 699.17K
folly_microspin(1thread) 100.17% 1.35us 738.74K
folly_picospin(1thread) 100.40% 1.35us 740.41K
folly_microlock(1thread) 82.90% 1.64us 611.34K
folly_sharedmutex(1thread) 101.07% 1.34us 745.36K
folly_distributedmutex(1thread) 101.50% 1.34us 748.54K
folly_distributedmutex_combining(1thread) 99.09% 1.37us 730.79K
folly_flatcombining_no_caching(1thread) 91.37% 1.48us 673.80K
folly_flatcombining_caching(1thread) 99.19% 1.37us 731.48K
----------------------------------------------------------------------------
std_mutex(2thread) 1.65us 605.33K
google_spin(2thread) 113.28% 1.46us 685.74K
folly_microspin(2thread) 117.23% 1.41us 709.63K
folly_picospin(2thread) 113.56% 1.45us 687.40K
folly_microlock(2thread) 106.92% 1.55us 647.22K
folly_sharedmutex(2thread) 107.24% 1.54us 649.15K
folly_distributedmutex(2thread) 114.89% 1.44us 695.47K
folly_distributedmutex_combining(2thread) 83.44% 1.98us 505.10K
folly_flatcombining_no_caching(2thread) 75.89% 2.18us 459.42K
folly_flatcombining_caching(2thread) 76.96% 2.15us 465.86K
----------------------------------------------------------------------------
std_mutex(4thread) 2.88us 347.43K
google_spin(4thread) 132.08% 2.18us 458.88K
folly_microspin(4thread) 160.15% 1.80us 556.43K
folly_picospin(4thread) 189.27% 1.52us 657.60K
folly_microlock(4thread) 155.13% 1.86us 538.97K
folly_sharedmutex(4thread) 148.96% 1.93us 517.55K
folly_distributedmutex(4thread) 106.64% 2.70us 370.51K
folly_distributedmutex_combining(4thread) 138.83% 2.07us 482.33K
folly_flatcombining_no_caching(4thread) 87.67% 3.28us 304.59K
folly_flatcombining_caching(4thread) 93.32% 3.08us 324.23K
----------------------------------------------------------------------------
std_mutex(8thread) 7.01us 142.65K
google_spin(8thread) 127.58% 5.49us 182.00K
folly_microspin(8thread) 137.50% 5.10us 196.14K
folly_picospin(8thread) 114.66% 6.11us 163.56K
folly_microlock(8thread) 107.90% 6.50us 153.92K
folly_sharedmutex(8thread) 114.21% 6.14us 162.93K
folly_distributedmutex(8thread) 129.43% 5.42us 184.63K
folly_distributedmutex_combining(8thread) 271.46% 2.58us 387.23K
folly_flatcombining_no_caching(8thread) 148.27% 4.73us 211.50K
folly_flatcombining_caching(8thread) 170.26% 4.12us 242.88K
----------------------------------------------------------------------------
std_mutex(16thread) 13.11us 76.30K
google_spin(16thread) 122.81% 10.67us 93.71K
folly_microspin(16thread) 91.61% 14.31us 69.90K
folly_picospin(16thread) 62.60% 20.94us 47.76K
folly_microlock(16thread) 73.44% 17.85us 56.04K
folly_sharedmutex(16thread) 74.68% 17.55us 56.98K
folly_distributedmutex(16thread) 142.42% 9.20us 108.67K
folly_distributedmutex_combining(16thread) 332.10% 3.95us 253.39K
folly_flatcombining_no_caching(16thread) 177.20% 7.40us 135.21K
folly_flatcombining_caching(16thread) 186.60% 7.02us 142.37K
----------------------------------------------------------------------------
std_mutex(32thread) 25.45us 39.30K
google_spin(32thread) 122.57% 20.76us 48.17K
folly_microspin(32thread) 73.58% 34.58us 28.92K
folly_picospin(32thread) 50.29% 50.60us 19.76K
folly_microlock(32thread) 58.33% 43.63us 22.92K
folly_sharedmutex(32thread) 55.89% 45.53us 21.96K
folly_distributedmutex(32thread) 142.80% 17.82us 56.12K
folly_distributedmutex_combining(32thread) 352.23% 7.22us 138.42K
folly_flatcombining_no_caching(32thread) 237.42% 10.72us 93.30K
folly_flatcombining_caching(32thread) 251.05% 10.14us 98.66K
----------------------------------------------------------------------------
std_mutex(64thread) 43.02us 23.25K
google_spin(64thread) 120.68% 35.65us 28.05K
folly_microspin(64thread) 70.09% 61.38us 16.29K
folly_picospin(64thread) 42.05% 102.31us 9.77K
folly_microlock(64thread) 54.50% 78.94us 12.67K
folly_sharedmutex(64thread) 50.37% 85.40us 11.71K
folly_distributedmutex(64thread) 135.17% 31.83us 31.42K
folly_distributedmutex_combining(64thread) 319.01% 13.49us 74.15K
folly_flatcombining_no_caching(64thread) 218.18% 19.72us 50.72K
folly_flatcombining_caching(64thread) 211.05% 20.38us 49.06K
----------------------------------------------------------------------------
std_mutex(128thread) 84.62us 11.82K
google_spin(128thread) 120.25% 70.37us 14.21K
folly_microspin(128thread) 66.54% 127.16us 7.86K
folly_picospin(128thread) 33.40% 253.38us 3.95K
folly_microlock(128thread) 51.91% 163.03us 6.13K
folly_sharedmutex(128thread) 49.51% 170.90us 5.85K
folly_distributedmutex(128thread) 131.90% 64.15us 15.59K
folly_distributedmutex_combining(128thread) 273.55% 30.93us 32.33K
folly_flatcombining_no_caching(128thread) 183.86% 46.02us 21.73K
folly_flatcombining_caching(128thread) 180.95% 46.76us 21.38K
----------------------------------------------------------------------------
std_mutex_simple(1thread) 1.20us 833.55K
google_spin_simple(1thread) 105.03% 1.14us 875.52K
folly_microspin_simple(1thread) 102.64% 1.17us 855.57K
folly_picospin_simple(1thread) 101.94% 1.18us 849.74K
folly_microlock_simple(1thread) 101.01% 1.19us 841.96K
folly_sharedmutex_simple(1thread) 100.82% 1.19us 840.37K
folly_distributedmutex_simple(1thread) 100.15% 1.20us 834.83K
folly_distributedmutex_combining_simple(1thread 102.37% 1.17us 853.32K
folly_flatcombining_no_caching_simple(1thread) 93.19% 1.29us 776.81K
folly_flatcombining_caching_simple(1thread) 100.03% 1.20us 833.80K
atomic_fetch_add(1thread) 98.13% 1.22us 817.99K
atomic_cas(1thread) 101.95% 1.18us 849.82K
----------------------------------------------------------------------------
std_mutex_simple(2thread) 1.56us 641.79K
google_spin_simple(2thread) 110.31% 1.41us 707.98K
folly_microspin_simple(2thread) 115.05% 1.35us 738.35K
folly_picospin_simple(2thread) 110.28% 1.41us 707.78K
folly_microlock_simple(2thread) 107.14% 1.45us 687.60K
folly_sharedmutex_simple(2thread) 113.16% 1.38us 726.22K
folly_distributedmutex_simple(2thread) 108.31% 1.44us 695.14K
folly_distributedmutex_combining_simple(2thread 104.39% 1.49us 669.95K
folly_flatcombining_no_caching_simple(2thread) 87.04% 1.79us 558.63K
folly_flatcombining_caching_simple(2thread) 97.59% 1.60us 626.30K
atomic_fetch_add(2thread) 103.06% 1.51us 661.42K
atomic_cas(2thread) 123.77% 1.26us 794.32K
----------------------------------------------------------------------------
std_mutex_simple(4thread) 2.72us 368.29K
google_spin_simple(4thread) 122.17% 2.22us 449.96K
folly_microspin_simple(4thread) 142.12% 1.91us 523.43K
folly_picospin_simple(4thread) 160.27% 1.69us 590.27K
folly_microlock_simple(4thread) 143.16% 1.90us 527.24K
folly_sharedmutex_simple(4thread) 139.18% 1.95us 512.61K
folly_distributedmutex_simple(4thread) 111.52% 2.43us 410.71K
folly_distributedmutex_combining_simple(4thread 138.74% 1.96us 510.96K
folly_flatcombining_no_caching_simple(4thread) 96.48% 2.81us 355.34K
folly_flatcombining_caching_simple(4thread) 105.15% 2.58us 387.28K
atomic_fetch_add(4thread) 148.73% 1.83us 547.75K
atomic_cas(4thread) 213.49% 1.27us 786.28K
----------------------------------------------------------------------------
std_mutex_simple(8thread) 7.04us 142.04K
google_spin_simple(8thread) 127.59% 5.52us 181.23K
folly_microspin_simple(8thread) 135.94% 5.18us 193.09K
folly_picospin_simple(8thread) 113.86% 6.18us 161.72K
folly_microlock_simple(8thread) 112.07% 6.28us 159.18K
folly_sharedmutex_simple(8thread) 113.25% 6.22us 160.86K
folly_distributedmutex_simple(8thread) 124.12% 5.67us 176.30K
folly_distributedmutex_combining_simple(8thread 309.01% 2.28us 438.91K
folly_flatcombining_no_caching_simple(8thread) 134.62% 5.23us 191.21K
folly_flatcombining_caching_simple(8thread) 147.13% 4.79us 208.99K
atomic_fetch_add(8thread) 347.94% 2.02us 494.21K
atomic_cas(8thread) 412.06% 1.71us 585.28K
----------------------------------------------------------------------------
std_mutex_simple(16thread) 12.87us 77.73K
google_spin_simple(16thread) 122.44% 10.51us 95.17K
folly_microspin_simple(16thread) 99.49% 12.93us 77.33K
folly_picospin_simple(16thread) 72.60% 17.72us 56.43K
folly_microlock_simple(16thread) 80.39% 16.00us 62.48K
folly_sharedmutex_simple(16thread) 78.76% 16.34us 61.22K
folly_distributedmutex_simple(16thread) 118.58% 10.85us 92.17K
folly_distributedmutex_combining_simple(16threa 483.44% 2.66us 375.76K
folly_flatcombining_no_caching_simple(16thread) 194.22% 6.62us 150.96K
folly_flatcombining_caching_simple(16thread) 229.03% 5.62us 178.02K
atomic_fetch_add(16thread) 617.57% 2.08us 480.01K
atomic_cas(16thread) 258.86% 4.97us 201.20K
----------------------------------------------------------------------------
std_mutex_simple(32thread) 22.85us 43.77K
google_spin_simple(32thread) 123.96% 18.43us 54.25K
folly_microspin_simple(32thread) 73.35% 31.15us 32.11K
folly_picospin_simple(32thread) 46.43% 49.21us 20.32K
folly_microlock_simple(32thread) 55.62% 41.08us 24.34K
folly_sharedmutex_simple(32thread) 52.67% 43.38us 23.05K
folly_distributedmutex_simple(32thread) 106.87% 21.38us 46.78K
folly_distributedmutex_combining_simple(32threa 581.80% 3.93us 254.64K
folly_flatcombining_no_caching_simple(32thread) 280.19% 8.15us 122.63K
folly_flatcombining_caching_simple(32thread) 350.87% 6.51us 153.57K
atomic_fetch_add(32thread) 1031.35% 2.22us 451.41K
atomic_cas(32thread) 209.10% 10.93us 91.52K
----------------------------------------------------------------------------
std_mutex_simple(64thread) 39.55us 25.28K
google_spin_simple(64thread) 124.15% 31.86us 31.39K
folly_microspin_simple(64thread) 72.27% 54.73us 18.27K
folly_picospin_simple(64thread) 39.96% 98.98us 10.10K
folly_microlock_simple(64thread) 53.10% 74.48us 13.43K
folly_sharedmutex_simple(64thread) 48.83% 81.00us 12.35K
folly_distributedmutex_simple(64thread) 103.91% 38.06us 26.27K
folly_distributedmutex_combining_simple(64threa 520.61% 7.60us 131.63K
folly_flatcombining_no_caching_simple(64thread) 288.46% 13.71us 72.93K
folly_flatcombining_caching_simple(64thread) 306.57% 12.90us 77.51K
atomic_fetch_add(64thread) 982.24% 4.03us 248.34K
atomic_cas(64thread) 191.87% 20.61us 48.51K
----------------------------------------------------------------------------
std_mutex_simple(128thread) 77.79us 12.85K
google_spin_simple(128thread) 123.39% 63.05us 15.86K
folly_microspin_simple(128thread) 69.13% 112.53us 8.89K
folly_picospin_simple(128thread) 30.32% 256.57us 3.90K
folly_microlock_simple(128thread) 50.78% 153.20us 6.53K
folly_sharedmutex_simple(128thread) 48.00% 162.07us 6.17K
folly_distributedmutex_simple(128thread) 102.79% 75.68us 13.21K
folly_distributedmutex_combining_simple(128thre 433.00% 17.97us 55.66K
folly_flatcombining_no_caching_simple(128thread 186.46% 41.72us 23.97K
folly_flatcombining_caching_simple(128thread) 204.22% 38.09us 26.25K
atomic_fetch_add(128thread) 965.10% 8.06us 124.06K
atomic_cas(128thread) 184.01% 42.28us 23.65K
============================================================================ ============================================================================
*/ */
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment