Add flat combining to DistributedMutex
Summary: Add combined critical sections to DistributedMutex. The implementation uses the framework within DistributedMutex as the point of reference for contention and resolves contention by either combining the lock requests of peers or migrating the lock based on usage and internal state. This boosts the performance of DistributedMutex more than before - up to 4x relative to the old benchmark on dual socket Broadwell and up to 5x on single socket Skylake machines. The win might be bigger when the cost of mutex migration is higher, eg. when the data being protected is wider than a single L1 cache line. Small critical sections when used in combinable mode, can now go more than 10x faster than the small locks, about 6x faster than std::mutex, up to 2-3x faster than the implementations of flat combining we benchmarked against and about as fast as a CAS instruction/loop (faster on some NUMA-less and more parallel architectures like Skylake). This also allows flat combining to be used in situations where fine-grained locking would be beneficial with virtually no overhead, DistributedMutex retains the original size of 8 bytes. DistributedMutex resolves contention through flat combining up to a constant factor of 2 contention chains to prevent issues with fairness and latency outliers. So we retain the fairness benefits of the original implementation with no noticeable regression when switching between the lock methods. The implementation of combined critical sections here is different from the original flat combining paper. This uses the same stack based LIFO contention chains from DistributedMutex to allow the leader to resolve lock requests from peers. Combine records are located on the stack along with the wait-node as an InlineFunctionRef instance to avoid memory allocation overhead or expensive copying. Using InlineFunctionRef also means that function calls are resolved without having to go through the double lookup of a vtable based implementation. InlineFunctionRef can flatten the virtual table and callable object in-situ so we have just one indirection. Additionally, we use preemption as a signal to speed up lock requests in the case where latency of acquisition would have otherwise gone beyond our control. As a side-bonus, this also results in much simpler code. The API looks like the following ``` auto integer = std::uint64_t{}; auto mutex = folly::DistributedMutex{}; // ... mutex.lock_combine([&]() { foo(); integer++; }); ``` This adds three new methods for symmetry with the old lock functions - folly::invoke_result_t<const Func&> lock_combine(Func) noexcept; - folly::Optional<> try_lock_combine_for(duration, Func) noexcept; - folly::Optional<> try_lock_combine_until(time_point, Func) noexcept; Benchmarks on Broadwell ``` std_mutex_simple(1thread) 617.28ns 1.62M google_spin_simple(1thread) 101.97% 605.33ns 1.65M folly_microspin_simple(1thread) 99.40% 621.01ns 1.61M folly_picospin_simple(1thread) 100.15% 616.36ns 1.62M folly_microlock_simple(1thread) 98.86% 624.37ns 1.60M folly_sharedmutex_simple(1thread) 86.14% 716.59ns 1.40M folly_distributedmutex_simple(1thread) 97.95% 630.21ns 1.59M folly_distributedmutex_flatcombining_simple(1th 98.04% 629.60ns 1.59M folly_flatcombining_no_caching_simple(1thread) 89.85% 687.01ns 1.46M folly_flatcombining_caching_simple(1thread) 78.36% 787.75ns 1.27M atomics_fetch_add(1thread) 97.88% 630.67ns 1.59M atomic_cas(1thread) 102.31% 603.33ns 1.66M ---------------------------------------------------------------------------- std_mutex_simple(2thread) 1.14us 875.72K google_spin_simple(2thread) 125.08% 912.95ns 1.10M folly_microspin_simple(2thread) 116.03% 984.14ns 1.02M folly_picospin_simple(2thread) 117.35% 973.04ns 1.03M folly_microlock_simple(2thread) 102.54% 1.11us 897.95K folly_sharedmutex_simple(2thread) 121.04% 943.42ns 1.06M folly_distributedmutex_simple(2thread) 128.24% 890.48ns 1.12M folly_distributedmutex_flatcombining_simple(2th 107.99% 1.06us 945.66K folly_flatcombining_no_caching_simple(2thread) 83.40% 1.37us 730.33K folly_flatcombining_caching_simple(2thread) 87.47% 1.31us 766.00K atomics_fetch_add(2thread) 115.71% 986.85ns 1.01M atomic_cas(2thread) 171.35% 666.42ns 1.50M ---------------------------------------------------------------------------- std_mutex_simple(4thread) 1.98us 504.43K google_spin_simple(4thread) 103.24% 1.92us 520.76K folly_microspin_simple(4thread) 92.05% 2.15us 464.33K folly_picospin_simple(4thread) 89.16% 2.22us 449.75K folly_microlock_simple(4thread) 66.62% 2.98us 336.06K folly_sharedmutex_simple(4thread) 82.61% 2.40us 416.69K folly_distributedmutex_simple(4thread) 108.83% 1.82us 548.98K folly_distributedmutex_flatcombining_simple(4th 145.24% 1.36us 732.63K folly_flatcombining_no_caching_simple(4thread) 84.77% 2.34us 427.62K folly_flatcombining_caching_simple(4thread) 91.01% 2.18us 459.09K atomics_fetch_add(4thread) 142.86% 1.39us 720.62K atomic_cas(4thread) 223.50% 887.02ns 1.13M ---------------------------------------------------------------------------- std_mutex_simple(8thread) 3.70us 270.40K google_spin_simple(8thread) 110.24% 3.35us 298.09K folly_microspin_simple(8thread) 81.59% 4.53us 220.63K folly_picospin_simple(8thread) 57.61% 6.42us 155.77K folly_microlock_simple(8thread) 54.18% 6.83us 146.49K folly_sharedmutex_simple(8thread) 55.44% 6.67us 149.92K folly_distributedmutex_simple(8thread) 109.86% 3.37us 297.05K folly_distributedmutex_flatcombining_simple(8th 225.14% 1.64us 608.76K folly_flatcombining_no_caching_simple(8thread) 96.25% 3.84us 260.26K folly_flatcombining_caching_simple(8thread) 108.13% 3.42us 292.39K atomics_fetch_add(8thread) 255.40% 1.45us 690.60K atomic_cas(8thread) 183.68% 2.01us 496.66K ---------------------------------------------------------------------------- std_mutex_simple(16thread) 8.70us 114.89K google_spin_simple(16thread) 124.47% 6.99us 143.01K folly_microspin_simple(16thread) 86.46% 10.07us 99.34K folly_picospin_simple(16thread) 40.76% 21.36us 46.83K folly_microlock_simple(16thread) 54.78% 15.89us 62.94K folly_sharedmutex_simple(16thread) 58.14% 14.97us 66.80K folly_distributedmutex_simple(16thread) 124.53% 6.99us 143.08K folly_distributedmutex_flatcombining_simple(16t 324.08% 2.69us 372.34K folly_flatcombining_no_caching_simple(16thread) 134.73% 6.46us 154.79K folly_flatcombining_caching_simple(16thread) 188.24% 4.62us 216.28K atomics_fetch_add(16thread) 340.07% 2.56us 390.72K atomic_cas(16thread) 220.15% 3.95us 252.93K ---------------------------------------------------------------------------- std_mutex_simple(32thread) 25.62us 39.03K google_spin_simple(32thread) 105.21% 24.35us 41.07K folly_microspin_simple(32thread) 79.64% 32.17us 31.08K folly_picospin_simple(32thread) 19.61% 130.67us 7.65K folly_microlock_simple(32thread) 42.97% 59.62us 16.77K folly_sharedmutex_simple(32thread) 52.41% 48.88us 20.46K folly_distributedmutex_simple(32thread) 144.48% 17.73us 56.39K folly_distributedmutex_flatcombining_simple(32t 461.73% 5.55us 180.22K folly_flatcombining_no_caching_simple(32thread) 207.55% 12.34us 81.01K folly_flatcombining_caching_simple(32thread) 237.34% 10.80us 92.64K atomics_fetch_add(32thread) 561.68% 4.56us 219.23K atomic_cas(32thread) 484.13% 5.29us 188.96K ---------------------------------------------------------------------------- std_mutex_simple(64thread) 31.26us 31.99K google_spin_simple(64thread) 99.95% 31.28us 31.97K folly_microspin_simple(64thread) 83.63% 37.38us 26.75K folly_picospin_simple(64thread) 20.88% 149.68us 6.68K folly_microlock_simple(64thread) 45.46% 68.77us 14.54K folly_sharedmutex_simple(64thread) 52.65% 59.38us 16.84K folly_distributedmutex_simple(64thread) 154.90% 20.18us 49.55K folly_distributedmutex_flatcombining_simple(64t 475.05% 6.58us 151.96K folly_flatcombining_no_caching_simple(64thread) 195.63% 15.98us 62.58K folly_flatcombining_caching_simple(64thread) 199.29% 15.69us 63.75K atomics_fetch_add(64thread) 580.23% 5.39us 185.61K atomic_cas(64thread) 510.76% 6.12us 163.39K ---------------------------------------------------------------------------- std_mutex_simple(128thread) 70.53us 14.18K google_spin_simple(128thread) 99.20% 71.09us 14.07K folly_microspin_simple(128thread) 88.73% 79.49us 12.58K folly_picospin_simple(128thread) 22.24% 317.06us 3.15K folly_microlock_simple(128thread) 50.17% 140.57us 7.11K folly_sharedmutex_simple(128thread) 59.53% 118.47us 8.44K folly_distributedmutex_simple(128thread) 172.74% 40.83us 24.49K folly_distributedmutex_flatcombining_simple(128 538.22% 13.10us 76.31K folly_flatcombining_no_caching_simple(128thread 165.11% 42.72us 23.41K folly_flatcombining_caching_simple(128thread) 161.46% 43.68us 22.89K atomics_fetch_add(128thread) 606.51% 11.63us 85.99K atomic_cas(128thread) 578.52% 12.19us 82.03K ``` Reviewed By: djwatson Differential Revision: D13799447 fbshipit-source-id: 923cc35e5060ef79b349690821d8545459248347
Showing
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Please register or sign in to comment