Commit f11ea4c6 authored by Aaryaman Sagar's avatar Aaryaman Sagar Committed by Facebook Github Bot

Add flat combining to DistributedMutex

Summary:
Add combined critical sections to DistributedMutex.  The implementation uses
the framework within DistributedMutex as the point of reference for contention
and resolves contention by either combining the lock requests of peers or
migrating the lock based on usage and internal state.  This boosts the
performance of DistributedMutex more than before - up to 4x relative to the
old benchmark on dual socket Broadwell and up to 5x on single socket Skylake
machines.  The win might be bigger when the cost of mutex migration is higher,
eg.  when the data being protected is wider than a single L1 cache line.
Small critical sections when used in combinable mode, can now go more than 10x
faster than the small locks, about 6x faster than std::mutex, up to 2-3x
faster than the implementations of flat combining we benchmarked against and
about as fast as a CAS instruction/loop (faster on some NUMA-less and more
parallel architectures like Skylake).  This also allows flat combining to be
used in situations where fine-grained locking would be beneficial with
virtually no overhead, DistributedMutex retains the original size of 8 bytes.
DistributedMutex resolves contention through flat combining up to a constant
factor of 2 contention chains to prevent issues with fairness and latency
outliers.  So we retain the fairness benefits of the original implementation
with no noticeable regression when switching between the lock methods.

The implementation of combined critical sections here is different from the
original flat combining paper.  This uses the same stack based LIFO contention
chains from DistributedMutex to allow the leader to resolve lock requests from
peers.  Combine records are located on the stack along with the wait-node as an
InlineFunctionRef instance to avoid memory allocation overhead or expensive
copying.  Using InlineFunctionRef also means that function calls are resolved
without having to go through the double lookup of a vtable based implementation.
InlineFunctionRef can flatten the virtual table and callable object in-situ
so we have just one indirection.  Additionally, we use preemption as a signal
to speed up lock requests in the case where latency of acquisition would have
otherwise gone beyond our control.  As a side-bonus, this also results in much
simpler code.

The API looks like the following
```
auto integer = std::uint64_t{};
auto mutex = folly::DistributedMutex{};

// ...

mutex.lock_combine([&]() {
  foo();
  integer++;
});
```

This adds three new methods for symmetry with the old lock functions
- folly::invoke_result_t<const Func&> lock_combine(Func) noexcept;
- folly::Optional<> try_lock_combine_for(duration, Func) noexcept;
- folly::Optional<> try_lock_combine_until(time_point, Func) noexcept;

Benchmarks on Broadwell
```
std_mutex_simple(1thread)                                  617.28ns    1.62M
google_spin_simple(1thread)                      101.97%   605.33ns    1.65M
folly_microspin_simple(1thread)                   99.40%   621.01ns    1.61M
folly_picospin_simple(1thread)                   100.15%   616.36ns    1.62M
folly_microlock_simple(1thread)                   98.86%   624.37ns    1.60M
folly_sharedmutex_simple(1thread)                 86.14%   716.59ns    1.40M
folly_distributedmutex_simple(1thread)            97.95%   630.21ns    1.59M
folly_distributedmutex_flatcombining_simple(1th   98.04%   629.60ns    1.59M
folly_flatcombining_no_caching_simple(1thread)    89.85%   687.01ns    1.46M
folly_flatcombining_caching_simple(1thread)       78.36%   787.75ns    1.27M
atomics_fetch_add(1thread)                        97.88%   630.67ns    1.59M
atomic_cas(1thread)                              102.31%   603.33ns    1.66M
----------------------------------------------------------------------------
std_mutex_simple(2thread)                                    1.14us  875.72K
google_spin_simple(2thread)                      125.08%   912.95ns    1.10M
folly_microspin_simple(2thread)                  116.03%   984.14ns    1.02M
folly_picospin_simple(2thread)                   117.35%   973.04ns    1.03M
folly_microlock_simple(2thread)                  102.54%     1.11us  897.95K
folly_sharedmutex_simple(2thread)                121.04%   943.42ns    1.06M
folly_distributedmutex_simple(2thread)           128.24%   890.48ns    1.12M
folly_distributedmutex_flatcombining_simple(2th  107.99%     1.06us  945.66K
folly_flatcombining_no_caching_simple(2thread)    83.40%     1.37us  730.33K
folly_flatcombining_caching_simple(2thread)       87.47%     1.31us  766.00K
atomics_fetch_add(2thread)                       115.71%   986.85ns    1.01M
atomic_cas(2thread)                              171.35%   666.42ns    1.50M
----------------------------------------------------------------------------
std_mutex_simple(4thread)                                    1.98us  504.43K
google_spin_simple(4thread)                      103.24%     1.92us  520.76K
folly_microspin_simple(4thread)                   92.05%     2.15us  464.33K
folly_picospin_simple(4thread)                    89.16%     2.22us  449.75K
folly_microlock_simple(4thread)                   66.62%     2.98us  336.06K
folly_sharedmutex_simple(4thread)                 82.61%     2.40us  416.69K
folly_distributedmutex_simple(4thread)           108.83%     1.82us  548.98K
folly_distributedmutex_flatcombining_simple(4th  145.24%     1.36us  732.63K
folly_flatcombining_no_caching_simple(4thread)    84.77%     2.34us  427.62K
folly_flatcombining_caching_simple(4thread)       91.01%     2.18us  459.09K
atomics_fetch_add(4thread)                       142.86%     1.39us  720.62K
atomic_cas(4thread)                              223.50%   887.02ns    1.13M
----------------------------------------------------------------------------
std_mutex_simple(8thread)                                    3.70us  270.40K
google_spin_simple(8thread)                      110.24%     3.35us  298.09K
folly_microspin_simple(8thread)                   81.59%     4.53us  220.63K
folly_picospin_simple(8thread)                    57.61%     6.42us  155.77K
folly_microlock_simple(8thread)                   54.18%     6.83us  146.49K
folly_sharedmutex_simple(8thread)                 55.44%     6.67us  149.92K
folly_distributedmutex_simple(8thread)           109.86%     3.37us  297.05K
folly_distributedmutex_flatcombining_simple(8th  225.14%     1.64us  608.76K
folly_flatcombining_no_caching_simple(8thread)    96.25%     3.84us  260.26K
folly_flatcombining_caching_simple(8thread)      108.13%     3.42us  292.39K
atomics_fetch_add(8thread)                       255.40%     1.45us  690.60K
atomic_cas(8thread)                              183.68%     2.01us  496.66K
----------------------------------------------------------------------------
std_mutex_simple(16thread)                                   8.70us  114.89K
google_spin_simple(16thread)                     124.47%     6.99us  143.01K
folly_microspin_simple(16thread)                  86.46%    10.07us   99.34K
folly_picospin_simple(16thread)                   40.76%    21.36us   46.83K
folly_microlock_simple(16thread)                  54.78%    15.89us   62.94K
folly_sharedmutex_simple(16thread)                58.14%    14.97us   66.80K
folly_distributedmutex_simple(16thread)          124.53%     6.99us  143.08K
folly_distributedmutex_flatcombining_simple(16t  324.08%     2.69us  372.34K
folly_flatcombining_no_caching_simple(16thread)  134.73%     6.46us  154.79K
folly_flatcombining_caching_simple(16thread)     188.24%     4.62us  216.28K
atomics_fetch_add(16thread)                      340.07%     2.56us  390.72K
atomic_cas(16thread)                             220.15%     3.95us  252.93K
----------------------------------------------------------------------------
std_mutex_simple(32thread)                                  25.62us   39.03K
google_spin_simple(32thread)                     105.21%    24.35us   41.07K
folly_microspin_simple(32thread)                  79.64%    32.17us   31.08K
folly_picospin_simple(32thread)                   19.61%   130.67us    7.65K
folly_microlock_simple(32thread)                  42.97%    59.62us   16.77K
folly_sharedmutex_simple(32thread)                52.41%    48.88us   20.46K
folly_distributedmutex_simple(32thread)          144.48%    17.73us   56.39K
folly_distributedmutex_flatcombining_simple(32t  461.73%     5.55us  180.22K
folly_flatcombining_no_caching_simple(32thread)  207.55%    12.34us   81.01K
folly_flatcombining_caching_simple(32thread)     237.34%    10.80us   92.64K
atomics_fetch_add(32thread)                      561.68%     4.56us  219.23K
atomic_cas(32thread)                             484.13%     5.29us  188.96K
----------------------------------------------------------------------------
std_mutex_simple(64thread)                                  31.26us   31.99K
google_spin_simple(64thread)                      99.95%    31.28us   31.97K
folly_microspin_simple(64thread)                  83.63%    37.38us   26.75K
folly_picospin_simple(64thread)                   20.88%   149.68us    6.68K
folly_microlock_simple(64thread)                  45.46%    68.77us   14.54K
folly_sharedmutex_simple(64thread)                52.65%    59.38us   16.84K
folly_distributedmutex_simple(64thread)          154.90%    20.18us   49.55K
folly_distributedmutex_flatcombining_simple(64t  475.05%     6.58us  151.96K
folly_flatcombining_no_caching_simple(64thread)  195.63%    15.98us   62.58K
folly_flatcombining_caching_simple(64thread)     199.29%    15.69us   63.75K
atomics_fetch_add(64thread)                      580.23%     5.39us  185.61K
atomic_cas(64thread)                             510.76%     6.12us  163.39K
----------------------------------------------------------------------------
std_mutex_simple(128thread)                                 70.53us   14.18K
google_spin_simple(128thread)                     99.20%    71.09us   14.07K
folly_microspin_simple(128thread)                 88.73%    79.49us   12.58K
folly_picospin_simple(128thread)                  22.24%   317.06us    3.15K
folly_microlock_simple(128thread)                 50.17%   140.57us    7.11K
folly_sharedmutex_simple(128thread)               59.53%   118.47us    8.44K
folly_distributedmutex_simple(128thread)         172.74%    40.83us   24.49K
folly_distributedmutex_flatcombining_simple(128  538.22%    13.10us   76.31K
folly_flatcombining_no_caching_simple(128thread  165.11%    42.72us   23.41K
folly_flatcombining_caching_simple(128thread)    161.46%    43.68us   22.89K
atomics_fetch_add(128thread)                     606.51%    11.63us   85.99K
atomic_cas(128thread)                            578.52%    12.19us   82.03K
```

Reviewed By: djwatson

Differential Revision: D13799447

fbshipit-source-id: 923cc35e5060ef79b349690821d8545459248347
parent 11566445
/*
* Copyright 2004-present Facebook, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <folly/synchronization/DistributedMutex.h>
namespace folly {
namespace detail {
namespace distributed_mutex {
template class DistributedMutex<std::atomic, true>;
} // namespace distributed_mutex
} // namespace detail
} // namespace folly
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment