Fix concurrency issues in ConcurrentSkipList.

Summary: ## An Overview of ConcurrentSkipList Synchronization folly::ConcurrentSkipList, at a high level, is a normal skip list, except: * Access to the nodes' "next" pointers are atomic. (The implementation used release/consume, which this diff changes to release/acquire. It's not clear the original use for consume was correct, and consume is very complicated without any practical benefit, so we should just avoid it.) * All accesses (read/write) must go through an Accessor, which basically is nothing except an RAII object that calls addRef() on construction and releaseRef() on destruction. * Deleting a node will defer the deletion to a "recycler", which is just a vector of nodes to be recycled. When releaseRef() drops the refcount to zero, the nodes in the recycler are deleted. Intuitively speaking, when refcount drops to zero, it is safe to delete the nodes in the recycler because nobody holds any Accessors. It's a very simple way to manage the lifetime of the nodes, without having to worry about *which* nodes are accessed or to be deleted. However, this refcount/recycling behavior is very hard to get right when using atomics as the main synchronization mechanism. In the buggy implementation before this diff, I'll highlight three relevant parts: * To find an element in the skip list (either to fetch, insert, or delete), we start from the head node and skips by following successor pointers at the appropriate levels until we arrives at the node in question. Rough pseudocode: ``` def find(val): node = head while read(node) < val: node = skip(node) # read the node to find the successor at some level return node ``` * To delete an element from the skip list, after finding the element, we modify the predecessor at each level by changing their successors to point to the deleted element's successors, and place the deleted element into the recycler. ``` def delete(node): for level in range(...): node->pred->setSkip(node->succ) recycle(node) def recycle(node): lock(RECYCLER_LOCK) recycler.add(node) unlock(RECYCLER_LOCK) ``` * releaseRef() and addRef(): ``` def releaseRef(): if refcount > 1: refcount-- return lock(RECYCLER_LOCK) if refcount > 1: refcount-- return to_recycle, recycler = recycler, [] # swap out nodes to recycle unlock(RECYCLER_LOCK) for node in to_recycle: free(node) refcount-- def addRef(): recount++ ``` ## The Safety Argument The Accessor/deletion mechanism is safe if we can ensure the following: * If for a particular node X, a thread performs read(X) and a different thread performs free(X), the read(X) happens-before free(X). ### Problem 1: Relaxed decrement The buggy implementation used relaxed memory order when doing `refcount--`. Let's see why this is a problem. Let thread 1 be the one performing read(X) and let thread 2 be the one performing free(X). The intuitive argument is that free(X) can only happen after the refcount drops to zero, which cannot be true while read(X) is happening. The somewhat more formal argument is that read(X) happens before refcount-- in thread 1, which happens before refcount--, which happens before free(X). But because we use relaxed memory order, the two refcount-- operations do not synchronize. ### Problem 2: Relaxed increment The implementation also used relaxed memory order for addRef(). Normally, for a refcount, it is OK to use relaxed increments, but this refcount is different: the count can go back up once it reaches zero. When reusing refcounts this way, we can no longer use relaxed increment. To see why, suppose thread 2 performs the following in this order: ``` setSkip(P, not X) # step before deleting X; P is a predecessor at some level recycle(X) refcount-- # acq_rel, gets 0 free(X) ``` and thread 2 performs the following in this order: ``` refcount++ # relaxed skip(P) # gets X read(X) ``` See Appendix A below; it's possible for the refcount to reach 0, and thread 2 to get X when reading the successor from P. This means that free(X) might theoretically still race with read(X), as we failed to show that once we delete something from the skip list, another accessor can't possibly reach X again. This might feel like an unnecessary argument, but without this reasoning, we would not have found a problem even if we just modified releaseRef() to instead delete some random nodes from the list. That will certainly cause trouble when someone else later tries to read something. ### Problem 3: No release operation on refcount before freeing nodes This is much more subtle. Suppose thread 1 performs the following: ``` refcount-- # end of some previous accessor refcount++ # begin of new accessor skip(P) ``` and thread 2 does this: ``` setSkip(P, not X) # step before deleting X; P is a predecessor recycle(X) read(refcount) # gets 1, proceeds to lock lock(RECYCLER_LOCK) read(refcount) # gets 1 *** unlock(RECYCLER_LOCK) free(X) refcount-- ``` The interleaving to make this possible is: ``` thread 1 refcount-- thread 2 everything until free(X) thread 1 refcount++ thread 2 free(X) thread 2 refcount-- ``` We wish to show that `setSkip(P, not X)` happens before `skip(P)`, because this will allow us to prove that `skip(P) != X` (otherwise, we would not be able to show that a subsequent read(X) in thread 1 happens before free(X) - it might legitimately not). The intuitive argument is that `setSkip(P, not X)` happened before we decrement the refcount, which happens before the increment of the refcount in thread 1, which happens before `skip(P)`. However, because thread 2 actually decremented refcount quite late, it might be the case that thread 1's `refcount++` happened before thread 2's `refcount--` (and the increment synchronized with its own decrement earlier). There's nothing else in the middle that provided a synchronizes-with relationship (in particular, the read(refcount) operations do not provide synchronization because those are *loads* - wrong direction!). ### Correct implementation In addition to using acq_rel memory order on all operations on refcount, this diff modifies releaseRef() like this: ``` def releaseRef(): if refcount > 1: # optimization refcount-- return lock(RECYCLER_LOCK) if --refcount == 0: # "GUARD" to_recycle, recycler = recycler, [] # swap out nodes to recycle unlock(RECYCLER_LOCK) for node in to_recycle: free(node) ``` I'll use "GUARD" to denote the event that the --refcount within the lock *returns 0*. I'll use this for the correctness proof. ### Correct implementation proof The proof will still be to show that if thread 1 performs read(X) and thread 2 performs free(X), read(X) happens-before free(X). Proof: thread 1 must have grabbed an accessor while reading X, so its sequence of actions look like this: ``` refcount++ skip(P) # gets X read(X) refcount-- ``` thread 2 performs: ``` GUARD free(X) ``` Now, all writes on refcount are RMW operations and they all use acq_rel ordering, so all the RMW operations on refcount form a total order where successive operations have a synchronizes-with relationship. We'll look at where GUARD might stand in this total order. * Case 1: GUARD is after refcount-- from thread 1 in the total order. * In this case, read(X) happens before refcount-- in thread 1, which happens before GUARD, which happens before free(X). * Case 2: GUARD is between refcount++ and refcount-- from thread 1 in the total order. * In this case, observe (by looking at the total ordering on refcount RMW) that we have at least two threads (1 and 2) that contribute 1 to the refcount, right before GUARD. In other words, GUARD could not possibly have returned 0, which is a contradiction. * Case 3: GUARD is before refcount++ from thread 1 in the total order. * Let `setSkip(P, not X)` be a predecessor write operation before X is added to the recycler (this can happen on any thread). We will prove in the below lemma that `setSkip(P, not X)` happens before GUARD. Once we have that, then `setSkip(P, not X)` happens before GUARD which happens before thread 1's refcount++ which happens before `skip(P)`, and that renders it impossible for `skip(P)` to return X, making it a contradiction. It remains to prove that `setSkip(P, not X)` happens before GUARD. In the thread that performs an `setSkip(P, not X)` operation, it subsequently performs `recycle(X)`, which adds X to the recycler within RECYCLER_LOCK. In thread 2, GUARD happens within the RECYCLER_LOCK, and the subsequent swapping of the recycler vector contained X (which is shown by the fact that we free(X) after GUARD), so the lock must have been grabbed *after* the lock that added X to the recycler. In other words, we have the relationship that `setSkip(P, not X)` happens before `recycler.add(X)` which happens before `unlock(RECYCLER_LOCK)` which happens before `lock(RECYCLER_LOCK)` which happens before GUARD. Note that just like the original implementation, the optimization on top of releaseRef() is not a perfect optimization; it may delay the deletion of otherwise safe-to-delete nodes. However, that does not affect our correctness argument because it's always at least as safe to *delay* deletions (this hand-wavy argument is not part of the proof). ## Appendix A Consider the following two threads: ``` std::atomic<int> x{0}, y{1}; // Thread 1: x.store(1, std::memory_order_release); // A int y1 = y.fetch_add(-1, std::memory_order_acq_rel); // B // Thread 2: y.fetch_add(1, std::memory_order_relaxed); // C int x2 = x.load(std::memory_order_acquire); // D ``` Intuitively, if y1 = 1, then thread 1's fetch_add was executed first, so thread 2 should get x2 = 1. Otherwise, if thread 2's fetch_add happened first, then y1 = 2, and x2 could be either 0 or 1. But, could it happen that y1 = 1 and x2 = 0? Let's look at the happens-before relationships between these operations. For intra-thread (sequenced-before), we have A < B and C < D (I'm using < to denote happens-before). Now, for inter-thread (synchronizes-with), the only pair that could establish a synchronizes-with relationship is A and D. (B and C is not eligible because C uses relaxed ordering.) For y1 = 1 and x2 = 0 to happen, we must not have D read the result of A, so it must be the case that they A does not synchronize-with D. But that's as far as we can go; there's nothing that really enforces much of an ordering between the two threads. We can also think about this in terms of reordering of memory operations. Thread 1 is pretty safe from reordering because of the acq_release, but in thread 2, an "acquire" ordering typically means no memory operations after D may be reordered before D, but it doesn't prevent C from being reordered after D. C itself does not prevent reordering due to it being an relaxed operation. So if thread 2 executed D and then C, then it would be trivially possible to have y1 = 1 and x2 = 0. The point of this is to highlight that just by having a release/acquire pair does not magically *order* them. The pair merely provides a synchronizes-with relationship *if* the read happens to obtain the value written by the write, but not any guarantees of *which* value would be read. ## Appendix B Problem 1 is detected by TSAN, but problems 2 and 3 are not. Why? TSAN detects data races by deriving synchronization relationships from the *actual* interleaving of atomics at runtime. If the actual interleaving would always happen but is not *guaranteed* by the standard, there may be a real undetected data race. For example, it is well known that the following code will be detected by TSAN as a data race on int "x": ``` int x = 1; std::atomic<bool> y{false}; // Thread 1 x = 2; y.store(true, memory_order_relaxed); // Thread 2 while (!y.load(memory_order_acquire)) { // acquire is useless because writer used relaxed } std::cout << x << std::endl; ``` TSAN reports a data race on x because `y` failed to provide proper synchronizes-with relationship between the two threads due to incorrect memory ordering. However, when compiling on x86, most likely we will end up with a binary that always guarantees the intuitively desired behavior anyway. So now consider the following code: ``` std::atomic<int> x = 1; std::atomic<bool> y{false}; int z = 8; // Thread 1 z = 9; x.store(2, memory_order_release); y.store(true, memory_order_relaxed); // Thread 2 while (!y.load(memory_order_acquire)) { // acquire is useless because writer used relaxed } x.load(memory_order_acquire); std::cout << z << std::endl; ``` There is a data race on the access to z, because the happens-before chain of `write z -> x.store -> y.store -> y.load -> x.load -> read z` is broken on the `y.store -> y.load` link. However, TSAN will not report a data race, because it sees the chain as `write z -> x.store -> x.load -> read z`. It sees x.store as synchronizing with x.load because it *observed* that the x.load obtained the value written by the x.write *at runtime*, so it inferred that it was valid synchronization. This isn't guaranteed, though, because it's possible in some execution (in theory) that x.load does not get the value written by x.store (similar to Appendix A). Reviewed By: yfeldblum Differential Revision: D29248955 fbshipit-source-id: 2a3c9379c7c3a6469183df64582ca9cf763c0890

Fix concurrency issues in ConcurrentSkipList.
Summary: ## An Overview of ConcurrentSkipList Synchronization folly::ConcurrentSkipList, at a high level, is a normal skip list, except: * Access to the nodes' "next" pointers are atomic. (The implementation used release/consume, which this diff changes to release/acquire. It's not clear the original use for consume was correct, and consume is very complicated without any practical benefit, so we should just avoid it.) * All accesses (read/write) must go through an Accessor, which basically is nothing except an RAII object that calls addRef() on construction and releaseRef() on destruction. * Deleting a node will defer the deletion to a "recycler", which is just a vector of nodes to be recycled. When releaseRef() drops the refcount to zero, the nodes in the recycler are deleted. Intuitively speaking, when refcount drops to zero, it is safe to delete the nodes in the recycler because nobody holds any Accessors. It's a very simple way to manage the lifetime of the nodes, without having to worry about *which* nodes are accessed or to be deleted. However, this refcount/recycling behavior is very hard to get right when using atomics as the main synchronization mechanism. In the buggy implementation before this diff, I'll highlight three relevant parts: * To find an element in the skip list (either to fetch, insert, or delete), we start from the head node and skips by following successor pointers at the appropriate levels until we arrives at the node in question. Rough pseudocode: ``` def find(val): node = head while read(node) < val: node = skip(node) # read the node to find the successor at some level return node ``` * To delete an element from the skip list, after finding the element, we modify the predecessor at each level by changing their successors to point to the deleted element's successors, and place the deleted element into the recycler. ``` def delete(node): for level in range(...): node->pred->setSkip(node->succ) recycle(node) def recycle(node): lock(RECYCLER_LOCK) recycler.add(node) unlock(RECYCLER_LOCK) ``` * releaseRef() and addRef(): ``` def releaseRef(): if refcount > 1: refcount-- return lock(RECYCLER_LOCK) if refcount > 1: refcount-- return to_recycle, recycler = recycler, [] # swap out nodes to recycle unlock(RECYCLER_LOCK) for node in to_recycle: free(node) refcount-- def addRef(): recount++ ``` ## The Safety Argument The Accessor/deletion mechanism is safe if we can ensure the following: * If for a particular node X, a thread performs read(X) and a different thread performs free(X), the read(X) happens-before free(X). ### Problem 1: Relaxed decrement The buggy implementation used relaxed memory order when doing `refcount--`. Let's see why this is a problem. Let thread 1 be the one performing read(X) and let thread 2 be the one performing free(X). The intuitive argument is that free(X) can only happen after the refcount drops to zero, which cannot be true while read(X) is happening. The somewhat more formal argument is that read(X) happens before refcount-- in thread 1, which happens before refcount--, which happens before free(X). But because we use relaxed memory order, the two refcount-- operations do not synchronize. ### Problem 2: Relaxed increment The implementation also used relaxed memory order for addRef(). Normally, for a refcount, it is OK to use relaxed increments, but this refcount is different: the count can go back up once it reaches zero. When reusing refcounts this way, we can no longer use relaxed increment. To see why, suppose thread 2 performs the following in this order: ``` setSkip(P, not X) # step before deleting X; P is a predecessor at some level recycle(X) refcount-- # acq_rel, gets 0 free(X) ``` and thread 2 performs the following in this order: ``` refcount++ # relaxed skip(P) # gets X read(X) ``` See Appendix A below; it's possible for the refcount to reach 0, and thread 2 to get X when reading the successor from P. This means that free(X) might theoretically still race with read(X), as we failed to show that once we delete something from the skip list, another accessor can't possibly reach X again. This might feel like an unnecessary argument, but without this reasoning, we would not have found a problem even if we just modified releaseRef() to instead delete some random nodes from the list. That will certainly cause trouble when someone else later tries to read something. ### Problem 3: No release operation on refcount before freeing nodes This is much more subtle. Suppose thread 1 performs the following: ``` refcount-- # end of some previous accessor refcount++ # begin of new accessor skip(P) ``` and thread 2 does this: ``` setSkip(P, not X) # step before deleting X; P is a predecessor recycle(X) read(refcount) # gets 1, proceeds to lock lock(RECYCLER_LOCK) read(refcount) # gets 1 *** unlock(RECYCLER_LOCK) free(X) refcount-- ``` The interleaving to make this possible is: ``` thread 1 refcount-- thread 2 everything until free(X) thread 1 refcount++ thread 2 free(X) thread 2 refcount-- ``` We wish to show that `setSkip(P, not X)` happens before `skip(P)`, because this will allow us to prove that `skip(P) != X` (otherwise, we would not be able to show that a subsequent read(X) in thread 1 happens before free(X) - it might legitimately not). The intuitive argument is that `setSkip(P, not X)` happened before we decrement the refcount, which happens before the increment of the refcount in thread 1, which happens before `skip(P)`. However, because thread 2 actually decremented refcount quite late, it might be the case that thread 1's `refcount++` happened before thread 2's `refcount--` (and the increment synchronized with its own decrement earlier). There's nothing else in the middle that provided a synchronizes-with relationship (in particular, the read(refcount) operations do not provide synchronization because those are *loads* - wrong direction!). ### Correct implementation In addition to using acq_rel memory order on all operations on refcount, this diff modifies releaseRef() like this: ``` def releaseRef(): if refcount > 1: # optimization refcount-- return lock(RECYCLER_LOCK) if --refcount == 0: # "GUARD" to_recycle, recycler = recycler, [] # swap out nodes to recycle unlock(RECYCLER_LOCK) for node in to_recycle: free(node) ``` I'll use "GUARD" to denote the event that the --refcount within the lock *returns 0*. I'll use this for the correctness proof. ### Correct implementation proof The proof will still be to show that if thread 1 performs read(X) and thread 2 performs free(X), read(X) happens-before free(X). Proof: thread 1 must have grabbed an accessor while reading X, so its sequence of actions look like this: ``` refcount++ skip(P) # gets X read(X) refcount-- ``` thread 2 performs: ``` GUARD free(X) ``` Now, all writes on refcount are RMW operations and they all use acq_rel ordering, so all the RMW operations on refcount form a total order where successive operations have a synchronizes-with relationship. We'll look at where GUARD might stand in this total order. * Case 1: GUARD is after refcount-- from thread 1 in the total order. * In this case, read(X) happens before refcount-- in thread 1, which happens before GUARD, which happens before free(X). * Case 2: GUARD is between refcount++ and refcount-- from thread 1 in the total order. * In this case, observe (by looking at the total ordering on refcount RMW) that we have at least two threads (1 and 2) that contribute 1 to the refcount, right before GUARD. In other words, GUARD could not possibly have returned 0, which is a contradiction. * Case 3: GUARD is before refcount++ from thread 1 in the total order. * Let `setSkip(P, not X)` be a predecessor write operation before X is added to the recycler (this can happen on any thread). We will prove in the below lemma that `setSkip(P, not X)` happens before GUARD. Once we have that, then `setSkip(P, not X)` happens before GUARD which happens before thread 1's refcount++ which happens before `skip(P)`, and that renders it impossible for `skip(P)` to return X, making it a contradiction. It remains to prove that `setSkip(P, not X)` happens before GUARD. In the thread that performs an `setSkip(P, not X)` operation, it subsequently performs `recycle(X)`, which adds X to the recycler within RECYCLER_LOCK. In thread 2, GUARD happens within the RECYCLER_LOCK, and the subsequent swapping of the recycler vector contained X (which is shown by the fact that we free(X) after GUARD), so the lock must have been grabbed *after* the lock that added X to the recycler. In other words, we have the relationship that `setSkip(P, not X)` happens before `recycler.add(X)` which happens before `unlock(RECYCLER_LOCK)` which happens before `lock(RECYCLER_LOCK)` which happens before GUARD. Note that just like the original implementation, the optimization on top of releaseRef() is not a perfect optimization; it may delay the deletion of otherwise safe-to-delete nodes. However, that does not affect our correctness argument because it's always at least as safe to *delay* deletions (this hand-wavy argument is not part of the proof). ## Appendix A Consider the following two threads: ``` std::atomic<int> x{0}, y{1}; // Thread 1: x.store(1, std::memory_order_release); // A int y1 = y.fetch_add(-1, std::memory_order_acq_rel); // B // Thread 2: y.fetch_add(1, std::memory_order_relaxed); // C int x2 = x.load(std::memory_order_acquire); // D ``` Intuitively, if y1 = 1, then thread 1's fetch_add was executed first, so thread 2 should get x2 = 1. Otherwise, if thread 2's fetch_add happened first, then y1 = 2, and x2 could be either 0 or 1. But, could it happen that y1 = 1 and x2 = 0? Let's look at the happens-before relationships between these operations. For intra-thread (sequenced-before), we have A < B and C < D (I'm using < to denote happens-before). Now, for inter-thread (synchronizes-with), the only pair that could establish a synchronizes-with relationship is A and D. (B and C is not eligible because C uses relaxed ordering.) For y1 = 1 and x2 = 0 to happen, we must not have D read the result of A, so it must be the case that they A does not synchronize-with D. But that's as far as we can go; there's nothing that really enforces much of an ordering between the two threads. We can also think about this in terms of reordering of memory operations. Thread 1 is pretty safe from reordering because of the acq_release, but in thread 2, an "acquire" ordering typically means no memory operations after D may be reordered before D, but it doesn't prevent C from being reordered after D. C itself does not prevent reordering due to it being an relaxed operation. So if thread 2 executed D and then C, then it would be trivially possible to have y1 = 1 and x2 = 0. The point of this is to highlight that just by having a release/acquire pair does not magically *order* them. The pair merely provides a synchronizes-with relationship *if* the read happens to obtain the value written by the write, but not any guarantees of *which* value would be read. ## Appendix B Problem 1 is detected by TSAN, but problems 2 and 3 are not. Why? TSAN detects data races by deriving synchronization relationships from the *actual* interleaving of atomics at runtime. If the actual interleaving would always happen but is not *guaranteed* by the standard, there may be a real undetected data race. For example, it is well known that the following code will be detected by TSAN as a data race on int "x": ``` int x = 1; std::atomic<bool> y{false}; // Thread 1 x = 2; y.store(true, memory_order_relaxed); // Thread 2 while (!y.load(memory_order_acquire)) { // acquire is useless because writer used relaxed } std::cout << x << std::endl; ``` TSAN reports a data race on x because `y` failed to provide proper synchronizes-with relationship between the two threads due to incorrect memory ordering. However, when compiling on x86, most likely we will end up with a binary that always guarantees the intuitively desired behavior anyway. So now consider the following code: ``` std::atomic<int> x = 1; std::atomic<bool> y{false}; int z = 8; // Thread 1 z = 9; x.store(2, memory_order_release); y.store(true, memory_order_relaxed); // Thread 2 while (!y.load(memory_order_acquire)) { // acquire is useless because writer used relaxed } x.load(memory_order_acquire); std::cout << z << std::endl; ``` There is a data race on the access to z, because the happens-before chain of `write z -> x.store -> y.store -> y.load -> x.load -> read z` is broken on the `y.store -> y.load` link. However, TSAN will not report a data race, because it sees the chain as `write z -> x.store -> x.load -> read z`. It sees x.store as synchronizing with x.load because it *observed* that the x.load obtained the value written by the x.write *at runtime*, so it inferred that it was valid synchronization. This isn't guaranteed, though, because it's possible in some execution (in theory) that x.load does not get the value written by x.store (similar to Appendix A). Reviewed By: yfeldblum Differential Revision: D29248955 fbshipit-source-id: 2a3c9379c7c3a6469183df64582ca9cf763c0890
6f4811ef · Robin Cheng · Facebook GitHub Bot · 13bccbbf · 6f4811ef · 6f4811ef
Commit 6f4811ef authored Jun 22, 2021 by Robin Cheng Committed by Facebook GitHub Bot Jun 27, 2021
Show whitespace changes
Inline Side-by-side

Showing with 34 additions and 36 deletions

folly/ConcurrentSkipList-inl.h folly/ConcurrentSkipList-inl.h +25 -27

folly/ConcurrentSkipList.h folly/ConcurrentSkipList.h +9 -9

No files found.
--- a/folly/ConcurrentSkipList-inl.h
+++ b/folly/ConcurrentSkipList-inl.h
@@ -97,7 +97,7 @@ class SkipListNode {

  inline SkipListNode* skip(int layer) const {
    DCHECK_LT(layer, height_);
-    return skip_[layer].load(std::memory_order_consume);
+    return skip_[layer].load(std::memory_order_acquire);
  }

  // next valid node as in the linked list
@@ -155,7 +155,7 @@ class SkipListNode {
    }
  }

-  uint16_t getFlags() const { return flags_.load(std::memory_order_consume); }
+  uint16_t getFlags() const { return flags_.load(std::memory_order_acquire); }
  void setFlags(uint16_t flags) {
    flags_.store(flags, std::memory_order_release);
  }
@@ -269,43 +269,41 @@ class NodeRecycler<
    dirty_.store(true, std::memory_order_relaxed);
  }

-  int addRef() { return refs_.fetch_add(1, std::memory_order_relaxed); }
+  int addRef() { return refs_.fetch_add(1, std::memory_order_acq_rel); }

  int releaseRef() {
-    // We don't expect to clean the recycler immediately everytime it is OK
-    // to do so. Here, it is possible that multiple accessors all release at
-    // the same time but nobody would clean the recycler here. If this
-    // happens, the recycler will usually still get cleaned when
-    // such a race doesn't happen. The worst case is the recycler will
-    // eventually get deleted along with the skiplist.
-    if (LIKELY(!dirty_.load(std::memory_order_relaxed) || refs() > 1)) {
-      return refs_.fetch_add(-1, std::memory_order_relaxed);
+    // This if statement is purely an optimization. It's possible that this
+    // misses an opportunity to delete, but that's OK, we'll try again at
+    // the next opportunity. It does not harm the thread safety. For this
+    // reason, we can use relaxed loads to make the decision.
+    if (!dirty_.load(std::memory_order_relaxed) || refs() > 1) {
+      return refs_.fetch_add(-1, std::memory_order_acq_rel);
    }

    std::unique_ptr<std::vector<NodeType*>> newNodes;
+    int ret;
    {
+      // The order at which we lock, add, swap, is very important for
+      // correctness.
      std::lock_guard<MicroSpinLock> g(lock_);
-      if (nodes_.get() == nullptr || refs() > 1) {
-        return refs_.fetch_add(-1, std::memory_order_relaxed);
-      }
-      // once refs_ reaches 1 and there is no other accessor, it is safe to
-      // remove all the current nodes in the recycler, as we already acquired
-      // the lock here so no more new nodes can be added, even though new
-      // accessors may be added after that.
+      ret = refs_.fetch_add(-1, std::memory_order_acq_rel);
+      if (ret == 0) {
+        // When refs_ reachs 0, it is safe to remove all the current nodes
+        // in the recycler, as we already acquired the lock here so no more
+        // new nodes can be added, even though new accessors may be added
+        // after this.
        newNodes.swap(nodes_);
        dirty_.store(false, std::memory_order_relaxed);
      }
-
+    }
    // TODO(xliu) should we spawn a thread to do this when there are large
    // number of nodes in the recycler?
+    if (newNodes) {
      for (auto& node : *newNodes) {
        NodeType::destroy(alloc_, node);
      }
-
-    // decrease the ref count at the very end, to minimize the
-    // chance of other threads acquiring lock_ to clear the deleted
-    // nodes again.
-    return refs_.fetch_add(-1, std::memory_order_relaxed);
+    }
+    return ret;
  }

  NodeAlloc& alloc() { return alloc_; }

--- a/folly/ConcurrentSkipList.h
+++ b/folly/ConcurrentSkipList.h
@@ -250,7 +250,7 @@ class ConcurrentSkipList {
    return foundLayer;
  }

-  int height() const { return head_.load(std::memory_order_consume)->height(); }
+  int height() const { return head_.load(std::memory_order_acquire)->height(); }

  int maxLayer() const { return height() - 1; }

@@ -401,12 +401,12 @@ class ConcurrentSkipList {
  }

  const value_type* first() const {
-    auto node = head_.load(std::memory_order_consume)->skip(0);
+    auto node = head_.load(std::memory_order_acquire)->skip(0);
    return node ? &node->data() : nullptr;
  }

  const value_type* last() const {
-    NodeType* pred = head_.load(std::memory_order_consume);
+    NodeType* pred = head_.load(std::memory_order_acquire);
    NodeType* node = nullptr;
    for (int layer = maxLayer(); layer >= 0; --layer) {
      do {
@@ -434,7 +434,7 @@ class ConcurrentSkipList {
      int* max_layer) const {
    *max_layer = maxLayer();
    return findInsertionPoint(
-        head_.load(std::memory_order_consume), *max_layer, data, preds, succs);
+        head_.load(std::memory_order_acquire), *max_layer, data, preds, succs);
  }

  // Find node for access. Returns a paired values:
@@ -450,7 +450,7 @@ class ConcurrentSkipList {
  // results, this is slightly faster than findNodeRightDown for better
  // localality on the skipping pointers.
  std::pair<NodeType*, int> findNodeDownRight(const value_type& data) const {
-    NodeType* pred = head_.load(std::memory_order_consume);
+    NodeType* pred = head_.load(std::memory_order_acquire);
    int ht = pred->height();
    NodeType* node = nullptr;

@@ -478,7 +478,7 @@ class ConcurrentSkipList {
  // find node by first stepping right then stepping down.
  // We still keep this for reference purposes.
  std::pair<NodeType*, int> findNodeRightDown(const value_type& data) const {
-    NodeType* pred = head_.load(std::memory_order_consume);
+    NodeType* pred = head_.load(std::memory_order_acquire);
    NodeType* node = nullptr;
    auto top = maxLayer();
    int found = 0;
@@ -502,7 +502,7 @@ class ConcurrentSkipList {
  }

  void growHeight(int height) {
-    NodeType* oldHead = head_.load(std::memory_order_consume);
+    NodeType* oldHead = head_.load(std::memory_order_acquire);
    if (oldHead->height() >= height) { // someone else already did this
      return;
    }
@@ -598,7 +598,7 @@ class ConcurrentSkipList<T, Comp, NodeAlloc, MAX_HEIGHT>::Accessor {
  size_type count(const key_type& data) const { return contains(data); }

  iterator begin() const {
-    NodeType* head = sl_->head_.load(std::memory_order_consume);
+    NodeType* head = sl_->head_.load(std::memory_order_acquire);
    return iterator(head->next());
  }
  iterator end() const { return iterator(nullptr); }
@@ -810,7 +810,7 @@ class ConcurrentSkipList<T, Comp, NodeAlloc, MAX_HEIGHT>::Skipper {

 private:
  NodeType* head() const {
-    return accessor_.skiplist()->head_.load(std::memory_order_consume);
+    return accessor_.skiplist()->head_.load(std::memory_order_acquire);
  }

  Accessor accessor_;