Fix a TSAN-reported race condition in ThreadLocal.

Summary: This is a pretty cryptic race condition, but it works like this: Each thread local instance (i.e. one ThreadLocalPtr) is assigned a unique ID by the static meta. This ID is freed up when the thread local is destroyed, and may then be reused by another newly created thread local. For each thread we keep an array of elements (ElementWrapper), indexed by the thread local ID. Each element has some metadata such as the pointer to the object being cached for that thread and that thread local. There are a few cases where we access a ElementWrapper object: - We write to it under a global lock, when destroying a thread local object (only the i-th element is written to for each thread, if the ID is 'i' for the thread local). - We write to it under a global lock, when reallocating the array of elements for the thread if a new thread local ID exceeds the capacity of the array; this can happen when we are reading the thread local for a thread. - We write to it under a global lock when the corresponding thread is destroyed. - We read from it, *without* any locks, when reading a thread local's value from a thread and the above reallocation case does not occur. The race condition happens in this case: - Thread locals A and B are created - Thread local A is accessed by thread Z, and gets, say ID = 5, the corresponding element E is initialized - Thread local A is destroyed by thread X, and under the global lock, the element E is written to (setting element->ptr = nullptr); ID 5 is released to the list of free IDs. - Thread local B is accessed for the first time by thread Y, and therefore needs to grab an ID; under the global lock, it happens to get ID = 5, and stores that ID into the thread local instance as an atomic int; the store here uses std::memory_order_seq_cst. - Thread local B is then accessed again from thread Z; this time the atomic int ID is loaded with std::memory_order_relaxed, with the resulting value of 5; we then proceed to read element E's ptr value *without any locks*. TSAN reports the last read of element E to be racy with the previous write of element E during thread X's destruction of thread local A. Even though this happened with a mutex operation in between, the final std::memory_order_relaxed causes TSAN to think that the read access was not properly synchronized after the earlier write. This conjecture is confirmed by the following small reproducible test case: ``` TEST(MyTest, Test) { std::mutex mutex; int x = 0; bool written = false; std::atomic<bool> seen{false}; std::thread t0([&] { while (!seen.load(std::memory_order_relaxed)) { } std::cout << x; }); std::thread t1([&] { while (true) { std::lock_guard<std::mutex> guard(mutex); if (written) { seen.store(true, std::memory_order_release); break; } } }); std::thread t2([&] { std::lock_guard<std::mutex> guard(mutex); x = 1; written = true; }); t0.join(); t1.join(); t2.join(); } ``` (This example, when run under TSAN, reports a race condition between "x = 1" and "std::cout << x".) This change makes the load use std::memory_order_acquire so it synchronizes with the store. Should have no performance implications on x86 but I'm not totally confident about that. While we're at it, also changing the allocate() and destroy() functions to use more relaxed memory orders; sequential consistency is unnecessary in these places. Reviewed By: yfeldblum Differential Revision: D22561685 fbshipit-source-id: 0d53925486ef78139ffe71681827b0c0959dd29e

Fix a TSAN-reported race condition in ThreadLocal.
Summary: This is a pretty cryptic race condition, but it works like this: Each thread local instance (i.e. one ThreadLocalPtr) is assigned a unique ID by the static meta. This ID is freed up when the thread local is destroyed, and may then be reused by another newly created thread local. For each thread we keep an array of elements (ElementWrapper), indexed by the thread local ID. Each element has some metadata such as the pointer to the object being cached for that thread and that thread local. There are a few cases where we access a ElementWrapper object: - We write to it under a global lock, when destroying a thread local object (only the i-th element is written to for each thread, if the ID is 'i' for the thread local). - We write to it under a global lock, when reallocating the array of elements for the thread if a new thread local ID exceeds the capacity of the array; this can happen when we are reading the thread local for a thread. - We write to it under a global lock when the corresponding thread is destroyed. - We read from it, *without* any locks, when reading a thread local's value from a thread and the above reallocation case does not occur. The race condition happens in this case: - Thread locals A and B are created - Thread local A is accessed by thread Z, and gets, say ID = 5, the corresponding element E is initialized - Thread local A is destroyed by thread X, and under the global lock, the element E is written to (setting element->ptr = nullptr); ID 5 is released to the list of free IDs. - Thread local B is accessed for the first time by thread Y, and therefore needs to grab an ID; under the global lock, it happens to get ID = 5, and stores that ID into the thread local instance as an atomic int; the store here uses std::memory_order_seq_cst. - Thread local B is then accessed again from thread Z; this time the atomic int ID is loaded with std::memory_order_relaxed, with the resulting value of 5; we then proceed to read element E's ptr value *without any locks*. TSAN reports the last read of element E to be racy with the previous write of element E during thread X's destruction of thread local A. Even though this happened with a mutex operation in between, the final std::memory_order_relaxed causes TSAN to think that the read access was not properly synchronized after the earlier write. This conjecture is confirmed by the following small reproducible test case: ``` TEST(MyTest, Test) { std::mutex mutex; int x = 0; bool written = false; std::atomic<bool> seen{false}; std::thread t0([&] { while (!seen.load(std::memory_order_relaxed)) { } std::cout << x; }); std::thread t1([&] { while (true) { std::lock_guard<std::mutex> guard(mutex); if (written) { seen.store(true, std::memory_order_release); break; } } }); std::thread t2([&] { std::lock_guard<std::mutex> guard(mutex); x = 1; written = true; }); t0.join(); t1.join(); t2.join(); } ``` (This example, when run under TSAN, reports a race condition between "x = 1" and "std::cout << x".) This change makes the load use std::memory_order_acquire so it synchronizes with the store. Should have no performance implications on x86 but I'm not totally confident about that. While we're at it, also changing the allocate() and destroy() functions to use more relaxed memory orders; sequential consistency is unnecessary in these places. Reviewed By: yfeldblum Differential Revision: D22561685 fbshipit-source-id: 0d53925486ef78139ffe71681827b0c0959dd29e
2e45ff8d · Robin Cheng · Facebook GitHub Bot · 6c21531d · 2e45ff8d · 2e45ff8d
Commit 2e45ff8d authored Jul 21, 2020 by Robin Cheng Committed by Facebook GitHub Bot Jul 21, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 8 deletions

folly/detail/ThreadLocalDetail.cpp folly/detail/ThreadLocalDetail.cpp +4 -3

folly/detail/ThreadLocalDetail.h folly/detail/ThreadLocalDetail.h +1 -5

No files found.
--- a/folly/detail/ThreadLocalDetail.cpp
+++ b/folly/detail/ThreadLocalDetail.cpp
@@ -227,7 +227,7 @@ uint32_t StaticMetaBase::allocate(EntryID* ent) {
  auto& meta = *this;
  std::lock_guard<std::mutex> g(meta.lock_);
-  id = ent->value.load();
+  id = ent->value.load(std::memory_order_relaxed);
  if (id != kEntryIDInvalid) {
    return id;
  }
@@ -239,7 +239,7 @@ uint32_t StaticMetaBase::allocate(EntryID* ent) {
    id = meta.nextId_++;
  }
-  uint32_t old_id = ent->value.exchange(id);
+  uint32_t old_id = ent->value.exchange(id, std::memory_order_release);
  DCHECK_EQ(old_id, kEntryIDInvalid);
  reserveHeadUnlocked(id);
@@ -269,7 +269,8 @@ void StaticMetaBase::destroy(EntryID* ent) {
      {
        std::lock_guard<std::mutex> g(meta.lock_);
-        uint32_t id = ent->value.exchange(kEntryIDInvalid);
+        uint32_t id =
+            ent->value.exchange(kEntryIDInvalid, std::memory_order_relaxed);
        if (id == kEntryIDInvalid) {
          return;
        }

--- a/folly/detail/ThreadLocalDetail.h
+++ b/folly/detail/ThreadLocalDetail.h
@@ -328,11 +328,7 @@ struct StaticMetaBase {
    EntryID& operator=(const EntryID& other) = delete;
    uint32_t getOrInvalid() {
-      // It's OK for this to be relaxed, even though we're effectively doing
+      return value.load(std::memory_order_acquire);
-      // double checked locking in using this value. We only care about the
-      // uniqueness of IDs, getOrAllocate does not modify any other memory
-      // this thread will use.
-      return value.load(std::memory_order_relaxed);
    }
    uint32_t getOrAllocate(StaticMetaBase& meta) {