UCS: Introduce lightweight rwlock #10355

Artemy-Mellanox · 2024-12-05T08:09:09Z

[ RUN      ] test_rwlock.perf <> <>
[     INFO ] 8.22035 ms builtin 1 threads 1 writers per 256 
[     INFO ] 9.91718 ms builtin 1 threads 25 writers per 256 
[     INFO ] 15.6226 ms builtin 1 threads 128 writers per 256 
[     INFO ] 8.53102 ms builtin 1 threads 250 writers per 256 
[     INFO ] 14.8544 ms builtin 2 threads 1 writers per 256 
[     INFO ] 20.3778 ms builtin 2 threads 25 writers per 256 
[     INFO ] 25.2576 ms builtin 2 threads 128 writers per 256 
[     INFO ] 18.0413 ms builtin 2 threads 250 writers per 256 
[     INFO ] 15.9894 ms builtin 4 threads 1 writers per 256 
[     INFO ] 25.9471 ms builtin 4 threads 25 writers per 256 
[     INFO ] 30.6886 ms builtin 4 threads 128 writers per 256 
[     INFO ] 23.7099 ms builtin 4 threads 250 writers per 256 
[     INFO ] 109.585 ms builtin 64 threads 1 writers per 256 
[     INFO ] 498.465 ms builtin 64 threads 25 writers per 256 
[     INFO ] 981.667 ms builtin 64 threads 128 writers per 256 
[     INFO ] 891.406 ms builtin 64 threads 250 writers per 256 
[       OK ] test_rwlock.perf (2699 ms)
[ RUN      ] test_rwlock.pthread <> <>
[     INFO ] 15.7748 ms pthread 1 threads 1 writers per 256 
[     INFO ] 17.1273 ms pthread 1 threads 25 writers per 256 
[     INFO ] 26.9475 ms pthread 1 threads 128 writers per 256 
[     INFO ] 15.2211 ms pthread 1 threads 250 writers per 256 
[     INFO ] 29.4559 ms pthread 2 threads 1 writers per 256 
[     INFO ] 215.037 ms pthread 2 threads 25 writers per 256 
[     INFO ] 104.325 ms pthread 2 threads 128 writers per 256 
[     INFO ] 41.5187 ms pthread 2 threads 250 writers per 256 
[     INFO ] 35.4409 ms pthread 4 threads 1 writers per 256 
[     INFO ] 196.073 ms pthread 4 threads 25 writers per 256 
[     INFO ] 122.957 ms pthread 4 threads 128 writers per 256 
[     INFO ] 62.3138 ms pthread 4 threads 250 writers per 256 
[     INFO ] 73.8066 ms pthread 64 threads 1 writers per 256 
[     INFO ] 198.412 ms pthread 64 threads 25 writers per 256 
[     INFO ] 165.476 ms pthread 64 threads 128 writers per 256 
[     INFO ] 118.79 ms pthread 64 threads 250 writers per 256 
[       OK ] test_rwlock.pthread (1439 ms)

src/ucs/arch/cpu.h

src/ucs/type/rwlock.h

yosefe · 2024-12-05T23:51:05Z

src/ucs/type/rwlock.h

+            ucs_cpu_relax();
+        }
+
+        x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);


maybe we can use the atomic operations defined in atomic.h?

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

src/ucs/type/rwlock.h

test/gtest/ucs/test_type.cc

yosefe · 2024-12-06T00:19:35Z

test/gtest/ucs/test_type.cc

+        int m = std::thread::hardware_concurrency();
+        std::vector<int> threads = {1, 2, 4, m};
+        std::vector<int> writers_per_256 = {1, 25, 128, 250};
+


i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock

can you pls post example output in the PR description?

you mean write percent 0?

didn't i do that?

yes ( isee you added it)

yes, pls update with percent 0

test/gtest/ucs/test_type.cc

iyastreb · 2024-12-06T07:19:24Z

src/ucs/type/rwlock.h

+            ucs_cpu_relax();
+        }
+
+        x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);


maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

iyastreb · 2024-12-06T07:27:01Z

test/gtest/ucs/test_type.cc

+
+UCS_TEST_F(test_rwlock, perf) {
+    ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;
+    measure(


This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:

// Invariant: counter2 is 2 times bigger than counter1 int counter1 = 1; int counter2 = 2; measure( [&]() { ucs_rwlock_read_lock(&lock); UCS_ASSERT_EQ(counter1 * 2, counter2); ucs_rwlock_read_unlock(&lock); }, [&]() { ucs_rwlock_write_lock(&lock); counter1 += counter1; counter2 += counter2; if (counter2 > 100000) { counter1 = 1; counter2 = 2; } ucs_rwlock_write_unlock(&lock); },

I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.

Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.

Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc

test/gtest/ucs/test_type.cc

src/ucs/type/rwlock.h

iyastreb · 2024-12-06T07:40:47Z

src/ucs/type/rwlock.h

+    int x;
+
+    while (1) {
+        while (lock->l & UCS_RWLOCK_MASK) {


I'm not sure why we read atomic without mem order guarantees, normally it should be something

__atomic_load_n(&lock->l, __ATOMIC_RELAXED)

Same in other read cases

why? AFAIU relaxed mem order means - no mem order guarantees

Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_loads

src/ucs/type/rwlock.h

tvegas1 · 2024-12-09T10:07:58Z

src/ucs/type/rwlock.h

+
+static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
+{
+    __atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);


i think it is usually __ATOMIC_RELEASE to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?

tvegas1 · 2024-12-09T10:10:16Z

test/gtest/ucs/test_type.cc

+    }
+};
+
+UCS_TEST_F(test_rwlock, lock) {


Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?

Can we run this test and others with all optimizations?

tvegas1 · 2024-12-09T10:12:55Z

src/ucs/type/rwlock.h

+
+
+/**
+ * Read-write lock.


tvegas1 · 2024-12-09T10:16:59Z

src/ucs/type/rwlock.h

+#include <errno.h>
+
+/**
+ * The ucs_rwlock_t type.


suggestion: ucs_rw_spinlock_t and for all apis?

src/ucs/type/rwlock.h

tvegas1 · 2024-12-09T10:32:15Z

src/ucs/type/rwlock.h

+}
+
+
+static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)


add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)

tvegas1 · 2024-12-09T10:32:47Z

src/ucs/type/rwlock.h

+
+static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
+{
+    __atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);


detect underflow on debug builds?

tvegas1 · 2024-12-09T10:33:42Z

src/ucs/type/rwlock.h

+ * Read-write lock.
+ */
+typedef struct {
+    volatile int l;


use unsigned int to have defined behavior on overflow/underflow (and detect it)?

we don't expect overflow by design so signed int is more suitable i this case IMO

tvegas1 · 2024-12-09T10:41:07Z

src/ucs/type/rwlock.h

+}
+
+
+static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)


maybe annotate for coverity with something like /* coverity[lock] */, to help it with sanity checks?

same for lock paths

yosefe · 2025-01-12T09:10:52Z

src/ucs/arch/x86_64/cpu.h

 static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
 {
+#ifdef __SSE2__


minor: do something else if SSE2 not defined, like "":::"memory" ?

yosefe · 2025-01-12T09:14:24Z

src/ucs/arch/atomic.h

+    if (flags & UCS_ATOMIC_FENCE_LOCK) {
+        return __ATOMIC_ACQUIRE;
+    }
+
+    if (flags & UCS_ATOMIC_FENCE_UNLOCK) {
+        return __ATOMIC_RELEASE;
+    }
+
+    return __ATOMIC_RELAXED;


why are UCS_ATOMIC_xx defined as flags if there is no meaning to combine more than one of them?
also, how does it relate to UCS_DEFINE_ATOMIC_xx macros? should we remove/extend them?

yosefe · 2025-01-12T09:15:25Z

src/ucs/arch/atomic.h

+static UCS_F_ALWAYS_INLINE int ucs_atomic_get(int *ptr, unsigned flags)
+{
+    return __atomic_load_n(ptr, ucs_atomic_memorder(flags));
+}
+
+
+static UCS_F_ALWAYS_INLINE int
+ucs_atomic_fadd(int *ptr, int val, unsigned flags)
+{
+    return __atomic_fetch_add(ptr, val, ucs_atomic_memorder(flags));
+}
+
+
+static UCS_F_ALWAYS_INLINE void
+ucs_atomic_sub(int *ptr, int val, unsigned flags)
+{
+    __atomic_fetch_sub(ptr, val, ucs_atomic_memorder(flags));
+}
+
+
+static UCS_F_ALWAYS_INLINE void ucs_atomic_or(int *ptr, int val, unsigned flags)
+{
+    __atomic_fetch_or(ptr, val, ucs_atomic_memorder(flags));
+}
+
+
+static UCS_F_ALWAYS_INLINE int
+ucs_atomic_cswap(int *ptr, int old_val, int new_val, unsigned flags)


should we define it as macros to allow any type (not just int)?

yosefe · 2025-01-12T09:17:24Z

src/ucs/arch/atomic.h

+                                       flags & UCS_ATOMIC_WEAK,
+                                       ucs_atomic_memorder(flags),


why in case of success we want weak mem order?

using it in trylock so it may use faster instruction

yosefe · 2025-01-12T09:19:22Z

src/ucs/type/rwlock.h

 */
 typedef struct {
-    volatile int l;
-} ucs_rwlock_t;
+    int state;


maybe uint32_t ? since it's flags value and not numeric value

yosefe · 2025-01-12T09:23:21Z

src/ucs/type/rwlock.h

-        (__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
-                                     __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
-        return 0;
+        ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE,


maybe use
"!(x & UCS_RWLOCK_WRITE)" instead of "x < UCS_RWLOCK_WRITE"
"x | UCS_RWLOCK_WRITE" instead of "x + UCS_RWLOCK_WRITE"
since this is bit value

maybe keep wait bit included in the check too

maybe "!(x & ~(UCS_RWLOCK_WRITE-1))"? "!(x & UCS_RWLOCK_WRITE)" is not the same

yosefe · 2025-01-12T09:23:46Z

src/ucs/type/rwlock.h

+
+static UCS_F_ALWAYS_INLINE void ucs_rw_spinlock_cleanup(ucs_rw_spinlock_t *lock)
+{
+    ucs_assert(lock->state == 0);


yosefe · 2025-01-12T09:26:11Z

test/gtest/ucs/test_type.cc


-    int write_taken = 0;
+    bool write_taken = 0;


use true/false for bool instead of 1/0

yosefe · 2025-01-12T09:26:28Z

test/gtest/ucs/test_type.cc

+        int m = std::thread::hardware_concurrency();
+        std::vector<int> threads = {1, 2, 4, m};
+        std::vector<int> writers_per_256 = {1, 25, 128, 250};
+


yes ( isee you added it)

yes, pls update with percent 0

iyastreb · 2025-01-13T07:41:09Z

src/ucs/arch/atomic.h

@@ -138,4 +139,59 @@ UCS_DEFINE_ATOMIC_BOOL_CSWAP(16, w);
 UCS_DEFINE_ATOMIC_BOOL_CSWAP(32, l);
 UCS_DEFINE_ATOMIC_BOOL_CSWAP(64, q);

+
+#define UCS_ATOMIC_WEAK         UCS_BIT(0)


If we want to encapsulate these values, then maybe it should be an enum?
And also we miss few other values:
__ATOMIC_CONSUME - probably we don't need this
Data dependency only for both barrier and synchronization with another thread.
__ATOMIC_ACQ_REL - I'm not sure about this one
Full barrier in both directions and synchronizes with acquire loads and release stores in another thread.
__ATOMIC_SEQ_CST - the strongest mode, we definitely need it
Full barrier in both directions and synchronizes with acquire loads and release stores in all threads.

tvegas1 · 2025-01-13T15:10:44Z

src/ucs/arch/atomic.h

+#define UCS_ATOMIC_FENCE_UNLOCK UCS_BIT(2)
+
+
+static UCS_F_ALWAYS_INLINE int ucs_atomic_memorder(unsigned flags)


maybe we can simply use the actual flag (__ATOMIC_ACQUIRE etc) directly

tvegas1 · 2025-01-13T15:11:49Z

src/ucs/type/rwlock.h

+static UCS_F_ALWAYS_INLINE void
+ucs_rw_spinlock_read_unlock(ucs_rw_spinlock_t *lock)
+{
+    ucs_assert(lock->state >= UCS_RWLOCK_READ);


assertv with details?

tvegas1 · 2025-01-13T15:14:47Z

src/ucs/type/rwlock.h

@@ -0,0 +1,137 @@
+/*
+* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2024. ALL RIGHTS RESERVED.


copyrights would need 2025

src/ucs/type/rwlock.h

tvegas1 · 2025-01-13T15:37:32Z

src/ucs/type/rwlock.h

-        (__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
-                                     __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
-        return 0;
+        ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE,


maybe keep wait bit included in the check too

src/ucs/type/rwlock.h

tvegas1

looks good, is the test code run with optim too?

tvegas1 · 2025-05-13T14:34:39Z

test/gtest/ucs/test_type.cc

+    }
+};
+
+UCS_TEST_F(test_rwlock, lock) {


Can we run this test and others with all optimizations?

src/ucs/type/rwlock.h

gleon99 · 2025-06-17T13:12:30Z

@Artemy-Mellanox please squash

Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch 2 times, most recently from 65f025c to 89415df Compare December 5, 2024 15:38

yosefe reviewed Dec 6, 2024

View reviewed changes

iyastreb reviewed Dec 6, 2024

View reviewed changes

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved

tvegas1 reviewed Dec 9, 2024

View reviewed changes

Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch 2 times, most recently from d41b883 to 2c3a0e8 Compare January 12, 2025 02:32

yosefe reviewed Jan 12, 2025

View reviewed changes

iyastreb reviewed Jan 13, 2025

View reviewed changes

tvegas1 reviewed Jan 13, 2025

View reviewed changes

iyastreb mentioned this pull request Mar 14, 2025

ARCH/AARCH64: Refactor atomic operations to use __atomic_* functions #10551

Open

tvegas1 reviewed Apr 17, 2025

View reviewed changes

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved

yosefe requested a review from tvegas1 May 11, 2025 09:02

tvegas1 approved these changes May 13, 2025

View reviewed changes

test/gtest/ucs/test_type.cc

}

};

UCS_TEST_F(test_rwlock, lock) {

Copy link

Contributor

tvegas1 May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run this test and others with all optimizations?

yosefe reviewed Jun 16, 2025

View reviewed changes

src/ucs/type/rwlock.h Show resolved Hide resolved

src/ucs/type/rwlock.h Show resolved Hide resolved

yosefe approved these changes Jun 17, 2025

View reviewed changes

UCS: Introduce lightweight rwlock

8bbe776

Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch from 93c059d to 8bbe776 Compare June 17, 2025 13:25

yosefe enabled auto-merge June 18, 2025 08:48

yosefe merged commit 27927ed into openucx:master Jun 18, 2025
151 checks passed

		}


		static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)

		}


		static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)

		#define UCS_ATOMIC_FENCE_UNLOCK UCS_BIT(2)


		static UCS_F_ALWAYS_INLINE int ucs_atomic_memorder(unsigned flags)

		@@ -0,0 +1,137 @@
		/*
		* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2024. ALL RIGHTS RESERVED.

UCS: Introduce lightweight rwlock #10355

UCS: Introduce lightweight rwlock #10355

Uh oh!

Conversation

Artemy-Mellanox commented Dec 5, 2024 • edited by yosefe Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iyastreb Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artemy-Mellanox commented Dec 5, 2024 •

edited by yosefe

Loading

iyastreb Dec 6, 2024 •

edited

Loading