Skip to content

UCS: Introduce lightweight rwlock #10355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

Artemy-Mellanox
Copy link
Contributor

@Artemy-Mellanox Artemy-Mellanox commented Dec 5, 2024

[ RUN      ] test_rwlock.perf <> <>
[     INFO ] 8.22035 ms builtin 1 threads 1 writers per 256 
[     INFO ] 9.91718 ms builtin 1 threads 25 writers per 256 
[     INFO ] 15.6226 ms builtin 1 threads 128 writers per 256 
[     INFO ] 8.53102 ms builtin 1 threads 250 writers per 256 
[     INFO ] 14.8544 ms builtin 2 threads 1 writers per 256 
[     INFO ] 20.3778 ms builtin 2 threads 25 writers per 256 
[     INFO ] 25.2576 ms builtin 2 threads 128 writers per 256 
[     INFO ] 18.0413 ms builtin 2 threads 250 writers per 256 
[     INFO ] 15.9894 ms builtin 4 threads 1 writers per 256 
[     INFO ] 25.9471 ms builtin 4 threads 25 writers per 256 
[     INFO ] 30.6886 ms builtin 4 threads 128 writers per 256 
[     INFO ] 23.7099 ms builtin 4 threads 250 writers per 256 
[     INFO ] 109.585 ms builtin 64 threads 1 writers per 256 
[     INFO ] 498.465 ms builtin 64 threads 25 writers per 256 
[     INFO ] 981.667 ms builtin 64 threads 128 writers per 256 
[     INFO ] 891.406 ms builtin 64 threads 250 writers per 256 
[       OK ] test_rwlock.perf (2699 ms)
[ RUN      ] test_rwlock.pthread <> <>
[     INFO ] 15.7748 ms pthread 1 threads 1 writers per 256 
[     INFO ] 17.1273 ms pthread 1 threads 25 writers per 256 
[     INFO ] 26.9475 ms pthread 1 threads 128 writers per 256 
[     INFO ] 15.2211 ms pthread 1 threads 250 writers per 256 
[     INFO ] 29.4559 ms pthread 2 threads 1 writers per 256 
[     INFO ] 215.037 ms pthread 2 threads 25 writers per 256 
[     INFO ] 104.325 ms pthread 2 threads 128 writers per 256 
[     INFO ] 41.5187 ms pthread 2 threads 250 writers per 256 
[     INFO ] 35.4409 ms pthread 4 threads 1 writers per 256 
[     INFO ] 196.073 ms pthread 4 threads 25 writers per 256 
[     INFO ] 122.957 ms pthread 4 threads 128 writers per 256 
[     INFO ] 62.3138 ms pthread 4 threads 250 writers per 256 
[     INFO ] 73.8066 ms pthread 64 threads 1 writers per 256 
[     INFO ] 198.412 ms pthread 64 threads 25 writers per 256 
[     INFO ] 165.476 ms pthread 64 threads 128 writers per 256 
[     INFO ] 118.79 ms pthread 64 threads 250 writers per 256 
[       OK ] test_rwlock.pthread (1439 ms)

@Artemy-Mellanox Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch 2 times, most recently from 65f025c to 89415df Compare December 5, 2024 15:38
ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use the atomic operations defined in atomic.h?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

Comment on lines 187 to 192
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock
  2. can you pls post example output in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you mean write percent 0?
  2. didn't i do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes ( isee you added it)
  2. yes, pls update with percent 0

ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?


UCS_TEST_F(test_rwlock, perf) {
ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:

    // Invariant: counter2 is 2 times bigger than counter1
    int counter1 = 1;
    int counter2 = 2;
    measure(
            [&]() {
                ucs_rwlock_read_lock(&lock);
                UCS_ASSERT_EQ(counter1 * 2, counter2);
                ucs_rwlock_read_unlock(&lock);
            },
            [&]() {
                ucs_rwlock_write_lock(&lock);
                counter1 += counter1;
                counter2 += counter2;
                if (counter2 > 100000) {
                    counter1 = 1;
                    counter2 = 2;
                }
                ucs_rwlock_write_unlock(&lock);
            },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.

Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc

int x;

while (1) {
while (lock->l & UCS_RWLOCK_MASK) {
Copy link
Contributor

@iyastreb iyastreb Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we read atomic without mem order guarantees, normally it should be something

__atomic_load_n(&lock->l, __ATOMIC_RELAXED)

Same in other read cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? AFAIU relaxed mem order means - no mem order guarantees

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_loads


static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is usually __ATOMIC_RELEASE to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?

}
};

UCS_TEST_F(test_rwlock, lock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run this test and others with all optimizations?



/**
* Read-write lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spinlock

#include <errno.h>

/**
* The ucs_rwlock_t type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ucs_rw_spinlock_t and for all apis?

}


static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)


static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect underflow on debug builds?

* Read-write lock.
*/
typedef struct {
volatile int l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unsigned int to have defined behavior on overflow/underflow (and detect it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't expect overflow by design so signed int is more suitable i this case IMO

}


static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe annotate for coverity with something like /* coverity[lock] */, to help it with sanity checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for lock paths

@Artemy-Mellanox Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch 2 times, most recently from d41b883 to 2c3a0e8 Compare January 12, 2025 02:32
static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
{
#ifdef __SSE2__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: do something else if SSE2 not defined, like "":::"memory" ?

Comment on lines 150 to 158
if (flags & UCS_ATOMIC_FENCE_LOCK) {
return __ATOMIC_ACQUIRE;
}

if (flags & UCS_ATOMIC_FENCE_UNLOCK) {
return __ATOMIC_RELEASE;
}

return __ATOMIC_RELAXED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are UCS_ATOMIC_xx defined as flags if there is no meaning to combine more than one of them?
also, how does it relate to UCS_DEFINE_ATOMIC_xx macros? should we remove/extend them?

Comment on lines 162 to 189
static UCS_F_ALWAYS_INLINE int ucs_atomic_get(int *ptr, unsigned flags)
{
return __atomic_load_n(ptr, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE int
ucs_atomic_fadd(int *ptr, int val, unsigned flags)
{
return __atomic_fetch_add(ptr, val, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE void
ucs_atomic_sub(int *ptr, int val, unsigned flags)
{
__atomic_fetch_sub(ptr, val, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE void ucs_atomic_or(int *ptr, int val, unsigned flags)
{
__atomic_fetch_or(ptr, val, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE int
ucs_atomic_cswap(int *ptr, int old_val, int new_val, unsigned flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we define it as macros to allow any type (not just int)?

Comment on lines 192 to 193
flags & UCS_ATOMIC_WEAK,
ucs_atomic_memorder(flags),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why in case of success we want weak mem order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using it in trylock so it may use faster instruction

*/
typedef struct {
volatile int l;
} ucs_rwlock_t;
int state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe uint32_t ? since it's flags value and not numeric value

(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return 0;
ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use
"!(x & UCS_RWLOCK_WRITE)" instead of "x < UCS_RWLOCK_WRITE"
"x | UCS_RWLOCK_WRITE" instead of "x + UCS_RWLOCK_WRITE"
since this is bit value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe keep wait bit included in the check too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "!(x & ~(UCS_RWLOCK_WRITE-1))"? "!(x & UCS_RWLOCK_WRITE)" is not the same


static UCS_F_ALWAYS_INLINE void ucs_rw_spinlock_cleanup(ucs_rw_spinlock_t *lock)
{
ucs_assert(lock->state == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertv


int write_taken = 0;
bool write_taken = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use true/false for bool instead of 1/0

Comment on lines 187 to 192
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes ( isee you added it)
  2. yes, pls update with percent 0

@@ -138,4 +139,59 @@ UCS_DEFINE_ATOMIC_BOOL_CSWAP(16, w);
UCS_DEFINE_ATOMIC_BOOL_CSWAP(32, l);
UCS_DEFINE_ATOMIC_BOOL_CSWAP(64, q);


#define UCS_ATOMIC_WEAK UCS_BIT(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to encapsulate these values, then maybe it should be an enum?
And also we miss few other values:
__ATOMIC_CONSUME - probably we don't need this
Data dependency only for both barrier and synchronization with another thread.
__ATOMIC_ACQ_REL - I'm not sure about this one
Full barrier in both directions and synchronizes with acquire loads and release stores in another thread.
__ATOMIC_SEQ_CST - the strongest mode, we definitely need it
Full barrier in both directions and synchronizes with acquire loads and release stores in all threads.

#define UCS_ATOMIC_FENCE_UNLOCK UCS_BIT(2)


static UCS_F_ALWAYS_INLINE int ucs_atomic_memorder(unsigned flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can simply use the actual flag (__ATOMIC_ACQUIRE etc) directly

static UCS_F_ALWAYS_INLINE void
ucs_rw_spinlock_read_unlock(ucs_rw_spinlock_t *lock)
{
ucs_assert(lock->state >= UCS_RWLOCK_READ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertv with details?

@@ -0,0 +1,137 @@
/*
* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2024. ALL RIGHTS RESERVED.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyrights would need 2025

(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return 0;
ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe keep wait bit included in the check too

@yosefe yosefe requested a review from tvegas1 May 11, 2025 09:02
Copy link
Contributor

@tvegas1 tvegas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, is the test code run with optim too?

}
};

UCS_TEST_F(test_rwlock, lock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run this test and others with all optimizations?

@gleon99
Copy link
Contributor

gleon99 commented Jun 17, 2025

@Artemy-Mellanox please squash

@yosefe yosefe enabled auto-merge June 18, 2025 08:48
@yosefe yosefe merged commit 27927ed into openucx:master Jun 18, 2025
151 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants