Chapter 3 - Part 1. Sharing data between threads

Chapter 3 - Part 1. Sharing data between threads

Ayano Kagurazaka Lv3

If not specified, all the code is written in c++23 standard, compiled with g++ 13.2
command: g\++ -std=c\++2b -O0 ch3.cpp -o out && ./out

Chapter 3. Sharing data between threads - Part 1

One perk of multithreading that multiprocessing doesn’t have is the former has a shared memory. This makes the communication between threads easier. However, such convenience come with costs.

Race condition

Imagine two threads, A and B, are doing operation on the same object, for example, a std::vector. If thread A and thread B are both adding elements to a container, the order elements are being added will vary each time because the execution sequence and speed of each thread will vary. This is not a big case normally, but if you are sorting an array (assume in increment order, and decrement order is just vice versa) by dividing that array in half, sort each part, and put them back, the order matters.

When thread A checked ith element is less than the element k and about to insert at position i + 1, thread B might also insert an element l that is greater than ith element but less than k. If thread B finished insertion after the thread A finished comparison but before insertion, the element inserted by the thread B will be pushed to position i + 2 since thread A insert element at position just after position i. Then we have [element at i] < k > l, which is not ordered.

%%{
    init: {
        'theme': 'base', 
        'themeVariables': { 
            'primaryColor': '#add8e6', 
            'noteTextColor': '#0f0f0f', 
            'noteBkgColor': '#f5f6fa',
            'textColor': '#EAA1AF'
        }
    }
}%%

sequenceDiagram
    participant A as Thread A
    participant B as Thread B
    participant C as Container
    A->>C: check ith element
    B->>C: check ith element
    B->>C: insert l at i + 1
    A->>C: insert k at i + 1
    C->>B: l get pushed to i + 2
    Note right of B: end up with: 
[element at i] < k > l

Such condition is called a data race, which refers to concurrent modification of one object by multiple threads. In this case, after thread A determined where to insert the element, the container is modified by thread B, which is not expected.

Avoiding race condition

There are several ways to avoid race condition. The most intuitive one is to use a protective wrapper to protect the object from being modified by multiple threads at the same time.

Protective wrapper

The most basic and intuitive way of avoiding race condition is to lock the whole object when one thread is modifying it to achieve MUTual EXclusion, or mutex.

This is a simple example of using mutex to protect a linked list:

safe linked list
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

template<typename T>
class Node {
public:
Node(T value): data(value), next(nullptr){

}

Node* getNext() const noexcept {
return next;
}

void setNext(Node* next) noexcept {
this->next = next;
}

T getValue() const noexcept {
return data;
}

void setValue(T val) noexcept {
data = val;
}


private:
T data;
Node<T>* next;
};
template<typename T>
class LinkedList {
public:
LinkedList(): head(nullptr){

}

void pushBack(const T& val) {
std::lock_guard<std::mutex> lock(mutex);
if (head == nullptr){
head = new Node<T>(val);
}
else{
auto ptr = head;
while(ptr->next != nullptr) ptr = ptr -> next;
ptr->next = new Node<T>(val);
}
}
private:
Node<T>* head;
std::mutex mutex;
};

This is a simple definition of a single linked list that is protected by a mutex. When pushBack is called, the mutex will be locked by std::lock_guard therefore making sure only one thread can modify the list at a time.

icon-padding

std::lock_guard is a RAII wrapper of mutex. It will lock the mutex when it is constructed and unlock the mutex when it is destructed. Therefore, it is exception safe.

Problem with protective wrapper

This wrapper indeed make the linked list can only be accessed by one thread per time. However, if we want to provide user some flexibility by letting them pass in self-defined functions, like the following code:

not so safe
1
2
3
4
5
6
7
8
9
10
11

// ... omitted linked list definition

template<typename F>
void oops(F f) {
std::lock_guard<std::mutex> lock(mutex);
f(*head);
}

// ... omitted linked list definition

The function f is user provided with access to a reference to head pointer of the linked list, and it is not restricted by us. Therefore, it can do basically anything like the following code:

exploit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Node<int>* unprotected;
void bad(Node<int>& head){
unprotected = &head;
}

int main(){
LinkedList<int> list;
for (int i = 5; i > 0; i--)
{
list.pushBack(i);
}

list.oops(bad);
std::cout<< list.getHead()->getValue() << std::endl;
unprotected->setValue(114514);
std::cout<< list.getHead()->getValue() << std::endl;
}

Output:

oops
1
2
5
114514

Here we obtained a reference to a node in the linked list, and now we can modify the head node with that reference with no restriction, which is very bad.

icon-padding

The problem is not related to mutex, but related to giving out the reference of the node (same concept applies to pointer). If we can make sure anything happened to the reference to the node is under our control, the problem will be solved. So, DO NOT GIVE POINTER OR REFERENCE TO UNPROTECTED SCOPES BY ANY MEANS.

More problem with protective wrapper

Even with the protective wrapper, things can still go wrong. Consider a simple stack implementation that looks thread-safe:

stack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
template<typename T>
class Stack {
using mtx = std::lock_guard<std::mutex>;

public:
explicit stack(const std::initializer_list<T> li){
std::copy(li.begin(), li.end(), data.begin());
}

bool empty() const noexcept{
mtx lock(mutex);
return data.empty();
}

std::size_t size() const noexcept {
mtx lock(mutex);
return data.size();
}

void push(const T& val) {
mtx lock(mutex);
data.push_back(val);
}

void pop(){
mtx lock(mutex);
data.pop_back();
}

T top() const noexcept {
mtx lock(mutex);
return data[size() - 1];
}

private:
std::vector<T> data;
std::mutex mutex;
};

It looks thread-safe because every function is protected by a mutex. However, consider the following code:

stack problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

template<typename T>
void ok1(Stack<T>& stack) {
if (!stack.empty()) {
std::cout << "op1 start" << std::endl;
// do things assume the stack is not empty
std::this_thread::sleep_for(std::chrono::seconds(3));
std::cout << "op1 end" << std::endl;
}
}

template<typename T>
void ok2(Stack<T>& stack) {
std::cout << "op2 start" << std::endl;
// do things involving popping value from stack
std::this_thread::sleep_for(std::chrono::seconds(3));
std::cout << "op2 end" << std::endl;
}

template<typename T>
void notOk(const Stack<T>& stack) {
std::thread th1(ok1, std::ref(stack));
std::thread th2(ok2, std::ref(stack));
th1.join();
th2.join();
}

output:

stack problem
1
2
3
4
op1 startop2 start

op1 end
op2 end

icon-padding

here op1 start and op2 start are scrambled because std::cout is executed concurrently.

When ok1 and ok2 are launched, after ok1 done the check, since they are operating at the same time, ok2 might pop all the elements in the stack before ok1 finishes, which is not excepted.

In this case, even though every single operation is protected by mutex. However, when multiple operations combined, the race condition still exists. Therefore, even with this protective barrier, race condition cannot be prevented.

Also, consider this code:

stack problem 2
1
2
3
4
5
6
7
8
template<typename T>
void notOk2(Stack<T>& stack) {
if (!stack.empty()) {
T val = stack.top();
stack.pop();
// ...do things with val
}
}

This code run perfectly fine in a single thread, but with concurrency, since things done with val can be done concurrently, the execution sequence might be like this (let the top element be i and element under it be j):

%%{
    init: {
        'theme': 'base', 
        'themeVariables': { 
            'primaryColor': '#add8e6', 
            'noteTextColor': '#0f0f0f', 
            'noteBkgColor': '#f5f6fa',
            'textColor': '#EAA1AF'
        }
    }
}%%
sequenceDiagram
    participant A as Thread A
    participant B as Thread B
    participant C as Stack
    A->>C: check if empty
    B->>C: check if empty
    A->>C: get top (i)
    B->>C: get top (i)
    A->>C: pop i
    B->>C: pop j
    A->>C: do things with i
    B->>C: do things with i

In this case, i is processed twice but j is not processed at all. This is not what we want.

Better design

In the first example, the race condition occurs because two operations are modifying the same data, but both operations are not protected by the mutex. To prevent this, we must make sure writing and reading doesn’t occur at the same time. One solution will be using a sync lock to make each operation.

sync lock
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

template<typename T>
class BetterStack {

/*
duplicate definitions omitted...
*/

void startSync() noexcept {
sync.lock();
}

void endSync() noexcept {
sync.unlock();
}

private:
/*
duplicate definitions omitted...
*/
std::mutex sync;
};

This code defined another lock that can be locked by calling startSync and endSync method. When the sync lock is locked, no other thread can access the stack except the thread that locked the sync lock and all threads are synchronized. Therefore, we can rewrite the ok1 and ok2 method like this:

better methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

template<typename T>
void realOk1(BetterStack<T>& stack) {
stack.startSync();
if (!stack.empty()) {
std::cout << "op1 start" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(3));
std::cout << "op1 end" << std::endl;
}
stack.endSync();
}

template<typename T>
void realOk2(BetterStack<T>& stack) {
stack.startSync();
std::cout << "op2 start" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(3));
std::cout << "op2 end" << std::endl;
stack.endSync();
}

template<typename T>
void nowOk(BetterStack<T>& stack) {
std::thread th1(realOk1<T>, std::ref(stack));
std::thread th2(realOk2<T>, std::ref(stack));
th1.join();
th2.join();
}

output:

better methods
1
2
3
4
5
6

op1 start
op1 end
op2 start
op2 end

The first problem is solved.

In the second example, there’s a mutex, but it didn’t protect the data as expected. Here we certainly can use sync lock to protect the sequence, but there’s a better way.

If we take a closer look on the problem, the race condition occurs because the sequence of top and pop can be executed by different thread at the same time. Therefore, we can make the top and pop operation indivisible by combining them into one operation with fixed sequence. Since the combined operation is protected by mutex, there is no way they can be executed in different sequence and the problem is solved.

We can rewrite the pop method like this:

better pop
1
2
3
4
5
6
7
8
9
10
11
12

// ... omitted linked list definition

T pop() {
std::lock_guard<std::mutex> lock(mutex);
T val = top();
data.pop_back();
return val;
}

// ... omitted linked list definition

We first obtained the top element and then pop it, returning the value. Since this operation is protected by mutex, this process is thread-safe and the problem is solved.

More problem with design

This problem probably is not straightforward, but when we are dealing with big object and restricted memory, this problem cannot be ignored. Assume the object we are storing is a very, very big object(like a std::vector with thousands of elements), when we are storing object in val with top, we are copying the whole object, which is expensive both in terms of time and memory, and could lead to std::bad_alloc exception.

When std::bad_alloc occurs, val doesn’t hold the value of the top element, but the value of the top element is already popped. In this case, the program doesn’t behave like we expected and the value of top element is lost. Luckily, we can directly access element of the container, so we don’t have to use top which performs a copy operation when returning value.

One solution will be getting the value with a reference to object T and then pop it. This way, we can avoid a middle variable and use less memory.

reference pop
1
2
3
4
5
6
7

void pop(T& val) {
std::lock_guard<std::mutex> lock(mutex);
val = data[data.size() - 1];
data.pop_back();
}

However, if we want to accept a reference of object T, we must construct a T object first, which sometimes is hard and even impossible. Because T might not have an empty constructor, and sometimes the constructor require data that’s not available at the moment. Therefore, we cannot construct a T object, which means we cannot use T& to get the value.

Another solution will be using rvalue reference and smart pointer. Instead of return a value, we can return a std::shared_ptr<T>. To construct this pointer, we use std::make_shared<T>. Since std::make_shared<T> accept rvalue reference, we don’t have to copy the object but simply move the value to a rvalue reference. This way, we can avoid copying the object and use less memory. Also, since we are using smart pointer rather than raw pointer, we don’t have to worry about memory leak and the responsibility of memory management is transferred to the cpp runtime.

better pop 2
1
2
3
4
5
6
7
8

std::shared_ptr<T> pop() {
mtx lock(mutex);
std::shared_ptr<T> const val = std::make_shared<T>(std::move(data[data.size() - 1]));
data.pop_back();
return val;
}

Then, our modified thread-safe stack will be like this:

better stack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

template<typename T>
class BetterStack {
using mtx = std::lock_guard<std::mutex>;
public:
explicit BetterStack(const std::initializer_list<T>& li){
for (const auto& i : li){
data.push_back(i);
}
}

bool empty() noexcept{
mtx lock(mutex);
return data.empty();
}

std::size_t size() noexcept {
mtx lock(mutex);
return data.size();
}

void push(const T& val) {
mtx lock(mutex);
data.push_back(val);
}

std::shared_ptr<T> pop() {
mtx lock(mutex);
std::shared_ptr<T> const val = std::make_shared<T>(std::move(data[data.size() - 1]));
data.pop_back();
return val;
}

T top() const noexcept {
mtx lock(mutex);
return data[size() - 1];
}

void startSync() noexcept {
sync.lock();
}

void endSync() noexcept {
sync.unlock();
}

private:
std::vector<T> data;
std::mutex mutex;
std::mutex sync;
};

Summary of how protective wrapper dysfunction

The protective wrapper dysfunction because the lock cannot cover the entire operation. The coverage of the lock is called the granularity of the lock. Even though fine granularity is good in terms of performance, it’s sometimes not good in terms of safety. In our original implementation of the stack, we have a very fine granularity since lock only cover a very small operation, but when we are performing multiple operations together, we cannot guarantee the sequence of the operations. Therefore, we need a sync-lock to make the granularity bigger to protect the whole operation being interrupted.

In the second example and later optimization, we also increased the granularity, but in a more elegant way. Instead of using sync-lock, we made top and pop a transaction.

The word transaction is a DBMS terminology, refers to a sequence of database operations that must be performed as a unit. And if one step of the transaction fails due to any reason, the whole transaction will be rolled back. By combining the two operations into one transaction, we made sure these two operations are indivisible and protected by mutex and therefore thread-safe.

Flexible locking with std::unique_lock and std::defer_lock

std::unique_lock is another RAII wrapper like std::lock_guard, but it can be locked and unlocked multiple times in the same scope (std::lock_guard locks the mutex throughout the scope). std::defer_lock is similar to std::adopt_lock, but the former doesn’t make mutex RAII wrapper acquire the lock right away. It only creates a wrapper and the wrapper can be locked later with std::lock.

In fact, since std::lock_guard doesn’t store information about the lock and locks the lock right away, std::defer_lock cannot be used for std::lock_guard, but for std::unique_lock it’s possible.

The following code:

unique_lock mark:14-15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class ObjectWithExposedMutex {
public:
// ...
std::mutex lock;
// ...
};

void someOp(ObjectWithExposedMutex& obj1, ObjectWithExposedMutex& obj2) {
// ...
std::unique_lock<std::mutex> lock_a(obj1.lock, std::defer_lock);
std::unique_lock<std::mutex> lock_b(obj2.lock, std::defer_lock);
std::lock(lock_a, lock_b);
// ...
lock_a.unlock();
lock_b.unlock();
// ...
std::lock(lock_a, lock_b);
}

shows that unique lock can be locked and unlocked within same scope, and std::defer_lock tells the wrapper to acquire the lock but not lock right away.

icon-padding

If you change the type of lock_a and lock_b to std::lock_guard, the code won’t compile because std::lock_guard doesn’t have a unlock method and constructor of std::lock_guard only accept std::adopt_lock as the second parameter.

icon-padding

Another perk of std::unique_lock is that it supports std::try_to_lock option as its second parameter. Thus, it’s safer to use std::unique_lock because if the lock cannot be locked, it won’t throw an exception but return false.


With this perk of std::unique_lock, we can manipulate the granularity of the lock more flexibly to make the code run faster. Consider the following code:

time benchmark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

class ObjectWithExposedMutex {
public:
// ...
std::mutex lock;
// ...
};

void slowOp(ObjectWithExposedMutex& obj1, ObjectWithExposedMutex& obj2) {
std::scoped_lock lock(obj1.lock, obj2.lock);

// time consuming work but doesn't access shared object
std::this_thread::sleep_for(std::chrono::seconds(1));
// access shrared object
std::this_thread::sleep_for(std::chrono::seconds(1));
// time consuming work but doesn't access shared object
std::this_thread::sleep_for(std::chrono::seconds(1));
// access shrared object
std::this_thread::sleep_for(std::chrono::seconds(1));
}

void fastOp(ObjectWithExposedMutex& obj1, ObjectWithExposedMutex& obj2){
std::unique_lock lock1(obj1.lock, std::defer_lock);
std::unique_lock lock2(obj2.lock, std::defer_lock);

// time consuming work but doesn't access shared object
std::this_thread::sleep_for(std::chrono::seconds(1));
// access shrared object
std::lock(lock1, lock2);
std::this_thread::sleep_for(std::chrono::seconds(1));
lock1.unlock();
lock2.unlock();
// time consuming work but doesn't access shared object
std::this_thread::sleep_for(std::chrono::seconds(1));
// access shrared object
std::lock(lock1, lock2);
std::this_thread::sleep_for(std::chrono::seconds(1));
}


int main() {

ObjectWithExposedMutex obj1, obj2;

for (int j = 0; j < 6; j++)
{
std::vector<std::thread> slow_works;
auto start = std::chrono::system_clock::now();
for(int i = 0; i < 6; i++){
slow_works.push_back(std::thread(slowOp, std::ref(obj1), std::ref(obj2)));
}

for(auto& i : slow_works){
i.join();
}

auto end = std::chrono::system_clock::now();
auto elapsed = end - start;

std::cout << "std::scoped_lock: " << elapsed.count() << '\n';
std::vector<std::thread> fast_works;

start = std::chrono::system_clock::now();


for(int i = 0; i < 6; i++){
fast_works.push_back(std::thread(fastOp, std::ref(obj1), std::ref(obj2)));
}

for(auto& i : fast_works){
i.join();
}

end = std::chrono::system_clock::now();
elapsed = end - start;
std::cout << "std::unique_lock: " << elapsed.count() << "\n\n";
}

}

Output:

output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

std::scoped_lock: 24084662000
std::unique_lock: 13057103000

std::scoped_lock: 24078523000
std::unique_lock: 13034365000

std::scoped_lock: 24083858000
std::unique_lock: 13048024000

std::scoped_lock: 24089318000
std::unique_lock: 13041106000

std::scoped_lock: 24073756000
std::unique_lock: 13047171000

std::scoped_lock: 24075716000
std::unique_lock: 13049178000

icon-padding

Output format: std::scoped_lock: <time in nanoseconds>, std::unique_lock: <time in nanoseconds>
less time is better

We can see fastOp is nearly twice faster than slowOp. This is because we only lock the mutex during when we are accessing the shared object in fastOp. Therefore, we perform the operations that doesn’t require lock in parallel and makes the code run faster.

icon-padding

In testing we assume the time needed by ops that are accessing shared object is the same as the time needed by ops that are not, but in real life the latter is generally longer(like reading big file from disk, or waiting for API response). Therefore, the difference will be even bigger and the improvement will be more significant.

See more in next part~

Comments