
Chapter 3 - Part 1. Sharing data between threads

If not specified, all the code is written in c++23 standard, compiled with g++ 13.2
command: g\++ -std=c\++2b -O0 ch3.cpp -o out && ./out
Chapter 3. Sharing data between threads - Part 1
One perk of multithreading that multiprocessing doesn’t have is the former has a shared memory. This makes the communication between threads easier. However, such convenience come with costs.
Race condition
Imagine two threads, A and B, are doing operation on the same object, for example, a std::vector
. If thread A and thread B are both adding elements to a container, the order elements are being added will vary each time because the execution sequence and speed of each thread will vary. This is not a big case normally, but if you are sorting an array (assume in increment order, and decrement order is just vice versa) by dividing that array in half, sort each part, and put them back, the order matters.
When thread A checked ith element is less than the element k
and about to insert at position i + 1
, thread B might also insert an element l
that is greater than ith element but less than k
. If thread B finished insertion after the thread A finished comparison but before insertion, the element inserted by the thread B will be pushed to position i + 2
since thread A insert element at position just after position i
. Then we have [element at i] < k > l, which is not ordered.
%%{ init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#add8e6', 'noteTextColor': '#0f0f0f', 'noteBkgColor': '#f5f6fa', 'textColor': '#EAA1AF' } } }%% sequenceDiagram participant A as Thread A participant B as Thread B participant C as Container A->>C: check ith element B->>C: check ith element B->>C: insert l at i + 1 A->>C: insert k at i + 1 C->>B: l get pushed to i + 2 Note right of B: end up with:
[element at i] < k > l
Such condition is called a data race, which refers to concurrent modification of one object by multiple threads. In this case, after thread A determined where to insert the element, the container is modified by thread B, which is not expected.
Avoiding race condition
There are several ways to avoid race condition. The most intuitive one is to use a protective wrapper to protect the object from being modified by multiple threads at the same time.
Protective wrapper
The most basic and intuitive way of avoiding race condition is to lock the whole object when one thread is modifying it to achieve MUTual EXclusion, or mutex.
This is a simple example of using mutex to protect a linked list:
1 |
|
This is a simple definition of a single linked list that is protected by a mutex. When pushBack
is called, the mutex will be locked by std::lock_guard
therefore making sure only one thread can modify the list at a time.
icon-padding
std::lock_guard
is a RAII wrapper of mutex. It will lock the mutex when it is constructed and unlock the mutex when it is destructed. Therefore, it is exception safe.
Problem with protective wrapper
This wrapper indeed make the linked list can only be accessed by one thread per time. However, if we want to provide user some flexibility by letting them pass in self-defined functions, like the following code:
1 |
|
The function f
is user provided with access to a reference to head
pointer of the linked list, and it is not restricted by us. Therefore, it can do basically anything like the following code:
1 |
|
Output:
1 | 5 |
Here we obtained a reference to a node in the linked list, and now we can modify the head node with that reference with no restriction, which is very bad.
icon-padding
The problem is not related to mutex, but related to giving out the reference of the node (same concept applies to pointer). If we can make sure anything happened to the reference to the node is under our control, the problem will be solved. So, DO NOT GIVE POINTER OR REFERENCE TO UNPROTECTED SCOPES BY ANY MEANS.
More problem with protective wrapper
Even with the protective wrapper, things can still go wrong. Consider a simple stack implementation that looks thread-safe:
1 | template<typename T> |
It looks thread-safe because every function is protected by a mutex. However, consider the following code:
1 |
|
output:
1 | op1 startop2 start |
icon-padding
here op1 start
and op2 start
are scrambled because std::cout
is executed concurrently.
When ok1
and ok2
are launched, after ok1
done the check, since they are operating at the same time, ok2
might pop all the elements in the stack before ok1
finishes, which is not excepted.
In this case, even though every single operation is protected by mutex. However, when multiple operations combined, the race condition still exists. Therefore, even with this protective barrier, race condition cannot be prevented.
Also, consider this code:
1 | template<typename T> |
This code run perfectly fine in a single thread, but with concurrency, since things done with val
can be done concurrently, the execution sequence might be like this (let the top element be i
and element under it be j
):
%%{ init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#add8e6', 'noteTextColor': '#0f0f0f', 'noteBkgColor': '#f5f6fa', 'textColor': '#EAA1AF' } } }%% sequenceDiagram participant A as Thread A participant B as Thread B participant C as Stack A->>C: check if empty B->>C: check if empty A->>C: get top (i) B->>C: get top (i) A->>C: pop i B->>C: pop j A->>C: do things with i B->>C: do things with i
In this case, i
is processed twice but j
is not processed at all. This is not what we want.
Better design
In the first example, the race condition occurs because two operations are modifying the same data, but both operations are not protected by the mutex. To prevent this, we must make sure writing and reading doesn’t occur at the same time. One solution will be using a sync lock to make each operation.
1 |
|
This code defined another lock that can be locked by calling startSync
and endSync
method. When the sync lock is locked, no other thread can access the stack except the thread that locked the sync lock and all threads are synchronized. Therefore, we can rewrite the ok1
and ok2
method like this:
1 |
|
output:
1 |
|
The first problem is solved.
In the second example, there’s a mutex, but it didn’t protect the data as expected. Here we certainly can use sync lock to protect the sequence, but there’s a better way.
If we take a closer look on the problem, the race condition occurs because the sequence of top
and pop
can be executed by different thread at the same time. Therefore, we can make the top
and pop
operation indivisible by combining them into one operation with fixed sequence. Since the combined operation is protected by mutex, there is no way they can be executed in different sequence and the problem is solved.
We can rewrite the pop
method like this:
1 |
|
We first obtained the top element and then pop it, returning the value. Since this operation is protected by mutex, this process is thread-safe and the problem is solved.
More problem with design
This problem probably is not straightforward, but when we are dealing with big object and restricted memory, this problem cannot be ignored. Assume the object we are storing is a very, very big object(like a std::vector
with thousands of elements), when we are storing object in val
with top
, we are copying the whole object, which is expensive both in terms of time and memory, and could lead to std::bad_alloc
exception.
When std::bad_alloc
occurs, val
doesn’t hold the value of the top element, but the value of the top element is already popped. In this case, the program doesn’t behave like we expected and the value of top element is lost. Luckily, we can directly access element of the container, so we don’t have to use top
which performs a copy operation when returning value.
One solution will be getting the value with a reference to object T
and then pop it. This way, we can avoid a middle variable and use less memory.
1 |
|
However, if we want to accept a reference of object T
, we must construct a T
object first, which sometimes is hard and even impossible. Because T
might not have an empty constructor, and sometimes the constructor require data that’s not available at the moment. Therefore, we cannot construct a T
object, which means we cannot use T&
to get the value.
Another solution will be using rvalue reference and smart pointer. Instead of return a value, we can return a std::shared_ptr<T>
. To construct this pointer, we use std::make_shared<T>
. Since std::make_shared<T>
accept rvalue reference, we don’t have to copy the object but simply move the value to a rvalue reference. This way, we can avoid copying the object and use less memory. Also, since we are using smart pointer rather than raw pointer, we don’t have to worry about memory leak and the responsibility of memory management is transferred to the cpp runtime.
1 |
|
Then, our modified thread-safe stack will be like this:
1 |
|
Summary of how protective wrapper dysfunction
The protective wrapper dysfunction because the lock cannot cover the entire operation. The coverage of the lock is called the granularity of the lock. Even though fine granularity is good in terms of performance, it’s sometimes not good in terms of safety. In our original implementation of the stack, we have a very fine granularity since lock only cover a very small operation, but when we are performing multiple operations together, we cannot guarantee the sequence of the operations. Therefore, we need a sync-lock to make the granularity bigger to protect the whole operation being interrupted.
In the second example and later optimization, we also increased the granularity, but in a more elegant way. Instead of using sync-lock, we made top
and pop
a transaction.
The word transaction is a DBMS terminology, refers to a sequence of database operations that must be performed as a unit. And if one step of the transaction fails due to any reason, the whole transaction will be rolled back. By combining the two operations into one transaction, we made sure these two operations are indivisible and protected by mutex and therefore thread-safe.
Flexible locking with std::unique_lock
and std::defer_lock
std::unique_lock
is another RAII wrapper like std::lock_guard
, but it can be locked and unlocked multiple times in the same scope (std::lock_guard locks the mutex throughout the scope). std::defer_lock
is similar to std::adopt_lock
, but the former doesn’t make mutex RAII wrapper acquire the lock right away. It only creates a wrapper and the wrapper can be locked later with std::lock
.
In fact, since std::lock_guard
doesn’t store information about the lock and locks the lock right away, std::defer_lock
cannot be used for std::lock_guard
, but for std::unique_lock
it’s possible.
The following code:
1 | class ObjectWithExposedMutex { |
shows that unique lock can be locked and unlocked within same scope, and std::defer_lock
tells the wrapper to acquire the lock but not lock right away.
icon-padding
If you change the type of lock_a
and lock_b
to std::lock_guard
, the code won’t compile because std::lock_guard
doesn’t have a unlock
method and constructor of std::lock_guard
only accept std::adopt_lock
as the second parameter.
icon-padding
Another perk of std::unique_lock
is that it supports std::try_to_lock
option as its second parameter. Thus, it’s safer to use std::unique_lock
because if the lock cannot be locked, it won’t throw an exception but return false.
With this perk of std::unique_lock
, we can manipulate the granularity of the lock more flexibly to make the code run faster. Consider the following code:
1 |
|
Output:
1 |
|
icon-padding
Output format: std::scoped_lock: <time in nanoseconds>
, std::unique_lock: <time in nanoseconds>
less time is better
We can see fastOp
is nearly twice faster than slowOp
. This is because we only lock the mutex during when we are accessing the shared object in fastOp
. Therefore, we perform the operations that doesn’t require lock in parallel and makes the code run faster.
icon-padding
In testing we assume the time needed by ops that are accessing shared object is the same as the time needed by ops that are not, but in real life the latter is generally longer(like reading big file from disk, or waiting for API response). Therefore, the difference will be even bigger and the improvement will be more significant.
See more in next part~