If not specified, all the code is written in c23 standard, compiled with g 13.2
command: g\++ -std=c\++2b -O0 ch4.cpp -o out && ./out

# Chapter 4. Synchronizing concurrent operations - Part 2

# Futuristic(?) programming

One major source of race condition is data race, which multiple threads modifies one function. To avoid this, we can use mutex, but it will reduce efficiency since it makes the operation not concurrently occur. However, with future, we can implement a more efficient and safe way to do it.

# Functional programming(FP)

One critical trait of functional programming is that with FP, functions doesn’t depend on external states, and it will always return the same result when given the same input. Another trait of a pure function is that it doesn’t and shouldn’t modify external values and the influence of the function is limited to the return value.

C++ also have some support for FP, like lambda expression, std::function , std::bind , and auto type deduction for function return type. The first two made function can be passed around as variables and parameters, the third allowed us to create function objects by binding parameters, and the last one made function definition more like pure functional language(like Haskell or OCaml). And std::future made it possible to do FP-style concurrency in C++ because it allowed us to pass around results of functions among functions with no explicit access to shared data.

# FP-style concurrency example: quick sort

Idea behind quick sort is simple: pick a pivot, move all elements smaller than the pivot to the left and all elements larger than the pivot to the right. Then recursively perform this operation on the left and right part. The following diagram illustrated this process:

icon-padding

Legend:

green: not sorted
purple: pivot
gray: sorted
white: sorted after operation

We first implement a normal quick sort algorithm:

template<typename T>
std::list<T> qsort(std::list<T> in) {
    if(in.empty()) return in;
    std::list<T> result;
    result.splice(result.begin(), in, in.begin());
    T pivot = result.back();
    auto div = std::partition(in.begin(), in.end(), [&](const T& t){return t < pivot;});
    std::list<T> lower;
    lower.splice(lower.end(), in, in.begin(), div);
    auto new_lower = partition(std::move(lower));
    auto new_higher = partition(std::move(in));
    result.splice(result.end(), new_higher);
    result.splice(result.begin(), new_lower);
    return result;
}

This is a normal quick sort algorithm, but we used some functional programming concept like no modification to external data and use of lambda expression. However, functional programming brought a lot of copying, so apart from the functional programming part, we used a lot of standard library functions to reduce copying and lines of code.

icon-padding

std::partition is a function that takes a range and a predicate, and it will move all elements that satisfy the predicate to the front of the range and return an iterator to the first element that doesn’t satisfy the predicate.
std::list::splice is a function that takes a position and a range, and it will move all elements in the range to the position.

Now, let’s convert that into an FP-style quick sort. Since it’s already a pure function, we can do it with a slight twist of code:

template<typename T>
std::list<T> qsort_parallel(std::list<T> in) {
    if(in.empty()) return in;
    std::list<T> result;
    result.splice(result.begin(), in, in.begin());
    T pivot = result.back();
    auto div = std::partition(in.begin(), in.end(), [&](const T& t){return t < pivot;});
    std::list<T> lower;
    lower.splice(lower.end(), in, in.begin(), div);
    auto new_lower = std::async(&qsort_parallel<T>, std::move(lower));
    auto new_higher = qsort_parallel(std::move(in));
    result.splice(result.end(), new_higher);
    result.splice(result.begin(), new_lower.get());
    return result;
}

Here we made the sort of the lower part async, which will save a lot of time… or not? Let’s see the result:

benchmark code

int main() {
    std::list<int> c, d;
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> rand(1, 100);
    auto hd = 12;
    for (int element_cnt = 0; element_cnt < hd; element_cnt++) {
        c.clear();
        d.clear();
        std::cout << element_cnt << " elements:\n";
        for(int i = 0; i < element_cnt; i++) {
            c.push_back(rand(gen));
            d.push_back(c.back());
        }
        auto begin = std::chrono::steady_clock::now();
        qsort_parallel<int>(c);
        auto end = std::chrono::steady_clock::now();
        std::cout << "\tparallel: " << (end - begin).count() << std::endl;
        begin = std::chrono::steady_clock::now();
        qsort<int>(d);
        end = std::chrono::steady_clock::now();
        std::cout << "\tsequential: " << (end - begin).count() << std::endl;
    }
    for(int element_cnt = hd; element_cnt < hd * 20; element_cnt+=hd){
        c.clear();
        d.clear();
        std::cout << element_cnt << " elements:\n";
        for(int i = 0; i < element_cnt; i++) {
            c.push_back(rand(gen));
            d.push_back(c.back());
        }
        auto begin = std::chrono::steady_clock::now();
        qsort_parallel<int>(c);
        auto end = std::chrono::steady_clock::now();
        std::cout << "\tparallel: " << (end - begin).count() << std::endl;
        begin = std::chrono::steady_clock::now();
        qsort<int>(d);
        end = std::chrono::steady_clock::now();
        std::cout << "\tsequential: " << (end - begin).count() << std::endl;
    }
}

This output is very long.

Output

0 elements:
        parallel: 416
        sequential: 166
1 elements:
        parallel: 34167
        sequential: 8250
2 elements:
        parallel: 53167
        sequential: 1459
3 elements:
        parallel: 61541
        sequential: 7084
4 elements:
        parallel: 100000
        sequential: 3041
5 elements:
        parallel: 90042
        sequential: 3666
6 elements:
        parallel: 114458
        sequential: 4333
7 elements:
        parallel: 117250
        sequential: 5416
8 elements:
        parallel: 144667
        sequential: 5416
9 elements:
        parallel: 160208
        sequential: 14458
10 elements:
        parallel: 166167
        sequential: 7000
11 elements:
        parallel: 177250
        sequential: 7334
12 elements:
        parallel: 229292
        sequential: 7583
24 elements:
        parallel: 326125
        sequential: 20625
36 elements:
        parallel: 447417
        sequential: 22666
48 elements:
        parallel: 605583
        sequential: 30542
60 elements:
        parallel: 655625
        sequential: 37625
72 elements:
        parallel: 764541
        sequential: 46000
84 elements:
        parallel: 994291
        sequential: 52250
96 elements:
        parallel: 949875
        sequential: 66458
108 elements:
        parallel: 1124250
        sequential: 73667
120 elements:
        parallel: 1237500
        sequential: 77416
132 elements:
        parallel: 1595750
        sequential: 82750
144 elements:
        parallel: 1721875
        sequential: 89708
156 elements:
        parallel: 1470042
        sequential: 102458
168 elements:
        parallel: 1936083
        sequential: 106125
180 elements:
        parallel: 1748125
        sequential: 113292
192 elements:
        parallel: 1215500
        sequential: 128292
204 elements:
        parallel: 2028375
        sequential: 152583
216 elements:
        parallel: 2286917
        sequential: 183250
228 elements:
        parallel: 2238375
        sequential: 145833

We can see that the parallel version is much slower than the sequential version. This is because when we are using std::async , we are creating a thread each time we call qsort_parallel , which is very expensive. Also, each time std::async creates a thread, that thread won’t use the full CPU core resource, leads to suboptimal use of resources. Even we assume std::async allocates system resources efficiently, std::partition is still not running in parallel( $O(n\log{n})$ according to cppreference), and we called that very often.

# Communicating Sequential Processes(CSP)

Another paradigm of concurrent programming is Communicating Sequential Processes, where threads are by definition separate, independent, and can only contact each other through message passing. This is used by Erlang and by the Message Passing Interface for high-perfomance computing in C++.

The idea of CSP is that threads are purely on the basis of how it behaves in response to the message it receives. In this case, each thread is a state machine, with the state being modified by messages. Thus, one way to implement CSP is to write a finite state machine, but that’s not the only way. The state machine structure can be implicit in the structure of the application and that’s fine.

warning

Pure CSP cannot have shared data, which is impossible in C++ because every thread shares a part of memory. Thus, it’s library author’s job to ensure there’s no data shared between threads.

# CSP-style concurrency example: ATM sceneario

A typical ATM state machine (withdraw only) can be illustrated as the following diagram:

%%{
    init: {
        'theme': 'base',
        'themeVariables': {
            'primaryColor': '#add8e6',
            'noteTextColor': '#0f0f0f',
            'noteBkgColor': '#f5f6fa',
            'lineColor': '#eaa1af',
            'labelBoxBkgColor': '#add8e6'
        }
    }
}%%
flowchart TD
        init -->|card inserted| pin[getting pin]
        pin -->|last pin pressed|pin_check[check pin]
        pin -->|digit pressed|pin
        pin -->|delete last digit|pin
        pin_check -->|pin not ok|done
        pin_check -->|pin ok|action[wait for withdraw amount]
        action -->|cancel|done
        action -->wait[getting withdraw amount]
        wait -->|withdraw amount entered|amount[check balance]
        amount -->|insufficient fund|done
        amount -->|give cash|done

In this diagram, each blue box represents a state, each arrow represents sending a message, and each lime box represent the message sent.

In actual implementation, we will use a message queue to send messages between threads. Each message will be a struct that implements a common interface. For the actual processor, we will have a class represents the state machine, with each state as a member function. Then we can simply use function pointer as states and use a loop to run the state machine.

The implementation of message queue is complicated and will be explained in another article.

icon-padding

Originally the book talked about concurrency TS after this part, but according to this paper from open-std, they are still nor sure whether to include it or not and they are still experimental (under std::experimental namespace). Thus, I will skip that part.

# Summary

Functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data.
- Since in functional programming, functions don’t modify external data, and by definition variables are immutable, it’s easier to write concurrent code.
Concurrency doesn’t necessarily mean faster execution, since system coordination, allocation, and context switching between threads is expensive.
CSP is a programming paradigm that treats threads as state machines and communicate with each other through message passing.
- C++ is by definition not CSP, but we can still use this paradigm by using message queue.

Ayano's blog

Chapter 4 - Part 3. Syncronizing concurrent operations

# Chapter 4. Synchronizing concurrent operations - Part 2

# Futuristic(?) programming

# Functional programming(FP)

# FP-style concurrency example: quick sort

# Communicating Sequential Processes(CSP)

# CSP-style concurrency example: ATM sceneario

# Summary