2. Sorting and Order Statistics

2. Sorting and Order Statistics

Ayano Kagurazaka Lv3

~well since I’m learning rust I will use rust to implement algorithms~

A brief review of sorting problem

A sorting problem is usually defined as below:

Given a sequence of n numbers a1,a2,...,ana_1, a_2, ..., a_n, we want the algorithm to output a reordering of original sequence a1,a2,...,ana_1', a_2', ..., a_n', such that a1a2...ana_1 \leq a_2 \leq ... \leq a_n

It’s worth noticing that the “numbers” can be anything that can be ordered, and the “sequence” can be any data structure that can contain an ordered collection of objects. Thus the focus of this part is on the problem itself, not the specification of individual type of objects.

Recall the algorithm we introduced in the previous part: insertion sort and quick sort, with their own properties. Insertion sort takes more time than merge sort, but it has good performance on small size input, since it sorts the array in place, means there is only constant number of elements are stored out side of the input array(Recall that the merge function of merge sort will create additional array as the size of array increases, where insertion sort only stores the max/min element no matter how large the input array is).

In this part, we will introduce two new sorting algorithms: heap sort and quick sort.

Heap sort

Heap sort is an sort algorithm that stores array in-place and its run time is O(nlogn)O(n\log n). Heap sort utilizes a data structure called binary heap, which is a complete binary tree that the parent node is always greater/smaller than the child node. When all the parents are greater than their childs, the heap is called max heap, otherwise it’s called a min heap.

Recall that in a binary tree represented by an array, give the i-th index of the array, its parent node idx is i/2i / 2, and its left child and right child are i2+1i * 2 + 1 and i2+2i * 2 + 2, if we start count from 0 and floor the division.

For a binary tree with nn node, the height of the tree(the number of edges from of the longest downward path from the root to a leaf) is at least floor(log(n)).

Proof:

To proof, we use this fact:

for a complete binary tree, the h-th layer has 2h12^{h - 1} nodes (i.e. layer 1 has 20=12^0 = 1 node)

Thus, a tree with height hh has at least

i=1h2i1=2h1\sum_{i = 1}^{h}{2^{i - 1}} = 2^h - 1

number of nodes.

Let

n=2h1n = 2^h - 1

take log2\log_2 on both side, we get:

h=logn+1h = \log{n + 1}

since log\log is increasing, we can conclude that logn+1>logn\log{n + 1} > \lfloor \log{n} \rfloor, and thus the lower limit is what we desired.

Construction of heap

To create a max heap, we can follow the following algorithm:

The input is the ithi-th node. And the function left(x) and right(x) stands for the index of the left child and the right child of the x-th node.

  1. Check the left node and the right node to find the largest node index of the i-th node(if the i-th node is greater than all of them, the greatest node index is i)
  2. If the i-th node is not the largest node, swap the i-th node and the greatest node(either left or right node).
  3. Apply this procedure on the largest node index we found previously.

The rust code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
fn left(v: &[i32], idx: usize) -> Option<usize> {
if v.len() > idx * 2 + 1 {
return Some(idx * 2 + 1);
}
None
}

fn right(v: &[i32], idx: usize) -> Option<usize> {
if v.len() > idx * 2 + 2 {
return Some(idx * 2 + 2);
}
None
}

fn swap_element(vec: &mut [i32], lhs: usize, rhs: usize) {
vec[lhs] += vec[rhs];
vec[rhs] = vec[lhs] - vec[rhs];
vec[lhs] -= vec[rhs];
}

fn max_heapify(vec: &mut [i32], root: usize) {
let left = left(vec, root);
let right = right(vec, root);
let mut largest = root;
if let Some(i) = left {
if vec[largest] < vec[i] {
largest = i;
}
}
if let Some(i) = right {
if vec[largest] < vec[i] {
largest = i;
}
}

if largest != root {
swap_element(vec, largest, root);
max_heapify(vec, largest);
}
}

fn print_vec(vec: &[i32]) {
for i in vec {
print!("{} ", i);
}
}

fn main() {
let mut v = [1, 1, 4, 5, 1, 4];
max_heapify(&mut v, 0);
print_vec(&v);
}

The output:

1
4 1 4 5 1 1

Originally the tree looks like this:

1
2
3
4
5
6
7
      1
/ \
/ \
/ \
1 4
/ \ / \
5 1 4 N

and after adjustment, we have:

1
2
3
4
5
6
7
      4
/ \
/ \
/ \
1 4
/ \ / \
5 1 1 N

This doesn’t look like a heap, since we only processed one branch, which is the largest branch of the root node.

To implement the full heap sort, we need to adjust the tree from the second lowest layer to the upper most layer:

Why the second lowerst layer

The reason why we don’t begin with the root node because in our scenerio, if we apply the procedure twice, we get:

1
2
3
4
5
6
7
      4
/ \
/ \
/ \
5 4
/ \ / \
1 1 1 N

which is not a valid heap.

To come up with a full procedure, we iterate from the last node with children to the root node, and apply the procedure max_heapify to each node.

The rust code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
fn max_heapify(vec: &mut [i32], root: usize) {
let left = left(vec, root);
let right = right(vec, root);
let mut largest = root;
if let Some(i) = left {
if vec[largest] < vec[i] {
largest = i;
}
}
if let Some(i) = right {
if vec[largest] < vec[i] {
largest = i;
}
}

if largest != root {
swap_element(vec, largest, root);
max_heapify(vec, largest);
}
}

fn make_heap(vec: &mut [i32]) {
for it in (0..vec.len() / 2).rev() {
max_heapify(vec, it);
}
}

fn print_vec(vec: &[i32]) {
for i in vec {
print!("{} ", i);
}
}

The output:

1
2
1 1 4 5 1 4
5 1 4 1 1 4

If we draw the latter as a binary tree we get:

1
2
3
4
5
6
7
      5
/ \
/ \
/ \
1 4
/ \ / \
1 1 4 N

To prove this works, we once again uses loop invarient, since there is a loop in this code:

At the start of each loop, the node i...vec.len()1i ... vec.len() - 1 are root node of a max heap

Now we are about to proof this actually remains invarient throughout the loop. But we will first proof one lemma:

Lemma 2.1

Given a complete binary tree vec, if node ii has child that is the root of a max heap, after the function max_heapify applied to ii, ii will be the root of a max heap.

proof

Note: the right left child and right left child has symmetric proof, we will only discuss the left child.

Assume the left child of the iith node is 2i+12i + 1. Since 2i+12i + 1 is a max-heap root, it is greater than all of its child nodes. There are two cases:

  1. i2i+1i \geq 2i + 1: In this case, ii is already a max heap and we do nothing.
  2. i<2i+1i < 2i + 1: In this case, the function max_heapify will swap that element. Given that 2i+12i + 1 is the root of a max heap, its child nodes are also valid roots for max heaps, which comes back to our initial condition.

After some iterations, case 2 either goes to case 1 or hits the bottom of the tree. When the iteration hits the bottom of the tree, according to our algorithm, the element must be smaller than all the previous elements, and given that all the previous elements are in order, the new binary tree formed by node 2i+12i + 1 is a valid max heap.

Initialization: This loop starts from vec.len()/2vec.len() / 2, which is the last element in the second last layer. Notice that the element $vec.len() / 2 $ to vec.len()vec.len() are tivially max heap.

Maintenance: In each heap, we construct a max heap with root node as node ii, By our initialization condition, the child node of ii are root of a max heap. By Lemma 2.1, after adjustment, ii should become a new root of a max heap.

Termination: The loop terminates at the root node of the binary tree. At that time, we have applied max_heapify algorithm to all the nodes, thus every nodes are valid root of max heap, thus the whole tree is a valid max heap.

Comments
On this page
2. Sorting and Order Statistics