PERFORMANCE.md 6.56 KB
Newer Older
1 2
# Notes on performance measures during development

3 4 5
#### Commit 52fcb51f - Add basic random stealing

Slight improvement, needs further measurement after removing more important bottlenecks.
6 7 8 9
Below are three individual measurements of the difference.
Overall the trend (sum of all numbers/last number),
go down (98.7%, 96.9% and 100.6%), but with the one measurement
above 100% we think the improvements are minor.
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

| | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
old     |    1659.01 us|    967.19 us|    830.08 us|    682.69 us|    737.71 us|    747.92 us|    749.37 us|    829.75 us|   7203.73 us
new     |    1676.06 us|    981.56 us|    814.71 us|    698.72 us|    680.87 us|    737.68 us|    756.91 us|    764.71 us|   7111.22 us
change  |    101.03  %|    101.49  %|     98.15  %|    102.35  %|     92.30  %|     98.63  %|    101.01  %|     92.16  %|     98.72  %

| | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
old     |    1648.65 us|    973.33 us|    820.18 us|    678.80 us|    746.21 us|    767.63 us|    747.17 us|   1025.35 us|   7407.32 us
new     |    1655.09 us|    964.99 us|    807.57 us|    731.34 us|    747.47 us|    714.71 us|    794.35 us|    760.28 us|   7175.80 us
change  |    100.39  %|     99.14  %|     98.46  %|    107.74  %|    100.17  %|     93.11  %|    106.31  %|     74.15  %|     96.87  %

| | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
old     |    1654.26 us|    969.12 us|    832.13 us|    680.69 us|    718.70 us|    750.80 us|    744.12 us|    775.24 us|   7125.07 us
new     |    1637.04 us|    978.09 us|    799.93 us|    709.33 us|    746.42 us|    684.87 us|    822.30 us|    787.61 us|   7165.59 us
change  |     98.96  %|    100.93  %|     96.13  %|    104.21  %|    103.86  %|     91.22  %|    110.51  %|    101.60  %|    100.57  %
28 29 30 31 32 33

#### Commit 3535cbd8  - Cache Align scheduler_memory

Big improvements of about 6% in our test. This seems like a little,
but 6% from the scheduler is a lot, as the 'main work' is the tasks
itself, not the scheduler.
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

This change unsurprisingly yields the biggest improvement yet.

#### Commit b9bb90a4  - Try to figure out the 'high thread bottleneck'

We are currently seeing good performance on low core counts
(up to 1/2 of the machines cores), but after that performance
plumishes:

Bana-Pi Best-Case:

<img src="./media/b9bb90a4-banana-pi-best-case.png" width="400"/>

Bana-Pi Average-Case:

<img src="./media/b9bb90a4-banana-pi-average-case.png" width="400"/>

Laptop Best-Case:

<img src="./media/b9bb90a4-laptop-best-case.png" width="400"/>

Laptop Average-Case:

<img src="./media/b9bb90a4-laptop-average-case.png" width="400"/>


As we can see, in average the performance of PLS starts getting
way worse than TBB and EMBB after 4 cores. We suspect this is due
to contemption, but could not resolve it with any combination
of `tas_spinlock` vs `ttas_spinlock` and `lock` vs `try_lock`.

This issue clearly needs further investigation.
66

67 68
### Commit aa27064 - Performance with ttsa spinlocks (and 'full blocking' top level)

69
<img src="media/aa27064_fft_average.png" width="400"/>
70 71 72

### Commit d16ad3e - Performance with rw-lock and backoff

73
<img src="media/d16ad3e_fft_average.png" width="400"/>
74 75

### Commit 18b2d744 - Performance with lock-free deque
76 77 78 79 80 81 82 83

After much tinkering we still have performance problems with higher
thread counts in the FFT benchmark. Upward from 4/5 threads the
performance gains start to saturate (before removing the top level
locks we even saw a slight drop in performance).

Currently the FFT benchmark shows the following results (average):

84
<img src="media/18b2d744_fft_average.png" width="400"/>
85 86 87 88

We want to positively note that the overall trend of 'performance drops'
at the hyperthreading mark is not really bad anymore, it rather
seems similar to EMBB now (with backoff + lockfree deque + top level
89 90 91
reader-writers lock). This comes partly because the spike at 4 threads
is lower (less performance at 4 threads). We also see better times
on the multiprogramed system with the lock-free deque.
92 93 94 95 96

This is discouraging after many tests. To see where the overhead lies
we also implemented the unbalanced tree search benchmark,
resulting in the following, suprisingly good, results (average):

97
<img src="media/18b2d744_unbalanced_average.png" width="400"/>
98 99 100 101 102 103 104

The main difference between the two benchmarks is, that the second
one has more work and the work is relatively independent.
Additionaly, the first one uses our high level API (parallel invoke),
while the second one uses our low level API.
It is worth investigating if either or high level API or the structure
of the memory access in FFT are the problem.
105 106 107 108 109 110 111 112 113 114 115

### Commit cf056856 - Remove two-level scheduler

In this test we replace the two level scheduler with ONLY fork_join
tasks. This removes the top level steal overhead and performs only
internal stealing. For this we set the fork_join task as the only
possible task type and removed the top level rw-lock, the digging
down to our level and solely use internal stealing.

Average results FFT:

116
<img src="media/cf056856_fft_average.png" width="400"/>
117 118 119

Average results Unbalanced:

120
<img src="media/cf056856_unbalanced_average.png" width="400"/>
121 122 123 124

There seems to be only a minor performance difference between the two,
suggesting tha our two-level approach is not the part causing our
weaker performance.
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157

### Commit afd0331b - Some notes on scaling problems

After tweaking individual values and parameters we can still not find
the main cause for our slowdown on multiple processors.
We also use intel's vtune amplifier to measure performance on our run
and find that we always spend way too much time 'waiting for work',
e.g. in the backoff mechanism when enabled or in the locks for stealing
work when backoff is disabled. This leads us to believe that our problems
might be connected to some issue with work distribution on the FFT case,
as the unbalanced tree search (with a lot 'local' work) performs good.

To get more data in we add benchmarks on matrix multiplication implemented
in two fashions: once with a 'native' array stealing task and once with
a fork-join task. Both implementations use the same minimum array
sub-size of 4 elements and we can hopefully see if they have any
performance differences.

Best case fork-join:

<img src="media/afd0331b_matrix_best_case_fork.png" width="400"/>

Average case fork-join:

<img src="media/afd0331b_matrix_average_case_fork.png" width="400"/>

Best case Native:

<img src="media/afd0331b_matrix_best_case_native.png" width="400"/>

Average case Native:

<img src="media/afd0331b_matrix_average_case_native.png" width="400"/>