diff --git a/PERFORMANCE.md b/PERFORMANCE.md index 343c880..7cc7c1e 100644 --- a/PERFORMANCE.md +++ b/PERFORMANCE.md @@ -173,3 +173,72 @@ Our conclusion after this long hunting for performance is, that we might just be bound by some general performance issues with our code. The next step will therefore be to read the other frameworks and our code carefully, trying to find potential issues. + +### Commit 116cf4af - Removing Exponential Backoff + +In the steal loop we first hade a backoff-mechanism as often seen in +locks (spin with relaxed CPU, then sleep/yield after too many backoffs). +The rationale behind this is to relax the memory bus by not busily +working on atomic variables. We introduced it first with the fear that +keeping the CPU busy with spinning would degregade performance of the +other working threads. However, the above examination with Intel VTune +showed that this seems to not be the main problem of our implementation +(TBB shows the same CPI increases with more threads, our implementation +seems fine in this regard). + +To further reduce elements that could cause performance problems, we +therefore decided to perform one more measurement without this backoff. + +#### Results of FFT + +The first measurement is on the FFT. Here we tested two variants: +One with a 'yield/sleep' statement after a worker thread failed +to steal any work after the first try on every other thread and +one without this sleep. The rationale behind the sleep is that +it relaxes the CPU (it is also found in EMBB). + +Average with sleep: + + + + +Average without sleep: + + + + +We clearly observe that the version without a sleep statement +is faster, and thus in future experiments/measurements +will exclude this statement. This also makes sense, as our +steal loop can fail, even thought there potentially is work +(because of our lock free deque implementation). + +#### Results Matrix + +We re-ran our benchmarks on the fork-join and native matrix +multiplication implementation to see how those change without +the backoff. We expect good results, as the matrix multiplication +mostly has enough work to keep all threads busy, thus having +workers less time spinning in the steal loop. + +Average Fork-Join Matrix: + + + + +Average Native Matrix: + + + +The results are far better than the last ones, and indicate that +removing the backoff can drasticly improve performance. + +#### Conclusion + +We will exclude the backoff mechanisms for further tests, as this +seems to generally improve (or at least not harm performance in +case of FFT). + +We also want to note that all these measurements are not very +controlled/scientific, but simply ran ot our notebook for +fast iterations over different, potential issues with our scheduler. diff --git a/media/116cf4af_fft_average_no_sleep.png b/media/116cf4af_fft_average_no_sleep.png new file mode 100644 index 0000000..d40844e Binary files /dev/null and b/media/116cf4af_fft_average_no_sleep.png differ diff --git a/media/116cf4af_fft_average_sleep.png b/media/116cf4af_fft_average_sleep.png new file mode 100644 index 0000000..94519db Binary files /dev/null and b/media/116cf4af_fft_average_sleep.png differ diff --git a/media/116cf4af_matrix_average_fork.png b/media/116cf4af_matrix_average_fork.png new file mode 100644 index 0000000..814480d Binary files /dev/null and b/media/116cf4af_matrix_average_fork.png differ diff --git a/media/116cf4af_matrix_average_native.png b/media/116cf4af_matrix_average_native.png new file mode 100644 index 0000000..760c5fe Binary files /dev/null and b/media/116cf4af_matrix_average_native.png differ