Performance notes on removing the exponential-backoff mechanism.

9ab06d6f · FritzFlorian · 116cf4af · 9ab06d6f · 9ab06d6f · 9ab06d6f
Commit 9ab06d6f authored Jun 03, 2019 by FritzFlorian
Showing with 69 additions and 0 deletions

PERFORMANCE.md
+69 -0

media/116cf4af_fft_average_no_sleep.png
+0 -0

media/116cf4af_fft_average_sleep.png
+0 -0

media/116cf4af_matrix_average_fork.png
+0 -0

media/116cf4af_matrix_average_native.png
+0 -0

No files found.
--- a/PERFORMANCE.md
+++ b/PERFORMANCE.md
@@ -173,3 +173,72 @@ Our conclusion after this long hunting for performance is, that we
 might just be bound by some general performance issues with our code.
 The next step will therefore be to read the other frameworks and our
 code carefully, trying to find potential issues.
+### Commit 116cf4af - Removing Exponential Backoff
+In the steal loop we first hade a backoff-mechanism as often seen in
+locks (spin with relaxed CPU, then sleep/yield after too many backoffs).
+The rationale behind this is to relax the memory bus by not busily
+working on atomic variables. We introduced it first with the fear that
+keeping the CPU busy with spinning would degregade performance of the
+other working threads. However, the above examination with Intel VTune
+showed that this seems to not be the main problem of our implementation
+(TBB shows the same CPI increases with more threads, our implementation
+seems fine in this regard).
+To further reduce elements that could cause performance problems, we
+therefore decided to perform one more measurement without this backoff.
+#### Results of FFT
+The first measurement is on the FFT. Here we tested two variants:
+One with a 'yield/sleep' statement after a worker thread failed
+to steal any work after the first try on every other thread and
+one without this sleep. The rationale behind the sleep is that
+it relaxes the CPU (it is also found in EMBB).
+Average with sleep:
+<img src="media/116cf4af_fft_average_sleep.png" width="400"/>
+Average without sleep:
+<img src="media/116cf4af_fft_average_no_sleep.png" width="400"/>
+We clearly observe that the version without a sleep statement
+is faster, and thus in future experiments/measurements
+will exclude this statement. This also makes sense, as our
+steal loop can fail, even thought there potentially is work
+(because of our lock free deque implementation).
+#### Results Matrix
+We re-ran our benchmarks on the fork-join and native matrix
+multiplication implementation to see how those change without
+the backoff. We expect good results, as the matrix multiplication
+mostly has enough work to keep all threads busy, thus having
+workers less time spinning in the steal loop.
+Average Fork-Join Matrix:
+<img src="media/116cf4af_matrix_average_fork.png" width="400"/>
+Average Native Matrix:
+<img src="media/116cf4af_matrix_average_native.png" width="400"/>
+The results are far better than the last ones, and indicate that
+removing the backoff can drasticly improve performance.
+#### Conclusion
+We will exclude the backoff mechanisms for further tests, as this
+seems to generally improve (or at least not harm performance in
+case of FFT).
+We also want to note that all these measurements are not very
+controlled/scientific, but simply ran ot our notebook for
+fast iterations over different, potential issues with our scheduler.
--- a/media/116cf4af_fft_average_no_sleep.png
+++ b/media/116cf4af_fft_average_no_sleep.png
--- a/media/116cf4af_fft_average_sleep.png
+++ b/media/116cf4af_fft_average_sleep.png
--- a/media/116cf4af_matrix_average_fork.png
+++ b/media/116cf4af_matrix_average_fork.png
--- a/media/116cf4af_matrix_average_native.png
+++ b/media/116cf4af_matrix_average_native.png