Commit 9ab06d6f by FritzFlorian

Performance notes on removing the exponential-backoff mechanism.

parent 116cf4af
......@@ -173,3 +173,72 @@ Our conclusion after this long hunting for performance is, that we
might just be bound by some general performance issues with our code.
The next step will therefore be to read the other frameworks and our
code carefully, trying to find potential issues.
### Commit 116cf4af - Removing Exponential Backoff
In the steal loop we first hade a backoff-mechanism as often seen in
locks (spin with relaxed CPU, then sleep/yield after too many backoffs).
The rationale behind this is to relax the memory bus by not busily
working on atomic variables. We introduced it first with the fear that
keeping the CPU busy with spinning would degregade performance of the
other working threads. However, the above examination with Intel VTune
showed that this seems to not be the main problem of our implementation
(TBB shows the same CPI increases with more threads, our implementation
seems fine in this regard).
To further reduce elements that could cause performance problems, we
therefore decided to perform one more measurement without this backoff.
#### Results of FFT
The first measurement is on the FFT. Here we tested two variants:
One with a 'yield/sleep' statement after a worker thread failed
to steal any work after the first try on every other thread and
one without this sleep. The rationale behind the sleep is that
it relaxes the CPU (it is also found in EMBB).
Average with sleep:
<img src="media/116cf4af_fft_average_sleep.png" width="400"/>
Average without sleep:
<img src="media/116cf4af_fft_average_no_sleep.png" width="400"/>
We clearly observe that the version without a sleep statement
is faster, and thus in future experiments/measurements
will exclude this statement. This also makes sense, as our
steal loop can fail, even thought there potentially is work
(because of our lock free deque implementation).
#### Results Matrix
We re-ran our benchmarks on the fork-join and native matrix
multiplication implementation to see how those change without
the backoff. We expect good results, as the matrix multiplication
mostly has enough work to keep all threads busy, thus having
workers less time spinning in the steal loop.
Average Fork-Join Matrix:
<img src="media/116cf4af_matrix_average_fork.png" width="400"/>
Average Native Matrix:
<img src="media/116cf4af_matrix_average_native.png" width="400"/>
The results are far better than the last ones, and indicate that
removing the backoff can drasticly improve performance.
#### Conclusion
We will exclude the backoff mechanisms for further tests, as this
seems to generally improve (or at least not harm performance in
case of FFT).
We also want to note that all these measurements are not very
controlled/scientific, but simply ran ot our notebook for
fast iterations over different, potential issues with our scheduler.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment