Add further notes on CPI performance.

a78ff0d1 · FritzFlorian · 5b73b30a · a78ff0d1
Commit a78ff0d1 authored May 28, 2019 by FritzFlorian
Show whitespace changes
Inline Side-by-side

Showing with 18 additions and 0 deletions

PERFORMANCE.md
+18 -0

No files found.
--- a/PERFORMANCE.md
+++ b/PERFORMANCE.md
@@ -155,3 +155,21 @@ Best case Native:
 Average case Native:

 <img src="media/afd0331b_matrix_average_case_native.png" width="400"/>
+
+What we find very interesting is, that the best case times of our
+pls library are very fast (as good as TBB), but the average times
+drop badly. We currently do not know why this is the case.
+
+### Commit afd0331b - Intel VTune Amplifier
+
+We did serval measurements with intel's VTune Amplifier profiling
+tool. The main thing that we notice is, that the cycles per instruction
+for our useful work blocks increase, thus requiring more CPU time
+for the acutal useful work.
+
+We also measured an implementation using TBB and found no significante
+difference, e.g. TBB also has a higher CPI with 8 threads.
+Our conclusion after this long hunting for performance is, that we
+might just be bound by some general performance issues with our code.
+The next step will therefore be to read the other frameworks and our
+code carefully, trying to find potential issues.