Remove notes fol PLS_v1.

If required, they are still in the history. However, they are not up to date and thus mostly confuling.

Remove notes fol PLS_v1.
If required, they are still in the history. However, they are not up to date and thus mostly confuling.
ca2c8abd · FritzFlorian · c21af88f · ca2c8abd · c21af88f · c21af88f
Commit ca2c8abd authored Jul 22, 2020 by FritzFlorian
36 changed files
--- a/BANANAPI.md
+++ b/BANANAPI.md
@@ -12,7 +12,7 @@ bananaPI m3. For this we generally
 [follow the instructions given](https://docs.armbian.com/Developer-Guide_Build-Preparation/),
 below are notes on what to do to get the rt kernel patch into it and to build.

-You can also use [our pre-build image](https://drive.google.com/open?id=1RiHymBO_XjOk5tMAL31iOSJGfncrWFQh)
+You can also use our pre-build image [on google drive](https://drive.google.com/open?id=1RiHymBO_XjOk5tMAL31iOSJGfncrWFQh)
 and skip the build process below. Just use etcher (https://www.balena.io/etcher/) or similar,
 flash an sd card and the PI should boot up. Default login is root/1234, follow the instructions,
 then continue with the isolating system setup steps for more accurate measurements.

--- a/PERFORMANCE-v1.md
+++ b/PERFORMANCE-v1.md
-# Notes on performance measures during development
-
-#### Commit 52fcb51f - Add basic random stealing
-
-Slight improvement, needs further measurement after removing more important bottlenecks.
-Below are three individual measurements of the difference.
-Overall the trend (sum of all numbers/last number),
-go down (98.7%, 96.9% and 100.6%), but with the one measurement
-above 100% we think the improvements are minor.
-
-| | | | | | | | | | |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-old     |    1659.01 us|    967.19 us|    830.08 us|    682.69 us|    737.71 us|    747.92 us|    749.37 us|    829.75 us|   7203.73 us
-new     |    1676.06 us|    981.56 us|    814.71 us|    698.72 us|    680.87 us|    737.68 us|    756.91 us|    764.71 us|   7111.22 us
-change  |    101.03  %|    101.49  %|     98.15  %|    102.35  %|     92.30  %|     98.63  %|    101.01  %|     92.16  %|     98.72  %
-
-| | | | | | | | | | |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-old     |    1648.65 us|    973.33 us|    820.18 us|    678.80 us|    746.21 us|    767.63 us|    747.17 us|   1025.35 us|   7407.32 us
-new     |    1655.09 us|    964.99 us|    807.57 us|    731.34 us|    747.47 us|    714.71 us|    794.35 us|    760.28 us|   7175.80 us
-change  |    100.39  %|     99.14  %|     98.46  %|    107.74  %|    100.17  %|     93.11  %|    106.31  %|     74.15  %|     96.87  %
-
-| | | | | | | | | | |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-old     |    1654.26 us|    969.12 us|    832.13 us|    680.69 us|    718.70 us|    750.80 us|    744.12 us|    775.24 us|   7125.07 us
-new     |    1637.04 us|    978.09 us|    799.93 us|    709.33 us|    746.42 us|    684.87 us|    822.30 us|    787.61 us|   7165.59 us
-change  |     98.96  %|    100.93  %|     96.13  %|    104.21  %|    103.86  %|     91.22  %|    110.51  %|    101.60  %|    100.57  %
-
-#### Commit 3535cbd8  - Cache Align scheduler_memory
-
-Big improvements of about 6% in our test. This seems like a little,
-but 6% from the scheduler is a lot, as the 'main work' is the tasks
-itself, not the scheduler.
-
-This change unsurprisingly yields the biggest improvement yet.
-
-#### Commit b9bb90a4  - Try to figure out the 'high thread bottleneck'
-
-We are currently seeing good performance on low core counts
-(up to 1/2 of the machines cores), but after that performance
-plumishes:
-
-Bana-Pi Best-Case:
-
-<img src="./media/b9bb90a4-banana-pi-best-case.png" width="400"/>
-
-Bana-Pi Average-Case:
-
-<img src="./media/b9bb90a4-banana-pi-average-case.png" width="400"/>
-
-Laptop Best-Case:
-
-<img src="./media/b9bb90a4-laptop-best-case.png" width="400"/>
-
-Laptop Average-Case:
-
-<img src="./media/b9bb90a4-laptop-average-case.png" width="400"/>
-
-
-As we can see, in average the performance of PLS starts getting
-way worse than TBB and EMBB after 4 cores. We suspect this is due
-to contemption, but could not resolve it with any combination
-of `tas_spinlock` vs `ttas_spinlock` and `lock` vs `try_lock`.
-
-This issue clearly needs further investigation.
-
-### Commit aa27064 - Performance with ttsa spinlocks (and 'full blocking' top level)
-
-<img src="media/aa27064_fft_average.png" width="400"/>
-
-### Commit d16ad3e - Performance with rw-lock and backoff
-
-<img src="media/d16ad3e_fft_average.png" width="400"/>
-
-### Commit 18b2d744 - Performance with lock-free deque
-
-After much tinkering we still have performance problems with higher
-thread counts in the FFT benchmark. Upward from 4/5 threads the
-performance gains start to saturate (before removing the top level
-locks we even saw a slight drop in performance).
-
-Currently the FFT benchmark shows the following results (average):
-
-<img src="media/18b2d744_fft_average.png" width="400"/>
-
-We want to positively note that the overall trend of 'performance drops'
-at the hyperthreading mark is not really bad anymore, it rather
-seems similar to EMBB now (with backoff + lockfree deque + top level
-reader-writers lock). This comes partly because the spike at 4 threads
-is lower (less performance at 4 threads). We also see better times
-on the multiprogramed system with the lock-free deque.
-
-This is discouraging after many tests. To see where the overhead lies
-we also implemented the unbalanced tree search benchmark,
-resulting in the following, suprisingly good, results (average):
-
-<img src="media/18b2d744_unbalanced_average.png" width="400"/>
-
-The main difference between the two benchmarks is, that the second
-one has more work and the work is relatively independent.
-Additionaly, the first one uses our high level API (parallel invoke),
-while the second one uses our low level API.
-It is worth investigating if either or high level API or the structure
-of the memory access in FFT are the problem.
-
-### Commit cf056856 - Remove two-level scheduler
-
-In this test we replace the two level scheduler with ONLY fork_join
-tasks. This removes the top level steal overhead and performs only
-internal stealing. For this we set the fork_join task as the only
-possible task type and removed the top level rw-lock, the digging
-down to our level and solely use internal stealing.
-
-Average results FFT:
-
-<img src="media/cf056856_fft_average.png" width="400"/>
-
-Average results Unbalanced:
-
-<img src="media/cf056856_unbalanced_average.png" width="400"/>
-
-There seems to be only a minor performance difference between the two,
-suggesting tha our two-level approach is not the part causing our
-weaker performance.
-
-### Commit afd0331b - Some notes on scaling problems
-
-After tweaking individual values and parameters we can still not find
-the main cause for our slowdown on multiple processors.
-We also use intel's vtune amplifier to measure performance on our run
-and find that we always spend way too much time 'waiting for work',
-e.g. in the backoff mechanism when enabled or in the locks for stealing
-work when backoff is disabled. This leads us to believe that our problems
-might be connected to some issue with work distribution on the FFT case,
-as the unbalanced tree search (with a lot 'local' work) performs good.
-
-To get more data in we add benchmarks on matrix multiplication implemented
-in two fashions: once with a 'native' array stealing task and once with
-a fork-join task. Both implementations use the same minimum array
-sub-size of 4 elements and we can hopefully see if they have any
-performance differences.
-
-Best case fork-join:
-
-<img src="media/afd0331b_matrix_best_case_fork.png" width="400"/>
-
-Average case fork-join:
-
-<img src="media/afd0331b_matrix_average_case_fork.png" width="400"/>
-
-Best case Native:
-
-<img src="media/afd0331b_matrix_best_case_native.png" width="400"/>
-
-Average case Native:
-
-<img src="media/afd0331b_matrix_average_case_native.png" width="400"/>
-
-What we find very interesting is, that the best case times of our
-pls library are very fast (as good as TBB), but the average times
-drop badly. We currently do not know why this is the case.
-
-### Commit afd0331b - Intel VTune Amplifier
-
-We did serval measurements with intel's VTune Amplifier profiling
-tool. The main thing that we notice is, that the cycles per instruction
-for our useful work blocks increase, thus requiring more CPU time
-for the acutal useful work.
-
-We also measured an implementation using TBB and found no significante
-difference, e.g. TBB also has a higher CPI with 8 threads.
-Our conclusion after this long hunting for performance is, that we
-might just be bound by some general performance issues with our code.
-The next step will therefore be to read the other frameworks and our
-code carefully, trying to find potential issues.
-
-### Commit 116cf4af - Removing Exponential Backoff
-
-In the steal loop we first hade a backoff-mechanism as often seen in
-locks (spin with relaxed CPU, then sleep/yield after too many backoffs).
-The rationale behind this is to relax the memory bus by not busily
-working on atomic variables. We introduced it first with the fear that
-keeping the CPU busy with spinning would degregade performance of the
-other working threads. However, the above examination with Intel VTune
-showed that this seems to not be the main problem of our implementation
-(TBB shows the same CPI increases with more threads, our implementation
-seems fine in this regard).
-
-To further reduce elements that could cause performance problems, we
-therefore decided to perform one more measurement without this backoff.
-
-#### Results of FFT
-
-The first measurement is on the FFT. Here we tested two variants:
-One with a 'yield/sleep' statement after a worker thread failed
-to steal any work after the first try on every other thread and
-one without this sleep. The rationale behind the sleep is that
-it relaxes the CPU (it is also found in EMBB).
-
-Average with sleep:
-
-<img src="media/116cf4af_fft_average_sleep.png" width="400"/>
-
-
-Average without sleep:
-
-<img src="media/116cf4af_fft_average_no_sleep.png" width="400"/>
-
-
-We clearly observe that the version without a sleep statement
-is faster, and thus in future experiments/measurements
-will exclude this statement. This also makes sense, as our
-steal loop can fail, even thought there potentially is work
-(because of our lock free deque implementation).
-
-#### Results Matrix
-
-We re-ran our benchmarks on the fork-join and native matrix
-multiplication implementation to see how those change without
-the backoff. We expect good results, as the matrix multiplication
-mostly has enough work to keep all threads busy, thus having
-workers less time spinning in the steal loop.
-
-Average Fork-Join Matrix:
-
-<img src="media/116cf4af_matrix_average_fork.png" width="400"/>
-
-
-Average Native Matrix:
-
-<img src="media/116cf4af_matrix_average_native.png" width="400"/>
-
-The results are far better than the last ones, and indicate that
-removing the backoff can drasticly improve performance.
-
-#### Conclusion
-
-We will exclude the backoff mechanisms for further tests, as this
-seems to generally improve (or at least not harm performance in
-case of FFT).
-
-We also want to note that all these measurements are not very
-controlled/scientific, but simply ran ot our notebook for
-fast iterations over different, potential issues with our scheduler.
-
-
-### Commit 116cf4af - VTune Amplifier and MRSW top level lock
-
-When looking at why our code works quite well on problems with
-mostly busy workers and not so well on code with spinning/waiting
-workers (like in the FFT), we take a closer look at the FFT and
-matrix multiplication in VTune.
-
-FFT:
-
-<img src="media/116cf4af_fft_vtune.png" width="400"/>
-
-Matrix:
-
-<img src="media/116cf4af_matrix_vtune.png" width="400"/>
-
-The sections highlighted in red represent parts of the code spent
-on spinning in the work-stealing loop.
-We can see that as long as our workers are mainly busy/find work
-in the stealing loop the overhead spent on spinning is minimal.
-We can also see that in the FFT considerable amounts of time are
-spent spining.
-
-A general observation are the high CPI rates for our spinning code.
-This makes sense, as we are currently working on locks that share
-atomic variables in order to work, thus leading to cache misses.
-
-### Commit 116cf4af - 2D Heat Diffusion
-
-As a last test for our current state on performance we implemented the
-2D heat diffusion benchmark using our framework (using fork-join based
-parallel_for, 512 heat array size):
-
-<img src="media/116cf4af_heat_average.png" width="400"/>
-
-We observe solid performance from our implementation.
-(Again, not very scientific test environment, but good enough for
-our general direction)
-
-### Commit 3bdaba42 - Move to pure fork-join tasks (remove two level)
-
-We moved away from our two-level scheduler approach towards a
-pure fork-join task model (in order to remove any lock's in the
-code more easily and to make further tests simpler/more focused
-on one specific aspecs.
-These are the measurements made after the change
-(without any performance optimizations done):
-
-FFT Average:
-
-<img src="media/3bdaba42_fft_average.png" width="400"/>
-
-Heat Diffusion Average:
-
-<img src="media/3bdaba42_heat_average.png" width="400"/>
-
-Matrix Multiplication Average:
-
-<img src="media/3bdaba42_matrix_average.png" width="400"/>
-
-Unbalanced Tree Search Average:
-
-<img src="media/3bdaba42_unbalanced_average.png" width="400"/>
-
-
-We note that in heat diffusion, matrix multiplication and unbalanced
-tree search - all three benchmarks with mostly enough work avaliable at
-all time - our implementation performs head on head with intel's
-TBB. Only the FFT benchmark is a major problem four our library.
-We notice a MAJOR drop in performance exactly at the hyperthreading
-mark, indicating problems with limited resources due to the spinning
-threads (threads without any actual work) and the threads actually
-performing work. Most likely there is a resource on the same cache
-line used that hinders the working threads, but we can not really
-figure out which one it is.
-
-### Commit be2cdbfe - Locking Deque
-
-Switching to a locking deque has not improved (or even slightly hurt)
-performance, we therefore think that the deque itself is not the
-portion slowing down our execution.
-
-### Commit 5044f0a1 - Performance Bottelneck in FFT FIXED
-
-By moving from directly calling one of the parallel invocations
-
-```c++
-scheduler::spawn_child(sub_task_2);
-function1(); // Execute first function 'inline' without spawning a sub_task object
-```
-
-to spawning two tasks
-```c++
-scheduler::spawn_child(sub_task_2);
-scheduler::spawn_child(sub_task_1);
-```
-
-we where able to fix the bad performance of our framework in the
-FFT benchmark (where there is a lot spinning/idling of some
-worker threads).
-
-We think this is due to some sort of cache misses/bus contemption
-on the finishing counters. This would make sense, as the drop
-at the hyperthreading mark indicates problems with this part of the
-CPU pipeline (althought it did not show clearly in our profiling runs).
-We will now try to find the exact spot where the problem originates and
-fix the source rather then 'circumventing' it with these extra tasks.
-(This then aigain, should hopefully even boost all other workloads
-performance, as contemption on the bus/cache is always bad)
-
-
-After some research we think that the issue is down to many threads
-referencing the same atomic reference counter. We think so because
-even cache aligning the shared refernce count does not fix the issue
-when using the direct function call. Also, forcing a new method call
-(going down in the call stack one function call) is not solving the
-issue (thus making sure that it is not related with some caching issue
-in the call itself).
-
-In conclusion there seems to be a hyperthreading issue with this
-shared reference count. We keep this in mind if we eventually get
-tasks with changing data memebers (as this problem could reappear there,
-as then the ref_count actualy is in the same memory region as our
-'user variables'). For now we leave the code like it is.
-
-
-FFT Average with new call method:
-
-<img src="media/5044f0a1_fft_average.png" width="400"/>
-
-The performance of our new call method looks shockingly similar
-to TBB with a slight, constant performance drop behind it.
-This makes sense, as the basic principle (lock-free, classic work
-stealing deque and the parallel call structure) are nearly the same.
-
-We will see if minor optimizations can even close this last gap.
-Overall the performance at this point is good enough to move on
-to implementing more functionality and to running tests on different
-queues/stealing tactics etc.
--- a/PERFORMANCE-v2.md
+++ b/PERFORMANCE-v2.md
-# Notes on performance measures during development
-
-#### Commit e34ea267 - 05.12.2019 - First Version of new Algorithm - Scaling Problems
-
-The first version of our memory trading work stealing algorithm works. It still shows scaling issues over
-the hyperthreading mark, very similar to what we have seen in version 1. This indicates some sort of
-contention between the threads when running the FFT algorithm.
-
-Analyzing the current version we find issue with the frequent call to `thread_state_for(id)` in
-the stealing loop.
-
-![](./media/e34ea267_thread_state_for.png)
-
-It is obvious that the method takes some amount of runtime, as FFT has a structure that tends to only
-work on the continuations in the end of the computation (the critical path of FFT can only be executed
-after most parallel tasks are done).
-
-![](./media/e34ea267_fft_execution_pattern.png)
-
-What we can see here is the long tail of continuations running at the end of the computation. During
-this time the non working threads constantly steal, thus requiring the `thread_state_for(id)`
-virtual method, potentially hindering other threads from doing their work properly. 
--- a/media/116cf4af_fft_average_no_sleep.png
+++ b/media/116cf4af_fft_average_no_sleep.png
--- a/media/116cf4af_fft_average_sleep.png
+++ b/media/116cf4af_fft_average_sleep.png
--- a/media/116cf4af_fft_vtune.png
+++ b/media/116cf4af_fft_vtune.png
--- a/media/116cf4af_heat_average.png
+++ b/media/116cf4af_heat_average.png
--- a/media/116cf4af_matrix_average_fork.png
+++ b/media/116cf4af_matrix_average_fork.png
--- a/media/116cf4af_matrix_average_native.png
+++ b/media/116cf4af_matrix_average_native.png
--- a/media/116cf4af_matrix_vtune.png
+++ b/media/116cf4af_matrix_vtune.png
--- a/media/18b2d744_fft_average.png
+++ b/media/18b2d744_fft_average.png
--- a/media/18b2d744_unbalanced_average.png
+++ b/media/18b2d744_unbalanced_average.png
--- a/media/2ad04ce5_fft_boxplot_isolated.png
+++ b/media/2ad04ce5_fft_boxplot_isolated.png
--- a/media/2ad04ce5_fft_boxplot_multiprogrammed.png
+++ b/media/2ad04ce5_fft_boxplot_multiprogrammed.png
--- a/media/2ad04ce5_matrix_boxplot_isolated.png
+++ b/media/2ad04ce5_matrix_boxplot_isolated.png
--- a/media/2ad04ce5_matrix_boxplot_multiprogrammed.png
+++ b/media/2ad04ce5_matrix_boxplot_multiprogrammed.png
--- a/media/3bdaba42_fft_average.png
+++ b/media/3bdaba42_fft_average.png
--- a/media/3bdaba42_heat_average.png
+++ b/media/3bdaba42_heat_average.png
--- a/media/3bdaba42_matrix_average.png
+++ b/media/3bdaba42_matrix_average.png
--- a/media/3bdaba42_unbalanced_average.png
+++ b/media/3bdaba42_unbalanced_average.png
--- a/media/5044f0a1_fft_average.png
+++ b/media/5044f0a1_fft_average.png
--- a/media/7874c2a2_pipeline_speedup.png
+++ b/media/7874c2a2_pipeline_speedup.png
--- a/media/aa27064_fft_average.png
+++ b/media/aa27064_fft_average.png
--- a/media/afd0331b_matrix_average_case_fork.png
+++ b/media/afd0331b_matrix_average_case_fork.png
--- a/media/afd0331b_matrix_average_case_native.png
+++ b/media/afd0331b_matrix_average_case_native.png
--- a/media/afd0331b_matrix_best_case_fork.png
+++ b/media/afd0331b_matrix_best_case_fork.png
--- a/media/afd0331b_matrix_best_case_native.png
+++ b/media/afd0331b_matrix_best_case_native.png
--- a/media/b9bb90a4-banana-pi-average-case.png
+++ b/media/b9bb90a4-banana-pi-average-case.png
--- a/media/b9bb90a4-banana-pi-best-case.png
+++ b/media/b9bb90a4-banana-pi-best-case.png
--- a/media/b9bb90a4-laptop-average-case.png
+++ b/media/b9bb90a4-laptop-average-case.png
--- a/media/b9bb90a4-laptop-best-case.png
+++ b/media/b9bb90a4-laptop-best-case.png
--- a/media/cf056856_fft_average.png
+++ b/media/cf056856_fft_average.png
--- a/media/cf056856_unbalanced_average.png
+++ b/media/cf056856_unbalanced_average.png
--- a/media/d16ad3e_fft_average.png
+++ b/media/d16ad3e_fft_average.png
--- a/media/e34ea267_fft_execution_pattern.png
+++ b/media/e34ea267_fft_execution_pattern.png
--- a/media/e34ea267_thread_state_for.png
+++ b/media/e34ea267_thread_state_for.png