diff --git a/BANANAPI.md b/BANANAPI.md index 588fef1..62962b5 100644 --- a/BANANAPI.md +++ b/BANANAPI.md @@ -12,7 +12,7 @@ bananaPI m3. For this we generally [follow the instructions given](https://docs.armbian.com/Developer-Guide_Build-Preparation/), below are notes on what to do to get the rt kernel patch into it and to build. -You can also use [our pre-build image](https://drive.google.com/open?id=1RiHymBO_XjOk5tMAL31iOSJGfncrWFQh) +You can also use our pre-build image [on google drive](https://drive.google.com/open?id=1RiHymBO_XjOk5tMAL31iOSJGfncrWFQh) and skip the build process below. Just use etcher (https://www.balena.io/etcher/) or similar, flash an sd card and the PI should boot up. Default login is root/1234, follow the instructions, then continue with the isolating system setup steps for more accurate measurements. diff --git a/PERFORMANCE-v1.md b/PERFORMANCE-v1.md deleted file mode 100644 index 7112439..0000000 --- a/PERFORMANCE-v1.md +++ /dev/null @@ -1,384 +0,0 @@ -# Notes on performance measures during development - -#### Commit 52fcb51f - Add basic random stealing - -Slight improvement, needs further measurement after removing more important bottlenecks. -Below are three individual measurements of the difference. -Overall the trend (sum of all numbers/last number), -go down (98.7%, 96.9% and 100.6%), but with the one measurement -above 100% we think the improvements are minor. - -| | | | | | | | | | | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -old | 1659.01 us| 967.19 us| 830.08 us| 682.69 us| 737.71 us| 747.92 us| 749.37 us| 829.75 us| 7203.73 us -new | 1676.06 us| 981.56 us| 814.71 us| 698.72 us| 680.87 us| 737.68 us| 756.91 us| 764.71 us| 7111.22 us -change | 101.03 %| 101.49 %| 98.15 %| 102.35 %| 92.30 %| 98.63 %| 101.01 %| 92.16 %| 98.72 % - -| | | | | | | | | | | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -old | 1648.65 us| 973.33 us| 820.18 us| 678.80 us| 746.21 us| 767.63 us| 747.17 us| 1025.35 us| 7407.32 us -new | 1655.09 us| 964.99 us| 807.57 us| 731.34 us| 747.47 us| 714.71 us| 794.35 us| 760.28 us| 7175.80 us -change | 100.39 %| 99.14 %| 98.46 %| 107.74 %| 100.17 %| 93.11 %| 106.31 %| 74.15 %| 96.87 % - -| | | | | | | | | | | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -old | 1654.26 us| 969.12 us| 832.13 us| 680.69 us| 718.70 us| 750.80 us| 744.12 us| 775.24 us| 7125.07 us -new | 1637.04 us| 978.09 us| 799.93 us| 709.33 us| 746.42 us| 684.87 us| 822.30 us| 787.61 us| 7165.59 us -change | 98.96 %| 100.93 %| 96.13 %| 104.21 %| 103.86 %| 91.22 %| 110.51 %| 101.60 %| 100.57 % - -#### Commit 3535cbd8 - Cache Align scheduler_memory - -Big improvements of about 6% in our test. This seems like a little, -but 6% from the scheduler is a lot, as the 'main work' is the tasks -itself, not the scheduler. - -This change unsurprisingly yields the biggest improvement yet. - -#### Commit b9bb90a4 - Try to figure out the 'high thread bottleneck' - -We are currently seeing good performance on low core counts -(up to 1/2 of the machines cores), but after that performance -plumishes: - -Bana-Pi Best-Case: - - - -Bana-Pi Average-Case: - - - -Laptop Best-Case: - - - -Laptop Average-Case: - - - - -As we can see, in average the performance of PLS starts getting -way worse than TBB and EMBB after 4 cores. We suspect this is due -to contemption, but could not resolve it with any combination -of `tas_spinlock` vs `ttas_spinlock` and `lock` vs `try_lock`. - -This issue clearly needs further investigation. - -### Commit aa27064 - Performance with ttsa spinlocks (and 'full blocking' top level) - - - -### Commit d16ad3e - Performance with rw-lock and backoff - - - -### Commit 18b2d744 - Performance with lock-free deque - -After much tinkering we still have performance problems with higher -thread counts in the FFT benchmark. Upward from 4/5 threads the -performance gains start to saturate (before removing the top level -locks we even saw a slight drop in performance). - -Currently the FFT benchmark shows the following results (average): - - - -We want to positively note that the overall trend of 'performance drops' -at the hyperthreading mark is not really bad anymore, it rather -seems similar to EMBB now (with backoff + lockfree deque + top level -reader-writers lock). This comes partly because the spike at 4 threads -is lower (less performance at 4 threads). We also see better times -on the multiprogramed system with the lock-free deque. - -This is discouraging after many tests. To see where the overhead lies -we also implemented the unbalanced tree search benchmark, -resulting in the following, suprisingly good, results (average): - - - -The main difference between the two benchmarks is, that the second -one has more work and the work is relatively independent. -Additionaly, the first one uses our high level API (parallel invoke), -while the second one uses our low level API. -It is worth investigating if either or high level API or the structure -of the memory access in FFT are the problem. - -### Commit cf056856 - Remove two-level scheduler - -In this test we replace the two level scheduler with ONLY fork_join -tasks. This removes the top level steal overhead and performs only -internal stealing. For this we set the fork_join task as the only -possible task type and removed the top level rw-lock, the digging -down to our level and solely use internal stealing. - -Average results FFT: - - - -Average results Unbalanced: - - - -There seems to be only a minor performance difference between the two, -suggesting tha our two-level approach is not the part causing our -weaker performance. - -### Commit afd0331b - Some notes on scaling problems - -After tweaking individual values and parameters we can still not find -the main cause for our slowdown on multiple processors. -We also use intel's vtune amplifier to measure performance on our run -and find that we always spend way too much time 'waiting for work', -e.g. in the backoff mechanism when enabled or in the locks for stealing -work when backoff is disabled. This leads us to believe that our problems -might be connected to some issue with work distribution on the FFT case, -as the unbalanced tree search (with a lot 'local' work) performs good. - -To get more data in we add benchmarks on matrix multiplication implemented -in two fashions: once with a 'native' array stealing task and once with -a fork-join task. Both implementations use the same minimum array -sub-size of 4 elements and we can hopefully see if they have any -performance differences. - -Best case fork-join: - - - -Average case fork-join: - - - -Best case Native: - - - -Average case Native: - - - -What we find very interesting is, that the best case times of our -pls library are very fast (as good as TBB), but the average times -drop badly. We currently do not know why this is the case. - -### Commit afd0331b - Intel VTune Amplifier - -We did serval measurements with intel's VTune Amplifier profiling -tool. The main thing that we notice is, that the cycles per instruction -for our useful work blocks increase, thus requiring more CPU time -for the acutal useful work. - -We also measured an implementation using TBB and found no significante -difference, e.g. TBB also has a higher CPI with 8 threads. -Our conclusion after this long hunting for performance is, that we -might just be bound by some general performance issues with our code. -The next step will therefore be to read the other frameworks and our -code carefully, trying to find potential issues. - -### Commit 116cf4af - Removing Exponential Backoff - -In the steal loop we first hade a backoff-mechanism as often seen in -locks (spin with relaxed CPU, then sleep/yield after too many backoffs). -The rationale behind this is to relax the memory bus by not busily -working on atomic variables. We introduced it first with the fear that -keeping the CPU busy with spinning would degregade performance of the -other working threads. However, the above examination with Intel VTune -showed that this seems to not be the main problem of our implementation -(TBB shows the same CPI increases with more threads, our implementation -seems fine in this regard). - -To further reduce elements that could cause performance problems, we -therefore decided to perform one more measurement without this backoff. - -#### Results of FFT - -The first measurement is on the FFT. Here we tested two variants: -One with a 'yield/sleep' statement after a worker thread failed -to steal any work after the first try on every other thread and -one without this sleep. The rationale behind the sleep is that -it relaxes the CPU (it is also found in EMBB). - -Average with sleep: - - - - -Average without sleep: - - - - -We clearly observe that the version without a sleep statement -is faster, and thus in future experiments/measurements -will exclude this statement. This also makes sense, as our -steal loop can fail, even thought there potentially is work -(because of our lock free deque implementation). - -#### Results Matrix - -We re-ran our benchmarks on the fork-join and native matrix -multiplication implementation to see how those change without -the backoff. We expect good results, as the matrix multiplication -mostly has enough work to keep all threads busy, thus having -workers less time spinning in the steal loop. - -Average Fork-Join Matrix: - - - - -Average Native Matrix: - - - -The results are far better than the last ones, and indicate that -removing the backoff can drasticly improve performance. - -#### Conclusion - -We will exclude the backoff mechanisms for further tests, as this -seems to generally improve (or at least not harm performance in -case of FFT). - -We also want to note that all these measurements are not very -controlled/scientific, but simply ran ot our notebook for -fast iterations over different, potential issues with our scheduler. - - -### Commit 116cf4af - VTune Amplifier and MRSW top level lock - -When looking at why our code works quite well on problems with -mostly busy workers and not so well on code with spinning/waiting -workers (like in the FFT), we take a closer look at the FFT and -matrix multiplication in VTune. - -FFT: - - - -Matrix: - - - -The sections highlighted in red represent parts of the code spent -on spinning in the work-stealing loop. -We can see that as long as our workers are mainly busy/find work -in the stealing loop the overhead spent on spinning is minimal. -We can also see that in the FFT considerable amounts of time are -spent spining. - -A general observation are the high CPI rates for our spinning code. -This makes sense, as we are currently working on locks that share -atomic variables in order to work, thus leading to cache misses. - -### Commit 116cf4af - 2D Heat Diffusion - -As a last test for our current state on performance we implemented the -2D heat diffusion benchmark using our framework (using fork-join based -parallel_for, 512 heat array size): - - - -We observe solid performance from our implementation. -(Again, not very scientific test environment, but good enough for -our general direction) - -### Commit 3bdaba42 - Move to pure fork-join tasks (remove two level) - -We moved away from our two-level scheduler approach towards a -pure fork-join task model (in order to remove any lock's in the -code more easily and to make further tests simpler/more focused -on one specific aspecs. -These are the measurements made after the change -(without any performance optimizations done): - -FFT Average: - - - -Heat Diffusion Average: - - - -Matrix Multiplication Average: - - - -Unbalanced Tree Search Average: - - - - -We note that in heat diffusion, matrix multiplication and unbalanced -tree search - all three benchmarks with mostly enough work avaliable at -all time - our implementation performs head on head with intel's -TBB. Only the FFT benchmark is a major problem four our library. -We notice a MAJOR drop in performance exactly at the hyperthreading -mark, indicating problems with limited resources due to the spinning -threads (threads without any actual work) and the threads actually -performing work. Most likely there is a resource on the same cache -line used that hinders the working threads, but we can not really -figure out which one it is. - -### Commit be2cdbfe - Locking Deque - -Switching to a locking deque has not improved (or even slightly hurt) -performance, we therefore think that the deque itself is not the -portion slowing down our execution. - -### Commit 5044f0a1 - Performance Bottelneck in FFT FIXED - -By moving from directly calling one of the parallel invocations - -```c++ -scheduler::spawn_child(sub_task_2); -function1(); // Execute first function 'inline' without spawning a sub_task object -``` - -to spawning two tasks -```c++ -scheduler::spawn_child(sub_task_2); -scheduler::spawn_child(sub_task_1); -``` - -we where able to fix the bad performance of our framework in the -FFT benchmark (where there is a lot spinning/idling of some -worker threads). - -We think this is due to some sort of cache misses/bus contemption -on the finishing counters. This would make sense, as the drop -at the hyperthreading mark indicates problems with this part of the -CPU pipeline (althought it did not show clearly in our profiling runs). -We will now try to find the exact spot where the problem originates and -fix the source rather then 'circumventing' it with these extra tasks. -(This then aigain, should hopefully even boost all other workloads -performance, as contemption on the bus/cache is always bad) - - -After some research we think that the issue is down to many threads -referencing the same atomic reference counter. We think so because -even cache aligning the shared refernce count does not fix the issue -when using the direct function call. Also, forcing a new method call -(going down in the call stack one function call) is not solving the -issue (thus making sure that it is not related with some caching issue -in the call itself). - -In conclusion there seems to be a hyperthreading issue with this -shared reference count. We keep this in mind if we eventually get -tasks with changing data memebers (as this problem could reappear there, -as then the ref_count actualy is in the same memory region as our -'user variables'). For now we leave the code like it is. - - -FFT Average with new call method: - - - -The performance of our new call method looks shockingly similar -to TBB with a slight, constant performance drop behind it. -This makes sense, as the basic principle (lock-free, classic work -stealing deque and the parallel call structure) are nearly the same. - -We will see if minor optimizations can even close this last gap. -Overall the performance at this point is good enough to move on -to implementing more functionality and to running tests on different -queues/stealing tactics etc. diff --git a/PERFORMANCE-v2.md b/PERFORMANCE-v2.md deleted file mode 100644 index b8a27df..0000000 --- a/PERFORMANCE-v2.md +++ /dev/null @@ -1,22 +0,0 @@ -# Notes on performance measures during development - -#### Commit e34ea267 - 05.12.2019 - First Version of new Algorithm - Scaling Problems - -The first version of our memory trading work stealing algorithm works. It still shows scaling issues over -the hyperthreading mark, very similar to what we have seen in version 1. This indicates some sort of -contention between the threads when running the FFT algorithm. - -Analyzing the current version we find issue with the frequent call to `thread_state_for(id)` in -the stealing loop. - -![](./media/e34ea267_thread_state_for.png) - -It is obvious that the method takes some amount of runtime, as FFT has a structure that tends to only -work on the continuations in the end of the computation (the critical path of FFT can only be executed -after most parallel tasks are done). - -![](./media/e34ea267_fft_execution_pattern.png) - -What we can see here is the long tail of continuations running at the end of the computation. During -this time the non working threads constantly steal, thus requiring the `thread_state_for(id)` -virtual method, potentially hindering other threads from doing their work properly. diff --git a/media/116cf4af_fft_average_no_sleep.png b/media/116cf4af_fft_average_no_sleep.png deleted file mode 100644 index d40844e..0000000 Binary files a/media/116cf4af_fft_average_no_sleep.png and /dev/null differ diff --git a/media/116cf4af_fft_average_sleep.png b/media/116cf4af_fft_average_sleep.png deleted file mode 100644 index 94519db..0000000 Binary files a/media/116cf4af_fft_average_sleep.png and /dev/null differ diff --git a/media/116cf4af_fft_vtune.png b/media/116cf4af_fft_vtune.png deleted file mode 100644 index 01008cb..0000000 Binary files a/media/116cf4af_fft_vtune.png and /dev/null differ diff --git a/media/116cf4af_heat_average.png b/media/116cf4af_heat_average.png deleted file mode 100644 index 24f159b..0000000 Binary files a/media/116cf4af_heat_average.png and /dev/null differ diff --git a/media/116cf4af_matrix_average_fork.png b/media/116cf4af_matrix_average_fork.png deleted file mode 100644 index 814480d..0000000 Binary files a/media/116cf4af_matrix_average_fork.png and /dev/null differ diff --git a/media/116cf4af_matrix_average_native.png b/media/116cf4af_matrix_average_native.png deleted file mode 100644 index 760c5fe..0000000 Binary files a/media/116cf4af_matrix_average_native.png and /dev/null differ diff --git a/media/116cf4af_matrix_vtune.png b/media/116cf4af_matrix_vtune.png deleted file mode 100644 index 8182121..0000000 Binary files a/media/116cf4af_matrix_vtune.png and /dev/null differ diff --git a/media/18b2d744_fft_average.png b/media/18b2d744_fft_average.png deleted file mode 100644 index d8a6017..0000000 Binary files a/media/18b2d744_fft_average.png and /dev/null differ diff --git a/media/18b2d744_unbalanced_average.png b/media/18b2d744_unbalanced_average.png deleted file mode 100644 index 2a2fded..0000000 Binary files a/media/18b2d744_unbalanced_average.png and /dev/null differ diff --git a/media/2ad04ce5_fft_boxplot_isolated.png b/media/2ad04ce5_fft_boxplot_isolated.png deleted file mode 100644 index 53a10e6..0000000 Binary files a/media/2ad04ce5_fft_boxplot_isolated.png and /dev/null differ diff --git a/media/2ad04ce5_fft_boxplot_multiprogrammed.png b/media/2ad04ce5_fft_boxplot_multiprogrammed.png deleted file mode 100644 index 3b62236..0000000 Binary files a/media/2ad04ce5_fft_boxplot_multiprogrammed.png and /dev/null differ diff --git a/media/2ad04ce5_matrix_boxplot_isolated.png b/media/2ad04ce5_matrix_boxplot_isolated.png deleted file mode 100644 index 5ea9653..0000000 Binary files a/media/2ad04ce5_matrix_boxplot_isolated.png and /dev/null differ diff --git a/media/2ad04ce5_matrix_boxplot_multiprogrammed.png b/media/2ad04ce5_matrix_boxplot_multiprogrammed.png deleted file mode 100644 index 9552893..0000000 Binary files a/media/2ad04ce5_matrix_boxplot_multiprogrammed.png and /dev/null differ diff --git a/media/3bdaba42_fft_average.png b/media/3bdaba42_fft_average.png deleted file mode 100644 index b558191..0000000 Binary files a/media/3bdaba42_fft_average.png and /dev/null differ diff --git a/media/3bdaba42_heat_average.png b/media/3bdaba42_heat_average.png deleted file mode 100644 index edb98d2..0000000 Binary files a/media/3bdaba42_heat_average.png and /dev/null differ diff --git a/media/3bdaba42_matrix_average.png b/media/3bdaba42_matrix_average.png deleted file mode 100644 index 9538c36..0000000 Binary files a/media/3bdaba42_matrix_average.png and /dev/null differ diff --git a/media/3bdaba42_unbalanced_average.png b/media/3bdaba42_unbalanced_average.png deleted file mode 100644 index ee1415b..0000000 Binary files a/media/3bdaba42_unbalanced_average.png and /dev/null differ diff --git a/media/5044f0a1_fft_average.png b/media/5044f0a1_fft_average.png deleted file mode 100644 index 73736b2..0000000 Binary files a/media/5044f0a1_fft_average.png and /dev/null differ diff --git a/media/7874c2a2_pipeline_speedup.png b/media/7874c2a2_pipeline_speedup.png deleted file mode 100644 index 8c0d38f..0000000 Binary files a/media/7874c2a2_pipeline_speedup.png and /dev/null differ diff --git a/media/aa27064_fft_average.png b/media/aa27064_fft_average.png deleted file mode 100644 index 74229a0..0000000 Binary files a/media/aa27064_fft_average.png and /dev/null differ diff --git a/media/afd0331b_matrix_average_case_fork.png b/media/afd0331b_matrix_average_case_fork.png deleted file mode 100644 index 2cd6eb5..0000000 Binary files a/media/afd0331b_matrix_average_case_fork.png and /dev/null differ diff --git a/media/afd0331b_matrix_average_case_native.png b/media/afd0331b_matrix_average_case_native.png deleted file mode 100644 index 68ca0e5..0000000 Binary files a/media/afd0331b_matrix_average_case_native.png and /dev/null differ diff --git a/media/afd0331b_matrix_best_case_fork.png b/media/afd0331b_matrix_best_case_fork.png deleted file mode 100644 index db4c9cf..0000000 Binary files a/media/afd0331b_matrix_best_case_fork.png and /dev/null differ diff --git a/media/afd0331b_matrix_best_case_native.png b/media/afd0331b_matrix_best_case_native.png deleted file mode 100644 index 22ece98..0000000 Binary files a/media/afd0331b_matrix_best_case_native.png and /dev/null differ diff --git a/media/b9bb90a4-banana-pi-average-case.png b/media/b9bb90a4-banana-pi-average-case.png deleted file mode 100644 index 0d414cb..0000000 Binary files a/media/b9bb90a4-banana-pi-average-case.png and /dev/null differ diff --git a/media/b9bb90a4-banana-pi-best-case.png b/media/b9bb90a4-banana-pi-best-case.png deleted file mode 100644 index b090449..0000000 Binary files a/media/b9bb90a4-banana-pi-best-case.png and /dev/null differ diff --git a/media/b9bb90a4-laptop-average-case.png b/media/b9bb90a4-laptop-average-case.png deleted file mode 100644 index 8153ac1..0000000 Binary files a/media/b9bb90a4-laptop-average-case.png and /dev/null differ diff --git a/media/b9bb90a4-laptop-best-case.png b/media/b9bb90a4-laptop-best-case.png deleted file mode 100644 index 57bb039..0000000 Binary files a/media/b9bb90a4-laptop-best-case.png and /dev/null differ diff --git a/media/cf056856_fft_average.png b/media/cf056856_fft_average.png deleted file mode 100644 index ec55027..0000000 Binary files a/media/cf056856_fft_average.png and /dev/null differ diff --git a/media/cf056856_unbalanced_average.png b/media/cf056856_unbalanced_average.png deleted file mode 100644 index 75d2829..0000000 Binary files a/media/cf056856_unbalanced_average.png and /dev/null differ diff --git a/media/d16ad3e_fft_average.png b/media/d16ad3e_fft_average.png deleted file mode 100644 index a29018b..0000000 Binary files a/media/d16ad3e_fft_average.png and /dev/null differ diff --git a/media/e34ea267_fft_execution_pattern.png b/media/e34ea267_fft_execution_pattern.png deleted file mode 100644 index 108ab8d..0000000 Binary files a/media/e34ea267_fft_execution_pattern.png and /dev/null differ diff --git a/media/e34ea267_thread_state_for.png b/media/e34ea267_thread_state_for.png deleted file mode 100644 index 8431bfc..0000000 Binary files a/media/e34ea267_thread_state_for.png and /dev/null differ