diff --git a/BANANAPI.md b/BANANAPI.md
index 588fef1..62962b5 100644
--- a/BANANAPI.md
+++ b/BANANAPI.md
@@ -12,7 +12,7 @@ bananaPI m3. For this we generally
[follow the instructions given](https://docs.armbian.com/Developer-Guide_Build-Preparation/),
below are notes on what to do to get the rt kernel patch into it and to build.
-You can also use [our pre-build image](https://drive.google.com/open?id=1RiHymBO_XjOk5tMAL31iOSJGfncrWFQh)
+You can also use our pre-build image [on google drive](https://drive.google.com/open?id=1RiHymBO_XjOk5tMAL31iOSJGfncrWFQh)
and skip the build process below. Just use etcher (https://www.balena.io/etcher/) or similar,
flash an sd card and the PI should boot up. Default login is root/1234, follow the instructions,
then continue with the isolating system setup steps for more accurate measurements.
diff --git a/PERFORMANCE-v1.md b/PERFORMANCE-v1.md
deleted file mode 100644
index 7112439..0000000
--- a/PERFORMANCE-v1.md
+++ /dev/null
@@ -1,384 +0,0 @@
-# Notes on performance measures during development
-
-#### Commit 52fcb51f - Add basic random stealing
-
-Slight improvement, needs further measurement after removing more important bottlenecks.
-Below are three individual measurements of the difference.
-Overall the trend (sum of all numbers/last number),
-go down (98.7%, 96.9% and 100.6%), but with the one measurement
-above 100% we think the improvements are minor.
-
-| | | | | | | | | | |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-old | 1659.01 us| 967.19 us| 830.08 us| 682.69 us| 737.71 us| 747.92 us| 749.37 us| 829.75 us| 7203.73 us
-new | 1676.06 us| 981.56 us| 814.71 us| 698.72 us| 680.87 us| 737.68 us| 756.91 us| 764.71 us| 7111.22 us
-change | 101.03 %| 101.49 %| 98.15 %| 102.35 %| 92.30 %| 98.63 %| 101.01 %| 92.16 %| 98.72 %
-
-| | | | | | | | | | |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-old | 1648.65 us| 973.33 us| 820.18 us| 678.80 us| 746.21 us| 767.63 us| 747.17 us| 1025.35 us| 7407.32 us
-new | 1655.09 us| 964.99 us| 807.57 us| 731.34 us| 747.47 us| 714.71 us| 794.35 us| 760.28 us| 7175.80 us
-change | 100.39 %| 99.14 %| 98.46 %| 107.74 %| 100.17 %| 93.11 %| 106.31 %| 74.15 %| 96.87 %
-
-| | | | | | | | | | |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-old | 1654.26 us| 969.12 us| 832.13 us| 680.69 us| 718.70 us| 750.80 us| 744.12 us| 775.24 us| 7125.07 us
-new | 1637.04 us| 978.09 us| 799.93 us| 709.33 us| 746.42 us| 684.87 us| 822.30 us| 787.61 us| 7165.59 us
-change | 98.96 %| 100.93 %| 96.13 %| 104.21 %| 103.86 %| 91.22 %| 110.51 %| 101.60 %| 100.57 %
-
-#### Commit 3535cbd8 - Cache Align scheduler_memory
-
-Big improvements of about 6% in our test. This seems like a little,
-but 6% from the scheduler is a lot, as the 'main work' is the tasks
-itself, not the scheduler.
-
-This change unsurprisingly yields the biggest improvement yet.
-
-#### Commit b9bb90a4 - Try to figure out the 'high thread bottleneck'
-
-We are currently seeing good performance on low core counts
-(up to 1/2 of the machines cores), but after that performance
-plumishes:
-
-Bana-Pi Best-Case:
-
-
-
-Bana-Pi Average-Case:
-
-
-
-Laptop Best-Case:
-
-
-
-Laptop Average-Case:
-
-
-
-
-As we can see, in average the performance of PLS starts getting
-way worse than TBB and EMBB after 4 cores. We suspect this is due
-to contemption, but could not resolve it with any combination
-of `tas_spinlock` vs `ttas_spinlock` and `lock` vs `try_lock`.
-
-This issue clearly needs further investigation.
-
-### Commit aa27064 - Performance with ttsa spinlocks (and 'full blocking' top level)
-
-
-
-### Commit d16ad3e - Performance with rw-lock and backoff
-
-
-
-### Commit 18b2d744 - Performance with lock-free deque
-
-After much tinkering we still have performance problems with higher
-thread counts in the FFT benchmark. Upward from 4/5 threads the
-performance gains start to saturate (before removing the top level
-locks we even saw a slight drop in performance).
-
-Currently the FFT benchmark shows the following results (average):
-
-
-
-We want to positively note that the overall trend of 'performance drops'
-at the hyperthreading mark is not really bad anymore, it rather
-seems similar to EMBB now (with backoff + lockfree deque + top level
-reader-writers lock). This comes partly because the spike at 4 threads
-is lower (less performance at 4 threads). We also see better times
-on the multiprogramed system with the lock-free deque.
-
-This is discouraging after many tests. To see where the overhead lies
-we also implemented the unbalanced tree search benchmark,
-resulting in the following, suprisingly good, results (average):
-
-
-
-The main difference between the two benchmarks is, that the second
-one has more work and the work is relatively independent.
-Additionaly, the first one uses our high level API (parallel invoke),
-while the second one uses our low level API.
-It is worth investigating if either or high level API or the structure
-of the memory access in FFT are the problem.
-
-### Commit cf056856 - Remove two-level scheduler
-
-In this test we replace the two level scheduler with ONLY fork_join
-tasks. This removes the top level steal overhead and performs only
-internal stealing. For this we set the fork_join task as the only
-possible task type and removed the top level rw-lock, the digging
-down to our level and solely use internal stealing.
-
-Average results FFT:
-
-
-
-Average results Unbalanced:
-
-
-
-There seems to be only a minor performance difference between the two,
-suggesting tha our two-level approach is not the part causing our
-weaker performance.
-
-### Commit afd0331b - Some notes on scaling problems
-
-After tweaking individual values and parameters we can still not find
-the main cause for our slowdown on multiple processors.
-We also use intel's vtune amplifier to measure performance on our run
-and find that we always spend way too much time 'waiting for work',
-e.g. in the backoff mechanism when enabled or in the locks for stealing
-work when backoff is disabled. This leads us to believe that our problems
-might be connected to some issue with work distribution on the FFT case,
-as the unbalanced tree search (with a lot 'local' work) performs good.
-
-To get more data in we add benchmarks on matrix multiplication implemented
-in two fashions: once with a 'native' array stealing task and once with
-a fork-join task. Both implementations use the same minimum array
-sub-size of 4 elements and we can hopefully see if they have any
-performance differences.
-
-Best case fork-join:
-
-
-
-Average case fork-join:
-
-
-
-Best case Native:
-
-
-
-Average case Native:
-
-
-
-What we find very interesting is, that the best case times of our
-pls library are very fast (as good as TBB), but the average times
-drop badly. We currently do not know why this is the case.
-
-### Commit afd0331b - Intel VTune Amplifier
-
-We did serval measurements with intel's VTune Amplifier profiling
-tool. The main thing that we notice is, that the cycles per instruction
-for our useful work blocks increase, thus requiring more CPU time
-for the acutal useful work.
-
-We also measured an implementation using TBB and found no significante
-difference, e.g. TBB also has a higher CPI with 8 threads.
-Our conclusion after this long hunting for performance is, that we
-might just be bound by some general performance issues with our code.
-The next step will therefore be to read the other frameworks and our
-code carefully, trying to find potential issues.
-
-### Commit 116cf4af - Removing Exponential Backoff
-
-In the steal loop we first hade a backoff-mechanism as often seen in
-locks (spin with relaxed CPU, then sleep/yield after too many backoffs).
-The rationale behind this is to relax the memory bus by not busily
-working on atomic variables. We introduced it first with the fear that
-keeping the CPU busy with spinning would degregade performance of the
-other working threads. However, the above examination with Intel VTune
-showed that this seems to not be the main problem of our implementation
-(TBB shows the same CPI increases with more threads, our implementation
-seems fine in this regard).
-
-To further reduce elements that could cause performance problems, we
-therefore decided to perform one more measurement without this backoff.
-
-#### Results of FFT
-
-The first measurement is on the FFT. Here we tested two variants:
-One with a 'yield/sleep' statement after a worker thread failed
-to steal any work after the first try on every other thread and
-one without this sleep. The rationale behind the sleep is that
-it relaxes the CPU (it is also found in EMBB).
-
-Average with sleep:
-
-
-
-
-Average without sleep:
-
-
-
-
-We clearly observe that the version without a sleep statement
-is faster, and thus in future experiments/measurements
-will exclude this statement. This also makes sense, as our
-steal loop can fail, even thought there potentially is work
-(because of our lock free deque implementation).
-
-#### Results Matrix
-
-We re-ran our benchmarks on the fork-join and native matrix
-multiplication implementation to see how those change without
-the backoff. We expect good results, as the matrix multiplication
-mostly has enough work to keep all threads busy, thus having
-workers less time spinning in the steal loop.
-
-Average Fork-Join Matrix:
-
-
-
-
-Average Native Matrix:
-
-
-
-The results are far better than the last ones, and indicate that
-removing the backoff can drasticly improve performance.
-
-#### Conclusion
-
-We will exclude the backoff mechanisms for further tests, as this
-seems to generally improve (or at least not harm performance in
-case of FFT).
-
-We also want to note that all these measurements are not very
-controlled/scientific, but simply ran ot our notebook for
-fast iterations over different, potential issues with our scheduler.
-
-
-### Commit 116cf4af - VTune Amplifier and MRSW top level lock
-
-When looking at why our code works quite well on problems with
-mostly busy workers and not so well on code with spinning/waiting
-workers (like in the FFT), we take a closer look at the FFT and
-matrix multiplication in VTune.
-
-FFT:
-
-
-
-Matrix:
-
-
-
-The sections highlighted in red represent parts of the code spent
-on spinning in the work-stealing loop.
-We can see that as long as our workers are mainly busy/find work
-in the stealing loop the overhead spent on spinning is minimal.
-We can also see that in the FFT considerable amounts of time are
-spent spining.
-
-A general observation are the high CPI rates for our spinning code.
-This makes sense, as we are currently working on locks that share
-atomic variables in order to work, thus leading to cache misses.
-
-### Commit 116cf4af - 2D Heat Diffusion
-
-As a last test for our current state on performance we implemented the
-2D heat diffusion benchmark using our framework (using fork-join based
-parallel_for, 512 heat array size):
-
-
-
-We observe solid performance from our implementation.
-(Again, not very scientific test environment, but good enough for
-our general direction)
-
-### Commit 3bdaba42 - Move to pure fork-join tasks (remove two level)
-
-We moved away from our two-level scheduler approach towards a
-pure fork-join task model (in order to remove any lock's in the
-code more easily and to make further tests simpler/more focused
-on one specific aspecs.
-These are the measurements made after the change
-(without any performance optimizations done):
-
-FFT Average:
-
-
-
-Heat Diffusion Average:
-
-
-
-Matrix Multiplication Average:
-
-
-
-Unbalanced Tree Search Average:
-
-
-
-
-We note that in heat diffusion, matrix multiplication and unbalanced
-tree search - all three benchmarks with mostly enough work avaliable at
-all time - our implementation performs head on head with intel's
-TBB. Only the FFT benchmark is a major problem four our library.
-We notice a MAJOR drop in performance exactly at the hyperthreading
-mark, indicating problems with limited resources due to the spinning
-threads (threads without any actual work) and the threads actually
-performing work. Most likely there is a resource on the same cache
-line used that hinders the working threads, but we can not really
-figure out which one it is.
-
-### Commit be2cdbfe - Locking Deque
-
-Switching to a locking deque has not improved (or even slightly hurt)
-performance, we therefore think that the deque itself is not the
-portion slowing down our execution.
-
-### Commit 5044f0a1 - Performance Bottelneck in FFT FIXED
-
-By moving from directly calling one of the parallel invocations
-
-```c++
-scheduler::spawn_child(sub_task_2);
-function1(); // Execute first function 'inline' without spawning a sub_task object
-```
-
-to spawning two tasks
-```c++
-scheduler::spawn_child(sub_task_2);
-scheduler::spawn_child(sub_task_1);
-```
-
-we where able to fix the bad performance of our framework in the
-FFT benchmark (where there is a lot spinning/idling of some
-worker threads).
-
-We think this is due to some sort of cache misses/bus contemption
-on the finishing counters. This would make sense, as the drop
-at the hyperthreading mark indicates problems with this part of the
-CPU pipeline (althought it did not show clearly in our profiling runs).
-We will now try to find the exact spot where the problem originates and
-fix the source rather then 'circumventing' it with these extra tasks.
-(This then aigain, should hopefully even boost all other workloads
-performance, as contemption on the bus/cache is always bad)
-
-
-After some research we think that the issue is down to many threads
-referencing the same atomic reference counter. We think so because
-even cache aligning the shared refernce count does not fix the issue
-when using the direct function call. Also, forcing a new method call
-(going down in the call stack one function call) is not solving the
-issue (thus making sure that it is not related with some caching issue
-in the call itself).
-
-In conclusion there seems to be a hyperthreading issue with this
-shared reference count. We keep this in mind if we eventually get
-tasks with changing data memebers (as this problem could reappear there,
-as then the ref_count actualy is in the same memory region as our
-'user variables'). For now we leave the code like it is.
-
-
-FFT Average with new call method:
-
-
-
-The performance of our new call method looks shockingly similar
-to TBB with a slight, constant performance drop behind it.
-This makes sense, as the basic principle (lock-free, classic work
-stealing deque and the parallel call structure) are nearly the same.
-
-We will see if minor optimizations can even close this last gap.
-Overall the performance at this point is good enough to move on
-to implementing more functionality and to running tests on different
-queues/stealing tactics etc.
diff --git a/PERFORMANCE-v2.md b/PERFORMANCE-v2.md
deleted file mode 100644
index b8a27df..0000000
--- a/PERFORMANCE-v2.md
+++ /dev/null
@@ -1,22 +0,0 @@
-# Notes on performance measures during development
-
-#### Commit e34ea267 - 05.12.2019 - First Version of new Algorithm - Scaling Problems
-
-The first version of our memory trading work stealing algorithm works. It still shows scaling issues over
-the hyperthreading mark, very similar to what we have seen in version 1. This indicates some sort of
-contention between the threads when running the FFT algorithm.
-
-Analyzing the current version we find issue with the frequent call to `thread_state_for(id)` in
-the stealing loop.
-
-![](./media/e34ea267_thread_state_for.png)
-
-It is obvious that the method takes some amount of runtime, as FFT has a structure that tends to only
-work on the continuations in the end of the computation (the critical path of FFT can only be executed
-after most parallel tasks are done).
-
-![](./media/e34ea267_fft_execution_pattern.png)
-
-What we can see here is the long tail of continuations running at the end of the computation. During
-this time the non working threads constantly steal, thus requiring the `thread_state_for(id)`
-virtual method, potentially hindering other threads from doing their work properly.
diff --git a/media/116cf4af_fft_average_no_sleep.png b/media/116cf4af_fft_average_no_sleep.png
deleted file mode 100644
index d40844e..0000000
Binary files a/media/116cf4af_fft_average_no_sleep.png and /dev/null differ
diff --git a/media/116cf4af_fft_average_sleep.png b/media/116cf4af_fft_average_sleep.png
deleted file mode 100644
index 94519db..0000000
Binary files a/media/116cf4af_fft_average_sleep.png and /dev/null differ
diff --git a/media/116cf4af_fft_vtune.png b/media/116cf4af_fft_vtune.png
deleted file mode 100644
index 01008cb..0000000
Binary files a/media/116cf4af_fft_vtune.png and /dev/null differ
diff --git a/media/116cf4af_heat_average.png b/media/116cf4af_heat_average.png
deleted file mode 100644
index 24f159b..0000000
Binary files a/media/116cf4af_heat_average.png and /dev/null differ
diff --git a/media/116cf4af_matrix_average_fork.png b/media/116cf4af_matrix_average_fork.png
deleted file mode 100644
index 814480d..0000000
Binary files a/media/116cf4af_matrix_average_fork.png and /dev/null differ
diff --git a/media/116cf4af_matrix_average_native.png b/media/116cf4af_matrix_average_native.png
deleted file mode 100644
index 760c5fe..0000000
Binary files a/media/116cf4af_matrix_average_native.png and /dev/null differ
diff --git a/media/116cf4af_matrix_vtune.png b/media/116cf4af_matrix_vtune.png
deleted file mode 100644
index 8182121..0000000
Binary files a/media/116cf4af_matrix_vtune.png and /dev/null differ
diff --git a/media/18b2d744_fft_average.png b/media/18b2d744_fft_average.png
deleted file mode 100644
index d8a6017..0000000
Binary files a/media/18b2d744_fft_average.png and /dev/null differ
diff --git a/media/18b2d744_unbalanced_average.png b/media/18b2d744_unbalanced_average.png
deleted file mode 100644
index 2a2fded..0000000
Binary files a/media/18b2d744_unbalanced_average.png and /dev/null differ
diff --git a/media/2ad04ce5_fft_boxplot_isolated.png b/media/2ad04ce5_fft_boxplot_isolated.png
deleted file mode 100644
index 53a10e6..0000000
Binary files a/media/2ad04ce5_fft_boxplot_isolated.png and /dev/null differ
diff --git a/media/2ad04ce5_fft_boxplot_multiprogrammed.png b/media/2ad04ce5_fft_boxplot_multiprogrammed.png
deleted file mode 100644
index 3b62236..0000000
Binary files a/media/2ad04ce5_fft_boxplot_multiprogrammed.png and /dev/null differ
diff --git a/media/2ad04ce5_matrix_boxplot_isolated.png b/media/2ad04ce5_matrix_boxplot_isolated.png
deleted file mode 100644
index 5ea9653..0000000
Binary files a/media/2ad04ce5_matrix_boxplot_isolated.png and /dev/null differ
diff --git a/media/2ad04ce5_matrix_boxplot_multiprogrammed.png b/media/2ad04ce5_matrix_boxplot_multiprogrammed.png
deleted file mode 100644
index 9552893..0000000
Binary files a/media/2ad04ce5_matrix_boxplot_multiprogrammed.png and /dev/null differ
diff --git a/media/3bdaba42_fft_average.png b/media/3bdaba42_fft_average.png
deleted file mode 100644
index b558191..0000000
Binary files a/media/3bdaba42_fft_average.png and /dev/null differ
diff --git a/media/3bdaba42_heat_average.png b/media/3bdaba42_heat_average.png
deleted file mode 100644
index edb98d2..0000000
Binary files a/media/3bdaba42_heat_average.png and /dev/null differ
diff --git a/media/3bdaba42_matrix_average.png b/media/3bdaba42_matrix_average.png
deleted file mode 100644
index 9538c36..0000000
Binary files a/media/3bdaba42_matrix_average.png and /dev/null differ
diff --git a/media/3bdaba42_unbalanced_average.png b/media/3bdaba42_unbalanced_average.png
deleted file mode 100644
index ee1415b..0000000
Binary files a/media/3bdaba42_unbalanced_average.png and /dev/null differ
diff --git a/media/5044f0a1_fft_average.png b/media/5044f0a1_fft_average.png
deleted file mode 100644
index 73736b2..0000000
Binary files a/media/5044f0a1_fft_average.png and /dev/null differ
diff --git a/media/7874c2a2_pipeline_speedup.png b/media/7874c2a2_pipeline_speedup.png
deleted file mode 100644
index 8c0d38f..0000000
Binary files a/media/7874c2a2_pipeline_speedup.png and /dev/null differ
diff --git a/media/aa27064_fft_average.png b/media/aa27064_fft_average.png
deleted file mode 100644
index 74229a0..0000000
Binary files a/media/aa27064_fft_average.png and /dev/null differ
diff --git a/media/afd0331b_matrix_average_case_fork.png b/media/afd0331b_matrix_average_case_fork.png
deleted file mode 100644
index 2cd6eb5..0000000
Binary files a/media/afd0331b_matrix_average_case_fork.png and /dev/null differ
diff --git a/media/afd0331b_matrix_average_case_native.png b/media/afd0331b_matrix_average_case_native.png
deleted file mode 100644
index 68ca0e5..0000000
Binary files a/media/afd0331b_matrix_average_case_native.png and /dev/null differ
diff --git a/media/afd0331b_matrix_best_case_fork.png b/media/afd0331b_matrix_best_case_fork.png
deleted file mode 100644
index db4c9cf..0000000
Binary files a/media/afd0331b_matrix_best_case_fork.png and /dev/null differ
diff --git a/media/afd0331b_matrix_best_case_native.png b/media/afd0331b_matrix_best_case_native.png
deleted file mode 100644
index 22ece98..0000000
Binary files a/media/afd0331b_matrix_best_case_native.png and /dev/null differ
diff --git a/media/b9bb90a4-banana-pi-average-case.png b/media/b9bb90a4-banana-pi-average-case.png
deleted file mode 100644
index 0d414cb..0000000
Binary files a/media/b9bb90a4-banana-pi-average-case.png and /dev/null differ
diff --git a/media/b9bb90a4-banana-pi-best-case.png b/media/b9bb90a4-banana-pi-best-case.png
deleted file mode 100644
index b090449..0000000
Binary files a/media/b9bb90a4-banana-pi-best-case.png and /dev/null differ
diff --git a/media/b9bb90a4-laptop-average-case.png b/media/b9bb90a4-laptop-average-case.png
deleted file mode 100644
index 8153ac1..0000000
Binary files a/media/b9bb90a4-laptop-average-case.png and /dev/null differ
diff --git a/media/b9bb90a4-laptop-best-case.png b/media/b9bb90a4-laptop-best-case.png
deleted file mode 100644
index 57bb039..0000000
Binary files a/media/b9bb90a4-laptop-best-case.png and /dev/null differ
diff --git a/media/cf056856_fft_average.png b/media/cf056856_fft_average.png
deleted file mode 100644
index ec55027..0000000
Binary files a/media/cf056856_fft_average.png and /dev/null differ
diff --git a/media/cf056856_unbalanced_average.png b/media/cf056856_unbalanced_average.png
deleted file mode 100644
index 75d2829..0000000
Binary files a/media/cf056856_unbalanced_average.png and /dev/null differ
diff --git a/media/d16ad3e_fft_average.png b/media/d16ad3e_fft_average.png
deleted file mode 100644
index a29018b..0000000
Binary files a/media/d16ad3e_fft_average.png and /dev/null differ
diff --git a/media/e34ea267_fft_execution_pattern.png b/media/e34ea267_fft_execution_pattern.png
deleted file mode 100644
index 108ab8d..0000000
Binary files a/media/e34ea267_fft_execution_pattern.png and /dev/null differ
diff --git a/media/e34ea267_thread_state_for.png b/media/e34ea267_thread_state_for.png
deleted file mode 100644
index 8431bfc..0000000
Binary files a/media/e34ea267_thread_state_for.png and /dev/null differ