# Notes on Continuation/Parent Stealing Implementation

The new version of pls uses a more complicated/less user friendly
API in favor of performance and memory guarantees.
For the old version refer to the second half of this document.

# 31.03.2020 - Test setup on BananaPI

We currently use a banana pi m3 single board computer for running our evaluations.
The reason is that we aim for clean, isolated measurements on a system that
introduces little jitter in CPU frequencies/performance with multiple equivialent cores.
It also is a rather low power board, with a TDP of about 15W, satisfying our desire for
a platform that mimmics embedded devices. On top of that it is rather cheap at under 100$.

The main concern is software. Options are the vendor image (kernel 3.4 with RT patch),
an [arch image from the forum](http://forum.banana-pi.org/t/bananapi-bpi-m3-new-image-archlinux-4-18-1-1-arch-2018-08-19/6544)
(kernel 4.19 with RT patch) or [armbian](https://www.armbian.com/bananapi-m3/) (kernel 5.4, NO RT patch).
We tried all three. The vendor image is simply too old to be comparable to modern changes in linux.
We then compared the arch rt kernel and the most recent armbian kernel using 
[cyclictest](https://manpages.debian.org/jessie/rt-tests/cyclictest.8). We ran a normal smp test
using `cycltest --smp -p95 -m` on both systems while stressing the processor with lower priority
applications to 100% utilization. The rt arch kernel had maximum latencies between
100us and 180us on different cores, the armbian kernel had latencies between 550us and 730us on
different cores. As our benchmarks are in the magnitude of a few milliseconds (e.g. the matrix
with 3ms) we think it is important to choose an rt kernel, especially if we later look at
applications running periodically. However, the arch kernel is 2 years old, making updating its
software components with arch rolling releases very hard. Even with tinkering we can not get it
to update the system without the kernel and without boot issues.

Following the above discussion we see that armbian is easy to setup, but has no prebuild rt image.
Because of this, we decide to use the armbian build script and add the rt patch to it, giving us the best
of both worlds.

Full setup documentation for the os and how to further tune isolated performance can be found in a
separate document: [BANANAPI.md](./BANANAPI.md)

# 24.03.2020 - mmap stack

We added different means of allocating stacks for our coroutines.
The biggest difference is that we now support mmap with guard pages
for the stacks. This leads to sigsevs if the stack space is really exhausted
during runtime, which is better than undefined behaviour.

The minimum page size is usually 4kB, which is plenty in release builds.
Other solutions for the cactus stack problem (memory mapped cactus stacks)
share this 'minimum resource per depth' limitation. They argue, that they only need
the physical memory in the worst case, whereas we need it in any execution schedule.
As we aim for worst case bounds this seems to be not a big issue. Using any smaller size
requires to give up on the guard pages, but is possible with heap allocated stacks.

See [mmap](http://man7.org/linux/man-pages/man2/mmap.2.html), [mprotect](https://linux.die.net/man/2/mprotect)
and [sysconf](http://man7.org/linux/man-pages/man3/sysconf.3.html) for implementation details.
The general approach of mmaping the stack, then protecting a page above it should be straight forward.

# 24.03.2020 - Bug in tsan

While implementing the coroutines required for the current parent-stealing scheduler
we fond issues when running with thread sanitizer. First we thought that the issues
are related to the way we handle context switching, but after some investigation we found
that it was due to a memory leak in tsan.

Specificially, tsan leaked memory mappings for each created/destroyed fiber
which lead to [linux limits](https://stackoverflow.com/questions/22779556/linux-error-from-munmap)
preventing the application from performing more mappings. This leads to the
application hanging or crashing after some time.

We addressed the issue in a [patch request](https://reviews.llvm.org/D76073) to the
LLVM/Clang tool suite. When the patch is merged we need to setup our test-runner to use the
current version of Clang and add the information to the documentation.

# 18.03.2020 - C++17 benefit, overaligned new

We have many cache line aligned types. Before C++17 they can
only be allocated with correct alignment using alignas(...)
when they are stored in static storage (e.g. global memory, stack memory).

As we move towards more modern RAII type resource management, the fact that
[C++ 17 supports alignment of arbitrary length, even in vectors](https://www.bfilipek.com/2019/08/newnew-align.html)
is extremely helpful. Before we needed wrappers to do aligned heap allocations.

# 18.03.2020 - Coding Standard/C++ Standard

We previously stuck to strict 'static' allocation in global static
storage or on the stack for most memory we need. However, we found
this to be very restrictive and making our code overly complicated
and not at all 'modern C++'.

After some research we found that this restrictive style is common
in very restrictive embedded domains. However, we target multi and many
core systems that run many applications on a shared operating system in
a more traditional 'PC-like' manner. The model best fitting this application
style is the adaptive autosar standard. It models the system as a posix
compliant os running specialized software for application deployment and management.
Under this stack there is usually a normal OS scheduler, like e.g. a
linux image tuned for low latencies. The target platform for such systems
are intel and arm processors with 2 to 16 cores, making it ideal for
our framework.

Following this logic, we decide to orient ourselfs at the 
[C++ 14 guidlines of autosar](https://www.autosar.org/fileadmin/user_upload/standards/adaptive/17-03/AUTOSAR_RS_CPP14Guidelines.pdf).
Specificially, they encourage the use of dynamic memory allocation and
modern C++, as long as some restrictions are hold. 
The memory usage of an app must be bounded and best practice is to
allocate resources 'outside of real-time sections, e.g. startup',
making it, again, a good fit for our concept. We do not plan
to follow all these rules, but we see them as an example for where
our research can be applied later on. As the autosar standard is
[moving towards C++ 17 and integrated into misra C++](https://www.autosar.org/news-events/details/misra-consortium-announce-integration-of-autosar-c-coding-guidelines-into-updated-industry-standar/)
we decide to go for C++ 17 in our project. These facts make develpoment
a lot more convenient and are enough to test our profe of concept prototype.

# 13.03.2020 - Coroutines and 'Nice API'

Before Christmas we finished the 'ugly' inline/object based API.
Because this was very hard to use (even for us), the only options
where to go to a class based solution like TBB (everything MUST be a
class, as soon as you call into user code promises are gone) or to
go with a fully fledged continuation based API.

We decide to go for the continuation API, as one of our project questions
was if we can incorporate a nice high level API with a library only solution.

Following this we need coroutines and we also need the stealing to work for more
than two child tasks.

## Coroutines basic implementation

We go for an implementation similar to [boost.context](https://github.com/boostorg/context),
where the context is stored and changed using custom assembly.
We do so, as jumpbuf is not portable at all (i.e. behaviour for the jumps
we intent to do are EXPLICITLY not allowed in the standard, see 
[docs](https://en.cppreference.com/w/c/program/setjmp) on jumping other
than what expections are allowed to do).

Inspirations for what we use in our end-product are the [deboost.context](https://github.com/septag/deboost.context),
 [tinycoroutine](https://github.com/soundsrc/tinycoroutine).
 
Our final implementatino uses custom assembly on supported platforms,
being faster than boost.context as we choose to ommit unneccesary setup steps,
as the focus is on quick switching into a new routine. We plan to also use
this for 'tail calls' in our future code (and think that this is a good fit if
we plan for composable APIs in future work).
 
## TSan support...
 
We want to integrate tsan into our fiber for sure, as it has helped us in the
past to catch various bugs. Unfortionetly, we found that our programm crashes 
when running it with it. The first solution was to add the new (mid 2019) API
for explicit fiber switching. This seemed to work, however, the programm crased
after some time. After long research, we found that it was a memory leak in
thet tsan implementation. [We fixed it](https://reviews.llvm.org/D76073) 
and integrated support for it into a seperate coroutine library.

## Multiple child steal

We then needed to adopt our stealing procedure/queue to support multiple
child tasks. To do so in a lock-free manner required some tinkering.
-> further details are in our paper notebook for now. In summary,
we use a flag to atomicially trade in resources when stealing a tasks
and use a lock free stack to store the resources pending on a task.


# 05.11.2019 - Memory Allocation, 'Fat Objects'

We change our memory allocation for all memory the scheduler requires
from allocating buffers (char* arrays) separate from the actual data
structures to 'fat datastructures' that use templating to create
an object that actually holds all the data. This allows us to more
simple add fields to manage tasks and continuations, as we do not
need to change the scheduler_memory (adding additional buffers), but
as we only have to add the fields directly to the container objects.

# 04.11.2019 - Memory Allocation and Initialization

Our framework tries to be explicit on how and where memory is allocated.
In any production build of the framework we will only use fixed size
memory pools/blocks to manage all data structures required by the
scheduler, as this property is our main research goal.

Never the less, we want to offer different ways on where to allocate
these fixed pools. Some people might prefer to store them in the stack,
some to store them in heap memory, and others might want to place them
into memory managed by custom allocators. Currently we support a stack
based 'fat' object and a heap based memory object that stores each
threads state in a vector (could be changed to lists in the future
to avoid the one big memory block allocated by the vector).

# Notes on Blocking/Child Stealing Implementation

Notes on the child stealing implementation of pls.
This corresponds to tag v0.1.

## 02.08.2019 - Ideas for sleeping threads when no work is available

A new requirement discovered during theoretical analysis is to
sleep threads when there is no work left to do. This, in
contrast to classic Arora, Blumofe and Plaxton (ABP) yielding
mechanism, we wan threads without work to allow LOWER PRIORITY
processes to take over. Simply yielding is not sufficient,
as it will not consider lower priority tasks.

Therefore, we need to actually sleep/wait in the time without any
work. This sleep is a conditional sleep: once work is available,
the thread should wake up and perform it to make sure high priority
processes finish in time. Usually one would use condition variables
for this kind of wait, but a posix condition variable requires to
hold a mutex to be race free, which we would like to avoid to keep
the whole system lock free (and thus keep our progress guarantee).

To keep a lock free mutex we found the `futex` linux syscall.
This is platform dependent on Linux, but as a system call is required
for any kind of sleep functionality we think it is ok to go for
OS specific primitives to implement this (a fallback for locking,
posix mutex + condition variables is always possible).

Some information can on `futex` can be found [in this bolg post](https://gustedt.wordpress.com/2011/01/28/linux-futexes-non-blocking-integer-valued-condition-variables/)
and [in the liunx man-pages](http://man7.org/linux/man-pages/man2/futex.2.html).
Basically, it allows to conditionally sleep a thread based on an
atomic compare and exchange operation on a 32 bit word,
which is exactly what we need for our lock free sleep based on an
indicator variable of available work.

## 31.07.2019 - First benchmark of implementation

We chose the implementation that allocates one invocation instance
in one task when running. Our first benchmarks show that this
is comparable to existing solutions, although we will investigate
optimizations to the internals of the task scheduler to avoid
unneeded copying and deeply nested tasks.

Average case results of the image pipeline (executed on notebook):

<img src="./media/7874c2a2_pipeline_speedup.png" width="400"/>

## 19.07.2019 - Colored tokens, recursion and where to put memory

While implementing dataflow graphs we encountered some obstacles.
The most severe one the most severe one that impacts the
internal design of the API, end user features and performance
is how to handle token flows, especially colored tokens for parallel
and/or recursive calls.

The basic issue here is, that each execution instance requires its
own isolated environment to execute in. Therefore, each invocation
instance (parallel or recursive) needs an unique identifier.
This identifier is usually realized using colored tokens,
as for example in this \[1] classic dataflow language implementation.
The problem with this is, that in our environment we are constrained
to static memory allocation and lock-free programming.
To handle static memory allocation we first decided to follow an model
introduced by EMBB, associating an clock to each colored token,
which maps it to an slot in an array of possible parallel execution
instances. This works fine in EMBB, as they put two major limitations
on their execution model: no cycles and no recursion.
(Issues arise with correcly coloring tokens in an lock-free matter
without too much contention, leading us to believe that this
method scales very bad, especially with smaller workloads)

While we could simply implement that, we wondered if there is a way to
better fit dataflow in our existing, task-stack based programming
approach.

After some investigation we found some core statements:
- The dataflow graph itself describes only programming execution,
it should therefore not be tightly coupled with data/buffer management
- Each invocation of an dataflow graph is logically a task in our model,
it therefore makes sense to map the memory and coordination resources
required for one invocation instance directly to this task
- What we do is in fact not 'pure' dataflow programming, as we do not
try to re-create a full programming language (e.g. we do not care about
memory management and loops described in \[1])
- What we do is more close to functional programming with single
assignment rules and recursion (see Elixir for a nice syntactic example)

Our plan is therefore the following:
Separate structure of the dataflow from execution and map
one execution instance to one active task in our runtime system.

This, conveniently, also mitigates most issues related to
memory allocation/parallelism in the graph and makes for a
nicer end user API (no more buffer types/parallelism in templates).


\[1] J. B. Dennis, “First version of a data flow procedure language,” in Programming Symposium, vol. 19, B. Robinet, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1974, pp. 362–376.


## 19.07.2019 - Variadic Templates for input/output goups

We searched for the correct way to represent an nodes
or graphs input/output pairs for a while and found the
solution in partial specialization of templates.

```C++
template<typename INS, typename OUTS>
class graph {};

template<typename I0, typename ...I, typename O0, typename ...O>
class graph<inputs<I0, I...>, outputs<O0, O...>> { ... };
```

The above code allows us to enforce an end-user API that is
clean while still giving us full access on the input/output variadic
template types. The end user API is the following:

```C++
graph<inputs<int, int>, outputs<std::string>> g;
```

## 03.07.2019 - Outline/Plan for our Dataflow API

The following describes our ideas for what we expect/want to build
in our dataflow API. The restrictions and decisions try to hold a
balance for usability, feasibility of the implementation and
feature richness (what kind of flows can even be represented using
the API).

We will model our dataflow closely to the EMBB implementation.
There are buffers that are controlled by a color/clock and
a DF graph has sources (inputs) and sinks (outputs).

Some notable properties we want (and that can differ from EMBB):
- A DF graph must be well behaved, i.e. for each full set of input
values exactly one set of output values is produced (each parallel
invocation therefore corresponds to exactly one value in the in-
and outputs)
- A DF graph must be declared with all it's interface, i.e. it's
full set of input types and output types must be known
when declaring it
```c++
dataflow<Inputs<String, Int>, Outputs<Int>, NUM_PARAL> df_instance;
```
- Nodes inside the DF graph are produced by the graph itself, allowing
it to propagate parallelism constraints (for example directly creating
the children with correct buffer capacities). This is done by having
factory functions on a concrete DF graph instance.
- A DF graph's sources/sinks are 'packed', i.e. the user must
provide a full sets of inputs to trigger the DF graph and is provided
a full set ouf outputs when reading a result from the DF Graph
(this highlights our intend for well behaved flow graphs, as users
do not even get the notion that it is ok to only return partial results)
```c++
auto source = df.source([](String &out1, Int &out2) {
  if (elements_avaliable) {
    out1 = ...;
    out2 = ...;
    return true;
  } else {
    return false;
  }
});

...

auto sink = df.sink([](const &Image, const &Int){
  ...;
});
```
- Easy API's for working with array data are provided in form of
interator sources/sinks
```c++
auto source = df.iterator_source();
auto sink = df.iterator_sink();

...

source.reset(input.begin(), input.end());
sink.reset(output.begin());
df.wait_for_all();
```
- In the first version nodes are always fully parallel,
further versions might include the per node property of
unordered_serial (nodes are executed at most once, but not ordered,
e.g. for accessing shared memory) or ordered_serial (e.g.
for logging something in correct order to a file/console).
- Sinks/User accessed outputs for a DF graph are always ordered_serial,
preventing confusion for the end user and ensuring deterministic
execution
- Memory management for all buffers and the DF graph itself is made
explicitly visible to the end user by enforcing him to hold all
components of the graph in memory when using it. This keeps on with
our phylosophy of not having hidden memory allocations, making
development for e.g. embedded platforms simpler, as it is clear
where and what resources are used (one can simply sizeof(...) all
parts of the graph and find out how much memory the buffers and so on
require)
- This model in principle allows recursive invocation. We will not
implement this in the first place, but keep the option for later.
This will potentially allow different patterns, like stencil operations,
to be implemented with the system.

## 03.07.2019 - Dataflow in EMBB

EMBB's dataflow is a very simple but effective implementation
of k-colored (maximum of k concurrent invocations, data/tokens
marked by an individual color per parallel invocation).
They force a acyclic, recursion-free flow and force to set source
nodes and sink nodes explicitly (acting as in- and outputs).

This allows them to send signals down on arcs even if there is
no value, e.g. if there is a split in control flow the 'not used'
side of the flow will be fed an 'empty' token, signaling sinks
that the execution of this parallel instance reached the sink.
Once one color of tokens (so one parallel execution instance)
reaches ALL sinks the model allows a new to be input.
This force of all tokens reaching the sinks before new ones can
entry is ordered, thus potentially limiting concurrency,
but at the same time makes for a very simple to implement model.
Computational nodes between sources and sinks are associated
with input buffers (having the same capacity as the number of
parallel invocations allowed). These can hold values from
predecessors until all inputs for the node are ready. The node
is started as a process as soon as the last needed input is provided
(this event is triggered by a reference counter of missing inputs).

## 26.06.2019 - Notes on Dataflow Implementation

### Dataflow in general

Dataflow based programming is nothing unique to task based
programming, it rather is a own programming paradigma and using
it inside a task scheduler is simply an adoption of the general
idea of organizing programs by flow of available data.

Therefore, we first look at the general domain of dataflow
programming, to understand the basic concepts.

The work in \[1] gives a good overview of dataflow programming
before 2000. It presents the basic execution concept
and the idea of data driving program execution.
Two main ways of execution can be distinguished: static and dynamic
execution. Static execution allows only one token per arc in the
execution graph, making it simple, but do not allow a lot of
parallel execution. Dynamic execution allows an unbounded number
of tokens per edge, making it harder to implment, but allowing
unlimited parallelism. There are also models allowing a maximum
of k tokens at an arch, which is probably where we are going as
we will try to keep memory allocation static once again.

Normally, dataflow programming is seen as on pure execution model,
meaning that everything is dataflow, even expressions like
`x = y + z` are executed with dataflow and the idea is to create
special hardware to execute such dependency graphs.
We are interested in tasks coupled by dataflow dependencies.
\[1] mentions this under names like threaded, hybrid or large-grain
dataflow.

Looking further into these groase-grained dataflow models we
discovered \[2] as an good overview of this area.
The paper focuses mainly no implementations that also rely on
special hardware, but the general principles hold for our implementation.
Our kind of dataflow falls under the flow/instruction category:
high-level coordination is achieved using dataflow,
individual work items are tasks in an imperative language.
We need to further see if ideas from these languages are helpful to us,
maybe individual papers can give some clues if we need them later.

For now we can conclude that we will probably implement some sort
of k-limited, dynamic system (using tokens with different ID's).


\[1] W. M. Johnston, J. R. P. Hanna, and R. J. Millar, “Advances in dataflow programming languages,” ACM Computing Surveys, vol. 36, no. 1, pp. 1–34, Mar. 2004.

\[2] F. Yazdanpanah, C. Alvarez-Martinez, D. Jimenez-Gonzalez, and Y. Etsion, “Hybrid Dataflow/von-Neumann Architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 6, pp. 1489–1509, Jun. 2014.

### Dataflow in TBB and EMBB

TBB's dataflow implementation follows (to what we can see) no
general dataflow theory, and implements an own method.
It relies on explicit concepts of buffering, multiple arc's
ending in the same node, concepts of pullin gand pushing modes for
arcs, ...
We think this is overly complicated, and too far away from
the classic model.

EMBB seems to follow a token based models with id's distinguishing
tokens belonging to different parallel executions.
It uses arcs that can buffer a limited number of data items.
Overall this seems rather close to what we have in mind and we will
further look into it.

## 24.06.2019 - Further Benchmark Settings and First Results

As a further option to reduce potential inference of our benchmark
process we add the `CPUAffinity` option to systemd, setting the
CPU affinity of all processes started by the startup procedure
to core 0, thus isolating cores 1 to 7 for our benchmarks.
For docs on this see [the systemd manual](https://www.freedesktop.org/software/systemd/man/systemd.exec.html).
(Please note that this is not prefect, it seems that a kernel parameter
would perform even better, but our current measurements show this
working well enough to not go that deep a long as we do not see the
need in further testing).

We could potentially better our results further by using a preemtive
patch on a newer kernel, e.g. Arch linux with preemt patch.

Our first results look promising, and show that our measurement
methodology can indeed work (measured on the banana PI with the
above settings). Backgroud tasks are started from core 1 to 7
with 500us work, 5000us pause. Benchmarks are started pinned
to core 1 to 7:

FFT without background tasks:

<img src="./media/2ad04ce5_fft_boxplot_isolated.png" width="400"/>


FFT with background tasks:

<img src="./media/2ad04ce5_fft_boxplot_multiprogrammed.png" width="400"/>

Matrix without background tasks:

<img src="./media/2ad04ce5_matrix_boxplot_isolated.png" width="400"/>

Matrix with background tasks:

<img src="./media/2ad04ce5_matrix_boxplot_multiprogrammed.png" width="400"/>


The plot's include all outliers, showing that the workst case times can
be measured and compared under these circumstances.
One option we might change in the future is how/when the background
preempting workers are executed, to better 'produce' bad situations
for the scheduler, e.g. have a higher active time for the workers
(close to maybe 1/4 of the benchmark times), to actually fully
remove one worker from a few tasks (we would like to do that with
our tests on different queue types, to e.g. see if the locknig queue
actualy is worse than lock free implementations by blocking up other
tasks).


## 14.06.2019 - Benchmark Settings/Real Time Execution on Linux

Our set goal of this project is predictable execution times.
To ensure this it would be perfect to run our benchmark in perfect
isolation/in a perfectly predictable and controlled environment.
A truly predictable environment is only possible on e.g. a fully
real time os on a board that supports everything.
As this is a lot work, we look at real time linux components,
the preemt rt patch.

We already execute our program with `chrt -f 99` to run it as a real
time task. We still have seen quite big jumps in latency.
This can be due to interrupts or other processes preemting our
benchmark. We found [in this blog post](https://ambrevar.xyz/real-time/)
about some more interesting details on real time processes.
There is a variable `/proc/sys/kernel/sched_rt_runtime_us` to set the
maximum time a real time process can run per second to ensure that
a system does not freeze when running a RT process.
When setting this to 1 second the RT processes are allowed to fully
take over the system (we will do this for further tests, as first
results show us that this drasticly improoves jitter).
Further information on RT groups can be [found here](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt).
The RT patch of linux apparently allows the OS to preemt even
most parts of interrupts, thus providing even more isolation
(can be also found [in this blog post](https://ambrevar.xyz/real-time/)).

We will further investigate this by looking at a book on RT Liunx.


## 14.06.2019 - Range Class

For iterating on integers (e.g. from 0 to 10) a translation from
integers to iterators on them is necessary.
Instead of rolling out our own implementation we found a
[standalone header file](https://bitbucket.org/AraK/range/src/default/)
that we will use for now to not have the boilerplate work.

## 11.06.2019 - Parallel Scan

Our parallel scan is oriented on the
['shared memory: two level algorithm'](https://en.wikipedia.org/wiki/Prefix_sum)
and our understanding is using [slides from a cs course](http://www.cs.princeton.edu/courses/archive/fall13/cos326/lec/23-parallel-scan.pdf).

## 06.05.2019 - Relaxed Atomics

At the end of the atomic talk it is mentioned how the relaxed
atomics correspond to modern hardware, stating that starting with
ARMv8 and x86 there is no real need for relaxed atomics,
as the 'strict' ordering has no real performance impact.

We will therefore ignore this for now (taking some potential
performance hits on our banana pi m2 ARMv7 architecture),
as it adds a lot of complexity and introduces many possibilities
for subtle concurrency bugs.
TODO: find some papers on the topic in case we want to mention
it in our project report.

## 06.05.2019 - Atomics

Atomics are a big part to get lock-free code running and are also
often useful for details like counters to spin on (e.g. is a task
finished). Resources to understand them are found in
[a talk online](https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/)
and in the book 'C++ concurrency in action, Anthony Williams'.

The key takeaway is that we need to be careful about ordering
possibilities of variable reads/writes. We will start of to
use strict ordering to begin with and will have to see if this
impacts performance and if we need to optimize it.

## 12.04.2019 - Unique IDs

Assigning unique IDs to logical different tasks is key to the
current model of separating timing/memory guarantees for certain
parallel patterns.

We do want to assign these IDs automatic in most cases (a call to
some parallel API should not require the end user to specific a
unique ID for each different call, as this would be error prone).
Instead we want to make sure that each DIFFERENT API call is separated
automatic, while leaving the option for manual ID assignment for later
implementations of GPU offloading (tasks need to have identifiers for
this to work properly).

Our first approach was using the `__COUNTER__` macro,
but this only works in ONE COMPLIATION UNIT and is not
portable to all compilers.

As all our unique instances are copled to a class/type
we decided to implement the unique IDs using
typeid(tuple<T..>) in automatic cases (bound to a specific type)
and fully manual IDs in other types. All this is wrapped in a
helper::unique_id class.


## 11.04.2019 - Lambda Pointer Abstraction

The question is if we could use a pointer to a lambda without
needing templating (for the type of the lambda) at the place that
we use it.

We looked into different techniques to achieve this:
- Using std::function<...>
- Using custom wrappers

std::function uses dynamic memory, thus ruling it out.
All methods that we can think of involve storing a pointer
to the lambda and then calling on it later on.
This works well enough with a simple wrapper, but has one
major downside: It involves virtual function calls,
making it impossible to inline the lambda.
This property broke the technique for us in most places,
as inlining is crucial, especially in small functions like loop
iterations. See `invoke_parallel_impl.h` for an example where we
did this (wrapper with virtual function call), but we only did it there,
as the generic `fork_join_sub_task` requires the virtual call anyway,
thus making us not loose ANY performance with this technique.

## 11.04.2019 - Notes on C++ Templating

After working more with templating and talking to mike,
it seems like the common way to go is the following:
- If possible, add template arguments to
  data containers only (separate from logic).
- If logic and data are coupled (like often with lambdas),
  add the declaration of the interface into the normal header
  some_class.h and add it's implementation into an extra implementation
  file some_class_impl.h that is included at the end of the file.


## 09.04.2019 - Cache Alignment

Aligning the cache needs all parts (both data types with correct alignment
and base memory with correct alignment).

Our first tests show that the initial alignment (Commit 3535cbd8),
boostet the performance in the fft_benchmark from our library to
Intel TBB's speedup when running on up to 4 threads.
When crossing the boundary to hyper-threading this falls of.
We therefore think that contemption/cache misses are the reason for
bad performance above 4 threads, but have to investigate further to
pin down the issue.

## 08.04.2019 - Random Numbers

We decided to go for a simple linear random number generator
as [std::minstd_rand](http://www.cplusplus.com/reference/random/minstd_rand/),
as this requires less memory and is faster. The decreased quality
in random numbers is probably ok (read up if there is literature on this),
as work stealing does not rely on a mathematically perfect distribution.

## 02.04.2019 - CMake Export

We built our project using CMake to make it portable and easy to setup.
To allow others to use our library we need to make it installable on
other systems. For this we use CMake's install feature and
a [tutorial](https://pabloariasal.github.io/2018/02/19/its-time-to-do-cmake-right/)
on how to correctly configure a CMake library to be included by other
projects.

## 28.03.2019 - custom new operators

When initializing sub_tasks we want to place them on our custom
'stack like' data structure per thread. We looked at TBB's API
and noticed them somehow implicitly setting parent relationships
in the new operator. After further investigation we see that the
initialization in this manner is a 'hack' to avoid passing
of references and counters.

It can be found at the bottom of the `task.h` file:

```C++
inline void *operator new( size_t bytes, const tbb::internal::allocate_child_proxy& p ) {
    return &p.allocate(bytes);
}

inline void operator delete( void* task, const tbb::internal::allocate_child_proxy& p ) {
    p.free( *static_cast<tbb::task*>(task) );
}
```

It simlpy constructs a temp 'allocator type' passed as the second
argument to new. This type then is called in new and
allocates the memory required.

## 27.03.2019 - atomics

C++ 11 offers atomics, however these require careful usage
and are not always lock free. We plan on doing more research
for these operations when we try to transform our code form using
spin locks to using more fine grained locks.

Resources can be found [here](https://www.justsoftwaresolutions.co.uk/files/ndc_oslo_2016_safety_off.pdf)
and [here](http://www.modernescpp.com/index.php/c-core-guidelines-the-remaining-rules-to-lock-free-programming).

## 27.03.2019 - variable sized lambdas

When working with lambdas one faces the problem of them having not
a fixed size because they can capture variables from the surrounding
scope.

To 'fix' this in normal C++ one would use a std::function,
wrapping the lambda by moving it onto the heap. This is of course
a problem when trying to prevent dynamic memory allocation.

When we want static allocation we have two options:
1) keep the lambda on the stack and only call into it while it is valid
2) use templating to create variable sized classes for each lambda used

Option 1) is preferable, as it does not create extra templating code
(longer compile time, can not separate code into CPP files). However
we can encounter situations where the lambda is not on the stack when
used, especially when working with sub-tasks.

## 21.03.2019 - Allocation on stack/static memory

We can use the [placement new](https://www.geeksforgeeks.org/placement-new-operator-cpp/)
operator for our tasks and other stuff to manage memory.
This can allow the pure 'stack based' approach without any memory
management suggested by mike.

## 20.03.2019 - Prohibit New

We want to write this library without using any runtime memory
allocation to better fit the needs of the embedded marked.
To make sure we do not do so we add a trick:
we link an new implementation into the project (when testing) that
will cause an linker error if new is used somewhere.
If the linker reports such an error we can switch to debugging
by using a new implementation with a break point in it.

That way we for example ruled out std::thread, as we found the dynamic
memory allocation used in it.

## 20.03.2019 - callable objects and memory allocation / why we use no std::thread

When working with any sort of functionality that can be passed
to an object or function it is usually passed as:
1. an function pointer and a list of parameter values
2. an lambda, capturing any surrounding parameters needed

When we want to pass ANY functionality (with any number of parameters
or captured variables) we can not determine the amount of memory before
the call is done, making the callable (function + parameters) dynamicly
sized.

This can be a problem when implementing e.g. a thread class,
as the callable has to be stored somewhere. The **std::thread**
implementation allocates memory at runtime using **new** when
called with any form of parameters for the started function.
Because of this (and because the implementation can differ from
system to system) we decided to not provide an **std::thread** backend
for our internal thread class (that does not use dynamic memory,
as it lives on the stack, knowing its size at compile time using
templates).

Lambdas can be used, as long as we are sure the outer scope still exists
while executing (lambda lies in the callers stack), or if we copy the
lambda manually to some memory that we know will persist during the call.
It is important to know that the lambda wont be freed while it is
used, as the captured variables used inside the body are held in the
lambda object.