Commit ec660473 by FritzFlorian

Rework README

parent ca2c8abd
Pipeline #1598 passed with stages
in 4 minutes 37 seconds
......@@ -5,39 +5,29 @@
[![pipeline status](http://lab.las3.de/gitlab/las3/development/scheduling/predictable_parallel_patterns/badges/master/pipeline.svg)](http://lab.las3.de/gitlab/las3/development/scheduling/predictable_parallel_patterns/commits/master)
## Getting Started
PLS is a C++ work-stealing library designed to be close to Blumofe's original randomized work-stealing algorithm.
It therefore holds both its theoretical time and space bounds. The key for this is to never
block parallel tasks that are ready to execute. In order to do this, tasks are modeled as stackful
coroutines, allowing them to be paused and resumed at any point. Additionally, PLS allocates all
its memory statically and manages it in a decentralized manner within the stealing procedure.
By doing this, a PLS scheduler instance allocates all required resources at startup, not doing any
more general purpose memory management or allocations during runtime.
This section will give a brief introduction on how to get a minimal
project setup that uses the PLS library.
Further [general notes](NOTES.md) and [performance notes](PERFORMANCE-v2.md) can be found in
their respective files.
### Installation
PLS has no external dependencies. To compile and install it you
only need cmake and a recent C++ 17 compatible compiler.
Care might be required on not explicitly supported systems
(currently we support Linux x86 and ARMv7).
Clone the repository and open a terminal session in its folder.
Create a build folder using `mkdir cmake-build-release`
and switch into it `cd cmake-build-release`.
Setup the cmake project using `cmake ../ -DCMAKE_BUILD_TYPE=RELEASE`,
then install it as a system wide dependency using `sudo make install.pls`.
PLS is a research prototype developed as a student project in the
laboratory for safe and secure systems (https://las3.de/).
At this point the library is installed on your system.
To use it simply add it to your existing cmake project using
`find_package(pls REQUIRED)` and then link it to your project
using `target_link_libraries(your_target pls::pls)`.
## API overview
### Basic Usage
PLS implements a nested fork-join API. This programming model originates from Cilk and
the better known C++ successor Cilk Plus. In contrast to these projects, PLS does not require any compiler
support and can be used as a plain library.
```c++
#include <pls/pls.h>
#include <iostream>
// Static memory allocation (see execution trees for how to configure)
static const int MAX_NUM_TASKS = 32;
// Static memory allocation (described in detail in a later section)
static const int MAX_SPAWN_DEPTH = 32;
static const int MAX_STACK_SIZE = 4096;
static const int NUM_THREADS = 8;
......@@ -46,7 +36,7 @@ long fib(long n);
int main() {
// Create a scheduler with the static amount of resources.
// All memory and system resources are allocated here.
pls::scheduler scheduler{NUM_THREADS, MAX_NUM_TASKS, MAX_STACK_SIZE};
pls::scheduler scheduler{NUM_THREADS, MAX_SPAWN_DEPTH, MAX_STACK_SIZE};
// Wake up the thread pool and perform work.
scheduler.perform_work([&] {
......@@ -64,57 +54,66 @@ long fib(long n) {
return n;
}
// Example for the high level API.
// Will run both functions in parallel as separate tasks.
// Example of the main functions spawn and sync.
// pls::spawn(...) starts a lambda as an asynchronous task.
int a, b;
pls::invoke(
[&a, n] { a = fib(n - 1); },
[&b, n] { b = fib(n - 2); }
);
pls::spawn([&a, n] { a = fib(n - 1); });
pls::spawn([&b, n] { b = fib(n - 2); });
// pls::sync() ensures that all child tasks are finished.
pls::sync();
// After the sync() the produced results can be used safely.
return a + b;
}
```
### Execution Trees and Static Resource Allocation
The API primarily exposes a spawn(...) and sync() function, which are used to create
nested fork-join parallelism. A spawn(...) call allows the passed lambda to execute
asynchronously, a sync() call forces all direct child invocations to finish before it
returns. The programming model therefore acts mostly as 'asynchronous sub procedure calls'
and is a good fit for all algorithms that are naturally expressed as recursive functions.
The described asynchronous invocation tree is then executed in parallel by the runtime system,
which uses randomized work-stealing for load balancing.
TODO: For the static memory allocation you need to find the maximum required resources.
The parameters used to allocate the scheduler resources statically are as follows:
- NUM_THREADS: The number of worker threads used for the parallel invocation.
- MAX_SPAWN_DEPTH: The maximum depth of parallel invocations, i.e. the number of nested spawn calls.
- MAX_STACK_SIZE: The stack size used for coroutines. It must be big enough to fit the stack of the passed
lambda function until the next spawn statement appears.
## Project Structure
## Installation
The project uses [CMAKE](https://cmake.org/) as it's build system,
the recommended IDE is either a simple text editor or [CLion](https://www.jetbrains.com/clion/).
We divide the project into sub-targets to separate for the library
itself, testing and example code. The library itself can be found in
`lib/pls`, the context switching implementation in `lib/context_switcher`,
testing related code is in `test`, example and playground/benchmark apps are in `app`.
This section will give a brief introduction on how to get a minimal
project setup that uses the PLS library.
### Buiding
PLS has no external dependencies. To compile and install it you
only need cmake and a recent C++ 17 compatible compiler.
Care might be required on not explicitly supported systems
(currently we support Linux x86 and ARMv7, however, all platforms
supported by boost.context should work fine).
To build the project first create a folder for the build
(typically as a subfolder to the project) using `mkdir cmake-build-debug`.
Change to the new folder `cd cmake-build-debug` and init the cmake
project using `cmake ../ -DCMAKE_BUILD_TYPE=DEBUG`. For realease builds
do the same only with build type `RELEASE`. Other build time settings
can also be passed at this setup step.
Clone the repository and open a terminal session in its folder.
Create a build folder using `mkdir cmake-build-release`
and switch into it `cd cmake-build-release`.
Setup the cmake project using `cmake ../ -DCMAKE_BUILD_TYPE=RELEASE`,
then install it as a system wide dependency using `sudo make install.pls`.
After this is done you can use normal `make` commands like
`make` to build everything `make <target>` to build a target
or `make install` to install the library globally.
At this point the library is installed on your system.
To use it simply add it to your existing cmake project using
`find_package(pls REQUIRED)` and then link it to your project
using `target_link_libraries(your_target pls::pls)`.
Available Settings:
- `-DPLS_PROFILER=ON/OFF`
- default OFF
- Enabling it will record execution DAGs with memory and runtime stats
- Enabling has a BIG performance hit (use only for development)
- `-DSLEEP_WORKERS=ON/OFF`
- default OFF
- Enabling it will make workers keep a central 'all workers empty flag'
- Workers try to sleep if there is no work in the system
- Has performance impact on isolated runs, but can benefit multiprogrammed systems
- `-DEASY_PROFILER=ON/OFF`
- `-DPLS_PROFILER=ON/OFF`
- default OFF
- Enabling will link the easy profiler library and enable its macros
- Enabling has a performance hit (do not use in releases)
- Enabling it will record execution DAGs with memory and runtime stats
- Enabling has a BIG performance hit (use only for development)
- `-DADDRESS_SANITIZER=ON/OFF`
- default OFF
- Enables address sanitizer to be linked to the executable
......@@ -129,17 +128,76 @@ Available Settings:
- default OFF
- Enables the build with debug symbols
- Use for e.g. profiling the release build
- `-DEASY_PROFILER=ON/OFF`
- deprecated, not currently used in the project
- default OFF
- Enabling will link the easy profiler library and enable its macros
- Enabling has a performance hit (do not use in releases)
Note that these settings are persistent for one CMake build folder.
If you e.g. set a flag in the debug build it will not influence
the release build, but it will persist in the debug build folder
until you explicitly change it back.
## Execution Trees and Static Resource Allocation in PLS
As mentioned in the introduction, PLS allocates all resources it requires
when creating the scheduler instance. For this, some parameters have to be provided by
the user. In order to understand them, it is important to be aware of the execution model of
PLS.
During invocation sub procedure calls are executed potentially in parallel. Each parallel call is
executed on a stackful coroutine with its own stack region, i.e. each spawn(...) places and executes
the passed lambda on a new stack. Therefore, the static allocation must know how big the coroutine's stacks will be.
The MAX_STACK_SIZE parameter controls the size of these stacks and should be big enough to hold
one such subroutine invocation. Note that this stack must only be big enough to reach the next
spawn(...) call, as this is executed on its own new stack. By default, PLS will allocate stacks that
are multiple of the systems page size to enforce sigsevs when outrunning them. Thus, it should
be easy to detect if the stack space is chosen too small. Optionally, the profiling mechanism
in PLS can be used to find the exact sizes of the stacks during runtime.
![](./media/invocation_tree.png)
The figure above shows a call tree resulting from a prallel invocation,
where the boxes between the dotted lines are stackfull coroutines in the parallel reginon.
The region is entered through a `scheduler.perform_work([&]() { ... });` call at the top dotted line,
switching execution onto the coroutines and left with a `pls::serial(...)` call that switches back
to a continous stack (as indicated by the shaded boxes below the dotted line).
The MAX_SPAWN_DEPTH paramter indicates the maximum nested spawn level, i.e. the depth of the
invocation tree in the parallel region. If this depth is reached,
PLS will automaticall disable any further parallelism and switch to a serial execution.
The work-stealing algorithm gurantees the busy-leaves property: during an invocation of the
call tree at most P branches (P being the number of worker threads configured using NUM_THREADS)
are active. Therefore, the invocation never user more than NUM_THREADS * MAX_SPAWN_DEPTH stacks
and can allocate them during startup using NUM_THREADS * MAX_SPAWN_DEPTH * MAX_STACK_SIZE memory.
During execution time pre-allocated memory is balanced between worker threads using a trading scheme
integrated into the work-stealing procedure. This way, the static allocated memory suffices for every
possible order the call tree can be invoced (as randomized stealing can lead to different execution orders).
The user must only have an image of a regular, serial call tree in order to set the above limits and reason about
prallel invocations, as the required memory simply scales up linearly with additional worker threads.
## Project Structure
The project uses [CMAKE](https://cmake.org/) as it's build system,
the recommended IDE is either a simple text editor or [CLion](https://www.jetbrains.com/clion/).
We divide the project into sub-targets to separate for the library
itself, testing and example code. The library itself can be found in
`lib/pls`, the context switching implementation in `lib/context_switcher`,
testing related code is in `test`, example and playground/benchmark apps are in `app`.
The main scheduling code can be found in `/pls/internal/scheduling` and the
resource trading/work stealing deque implementation responsible for balancing the static
resources and stealing work can be found in `/pls/internal/scheduling/lock_free`.
### Testing
Testing is done using [Catch2](https://github.com/catchorg/Catch2/)
in the test subfolder. Tests are build into a target called `tests`
and can be executed simply by building this executabe and running it.
Currently, only basic tests are implemented.
### PLS profiler
......@@ -149,55 +207,21 @@ which can later be rendered by the dot software to inspect the actual
executed graph.
The most useful tools are to analyze the maximum memory required per
coroutine stack, th computational depth, T_1 and T_inf.
### Data Race Detection
WARNING: the latest build of clang/thread sanitizer is required for this to work,
as a recent bug-fix regarding user level thread is required!
As this project contains a lot concurrent code we use
[Thread Sanitizer](https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual)
in our CI process and optional in other builds. To setup CMake builds
with sanitizer enabled add the cmake option `-DTHREAD_SANITIZER=ON`.
Please regularly test with thread sanitizer enabled and make sure to not
keep the repository in a state where the sanitizer reports errors.
Consider reading [the section on common data races](https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces)
to get an idea of what we try to avoid in our code.
### Profiling EasyProfiler
To make profiling portable and allow us to later analyze the logs
programaticly we use [easy_profiler](https://github.com/yse/easy_profiler)
for capturing data. To enable profiling install the library on your system
(best building it and then running `make install`) and set the
cmake option `-DEASY_PROFILER=ON`.
After that see the `invoke_parallel` example app for activating the
profiler. This will generate a trace file that can be viewed with
the `profiler_gui <output.prof>` command.
Please note that the profiler adds overhead when looking at sub millisecond
method invokations as we do and it can not replace a seperate
profiler like `gperf`, `valgrind` or `vtune amplifier` for detailed analysis.
We still think it makes sense to add it in as an optional feature,
as the customizable colors and fine grained events (including collection
of variables) can be used to visualize the `big picture` of
program execution. Also, we hope to use it to log 'events' like
successful and failed steals in the future, as the general idea of logging
information per thread efficiently might be helpful for further
analysis.
### Profiling VTune Amplifier
For detailed profiling of small performance hotspots we prefer
to use [Intel's VTune Amplifier](https://software.intel.com/en-us/vtune).
It gives insights in detailed microachitecture usage and performance
hotspots. Follow the instructions by Intel for using it.
Make sure to enable debug symbols (`-DDEBUG_SYMBOLS=ON`) in the
analyzed build and that all optimizations are turned on
(by choosing the release build).
coroutine stack, the computational depth, T_1 and T_inf.
To query the stats use the profiler object attached to a scheduler instance.
```c++
// You can disable the memory measure for better performance
scheduler.get_profiler().disable_memory_measure();
// Invoke some parallel algorithm you want to benchmark
scheduler.perform_work([&]() { ... });
// Collect stats from last run as required
std::cout << scheduler.get_profiler().current_run().t_1_ << std::endl;
std::cout << scheduler.get_profiler().current_run().t_inf_ << std::endl;
scheduler.get_profiler().current_run().print_dag(std::cout);
...
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment