diff --git a/README.md b/README.md index 79b6a7a..bd479fe 100644 --- a/README.md +++ b/README.md @@ -5,39 +5,29 @@ [![pipeline status](http://lab.las3.de/gitlab/las3/development/scheduling/predictable_parallel_patterns/badges/master/pipeline.svg)](http://lab.las3.de/gitlab/las3/development/scheduling/predictable_parallel_patterns/commits/master) -## Getting Started +PLS is a C++ work-stealing library designed to be close to Blumofe's original randomized work-stealing algorithm. +It therefore holds both its theoretical time and space bounds. The key for this is to never +block parallel tasks that are ready to execute. In order to do this, tasks are modeled as stackful +coroutines, allowing them to be paused and resumed at any point. Additionally, PLS allocates all +its memory statically and manages it in a decentralized manner within the stealing procedure. +By doing this, a PLS scheduler instance allocates all required resources at startup, not doing any +more general purpose memory management or allocations during runtime. -This section will give a brief introduction on how to get a minimal -project setup that uses the PLS library. -Further [general notes](NOTES.md) and [performance notes](PERFORMANCE-v2.md) can be found in -their respective files. - -### Installation - -PLS has no external dependencies. To compile and install it you -only need cmake and a recent C++ 17 compatible compiler. -Care might be required on not explicitly supported systems -(currently we support Linux x86 and ARMv7). - -Clone the repository and open a terminal session in its folder. -Create a build folder using `mkdir cmake-build-release` -and switch into it `cd cmake-build-release`. -Setup the cmake project using `cmake ../ -DCMAKE_BUILD_TYPE=RELEASE`, -then install it as a system wide dependency using `sudo make install.pls`. +PLS is a research prototype developed as a student project in the +laboratory for safe and secure systems (https://las3.de/). -At this point the library is installed on your system. -To use it simply add it to your existing cmake project using -`find_package(pls REQUIRED)` and then link it to your project -using `target_link_libraries(your_target pls::pls)`. +## API overview -### Basic Usage +PLS implements a nested fork-join API. This programming model originates from Cilk and +the better known C++ successor Cilk Plus. In contrast to these projects, PLS does not require any compiler +support and can be used as a plain library. ```c++ #include #include -// Static memory allocation (see execution trees for how to configure) -static const int MAX_NUM_TASKS = 32; +// Static memory allocation (described in detail in a later section) +static const int MAX_SPAWN_DEPTH = 32; static const int MAX_STACK_SIZE = 4096; static const int NUM_THREADS = 8; @@ -46,7 +36,7 @@ long fib(long n); int main() { // Create a scheduler with the static amount of resources. // All memory and system resources are allocated here. - pls::scheduler scheduler{NUM_THREADS, MAX_NUM_TASKS, MAX_STACK_SIZE}; + pls::scheduler scheduler{NUM_THREADS, MAX_SPAWN_DEPTH, MAX_STACK_SIZE}; // Wake up the thread pool and perform work. scheduler.perform_work([&] { @@ -64,57 +54,66 @@ long fib(long n) { return n; } - // Example for the high level API. - // Will run both functions in parallel as separate tasks. + // Example of the main functions spawn and sync. + // pls::spawn(...) starts a lambda as an asynchronous task. int a, b; - pls::invoke( - [&a, n] { a = fib(n - 1); }, - [&b, n] { b = fib(n - 2); } - ); + pls::spawn([&a, n] { a = fib(n - 1); }); + pls::spawn([&b, n] { b = fib(n - 2); }); + + // pls::sync() ensures that all child tasks are finished. + pls::sync(); + + // After the sync() the produced results can be used safely. return a + b; } ``` -### Execution Trees and Static Resource Allocation +The API primarily exposes a spawn(...) and sync() function, which are used to create +nested fork-join parallelism. A spawn(...) call allows the passed lambda to execute +asynchronously, a sync() call forces all direct child invocations to finish before it +returns. The programming model therefore acts mostly as 'asynchronous sub procedure calls' +and is a good fit for all algorithms that are naturally expressed as recursive functions. +The described asynchronous invocation tree is then executed in parallel by the runtime system, +which uses randomized work-stealing for load balancing. -TODO: For the static memory allocation you need to find the maximum required resources. +The parameters used to allocate the scheduler resources statically are as follows: +- NUM_THREADS: The number of worker threads used for the parallel invocation. +- MAX_SPAWN_DEPTH: The maximum depth of parallel invocations, i.e. the number of nested spawn calls. +- MAX_STACK_SIZE: The stack size used for coroutines. It must be big enough to fit the stack of the passed +lambda function until the next spawn statement appears. -## Project Structure +## Installation -The project uses [CMAKE](https://cmake.org/) as it's build system, -the recommended IDE is either a simple text editor or [CLion](https://www.jetbrains.com/clion/). -We divide the project into sub-targets to separate for the library -itself, testing and example code. The library itself can be found in -`lib/pls`, the context switching implementation in `lib/context_switcher`, -testing related code is in `test`, example and playground/benchmark apps are in `app`. +This section will give a brief introduction on how to get a minimal +project setup that uses the PLS library. -### Buiding +PLS has no external dependencies. To compile and install it you +only need cmake and a recent C++ 17 compatible compiler. +Care might be required on not explicitly supported systems +(currently we support Linux x86 and ARMv7, however, all platforms +supported by boost.context should work fine). -To build the project first create a folder for the build -(typically as a subfolder to the project) using `mkdir cmake-build-debug`. -Change to the new folder `cd cmake-build-debug` and init the cmake -project using `cmake ../ -DCMAKE_BUILD_TYPE=DEBUG`. For realease builds -do the same only with build type `RELEASE`. Other build time settings -can also be passed at this setup step. +Clone the repository and open a terminal session in its folder. +Create a build folder using `mkdir cmake-build-release` +and switch into it `cd cmake-build-release`. +Setup the cmake project using `cmake ../ -DCMAKE_BUILD_TYPE=RELEASE`, +then install it as a system wide dependency using `sudo make install.pls`. -After this is done you can use normal `make` commands like -`make` to build everything `make ` to build a target -or `make install` to install the library globally. +At this point the library is installed on your system. +To use it simply add it to your existing cmake project using +`find_package(pls REQUIRED)` and then link it to your project +using `target_link_libraries(your_target pls::pls)`. Available Settings: -- `-DPLS_PROFILER=ON/OFF` - - default OFF - - Enabling it will record execution DAGs with memory and runtime stats - - Enabling has a BIG performance hit (use only for development) - `-DSLEEP_WORKERS=ON/OFF` - default OFF - Enabling it will make workers keep a central 'all workers empty flag' - Workers try to sleep if there is no work in the system - Has performance impact on isolated runs, but can benefit multiprogrammed systems -- `-DEASY_PROFILER=ON/OFF` +- `-DPLS_PROFILER=ON/OFF` - default OFF - - Enabling will link the easy profiler library and enable its macros - - Enabling has a performance hit (do not use in releases) + - Enabling it will record execution DAGs with memory and runtime stats + - Enabling has a BIG performance hit (use only for development) - `-DADDRESS_SANITIZER=ON/OFF` - default OFF - Enables address sanitizer to be linked to the executable @@ -129,17 +128,76 @@ Available Settings: - default OFF - Enables the build with debug symbols - Use for e.g. profiling the release build +- `-DEASY_PROFILER=ON/OFF` + - deprecated, not currently used in the project + - default OFF + - Enabling will link the easy profiler library and enable its macros + - Enabling has a performance hit (do not use in releases) Note that these settings are persistent for one CMake build folder. If you e.g. set a flag in the debug build it will not influence the release build, but it will persist in the debug build folder until you explicitly change it back. +## Execution Trees and Static Resource Allocation in PLS + +As mentioned in the introduction, PLS allocates all resources it requires +when creating the scheduler instance. For this, some parameters have to be provided by +the user. In order to understand them, it is important to be aware of the execution model of +PLS. + +During invocation sub procedure calls are executed potentially in parallel. Each parallel call is +executed on a stackful coroutine with its own stack region, i.e. each spawn(...) places and executes +the passed lambda on a new stack. Therefore, the static allocation must know how big the coroutine's stacks will be. +The MAX_STACK_SIZE parameter controls the size of these stacks and should be big enough to hold +one such subroutine invocation. Note that this stack must only be big enough to reach the next +spawn(...) call, as this is executed on its own new stack. By default, PLS will allocate stacks that +are multiple of the systems page size to enforce sigsevs when outrunning them. Thus, it should +be easy to detect if the stack space is chosen too small. Optionally, the profiling mechanism +in PLS can be used to find the exact sizes of the stacks during runtime. + +![](./media/invocation_tree.png) + +The figure above shows a call tree resulting from a prallel invocation, +where the boxes between the dotted lines are stackfull coroutines in the parallel reginon. +The region is entered through a `scheduler.perform_work([&]() { ... });` call at the top dotted line, +switching execution onto the coroutines and left with a `pls::serial(...)` call that switches back +to a continous stack (as indicated by the shaded boxes below the dotted line). + +The MAX_SPAWN_DEPTH paramter indicates the maximum nested spawn level, i.e. the depth of the +invocation tree in the parallel region. If this depth is reached, +PLS will automaticall disable any further parallelism and switch to a serial execution. + +The work-stealing algorithm gurantees the busy-leaves property: during an invocation of the +call tree at most P branches (P being the number of worker threads configured using NUM_THREADS) +are active. Therefore, the invocation never user more than NUM_THREADS * MAX_SPAWN_DEPTH stacks +and can allocate them during startup using NUM_THREADS * MAX_SPAWN_DEPTH * MAX_STACK_SIZE memory. + +During execution time pre-allocated memory is balanced between worker threads using a trading scheme +integrated into the work-stealing procedure. This way, the static allocated memory suffices for every +possible order the call tree can be invoced (as randomized stealing can lead to different execution orders). +The user must only have an image of a regular, serial call tree in order to set the above limits and reason about +prallel invocations, as the required memory simply scales up linearly with additional worker threads. + +## Project Structure + +The project uses [CMAKE](https://cmake.org/) as it's build system, +the recommended IDE is either a simple text editor or [CLion](https://www.jetbrains.com/clion/). +We divide the project into sub-targets to separate for the library +itself, testing and example code. The library itself can be found in +`lib/pls`, the context switching implementation in `lib/context_switcher`, +testing related code is in `test`, example and playground/benchmark apps are in `app`. + +The main scheduling code can be found in `/pls/internal/scheduling` and the +resource trading/work stealing deque implementation responsible for balancing the static +resources and stealing work can be found in `/pls/internal/scheduling/lock_free`. + ### Testing Testing is done using [Catch2](https://github.com/catchorg/Catch2/) in the test subfolder. Tests are build into a target called `tests` and can be executed simply by building this executabe and running it. +Currently, only basic tests are implemented. ### PLS profiler @@ -149,55 +207,21 @@ which can later be rendered by the dot software to inspect the actual executed graph. The most useful tools are to analyze the maximum memory required per -coroutine stack, th computational depth, T_1 and T_inf. - -### Data Race Detection - -WARNING: the latest build of clang/thread sanitizer is required for this to work, -as a recent bug-fix regarding user level thread is required! - -As this project contains a lot concurrent code we use -[Thread Sanitizer](https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual) -in our CI process and optional in other builds. To setup CMake builds -with sanitizer enabled add the cmake option `-DTHREAD_SANITIZER=ON`. -Please regularly test with thread sanitizer enabled and make sure to not -keep the repository in a state where the sanitizer reports errors. - -Consider reading [the section on common data races](https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces) -to get an idea of what we try to avoid in our code. - -### Profiling EasyProfiler - -To make profiling portable and allow us to later analyze the logs -programaticly we use [easy_profiler](https://github.com/yse/easy_profiler) -for capturing data. To enable profiling install the library on your system -(best building it and then running `make install`) and set the -cmake option `-DEASY_PROFILER=ON`. - -After that see the `invoke_parallel` example app for activating the -profiler. This will generate a trace file that can be viewed with -the `profiler_gui ` command. - -Please note that the profiler adds overhead when looking at sub millisecond -method invokations as we do and it can not replace a seperate -profiler like `gperf`, `valgrind` or `vtune amplifier` for detailed analysis. -We still think it makes sense to add it in as an optional feature, -as the customizable colors and fine grained events (including collection -of variables) can be used to visualize the `big picture` of -program execution. Also, we hope to use it to log 'events' like -successful and failed steals in the future, as the general idea of logging -information per thread efficiently might be helpful for further -analysis. - - -### Profiling VTune Amplifier - -For detailed profiling of small performance hotspots we prefer -to use [Intel's VTune Amplifier](https://software.intel.com/en-us/vtune). -It gives insights in detailed microachitecture usage and performance -hotspots. Follow the instructions by Intel for using it. -Make sure to enable debug symbols (`-DDEBUG_SYMBOLS=ON`) in the -analyzed build and that all optimizations are turned on -(by choosing the release build). +coroutine stack, the computational depth, T_1 and T_inf. + +To query the stats use the profiler object attached to a scheduler instance. +```c++ +// You can disable the memory measure for better performance +scheduler.get_profiler().disable_memory_measure(); + +// Invoke some parallel algorithm you want to benchmark +scheduler.perform_work([&]() { ... }); + +// Collect stats from last run as required +std::cout << scheduler.get_profiler().current_run().t_1_ << std::endl; +std::cout << scheduler.get_profiler().current_run().t_inf_ << std::endl; +scheduler.get_profiler().current_run().print_dag(std::cout); +... +``` diff --git a/media/invocation_tree.png b/media/invocation_tree.png new file mode 100644 index 0000000..e81d9e9 Binary files /dev/null and b/media/invocation_tree.png differ