Rework README

ec660473 · FritzFlorian · ca2c8abd · ec660473 · ec660473
Commit ec660473 authored Jul 22, 2020 by FritzFlorian
Hide whitespace changes
Inline Side-by-side

Showing with 132 additions and 108 deletions

README.md
+132 -108

media/invocation_tree.png
+0 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -5,39 +5,29 @@

 [![pipeline status](http://lab.las3.de/gitlab/las3/development/scheduling/predictable_parallel_patterns/badges/master/pipeline.svg)](http://lab.las3.de/gitlab/las3/development/scheduling/predictable_parallel_patterns/commits/master)

-## Getting Started
+PLS is a C++ work-stealing library designed to be close to Blumofe's original randomized work-stealing algorithm.
+It therefore holds both its theoretical time and space bounds. The key for this is to never
+block parallel tasks that are ready to execute. In order to do this, tasks are modeled as stackful
+coroutines, allowing them to be paused and resumed at any point. Additionally, PLS allocates all
+its memory statically and manages it in a decentralized manner within the stealing procedure.
+By doing this, a PLS scheduler instance allocates all required resources at startup, not doing any
+more general purpose memory management or allocations during runtime.

-This section will give a brief introduction on how to get a minimal
-project setup that uses the PLS library.
-Further [general notes](NOTES.md) and [performance notes](PERFORMANCE-v2.md) can be found in
-their respective files.
-
-### Installation
-
-PLS has no external dependencies. To compile and install it you
-only need cmake and a recent C++ 17 compatible compiler.
-Care might be required on not explicitly supported systems
-(currently we support Linux x86 and ARMv7).
-
-Clone the repository and open a terminal session in its folder.
-Create a build folder using `mkdir cmake-build-release`
-and switch into it `cd cmake-build-release`.
-Setup the cmake project using `cmake ../ -DCMAKE_BUILD_TYPE=RELEASE`,
-then install it as a system wide dependency using `sudo make install.pls`.
+PLS is a research prototype developed as a student project in the
+laboratory for safe and secure systems (https://las3.de/).

-At this point the library is installed on your system.
-To use it simply add it to your existing cmake project using
-`find_package(pls REQUIRED)` and then link it to your project
-using `target_link_libraries(your_target pls::pls)`.
+## API overview

-### Basic Usage
+PLS implements a nested fork-join API. This programming model originates from Cilk and
+the better known C++ successor Cilk Plus. In contrast to these projects, PLS does not require any compiler
+support and can be used as a plain library.

 ```c++
 #include <pls/pls.h>
 #include <iostream>

-// Static memory allocation (see execution trees for how to configure)
-static const int MAX_NUM_TASKS = 32;
+// Static memory allocation (described in detail in a later section)
+static const int MAX_SPAWN_DEPTH = 32;
 static const int MAX_STACK_SIZE = 4096;
 static const int NUM_THREADS = 8;

@@ -46,7 +36,7 @@ long fib(long n);
 int main() {
  // Create a scheduler with the static amount of resources.
  // All memory and system resources are allocated here.
-  pls::scheduler scheduler{NUM_THREADS, MAX_NUM_TASKS, MAX_STACK_SIZE};
+  pls::scheduler scheduler{NUM_THREADS, MAX_SPAWN_DEPTH, MAX_STACK_SIZE};

  // Wake up the thread pool and perform work.
  scheduler.perform_work([&] {
@@ -64,57 +54,66 @@ long fib(long n) {
    return n;
  }

-  // Example for the high level API.
-  // Will run both functions in parallel as separate tasks.
+  // Example of the main functions spawn and sync.
+  // pls::spawn(...) starts a lambda as an asynchronous task.
  int a, b;
-  pls::invoke(
-      [&a, n] { a = fib(n - 1); },
-      [&b, n] { b = fib(n - 2); }
-  );
+  pls::spawn([&a, n] { a = fib(n - 1); });
+  pls::spawn([&b, n] { b = fib(n - 2); });
+
+  // pls::sync() ensures that all child tasks are finished.
+  pls::sync();
+
+  // After the sync() the produced results can be used safely.
  return a + b;
 }
 ```

-### Execution Trees and Static Resource Allocation
+The API primarily exposes a spawn(...) and sync() function, which are used to create
+nested fork-join parallelism. A spawn(...) call allows the passed lambda to execute
+asynchronously, a sync() call forces all direct child invocations to finish before it
+returns. The programming model therefore acts mostly as 'asynchronous sub procedure calls'
+and is a good fit for all algorithms that are naturally expressed as recursive functions.
+The described asynchronous invocation tree is then executed in parallel by the runtime system,
+which uses randomized work-stealing for load balancing.

-TODO: For the static memory allocation you need to find the maximum required resources.
+The parameters used to allocate the scheduler resources statically are as follows:
+- NUM_THREADS: The number of worker threads used for the parallel invocation.
+- MAX_SPAWN_DEPTH: The maximum depth of parallel invocations, i.e. the number of nested spawn calls.
+- MAX_STACK_SIZE: The stack size used for coroutines. It must be big enough to fit the stack of the passed
+lambda function until the next spawn statement appears.

-## Project Structure
+## Installation

-The project uses [CMAKE](https://cmake.org/) as it's build system,
-the recommended IDE is either a simple text editor or [CLion](https://www.jetbrains.com/clion/).
-We divide the project into sub-targets to separate for the library
-itself, testing and example code. The library itself can be found in
-`lib/pls`, the context switching implementation in `lib/context_switcher`,
-testing related code is in `test`, example and playground/benchmark apps are in `app`.
+This section will give a brief introduction on how to get a minimal
+project setup that uses the PLS library.

-### Buiding
+PLS has no external dependencies. To compile and install it you
+only need cmake and a recent C++ 17 compatible compiler.
+Care might be required on not explicitly supported systems
+(currently we support Linux x86 and ARMv7, however, all platforms
+supported by boost.context should work fine).

-To build the project first create a folder for the build
-(typically as a subfolder to the project) using `mkdir cmake-build-debug`.
-Change to the new folder `cd cmake-build-debug` and init the cmake
-project using `cmake ../ -DCMAKE_BUILD_TYPE=DEBUG`. For realease builds
-do the same only with build type `RELEASE`. Other build time settings
-can also be passed at this setup step.
+Clone the repository and open a terminal session in its folder.
+Create a build folder using `mkdir cmake-build-release`
+and switch into it `cd cmake-build-release`.
+Setup the cmake project using `cmake ../ -DCMAKE_BUILD_TYPE=RELEASE`,
+then install it as a system wide dependency using `sudo make install.pls`.

-After this is done you can use normal `make` commands like
-`make` to build everything `make <target>` to build a target
-or `make install` to install the library globally.
+At this point the library is installed on your system.
+To use it simply add it to your existing cmake project using
+`find_package(pls REQUIRED)` and then link it to your project
+using `target_link_libraries(your_target pls::pls)`.

 Available Settings:
- `-DPLS_PROFILER=ON/OFF`
-    - default OFF
-    - Enabling it will record execution DAGs with memory and runtime stats
-    - Enabling has a BIG performance hit (use only for development)
 - `-DSLEEP_WORKERS=ON/OFF`
    - default OFF
    - Enabling it will make workers keep a central 'all workers empty flag'
    - Workers try to sleep if there is no work in the system
    - Has performance impact on isolated runs, but can benefit multiprogrammed systems
- `-DEASY_PROFILER=ON/OFF`
+- `-DPLS_PROFILER=ON/OFF`
    - default OFF
-    - Enabling will link the easy profiler library and enable its macros
-    - Enabling has a performance hit (do not use in releases)
+    - Enabling it will record execution DAGs with memory and runtime stats
+    - Enabling has a BIG performance hit (use only for development)
 - `-DADDRESS_SANITIZER=ON/OFF`
    - default OFF
    - Enables address sanitizer to be linked to the executable
@@ -129,17 +128,76 @@ Available Settings:
    - default OFF
    - Enables the build with debug symbols
    - Use for e.g. profiling the release build
+- `-DEASY_PROFILER=ON/OFF`
+    - deprecated, not currently used in the project
+    - default OFF
+    - Enabling will link the easy profiler library and enable its macros
+    - Enabling has a performance hit (do not use in releases)

 Note that these settings are persistent for one CMake build folder.
 If you e.g. set a flag in the debug build it will not influence
 the release build, but it will persist in the debug build folder
 until you explicitly change it back.

+## Execution Trees and Static Resource Allocation in PLS
+
+As mentioned in the introduction, PLS allocates all resources it requires
+when creating the scheduler instance. For this, some parameters have to be provided by
+the user. In order to understand them, it is important to be aware of the execution model of
+PLS.
+
+During invocation sub procedure calls are executed potentially in parallel. Each parallel call is
+executed on a stackful coroutine with its own stack region, i.e. each spawn(...) places  and executes 
+the passed lambda on a new stack. Therefore, the static allocation must know how big the coroutine's stacks will be.
+The MAX_STACK_SIZE parameter controls the size of these stacks and should be big enough to hold
+one such subroutine invocation. Note that this stack must only be big enough to reach the next
+spawn(...) call, as this is executed on its own new stack. By default, PLS will allocate stacks that
+are multiple of the systems page size to enforce sigsevs when outrunning them. Thus, it should
+be easy to detect if the stack space is chosen too small. Optionally, the profiling mechanism
+in PLS can be used to find the exact sizes of the stacks during runtime.
+
+![](./media/invocation_tree.png)
+
+The figure above shows a call tree resulting from a prallel invocation, 
+where the boxes between the dotted lines are stackfull coroutines in the parallel reginon. 
+The region is entered through a `scheduler.perform_work([&]() { ... });` call at the top dotted line,
+switching execution onto the coroutines and left with a `pls::serial(...)` call that switches back
+to a continous stack (as indicated by the shaded boxes below the dotted line).
+
+The MAX_SPAWN_DEPTH paramter indicates the maximum nested spawn level, i.e. the depth of the
+invocation tree in the parallel region. If this depth is reached, 
+PLS will automaticall disable any further parallelism and switch to a serial execution.
+
+The work-stealing algorithm gurantees the busy-leaves property: during an invocation of the
+call tree at most P branches (P being the number of worker threads configured using NUM_THREADS)
+are active. Therefore, the invocation never user more than NUM_THREADS * MAX_SPAWN_DEPTH stacks
+and can allocate them during startup using NUM_THREADS * MAX_SPAWN_DEPTH * MAX_STACK_SIZE memory.
+
+During execution time pre-allocated memory is balanced between worker threads using a trading scheme
+integrated into the work-stealing procedure. This way, the static allocated memory suffices for every
+possible order the call tree can be invoced (as randomized stealing can lead to different execution orders).
+The user must only have an image of a regular, serial call tree in order to set the above limits and reason about
+prallel invocations, as the required memory simply scales up linearly with additional worker threads.
+
+## Project Structure
+
+The project uses [CMAKE](https://cmake.org/) as it's build system,
+the recommended IDE is either a simple text editor or [CLion](https://www.jetbrains.com/clion/).
+We divide the project into sub-targets to separate for the library
+itself, testing and example code. The library itself can be found in
+`lib/pls`, the context switching implementation in `lib/context_switcher`,
+testing related code is in `test`, example and playground/benchmark apps are in `app`.
+
+The main scheduling code can be found in `/pls/internal/scheduling` and the 
+resource trading/work stealing deque implementation responsible for balancing the static
+resources and stealing work can be found in `/pls/internal/scheduling/lock_free`.
+
 ### Testing

 Testing is done using [Catch2](https://github.com/catchorg/Catch2/)
 in the test subfolder. Tests are build into a target called `tests`
 and can be executed simply by building this executabe and running it.
+Currently, only basic tests are implemented. 

 ### PLS profiler

@@ -149,55 +207,21 @@ which can later be rendered by the dot software to inspect the actual
 executed graph.

 The most useful tools are to analyze the maximum memory required per
-coroutine stack, th computational depth, T_1 and T_inf.
-
-### Data Race Detection
-
-WARNING: the latest build of clang/thread sanitizer is required for this to work,
-as a recent bug-fix regarding user level thread is required!
-
-As this project contains a lot concurrent code we use
-[Thread Sanitizer](https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual)
-in our CI process and optional in other builds. To setup CMake builds
-with sanitizer enabled add the cmake option `-DTHREAD_SANITIZER=ON`.
-Please regularly test with thread sanitizer enabled and make sure to not
-keep the repository in a state where the sanitizer reports errors.
-
-Consider reading [the section on common data races](https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces)
-to get an idea of what we try to avoid in our code.
-
-### Profiling EasyProfiler
-
-To make profiling portable and allow us to later analyze the logs
-programaticly we use [easy_profiler](https://github.com/yse/easy_profiler)
-for capturing data. To enable profiling install the library on your system
-(best building it and then running `make install`) and set the
-cmake option `-DEASY_PROFILER=ON`.
-
-After that see the `invoke_parallel` example app for activating the
-profiler. This will generate a trace file that can be viewed with
-the `profiler_gui <output.prof>` command.
-
-Please note that the profiler adds overhead when looking at sub millisecond
-method invokations as we do and it can not replace a seperate
-profiler like `gperf`, `valgrind` or `vtune amplifier` for detailed analysis.
-We still think it makes sense to add it in as an optional feature,
-as the customizable colors and fine grained events (including collection
-of variables) can be used to visualize the `big picture` of
-program execution. Also, we hope to use it to log 'events' like
-successful and failed steals in the future, as the general idea of logging
-information per thread efficiently might be helpful for further
-analysis.
-
-
-### Profiling VTune Amplifier
-
-For detailed profiling of small performance hotspots we prefer
-to use [Intel's VTune Amplifier](https://software.intel.com/en-us/vtune).
-It gives insights in detailed microachitecture usage and performance
-hotspots. Follow the instructions by Intel for using it.
-Make sure to enable debug symbols (`-DDEBUG_SYMBOLS=ON`) in the
-analyzed build and that all optimizations are turned on
-(by choosing the release build).
+coroutine stack, the computational depth, T_1 and T_inf.
+
+To query the stats use the profiler object attached to a scheduler instance.
+```c++
+// You can disable the memory measure for better performance
+scheduler.get_profiler().disable_memory_measure();
+
+// Invoke some parallel algorithm you want to benchmark
+scheduler.perform_work([&]() { ... });
+
+// Collect stats from last run as required
+std::cout << scheduler.get_profiler().current_run().t_1_ << std::endl;
+std::cout << scheduler.get_profiler().current_run().t_inf_ << std::endl;
+scheduler.get_profiler().current_run().print_dag(std::cout);
+...
+```


--- a/media/invocation_tree.png
+++ b/media/invocation_tree.png