Commit dd8fb1e9 by FritzFlorian

Merge remote-tracking branch 'remotes/origin/master' into parallel_for

# Conflicts:
#	PERFORMANCE.md
#	lib/pls/CMakeLists.txt
#	lib/pls/include/pls/internal/scheduling/fork_join_task.h
parents 4fe555b7 70f72790
......@@ -34,11 +34,12 @@ add_subdirectory(app/playground)
add_subdirectory(app/test_for_new)
add_subdirectory(app/invoke_parallel)
add_subdirectory(app/benchmark_fft)
add_subdirectory(app/benchmark_unbalanced)
# Add optional tests
option(PACKAGE_TESTS "Build the tests" ON)
if(PACKAGE_TESTS)
if (PACKAGE_TESTS)
enable_testing()
add_subdirectory(test)
add_test(NAME AllTests COMMAND tests)
endif()
endif ()
......@@ -4,6 +4,33 @@ A collection of stuff that we noticed during development.
Useful later on two write a project report and to go back
in time to find out why certain decisions where made.
## 06.05.2019 - Relaxed Atomics
At the end of the atomic talk it is mentioned how the relaxed
atomics correspond to modern hardware, stating that starting with
ARMv8 and x86 there is no real need for relaxed atomics,
as the 'strict' ordering has no real performance impact.
We will therefore ignore this for now (taking some potential
performance hits on our banana pi m2 ARMv7 architecture),
as it adds a lot of complexity and introduces many possibilities
for subtle concurrency bugs.
TODO: find some papers on the topic in case we want to mention
it in our project report.
## 06.05.2019 - Atomics
Atomics are a big part to get lock-free code running and are also
often useful for details like counters to spin on (e.g. is a task
finished). Resources to understand them are found in
[a talk online](https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/)
and in the book 'C++ concurrency in action, Anthony Williams'.
The key takeaway is that we need to be careful about ordering
possibilities of variable reads/writes. We will start of to
use strict ordering to begin with and will have to see if this
impacts performance and if we need to optimize it.
## 12.04.2019 - Unique IDs
Assigning unique IDs to logical different tasks is key to the
......
......@@ -63,3 +63,62 @@ to contemption, but could not resolve it with any combination
of `tas_spinlock` vs `ttas_spinlock` and `lock` vs `try_lock`.
This issue clearly needs further investigation.
### Commit aa27064 - Performance with ttsa spinlocks (and 'full blocking' top level)
<img src="media/aa27064_fft_average.png" width="400"/>
### Commit d16ad3e - Performance with rw-lock and backoff
<img src="media/d16ad3e_fft_average.png" width="400"/>
### Commit 18b2d744 - Performance with lock-free deque
After much tinkering we still have performance problems with higher
thread counts in the FFT benchmark. Upward from 4/5 threads the
performance gains start to saturate (before removing the top level
locks we even saw a slight drop in performance).
Currently the FFT benchmark shows the following results (average):
<img src="media/18b2d744_fft_average.png" width="400"/>
We want to positively note that the overall trend of 'performance drops'
at the hyperthreading mark is not really bad anymore, it rather
seems similar to EMBB now (with backoff + lockfree deque + top level
reader-writers lock). This comes partly because the spike at 4 threads
is lower (less performance at 4 threads). We also see better times
on the multiprogramed system with the lock-free deque.
This is discouraging after many tests. To see where the overhead lies
we also implemented the unbalanced tree search benchmark,
resulting in the following, suprisingly good, results (average):
<img src="media/18b2d744_unbalanced_average.png" width="400"/>
The main difference between the two benchmarks is, that the second
one has more work and the work is relatively independent.
Additionaly, the first one uses our high level API (parallel invoke),
while the second one uses our low level API.
It is worth investigating if either or high level API or the structure
of the memory access in FFT are the problem.
### Commit cf056856 - Remove two-level scheduler
In this test we replace the two level scheduler with ONLY fork_join
tasks. This removes the top level steal overhead and performs only
internal stealing. For this we set the fork_join task as the only
possible task type and removed the top level rw-lock, the digging
down to our level and solely use internal stealing.
Average results FFT:
<img src="media/cf056856_fft_average.png" width="400"/>
Average results Unbalanced:
<img src="media/cf056856_unbalanced_average.png" width="400"/>
There seems to be only a minor performance difference between the two,
suggesting tha our two-level approach is not the part causing our
weaker performance.
......@@ -7,6 +7,8 @@
This section will give a brief introduction on how to get a minimal
project setup that uses the PLS library.
Further [general notes](NOTES.md) and [performance notes](PERFORMANCE.md) can be found in
their respective files.
Further notes on [performance](PERFORMANCE.md) and general
[notes](NOTES.md) on the development progress can be found in
......
......@@ -6,9 +6,9 @@
#include <complex>
#include <vector>
static constexpr int CUTOFF = 10;
static constexpr int CUTOFF = 16;
static constexpr int NUM_ITERATIONS = 1000;
static constexpr int INPUT_SIZE = 2064;
static constexpr int INPUT_SIZE = 8192;
typedef std::vector<std::complex<double>> complex_vector;
void divide(complex_vector::iterator data, int n) {
......@@ -73,7 +73,6 @@ complex_vector prepare_input(int input_size) {
return data;
}
int main() {
PROFILE_ENABLE
complex_vector initial_input = prepare_input(INPUT_SIZE);
......
add_executable(benchmark_unbalanced main.cpp node.h node.cpp picosha2.h)
target_link_libraries(benchmark_unbalanced pls)
if (EASY_PROFILER)
target_link_libraries(benchmark_unbalanced easy_profiler)
endif ()
MIT License
Copyright (c) 2017 okdshin
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
\ No newline at end of file
#include <pls/pls.h>
#include <pls/internal/helpers/profiler.h>
#include <pls/internal/helpers/mini_benchmark.h>
#include "node.h"
const int SEED = 42;
const int ROOT_CHILDREN = 140;
const double Q = 0.124875;
const int NORMAL_CHILDREN = 8;
const int NUM_NODES = 71069;
int count_child_nodes(uts::node &node) {
int child_count = 1;
std::vector<uts::node> children = node.spawn_child_nodes();
if (children.empty()) {
return child_count;
}
auto current_task = pls::fork_join_sub_task::current();
std::vector<int> results(children.size());
for (size_t i = 0; i < children.size(); i++) {
size_t index = i;
auto lambda = [&, index] { results[index] = count_child_nodes(children[index]); };
pls::fork_join_lambda_by_value<typeof(lambda)> sub_task(lambda);
current_task->spawn_child(sub_task);
}
current_task->wait_for_all();
for (auto result : results) {
child_count += result;
}
return child_count;
}
int unbalanced_tree_search(int seed, int root_children, double q, int normal_children) {
static auto id = pls::unique_id::create(42);
int result;
auto lambda = [&] {
uts::node root(seed, root_children, q, normal_children);
result = count_child_nodes(root);
};
pls::fork_join_lambda_by_reference<typeof(lambda)> task(lambda);
pls::fork_join_lambda_by_reference<typeof(lambda)> sub_task(lambda);
pls::fork_join_task root_task{&sub_task, id};
pls::scheduler::execute_task(root_task);
return result;
}
int main() {
PROFILE_ENABLE
pls::internal::helpers::run_mini_benchmark([&] {
unbalanced_tree_search(SEED, ROOT_CHILDREN, Q, NORMAL_CHILDREN);
}, 8, 4000);
PROFILE_SAVE("test_profile.prof")
}
//int main() {
// PROFILE_ENABLE
// pls::malloc_scheduler_memory my_scheduler_memory{8, 2u << 18};
// pls::scheduler scheduler{&my_scheduler_memory, 8};
//
// scheduler.perform_work([&] {
// PROFILE_MAIN_THREAD
// for (int i = 0; i < 10; i++) {
// PROFILE_WORK_BLOCK("Top Level")
// int result = unbalanced_tree_search(SEED, ROOT_CHILDREN, Q, NORMAL_CHILDREN);
// std::cout << result << std::endl;
// }
// });
//
// PROFILE_SAVE("test_profile.prof")
//}
#include "node.h"
namespace uts {
node_state node::generate_child_state(uint32_t index) {
node_state result;
picosha2::hash256_one_by_one hasher;
hasher.process(state_.begin(), state_.end());
auto index_begin = reinterpret_cast<uint8_t *>(&index);
hasher.process(index_begin, index_begin + 4);
hasher.finish();
hasher.get_hash_bytes(result.begin(), result.end());
return result;
}
double node::get_state_random() {
int32_t state_random_integer;
uint32_t b = ((uint32_t) state_[16] << 24) |
((uint32_t) state_[17] << 16) |
((uint32_t) state_[18] << 8) |
((uint32_t) state_[19] << 0);
b = b & 0x7fffffff; // Mask out negative values
state_random_integer = static_cast<int32_t>(b);
return (double) state_random_integer / (double) INT32_MAX;
}
}
#ifndef UTS_NODE_H
#define UTS_NODE_H
#include <cstdint>
#include <array>
#include <vector>
#include "picosha2.h"
namespace uts {
using node_state = std::array<uint8_t, 20>;
/**
* Node of an unballanced binomial tree (https://www.cs.unc.edu/~olivier/LCPC06.pdf).
* To build up the tree recursivly call spawn_child_nodes on each node until leaves are reached.
* The tree is not built up directly in memory, but rather by the recursive calls.
*/
class node {
// The state is used to allow a deterministic tree construction using sha256 hashes.
node_state state_;
// Set this to a positive number for the root node to start the tree with a specific size
int root_children_;
// general branching factors
double q_;
int b_;
// Private constructor for children
node(node_state state, double q, int b) : state_{state}, root_children_{-1}, q_{q}, b_{b} {}
std::array<uint8_t, 20> generate_child_state(uint32_t index);
double get_state_random();
public:
node(int seed, int root_children, double q, int b) : state_({{}}), root_children_{root_children}, q_{q}, b_{b} {
for (int i = 0; i < 16; i++) {
state_[i] = 0;
}
state_[16] = static_cast<uint8_t>(0xFF & (seed >> 24));
state_[17] = static_cast<uint8_t>(0xFF & (seed >> 16));
state_[18] = static_cast<uint8_t>(0xFF & (seed >> 8));
state_[19] = static_cast<uint8_t>(0xFF & (seed >> 0));
picosha2::hash256_one_by_one hasher;
hasher.process(state_.begin(), state_.end());
hasher.finish();
hasher.get_hash_bytes(state_.begin(), state_.end());
}
std::vector<node> spawn_child_nodes() {
double state_random = get_state_random();
int num_children;
if (root_children_ > 0) {
num_children = root_children_; // Root always spawns children
} else if (state_random < q_) {
num_children = b_;
} else {
num_children = 0;
}
std::vector<node> result;
for (int i = 0; i < num_children; i++) {
result.push_back(node(generate_child_state(i), q_, b_));
}
return result;
}
};
}
#endif //UTS_NODE_H
/*
The MIT License (MIT)
Copyright (C) 2017 okdshin
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#ifndef PICOSHA2_H
#define PICOSHA2_H
// picosha2:20140213
#ifndef PICOSHA2_BUFFER_SIZE_FOR_INPUT_ITERATOR
#define PICOSHA2_BUFFER_SIZE_FOR_INPUT_ITERATOR \
1048576 //=1024*1024: default is 1MB memory
#endif
#include <algorithm>
#include <cassert>
#include <iterator>
#include <sstream>
#include <vector>
#include <fstream>
namespace picosha2 {
typedef unsigned long word_t;
typedef unsigned char byte_t;
static const size_t k_digest_size = 32;
namespace detail {
inline byte_t mask_8bit(byte_t x) { return x & 0xff; }
inline word_t mask_32bit(word_t x) { return x & 0xffffffff; }
const word_t add_constant[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1,
0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786,
0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147,
0x06ca6351, 0x14292967, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85, 0xa2bfe8a1, 0xa81a664b,
0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a,
0x5b9cca4f, 0x682e6ff3, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};
const word_t initial_message_digest[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372,
0xa54ff53a, 0x510e527f, 0x9b05688c,
0x1f83d9ab, 0x5be0cd19};
inline word_t ch(word_t x, word_t y, word_t z) { return (x & y) ^ ((~x) & z); }
inline word_t maj(word_t x, word_t y, word_t z) {
return (x & y) ^ (x & z) ^ (y & z);
}
inline word_t rotr(word_t x, std::size_t n) {
assert(n < 32);
return mask_32bit((x >> n) | (x << (32 - n)));
}
inline word_t bsig0(word_t x) { return rotr(x, 2) ^ rotr(x, 13) ^ rotr(x, 22); }
inline word_t bsig1(word_t x) { return rotr(x, 6) ^ rotr(x, 11) ^ rotr(x, 25); }
inline word_t shr(word_t x, std::size_t n) {
assert(n < 32);
return x >> n;
}
inline word_t ssig0(word_t x) { return rotr(x, 7) ^ rotr(x, 18) ^ shr(x, 3); }
inline word_t ssig1(word_t x) { return rotr(x, 17) ^ rotr(x, 19) ^ shr(x, 10); }
template <typename RaIter1, typename RaIter2>
void hash256_block(RaIter1 message_digest, RaIter2 first, RaIter2 last) {
assert(first + 64 == last);
static_cast<void>(last); // for avoiding unused-variable warning
word_t w[64];
std::fill(w, w + 64, 0);
for (std::size_t i = 0; i < 16; ++i) {
w[i] = (static_cast<word_t>(mask_8bit(*(first + i * 4))) << 24) |
(static_cast<word_t>(mask_8bit(*(first + i * 4 + 1))) << 16) |
(static_cast<word_t>(mask_8bit(*(first + i * 4 + 2))) << 8) |
(static_cast<word_t>(mask_8bit(*(first + i * 4 + 3))));
}
for (std::size_t i = 16; i < 64; ++i) {
w[i] = mask_32bit(ssig1(w[i - 2]) + w[i - 7] + ssig0(w[i - 15]) +
w[i - 16]);
}
word_t a = *message_digest;
word_t b = *(message_digest + 1);
word_t c = *(message_digest + 2);
word_t d = *(message_digest + 3);
word_t e = *(message_digest + 4);
word_t f = *(message_digest + 5);
word_t g = *(message_digest + 6);
word_t h = *(message_digest + 7);
for (std::size_t i = 0; i < 64; ++i) {
word_t temp1 = h + bsig1(e) + ch(e, f, g) + add_constant[i] + w[i];
word_t temp2 = bsig0(a) + maj(a, b, c);
h = g;
g = f;
f = e;
e = mask_32bit(d + temp1);
d = c;
c = b;
b = a;
a = mask_32bit(temp1 + temp2);
}
*message_digest += a;
*(message_digest + 1) += b;
*(message_digest + 2) += c;
*(message_digest + 3) += d;
*(message_digest + 4) += e;
*(message_digest + 5) += f;
*(message_digest + 6) += g;
*(message_digest + 7) += h;
for (std::size_t i = 0; i < 8; ++i) {
*(message_digest + i) = mask_32bit(*(message_digest + i));
}
}
} // namespace detail
template <typename InIter>
void output_hex(InIter first, InIter last, std::ostream& os) {
os.setf(std::ios::hex, std::ios::basefield);
while (first != last) {
os.width(2);
os.fill('0');
os << static_cast<unsigned int>(*first);
++first;
}
os.setf(std::ios::dec, std::ios::basefield);
}
template <typename InIter>
void bytes_to_hex_string(InIter first, InIter last, std::string& hex_str) {
std::ostringstream oss;
output_hex(first, last, oss);
hex_str.assign(oss.str());
}
template <typename InContainer>
void bytes_to_hex_string(const InContainer& bytes, std::string& hex_str) {
bytes_to_hex_string(bytes.begin(), bytes.end(), hex_str);
}
template <typename InIter>
std::string bytes_to_hex_string(InIter first, InIter last) {
std::string hex_str;
bytes_to_hex_string(first, last, hex_str);
return hex_str;
}
template <typename InContainer>
std::string bytes_to_hex_string(const InContainer& bytes) {
std::string hex_str;
bytes_to_hex_string(bytes, hex_str);
return hex_str;
}
class hash256_one_by_one {
public:
hash256_one_by_one() { init(); }
void init() {
buffer_.clear();
std::fill(data_length_digits_, data_length_digits_ + 4, 0);
std::copy(detail::initial_message_digest,
detail::initial_message_digest + 8, h_);
}
template <typename RaIter>
void process(RaIter first, RaIter last) {
add_to_data_length(static_cast<word_t>(std::distance(first, last)));
std::copy(first, last, std::back_inserter(buffer_));
std::size_t i = 0;
for (; i + 64 <= buffer_.size(); i += 64) {
detail::hash256_block(h_, buffer_.begin() + i,
buffer_.begin() + i + 64);
}
buffer_.erase(buffer_.begin(), buffer_.begin() + i);
}
void finish() {
byte_t temp[64];
std::fill(temp, temp + 64, 0);
std::size_t remains = buffer_.size();
std::copy(buffer_.begin(), buffer_.end(), temp);
temp[remains] = 0x80;
if (remains > 55) {
std::fill(temp + remains + 1, temp + 64, 0);
detail::hash256_block(h_, temp, temp + 64);
std::fill(temp, temp + 64 - 4, 0);
} else {
std::fill(temp + remains + 1, temp + 64 - 4, 0);
}
write_data_bit_length(&(temp[56]));
detail::hash256_block(h_, temp, temp + 64);
}
template <typename OutIter>
void get_hash_bytes(OutIter first, OutIter last) const {
for (const word_t* iter = h_; iter != h_ + 8; ++iter) {
for (std::size_t i = 0; i < 4 && first != last; ++i) {
*(first++) = detail::mask_8bit(
static_cast<byte_t>((*iter >> (24 - 8 * i))));
}
}
}
private:
void add_to_data_length(word_t n) {
word_t carry = 0;
data_length_digits_[0] += n;
for (std::size_t i = 0; i < 4; ++i) {
data_length_digits_[i] += carry;
if (data_length_digits_[i] >= 65536u) {
carry = data_length_digits_[i] >> 16;
data_length_digits_[i] &= 65535u;
} else {
break;
}
}
}
void write_data_bit_length(byte_t* begin) {
word_t data_bit_length_digits[4];
std::copy(data_length_digits_, data_length_digits_ + 4,
data_bit_length_digits);
// convert byte length to bit length (multiply 8 or shift 3 times left)
word_t carry = 0;
for (std::size_t i = 0; i < 4; ++i) {
word_t before_val = data_bit_length_digits[i];
data_bit_length_digits[i] <<= 3;
data_bit_length_digits[i] |= carry;
data_bit_length_digits[i] &= 65535u;
carry = (before_val >> (16 - 3)) & 65535u;
}
// write data_bit_length
for (int i = 3; i >= 0; --i) {
(*begin++) = static_cast<byte_t>(data_bit_length_digits[i] >> 8);
(*begin++) = static_cast<byte_t>(data_bit_length_digits[i]);
}
}
std::vector<byte_t> buffer_;
word_t data_length_digits_[4]; // as 64bit integer (16bit x 4 integer)
word_t h_[8];
};
inline void get_hash_hex_string(const hash256_one_by_one& hasher,
std::string& hex_str) {
byte_t hash[k_digest_size];
hasher.get_hash_bytes(hash, hash + k_digest_size);
return bytes_to_hex_string(hash, hash + k_digest_size, hex_str);
}
inline std::string get_hash_hex_string(const hash256_one_by_one& hasher) {
std::string hex_str;
get_hash_hex_string(hasher, hex_str);
return hex_str;
}
namespace impl {
template <typename RaIter, typename OutIter>
void hash256_impl(RaIter first, RaIter last, OutIter first2, OutIter last2, int,
std::random_access_iterator_tag) {
hash256_one_by_one hasher;
// hasher.init();
hasher.process(first, last);
hasher.finish();
hasher.get_hash_bytes(first2, last2);
}
template <typename InputIter, typename OutIter>
void hash256_impl(InputIter first, InputIter last, OutIter first2,
OutIter last2, int buffer_size, std::input_iterator_tag) {
std::vector<byte_t> buffer(buffer_size);
hash256_one_by_one hasher;
// hasher.init();
while (first != last) {
int size = buffer_size;
for (int i = 0; i != buffer_size; ++i, ++first) {
if (first == last) {
size = i;
break;
}
buffer[i] = *first;
}
hasher.process(buffer.begin(), buffer.begin() + size);
}
hasher.finish();
hasher.get_hash_bytes(first2, last2);
}
}
template <typename InIter, typename OutIter>
void hash256(InIter first, InIter last, OutIter first2, OutIter last2,
int buffer_size = PICOSHA2_BUFFER_SIZE_FOR_INPUT_ITERATOR) {
picosha2::impl::hash256_impl(
first, last, first2, last2, buffer_size,
typename std::iterator_traits<InIter>::iterator_category());
}
template <typename InIter, typename OutContainer>
void hash256(InIter first, InIter last, OutContainer& dst) {
hash256(first, last, dst.begin(), dst.end());
}
template <typename InContainer, typename OutIter>
void hash256(const InContainer& src, OutIter first, OutIter last) {
hash256(src.begin(), src.end(), first, last);
}
template <typename InContainer, typename OutContainer>
void hash256(const InContainer& src, OutContainer& dst) {
hash256(src.begin(), src.end(), dst.begin(), dst.end());
}
template <typename InIter>
void hash256_hex_string(InIter first, InIter last, std::string& hex_str) {
byte_t hashed[k_digest_size];
hash256(first, last, hashed, hashed + k_digest_size);
std::ostringstream oss;
output_hex(hashed, hashed + k_digest_size, oss);
hex_str.assign(oss.str());
}
template <typename InIter>
std::string hash256_hex_string(InIter first, InIter last) {
std::string hex_str;
hash256_hex_string(first, last, hex_str);
return hex_str;
}
inline void hash256_hex_string(const std::string& src, std::string& hex_str) {
hash256_hex_string(src.begin(), src.end(), hex_str);
}
template <typename InContainer>
void hash256_hex_string(const InContainer& src, std::string& hex_str) {
hash256_hex_string(src.begin(), src.end(), hex_str);
}
template <typename InContainer>
std::string hash256_hex_string(const InContainer& src) {
return hash256_hex_string(src.begin(), src.end());
}
template<typename OutIter>void hash256(std::ifstream& f, OutIter first, OutIter last){
hash256(std::istreambuf_iterator<char>(f), std::istreambuf_iterator<char>(), first,last);
}
}// namespace picosha2
#endif // PICOSHA2_H
......@@ -2,48 +2,99 @@
#include <pls/internal/helpers/profiler.h>
#include <iostream>
#include <complex>
#include <vector>
static pls::static_scheduler_memory<8, 2 << 14> my_scheduler_memory;
static constexpr int CUTOFF = 16;
static constexpr int INPUT_SIZE = 8192;
typedef std::vector<std::complex<double>> complex_vector;
static constexpr int CUTOFF = 10;
long fib_serial(long n) {
if (n == 0) {
return 0;
void divide(complex_vector::iterator data, int n) {
complex_vector tmp_odd_elements(n / 2);
for (int i = 0; i < n / 2; i++) {
tmp_odd_elements[i] = data[i * 2 + 1];
}
for (int i = 0; i < n / 2; i++) {
data[i] = data[i * 2];
}
if (n == 1) {
return 1;
for (int i = 0; i < n / 2; i++) {
data[i + n / 2] = tmp_odd_elements[i];
}
}
void combine(complex_vector::iterator data, int n) {
for (int i = 0; i < n / 2; i++) {
std::complex<double> even = data[i];
std::complex<double> odd = data[i + n / 2];
return fib_serial(n - 1) + fib_serial(n - 2);
// w is the "twiddle-factor".
// this could be cached, but we run the same 'data_structures' algorithm parallel/serial,
// so it won't impact the performance comparison.
std::complex<double> w = exp(std::complex<double>(0, -2. * M_PI * i / n));
data[i] = even + w * odd;
data[i + n / 2] = even - w * odd;
}
}
long fib(long n) {
if (n <= CUTOFF) {
return fib_serial(n);
void fft(complex_vector::iterator data, int n) {
if (n < 2) {
return;
}
// Actual 'invoke_parallel' logic/code
int left, right;
PROFILE_WORK_BLOCK("Divide")
divide(data, n);
PROFILE_END_BLOCK
PROFILE_WORK_BLOCK("Invoke Parallel")
if (n == CUTOFF) {
PROFILE_WORK_BLOCK("FFT Serial")
fft(data, n / 2);
fft(data + n / 2, n / 2);
} else if (n <= CUTOFF) {
fft(data, n / 2);
fft(data + n / 2, n / 2);
} else {
pls::invoke_parallel(
[&] { left = fib(n - 1); },
[&] { right = fib(n - 2); }
[&] { fft(data, n / 2); },
[&] { fft(data + n / 2, n / 2); }
);
return left + right;
}
PROFILE_END_BLOCK
PROFILE_WORK_BLOCK("Combine")
combine(data, n);
PROFILE_END_BLOCK
}
complex_vector prepare_input(int input_size) {
std::vector<double> known_frequencies{2, 11, 52, 88, 256};
complex_vector data(input_size);
// Set our input data to match a time series of the known_frequencies.
// When applying fft to this time-series we should find these frequencies.
for (int i = 0; i < input_size; i++) {
data[i] = std::complex<double>(0.0, 0.0);
for (auto frequencie : known_frequencies) {
data[i] += sin(2 * M_PI * frequencie * i / input_size);
}
}
return data;
}
int main() {
PROFILE_ENABLE
pls::malloc_scheduler_memory my_scheduler_memory{8, 2u << 14};
pls::scheduler scheduler{&my_scheduler_memory, 8};
long result;
complex_vector initial_input = prepare_input(INPUT_SIZE);
scheduler.perform_work([&] {
PROFILE_MAIN_THREAD
// Call looks just the same, only requirement is
// the enclosure in the perform_work lambda.
for (int i = 0; i < 10; i++) {
result = fib(30);
std::cout << "Fib(30)=" << result << std::endl;
PROFILE_WORK_BLOCK("Top Level FFT")
complex_vector input = initial_input;
fft(input.begin(), input.size());
}
});
......
......@@ -10,8 +10,6 @@
#include <pls/internal/scheduling/root_task.h>
#include <pls/internal/helpers/unique_id.h>
int main() {
std::cout << pls::internal::scheduling::root_task<void(*)>::create_id().type_.hash_code() << std::endl;
std::cout << pls::internal::helpers::unique_id::create<pls::internal::scheduling::root_task<void(*)>>().type_.hash_code() << std::endl;
}
......@@ -5,9 +5,8 @@ using namespace pls::internal::base;
int global = 0;
int main() {
// Try to use every feature, to trigger the prohibited use of new if found somewhere
auto t1 = start_thread([] (){});
auto t1 = start_thread([]() {});
t1.join();
}
......@@ -6,15 +6,17 @@
#include "pls/internal/scheduling/scheduler.h"
namespace pls {
namespace algorithm {
template<typename Function1, typename Function2>
void invoke_parallel(const Function1& function1, const Function2& function2);
namespace algorithm {
template<typename Function1, typename Function2, typename Function3>
void invoke_parallel(const Function1& function1, const Function2& function2, const Function3& function3);
template<typename Function1, typename Function2>
void invoke_parallel(const Function1 &function1, const Function2 &function2);
// ...and so on, add more if we decide to keep this design
}
template<typename Function1, typename Function2, typename Function3>
void invoke_parallel(const Function1 &function1, const Function2 &function2, const Function3 &function3);
// ...and so on, add more if we decide to keep this design
}
}
#include "invoke_parallel_impl.h"
......
......@@ -2,70 +2,73 @@
#ifndef PLS_INVOKE_PARALLEL_IMPL_H
#define PLS_INVOKE_PARALLEL_IMPL_H
#include <pls/internal/scheduling/fork_join_task.h>
#include "pls/internal/scheduling/fork_join_task.h"
#include "pls/internal/scheduling/scheduler.h"
#include "pls/internal/helpers/unique_id.h"
#include "pls/internal/base/alignment.h"
namespace pls {
namespace algorithm {
namespace internal {
using namespace ::pls::internal::scheduling;
namespace algorithm {
namespace internal {
using namespace ::pls::internal::scheduling;
template<typename Body>
inline void run_body(const Body& internal_body, const abstract_task::id& id) {
template<typename Body>
inline void run_body(const Body &internal_body, const abstract_task::id &id) {
// Make sure we are in the context of this invoke_parallel instance,
// if not we will spawn it as a new 'fork-join-style' task.
auto current_task = scheduler::current_task();
if (current_task->unique_id() == id) {
auto current_sub_task = reinterpret_cast<fork_join_task*>(current_task)->currently_executing();
internal_body(current_sub_task);
internal_body();
} else {
fork_join_lambda<Body> root_body(&internal_body);
fork_join_lambda_by_reference<Body> root_body(internal_body);
fork_join_task root_task{&root_body, id};
scheduler::execute_task(root_task);
}
}
}
}
}
template<typename Function1, typename Function2>
void invoke_parallel(const Function1& function1, const Function2& function2) {
template<typename Function1, typename Function2>
void invoke_parallel(const Function1 &function1, const Function2 &function2) {
using namespace ::pls::internal::scheduling;
using namespace ::pls::internal::helpers;
using namespace ::pls::internal::base;
static abstract_task::id id = unique_id::create<Function1, Function2>();
auto internal_body = [&] (fork_join_sub_task* this_task){
auto sub_task_body_1 = [&] (fork_join_sub_task*){ function1(); };
auto sub_task_1 = fork_join_lambda<decltype(sub_task_body_1)>(&sub_task_body_1);
auto internal_body = [&]() {
auto current_task = fork_join_sub_task::current();
auto sub_task_2 = fork_join_lambda_by_reference<Function2>(function2);
this_task->spawn_child(sub_task_1);
function2(); // Execute last function 'inline' without spawning a sub_task object
this_task->wait_for_all();
current_task->spawn_child(sub_task_2);
function1(); // Execute first function 'inline' without spawning a sub_task object
current_task->wait_for_all();
};
internal::run_body(internal_body, id);
}
}
template<typename Function1, typename Function2, typename Function3>
void invoke_parallel(const Function1& function1, const Function2& function2, const Function3& function3) {
template<typename Function1, typename Function2, typename Function3>
void invoke_parallel(const Function1 &function1, const Function2 &function2, const Function3 &function3) {
using namespace ::pls::internal::scheduling;
using namespace ::pls::internal::helpers;
static abstract_task::id id = unique_id::create<Function1, Function2, Function3>();
auto internal_body = [&] (fork_join_sub_task* this_task){
auto sub_task_body_1 = [&] (fork_join_sub_task*){ function1(); };
auto sub_task_1 = fork_join_lambda<decltype(sub_task_body_1)>(&sub_task_body_1);
auto sub_task_body_2 = [&] (fork_join_sub_task*){ function2(); };
auto sub_task_2 = fork_join_lambda<decltype(sub_task_body_2)>(&sub_task_body_2);
auto internal_body = [&]() {
auto current_task = fork_join_sub_task::current();
auto sub_task_2 = fork_join_lambda_by_reference<Function2>(function2);
auto sub_task_3 = fork_join_lambda_by_reference<Function3>(function3);
this_task->spawn_child(sub_task_1);
this_task->spawn_child(sub_task_2);
function3(); // Execute last function 'inline' without spawning a sub_task object
this_task->wait_for_all();
current_task->spawn_child(sub_task_2);
current_task->spawn_child(sub_task_3);
function1(); // Execute first function 'inline' without spawning a sub_task object
current_task->wait_for_all();
};
internal::run_body(internal_body, id);
}
}
}
}
}
#endif //PLS_INVOKE_PARALLEL_IMPL_H
......@@ -8,21 +8,45 @@
#include "system_details.h"
namespace pls {
namespace internal {
namespace base {
namespace alignment {
template<typename T>
struct aligned_wrapper {
namespace internal {
namespace base {
namespace alignment {
template<typename T>
struct aligned_wrapper {
alignas(system_details::CACHE_LINE_SIZE) unsigned char data[sizeof(T)];
T* pointer() { return reinterpret_cast<T*>(data); }
};
void* allocate_aligned(size_t size);
T *pointer() { return reinterpret_cast<T *>(data); }
};
void *allocate_aligned(size_t size);
system_details::pointer_t next_alignment(system_details::pointer_t size);
system_details::pointer_t previous_alignment(system_details::pointer_t size);
char *next_alignment(char *pointer);
}
template<typename T>
struct aligned_aba_pointer {
const system_details::pointer_t pointer_;
explicit aligned_aba_pointer(T *pointer, unsigned int aba = 0) : pointer_{
reinterpret_cast<system_details::pointer_t >(pointer) + aba} {}
std::uintptr_t next_alignment(std::uintptr_t size);
char* next_alignment(char* pointer);
T *pointer() const {
return reinterpret_cast<T *>(pointer_ & system_details::CACHE_LINE_ADDRESS_USED_BITS);
}
unsigned int aba() const {
return pointer_ & system_details::CACHE_LINE_ADDRESS_UNUSED_BITS;
}
aligned_aba_pointer set_aba(unsigned int aba) const {
return aligned_aba_pointer(pointer(), aba);
}
};
}
}
}
#endif //PLS_ALIGNMENT_H
#ifndef PLS_BACKOFF_H_
#define PLS_BACKOFF_H_
#include "pls/internal/base/system_details.h"
#include "pls/internal/helpers/profiler.h"
#include "pls/internal/base/thread.h"
#include <random>
#include <math.h>
namespace pls {
namespace internal {
namespace base {
class backoff {
const unsigned long INITIAL_SPIN_ITERS = 2u << 4u;
const unsigned long MAX_SPIN_ITERS = 2u << 8u;
const unsigned long MAX_ITERS = 2u << 10u;
const unsigned long YELD_ITERS = 2u << 10u;
unsigned long current_ = INITIAL_SPIN_ITERS;
std::minstd_rand random_;
static void spin(unsigned long iterations) {
for (volatile unsigned long i = 0; i < iterations; i++)
system_details::relax_cpu(); // Spin
}
public:
backoff() : current_{INITIAL_SPIN_ITERS}, random_{std::random_device{}()} {}
void do_backoff() {
PROFILE_LOCK("Backoff")
spin(random_() % std::min(current_, MAX_SPIN_ITERS));
if (current_ >= YELD_ITERS) {
PROFILE_LOCK("Yield")
this_thread::yield();
}
current_ = std::min(current_ * 2, MAX_ITERS);
}
void reset() {
current_ = INITIAL_SPIN_ITERS;
}
};
}
}
}
#endif //PLS_BACKOFF_H_
......@@ -5,9 +5,10 @@
#include <pthread.h>
namespace pls {
namespace internal {
namespace base {
/**
namespace internal {
namespace base {
/**
* Provides standard barrier behaviour.
* `count` threads have to call `wait()` before any of the `wait()` calls returns,
* thus blocking all threads until everyone reached the barrier.
......@@ -15,7 +16,7 @@ namespace pls {
* PORTABILITY:
* Current implementation is based on pthreads.
*/
class barrier {
class barrier {
pthread_barrier_t barrier_;
public:
......@@ -23,9 +24,10 @@ namespace pls {
~barrier();
void wait();
};
}
}
};
}
}
}
#endif //PLS_BARRIER_H
......@@ -11,5 +11,6 @@
* (or its inclusion adds too much overhead).
*/
#define PLS_ERROR(msg) std::cout << msg << std::endl; exit(1);
#define PLS_ASSERT(cond, msg) if (!cond) { PLS_ERROR(msg) }
#endif //PLS_ERROR_HANDLING_H
......@@ -6,12 +6,14 @@
#include "ttas_spin_lock.h"
namespace pls {
namespace internal {
namespace base {
// Default Spin-Lock implementation for this project.
using spin_lock = tas_spin_lock;
}
}
namespace internal {
namespace base {
// Default Spin-Lock implementation for this project.
using spin_lock = ttas_spin_lock;
}
}
}
#endif //PLS_SPINLOCK_H
#ifndef PLS_SWMR_SPIN_LOCK_LOCK_H_
#define PLS_SWMR_SPIN_LOCK_LOCK_H_
#include <atomic>
#include "pls/internal/helpers/profiler.h"
namespace pls {
namespace internal {
namespace base {
/**
* Single writer, multiple reader spin lock.
* The writer is required to be the same thread all the time (single writer),
* while multiple threads can read.
* Readers fail to lock when the writer requests the lock,
* the acquires the lock after all remaining readers left the critical section.
*/
class swmr_spin_lock {
std::atomic<int> readers_;
std::atomic<int> write_request_;
public:
explicit swmr_spin_lock() : readers_{0}, write_request_{0} {}
bool reader_try_lock();
void reader_unlock();
void writer_lock();
void writer_unlock();
};
}
}
}
#endif //PLS_SWMR_SPIN_LOCK_LOCK_H_
......@@ -3,31 +3,76 @@
#define PLS_SYSTEM_DETAILS_H
#include <cstdint>
#if (COMPILER == MVCC)
#include <emmintrin.h>
#endif
namespace pls {
namespace internal {
namespace base {
/**
namespace internal {
namespace base {
/**
* Collection of system details, e.g. hardware cache line size.
*
* PORTABILITY:
* Currently sane default values for x86.
*/
namespace system_details {
/**
* Most processors have 64 byte cache lines
namespace system_details {
/**
* Pointer Types needed for ABA protection mixed into addresses.
* pointer_t should be an integer type capable of holding ANY pointer value.
*/
using pointer_t = std::uintptr_t;
constexpr pointer_t ZERO_POINTER = 0;
constexpr pointer_t MAX_POINTER = ~ZERO_POINTER;
/**
* Biggest type that supports atomic CAS operations.
* Usually it is sane to assume a pointer can be swapped in a single CAS operation.
*/
using cas_integer = pointer_t;
constexpr cas_integer MIN_CAS_INTEGER = 0;
constexpr cas_integer MAX_CAS_INTEGER = ~MIN_CAS_INTEGER;
constexpr cas_integer FIRST_HALF_CAS_INTEGER = MAX_CAS_INTEGER << ((sizeof(cas_integer) / 2) * 8);
constexpr cas_integer SECOND_HALF_CAS_INTEGER = ~FIRST_HALF_CAS_INTEGER;
/**
* Most processors have 64 byte cache lines (last 6 bit of the address are zero at line beginnings).
*/
constexpr std::uintptr_t CACHE_LINE_SIZE = 64;
constexpr unsigned int CACHE_LINE_ADDRESS_BITS = 6;
constexpr pointer_t CACHE_LINE_SIZE = 2u << (CACHE_LINE_ADDRESS_BITS - 1);
constexpr pointer_t CACHE_LINE_ADDRESS_USED_BITS = MAX_POINTER << CACHE_LINE_ADDRESS_BITS;
constexpr pointer_t CACHE_LINE_ADDRESS_UNUSED_BITS = ~CACHE_LINE_ADDRESS_USED_BITS;
/**
/**
* Choose one of the following ways to store thread specific data.
* Try to choose the fastest available on this processor/system.
*/
// #define PLS_THREAD_SPECIFIC_PTHREAD
#define PLS_THREAD_SPECIFIC_COMPILER
}
}
}
//#define PLS_THREAD_SPECIFIC_PTHREAD
#define PLS_THREAD_SPECIFIC_COMPILER
/**
* When spinning one wants to 'relax' the CPU from some task,
* e.g. disabling speculative execution/branch prediction
* or reducing its clock speed.
* This is both good for power draw, as well as for hyperthreading.
*
* Choose the implementation appropriate for your compiler-cpu combination.
*/
#if (COMPILER == MVCC)
inline void relax_cpu() {
_mm_pause();
}
#elif (COMPILER == GCC || COMPILER == LLVM)
inline void relax_cpu() {
asm("pause");
}
#endif
}
}
}
}
#endif //PLS_SYSTEM_DETAILS_H
......@@ -10,30 +10,29 @@
#include "pls/internal/base/thread.h"
namespace pls {
namespace internal {
namespace base {
/**
namespace internal {
namespace base {
/**
* A simple set and test_and_set based spin lock implementation.
*
* PORTABILITY:
* Current implementation is based on C++ 11 atomic_flag.
*/
class tas_spin_lock {
class tas_spin_lock {
std::atomic_flag flag_;
unsigned int yield_at_tries_;
public:
tas_spin_lock(): flag_{ATOMIC_FLAG_INIT}, yield_at_tries_{1024} {};
tas_spin_lock(const tas_spin_lock& other): flag_{ATOMIC_FLAG_INIT}, yield_at_tries_{other.yield_at_tries_} {}
tas_spin_lock() : flag_{ATOMIC_FLAG_INIT} {};
tas_spin_lock(const tas_spin_lock &/*other*/) : flag_{ATOMIC_FLAG_INIT} {}
void lock();
bool try_lock(unsigned int num_tries=1);
bool try_lock(unsigned int num_tries = 1);
void unlock();
};
}
}
}
};
}
}
}
#endif //PLS_TAS_SPIN_LOCK_H
......@@ -9,15 +9,17 @@
#include <functional>
#include <pthread.h>
#include <atomic>
#include <time.h>
#include "system_details.h"
namespace pls {
namespace internal {
namespace base {
using thread_entrypoint = void();
namespace internal {
namespace base {
/**
using thread_entrypoint = void();
/**
* Static methods than can be performed on the current thread.
*
* usage:
......@@ -27,21 +29,27 @@ namespace pls {
* PORTABILITY:
* Current implementation is based on pthreads.
*/
class this_thread {
class this_thread {
template<typename Function, typename State>
friend class thread;
friend
class thread;
#ifdef PLS_THREAD_SPECIFIC_PTHREAD
static pthread_key_t local_storage_key_;
static bool local_storage_key_initialized_;
#endif
#ifdef PLS_THREAD_SPECIFIC_COMPILER
static __thread void* local_state_;
static __thread void *local_state_;
#endif
public:
static void yield() {
pthread_yield();
}
static void sleep(long microseconds) {
timespec time{0, 1000 * microseconds};
nanosleep(&time, nullptr);
}
/**
* Retrieves the local state pointer.
*
......@@ -49,7 +57,7 @@ namespace pls {
* @return The state pointer hold for this thread.
*/
template<typename T>
static T* state();
static T *state();
/**
* Stores a pointer to the thread local state object.
......@@ -60,10 +68,10 @@ namespace pls {
* @param state_pointer A pointer to the threads state object.
*/
template<typename T>
static void set_state(T* state_pointer);
};
static void set_state(T *state_pointer);
};
/**
/**
* Abstraction for starting a function in a separate thread.
*
* @tparam Function Lambda being started on the new thread.
......@@ -79,43 +87,44 @@ namespace pls {
* PORTABILITY:
* Current implementation is based on pthreads.
*/
template<typename Function, typename State>
class thread {
template<typename Function, typename State>
class thread {
friend class this_thread;
// Keep a copy of the function (lambda) in this object to make sure it is valid when called!
Function function_;
State* state_pointer_;
State *state_pointer_;
// Wee need to wait for the started function to read
// the function_ and state_pointer_ property before returning
// from the constructor, as the object might be moved after this.
std::atomic_flag* startup_flag_;
std::atomic_flag *startup_flag_;
// Keep handle to native implementation
pthread_t pthread_thread_;
static void* start_pthread_internal(void* thread_pointer);
static void *start_pthread_internal(void *thread_pointer);
public:
explicit thread(const Function& function, State* state_pointer);
explicit thread(const Function &function, State *state_pointer);
public:
void join();
// make object move only
thread(thread&&) noexcept = default;
thread& operator=(thread&&) noexcept = default;
thread(thread &&) noexcept = default;
thread &operator=(thread &&) noexcept = default;
thread(const thread&) = delete;
thread& operator=(const thread&) = delete;
};
thread(const thread &) = delete;
thread &operator=(const thread &) = delete;
};
template<typename Function, typename State>
thread<Function, State> start_thread(const Function& function, State* state_pointer);
template<typename Function>
thread<Function, void> start_thread(const Function& function);
}
}
template<typename Function, typename State>
thread<Function, State> start_thread(const Function &function, State *state_pointer);
template<typename Function>
thread<Function, void> start_thread(const Function &function);
}
}
}
#include "thread_impl.h"
......
......@@ -3,33 +3,34 @@
#define PLS_THREAD_IMPL_H
namespace pls {
namespace internal {
namespace base {
template<typename T>
T* this_thread::state() {
namespace internal {
namespace base {
template<typename T>
T *this_thread::state() {
#ifdef PLS_THREAD_SPECIFIC_PTHREAD
return reinterpret_cast<T*>(pthread_getspecific(local_storage_key_));
#endif
#ifdef PLS_THREAD_SPECIFIC_COMPILER
return reinterpret_cast<T*>(local_state_);
return reinterpret_cast<T *>(local_state_);
#endif
}
}
template<typename T>
void this_thread::set_state(T* state_pointer) {
template<typename T>
void this_thread::set_state(T *state_pointer) {
#ifdef PLS_THREAD_SPECIFIC_PTHREAD
pthread_setspecific(this_thread::local_storage_key_, (void*)state_pointer);
#endif
#ifdef PLS_THREAD_SPECIFIC_COMPILER
local_state_ = state_pointer;
#endif
}
}
template<typename Function, typename State>
void* thread<Function, State>::start_pthread_internal(void* thread_pointer) {
auto my_thread = reinterpret_cast<thread*>(thread_pointer);
template<typename Function, typename State>
void *thread<Function, State>::start_pthread_internal(void *thread_pointer) {
auto my_thread = reinterpret_cast<thread *>(thread_pointer);
Function my_function_copy = my_thread->function_;
State* my_state_pointer_copy = my_thread->state_pointer_;
State *my_state_pointer_copy = my_thread->state_pointer_;
// Now we have copies of everything we need on the stack.
// The original thread object can be moved freely (no more
......@@ -41,10 +42,10 @@ namespace pls {
// Finished executing the user function
pthread_exit(nullptr);
}
}
template<typename Function, typename State>
thread<Function, State>::thread(const Function& function, State* state_pointer):
template<typename Function, typename State>
thread<Function, State>::thread(const Function &function, State *state_pointer):
function_{function},
state_pointer_{state_pointer},
startup_flag_{nullptr},
......@@ -62,27 +63,27 @@ namespace pls {
startup_flag_ = &startup_flag;
startup_flag.test_and_set(); // Set the flag, pthread will clear it when it is safe to return
pthread_create(&pthread_thread_, nullptr, start_pthread_internal, (void *)(this));
while (startup_flag.test_and_set())
; // Busy waiting for the starting flag to clear
}
pthread_create(&pthread_thread_, nullptr, start_pthread_internal, (void *) (this));
while (startup_flag.test_and_set()); // Busy waiting for the starting flag to clear
}
template<typename Function, typename State>
void thread<Function, State>::join() {
template<typename Function, typename State>
void thread<Function, State>::join() {
pthread_join(pthread_thread_, nullptr);
}
}
template<typename Function, typename State>
thread<Function, State> start_thread(const Function& function, State* state_pointer) {
template<typename Function, typename State>
thread<Function, State> start_thread(const Function &function, State *state_pointer) {
return thread<Function, State>(function, state_pointer);
}
}
template<typename Function>
thread<Function, void> start_thread(const Function& function) {
template<typename Function>
thread<Function, void> start_thread(const Function &function) {
return thread<Function, void>(function, nullptr);
}
}
}
}
}
}
}
#endif //PLS_THREAD_IMPL_H
......@@ -6,32 +6,30 @@
#include <iostream>
#include "pls/internal/base/thread.h"
#include "pls/internal/base/backoff.h"
namespace pls {
namespace internal {
namespace base {
/**
namespace internal {
namespace base {
/**
* A simple set and test_and_set based spin lock implementation.
*
* PORTABILITY:
* Current implementation is based on C++ 11 atomic_flag.
*/
class ttas_spin_lock {
class ttas_spin_lock {
std::atomic<int> flag_;
const unsigned int yield_at_tries_;
public:
ttas_spin_lock(): flag_{0}, yield_at_tries_{1024} {};
ttas_spin_lock(const ttas_spin_lock& other): flag_{0}, yield_at_tries_{other.yield_at_tries_} {}
ttas_spin_lock() : flag_{0} {};
ttas_spin_lock(const ttas_spin_lock &/*other*/) : flag_{0} {}
void lock();
bool try_lock(unsigned int num_tries=1);
bool try_lock(unsigned int num_tries = 1);
void unlock();
};
}
}
};
}
}
}
#endif //PLS_TTAS_SPIN_LOCK_H
......@@ -9,9 +9,12 @@
#include "pls/internal/base/alignment.h"
namespace pls {
namespace internal {
namespace data_structures {
/**
namespace internal {
namespace data_structures {
using base::system_details::pointer_t;
/**
* Generic stack-like data structure that allows to allocate arbitrary objects in a given memory region.
* The objects will be stored aligned in the stack, making the storage cache friendly and very fast
* (as long as one can live with the stack restrictions).
......@@ -23,31 +26,33 @@ namespace pls {
* T* pointer = stack.push(some_object); // Copy-Constrict the object on top of stack
* stack.pop<T>(); // Deconstruct the top object of type T
*/
class aligned_stack {
class aligned_stack {
// Keep bounds of our memory block
char* memory_start_;
char* memory_end_;
pointer_t memory_start_;
pointer_t memory_end_;
// Current head will always be aligned to cache lines
char* head_;
pointer_t head_;
public:
typedef char* state;
typedef pointer_t state;
aligned_stack(): memory_start_{nullptr}, memory_end_{nullptr}, head_{nullptr} {};
aligned_stack(char* memory_region, std::size_t size);
aligned_stack() : memory_start_{0}, memory_end_{0}, head_{0} {};
aligned_stack(pointer_t memory_region, std::size_t size);
aligned_stack(char *memory_region, std::size_t size);
template<typename T>
T* push(const T& object);
T *push(const T &object);
template<typename T>
void* push();
void *push();
template<typename T>
T pop();
state save_state() const { return head_; }
void reset_state(state new_state) { head_ = new_state; }
};
}
}
};
}
}
}
#include "aligned_stack_impl.h"
......
......@@ -3,17 +3,18 @@
#define PLS_ALIGNED_STACK_IMPL_H
namespace pls {
namespace internal {
namespace data_structures {
template<typename T>
T* aligned_stack::push(const T& object) {
namespace internal {
namespace data_structures {
template<typename T>
T *aligned_stack::push(const T &object) {
// Copy-Construct
return new ((void*)push<T>())T(object);
}
return new(push < T > ())T(object);
}
template<typename T>
void* aligned_stack::push() {
void* result = reinterpret_cast<T*>(head_);
template<typename T>
void *aligned_stack::push() {
void *result = reinterpret_cast<T *>(head_);
// Move head to next aligned position after new object
head_ = base::alignment::next_alignment(head_ + sizeof(T));
......@@ -22,15 +23,16 @@ namespace pls {
}
return result;
}
}
template<typename T>
T aligned_stack::pop() {
template<typename T>
T aligned_stack::pop() {
head_ = head_ - base::alignment::next_alignment(sizeof(T));
return *reinterpret_cast<T*>(head_);
}
}
}
return *reinterpret_cast<T *>(head_);
}
}
}
}
#endif //PLS_ALIGNED_STACK_IMPL_H
......@@ -5,56 +5,58 @@
#include "pls/internal/base/spin_lock.h"
namespace pls {
namespace internal {
namespace data_structures {
/**
namespace internal {
namespace data_structures {
/**
* Turns any object into deque item when inheriting from this.
*/
class deque_item {
class deque_item {
friend class deque_internal;
deque_item* prev_;
deque_item* next_;
deque_item *prev_;
deque_item *next_;
};
};
class deque_internal {
class deque_internal {
protected:
deque_item* head_;
deque_item* tail_;
deque_item *head_;
deque_item *tail_;
base::spin_lock lock_;
deque_item* pop_head_internal();
deque_item* pop_tail_internal();
deque_item *pop_head_internal();
deque_item *pop_tail_internal();
void push_tail_internal(deque_item *new_item);
};
};
/**
/**
* A double linked list based deque.
* Storage is therefore only needed for the individual items.
*
* @tparam Item The type of items stored in this deque
*/
template<typename Item>
class deque: deque_internal {
template<typename Item>
class deque : deque_internal {
public:
explicit deque(): deque_internal{} {}
explicit deque() : deque_internal{} {}
inline Item* pop_head() {
return static_cast<Item*>(pop_head_internal());
inline Item *pop_head() {
return static_cast<Item *>(pop_head_internal());
}
inline Item* pop_tail() {
return static_cast<Item*>(pop_tail_internal());
inline Item *pop_tail() {
return static_cast<Item *>(pop_tail_internal());
}
inline void push_tail(Item* new_item) {
inline void push_tail(Item *new_item) {
push_tail_internal(new_item);
}
};
}
}
};
}
}
}
#endif //PLS_DEQUE_H
#ifndef PLS_WORK_STEALING_DEQUE_H_
#define PLS_WORK_STEALING_DEQUE_H_
#include <atomic>
#include <mutex>
#include <pls/internal/scheduling/thread_state.h>
#include "pls/internal/base/system_details.h"
#include "pls/internal/base/spin_lock.h"
#include "pls/internal/base/error_handling.h"
#include "aligned_stack.h"
namespace pls {
namespace internal {
namespace data_structures {
using cas_integer = base::system_details::cas_integer;
using pointer_t = base::system_details::pointer_t;
static cas_integer get_stamp(cas_integer n) {
return (n & base::system_details::FIRST_HALF_CAS_INTEGER) >> ((sizeof(cas_integer) / 2) * 8);
}
static cas_integer get_offset(cas_integer n) {
return n & base::system_details::SECOND_HALF_CAS_INTEGER;
}
static cas_integer set_stamp(cas_integer n, cas_integer new_value) {
return (new_value << ((sizeof(cas_integer) / 2) * 8)) | (n & base::system_details::SECOND_HALF_CAS_INTEGER);
}
//static cas_integer set_offset(cas_integer n, cas_integer new_value) {
// return new_value | (n & base::system_details::FIRST_HALF_CAS_INTEGER);
//}
class work_stealing_deque_item {
// Pointer to the actual data
pointer_t data_;
// Index (relative to stack base) to the next and previous element
cas_integer next_item_;
cas_integer previous_item_;
public:
work_stealing_deque_item() : data_{0}, next_item_{0}, previous_item_{0} {}
template<typename Item>
Item *data() {
return reinterpret_cast<Item *>(data_);
}
template<typename Item>
void set_data(Item *data) {
data_ = reinterpret_cast<pointer_t >(data);
}
cas_integer next_item() {
return next_item_;
}
void set_next_item(cas_integer next_item) {
next_item_ = next_item;
}
cas_integer previous_item() {
return previous_item_;
}
void set_previous_item(cas_integer previous_item) {
previous_item_ = previous_item;
}
};
static_assert(sizeof(work_stealing_deque_item) < base::system_details::CACHE_LINE_SIZE,
"Work stealing deque relies on memory layout and requires cache lines to be longer than one 'work_stealing_deque_item' instance!");
template<typename Item>
class work_stealing_deque {
// Deque 'takes over' stack and handles memory management while in use.
// At any point in time the deque can stop using more memory and the stack can be used by other entities.
aligned_stack *stack_;
pointer_t base_pointer_;
std::atomic<cas_integer> head_;
std::atomic<cas_integer> tail_;
cas_integer previous_tail_;
base::spin_lock lock_{}; // TODO: Remove after debugging
public:
using state = aligned_stack::state;
explicit work_stealing_deque(aligned_stack *stack) : stack_{stack},
base_pointer_{0},
head_{0},
tail_{0},
previous_tail_{0} {
reset_base_pointer();
}
work_stealing_deque(const work_stealing_deque &other) : stack_{other.stack_},
base_pointer_{other.base_pointer_},
head_{other.head_.load()},
tail_{other.tail_.load()},
previous_tail_{other.previous_tail_} {}
void reset_base_pointer() {
base_pointer_ = reinterpret_cast<pointer_t >(stack_->save_state()); // Keep the base of our region in the stack
}
work_stealing_deque_item *item_at(cas_integer position) {
return reinterpret_cast<work_stealing_deque_item *>(base_pointer_
+ (base::system_details::CACHE_LINE_SIZE * position));
}
cas_integer current_stack_offset() {
return (stack_->save_state() - base_pointer_) / base::system_details::CACHE_LINE_SIZE;
}
template<typename T>
std::pair<work_stealing_deque_item, T> *allocate_item(const T &new_item) {
// 'Union' type to push both on stack
using pair_t = std::pair<work_stealing_deque_item, T>;
// Allocate space on stack
auto new_pair = reinterpret_cast<pair_t *>(stack_->push<pair_t>());
// Initialize memory on stack
new((void *) &(new_pair->first)) work_stealing_deque_item();
new((void *) &(new_pair->second)) T(new_item);
return new_pair;
}
template<typename T>
Item *push_tail(const T &new_item) {
cas_integer local_tail = tail_;
auto new_pair = allocate_item(new_item);
// Prepare current tail to point to correct next items
auto tail_deque_item = item_at(local_tail);
tail_deque_item->set_data(&(new_pair->second));
tail_deque_item->set_next_item(current_stack_offset());
tail_deque_item->set_previous_item(previous_tail_);
previous_tail_ = local_tail;
// Linearization point, item appears after this write
cas_integer new_tail = current_stack_offset();
tail_ = new_tail;
return &(new_pair->second);
}
Item *pop_tail() {
cas_integer local_tail = tail_;
cas_integer local_head = head_;
if (local_tail <= get_offset(local_head)) {
return nullptr; // EMPTY
}
work_stealing_deque_item *previous_tail_item = item_at(previous_tail_);
cas_integer new_tail = previous_tail_;
previous_tail_ = previous_tail_item->previous_item();
// Publish our wish to set the tail back
tail_ = new_tail;
// Get the state of local head AFTER we published our wish
local_head = head_; // Linearization point, outside knows list is empty
if (get_offset(local_head) < new_tail) {
return previous_tail_item->data<Item>(); // Success, enough distance to other threads
}
if (get_offset(local_head) == new_tail) {
cas_integer new_head = set_stamp(new_tail, get_stamp(local_head) + 1);
// Try competing with consumers by updating the head's stamp value
if (head_.compare_exchange_strong(local_head, new_head)) {
return previous_tail_item->data<Item>(); // SUCCESS, we won the competition with other threads
}
}
// Some other thread either won the competition or it already set the head further than we are
// before we even tried to compete with it.
// Reset the queue into an empty state => head_ = tail_
tail_ = get_offset(local_head); // ...we give up to the other winning thread
return nullptr; // EMPTY, we lost the competition with other threads
}
Item *pop_head() {
cas_integer local_head = head_;
cas_integer local_tail = tail_;
if (local_tail <= get_offset(local_head)) {
return nullptr; // EMPTY
}
// Load info on current deque item.
// In case we have a race with a new (aba) overwritten item at this position,
// there has to be a competition over the tail -> the stamp increased and our next
// operation will fail anyways!
work_stealing_deque_item *head_deque_item = item_at(get_offset(local_head));
cas_integer next_item_offset = head_deque_item->next_item();
Item *head_data_item = head_deque_item->data<Item>();
// We try to set the head to this new position.
// Possible outcomes:
// 1) no one interrupted us, we win this competition
// 2) other thread took the head, we lose to this
// 3) owning thread removed tail, we lose to this
cas_integer new_head = set_stamp(next_item_offset, get_stamp(local_head) + 1);
if (head_.compare_exchange_strong(local_head, new_head)) {
return head_data_item; // SUCCESS, we won the competition
}
return nullptr; // EMPTY, we lost the competition
}
void release_memory_until(state state) {
cas_integer item_offset = (state - base_pointer_) / base::system_details::CACHE_LINE_SIZE;
cas_integer local_head = head_;
cas_integer local_tail = tail_;
stack_->reset_state(state);
if (item_offset < local_tail) {
tail_ = item_offset;
if (get_offset(local_head) >= local_tail) {
head_ = set_stamp(item_offset, get_stamp(local_head) + 1);
}
}
}
void release_memory_until(Item *item) {
release_memory_until(reinterpret_cast<pointer_t >(item));
}
state save_state() {
return stack_->save_state();
}
};
}
}
}
#endif //PLS_WORK_STEALING_DEQUE_H_
......@@ -9,11 +9,12 @@
#include <iostream>
namespace pls {
namespace internal {
namespace helpers {
// TODO: Clean up (separate into small functions and .cpp file)
template<typename Function>
void run_mini_benchmark(const Function& lambda, size_t max_threads, unsigned long max_runtime_ms=1000) {
namespace internal {
namespace helpers {
// TODO: Clean up (separate into small functions and .cpp file)
template<typename Function>
void run_mini_benchmark(const Function &lambda, size_t max_threads, unsigned long max_runtime_ms = 1000) {
using namespace std;
using namespace pls::internal::scheduling;
......@@ -37,7 +38,7 @@ namespace pls {
});
long time = chrono::duration_cast<chrono::microseconds>(end_time - start_time).count();
double time_per_iteration = (double)time / iterations;
double time_per_iteration = (double) time / iterations;
std::cout << time_per_iteration;
if (num_threads < max_threads) {
......@@ -45,9 +46,10 @@ namespace pls {
}
}
std::cout << std::endl;
}
}
}
}
}
}
}
#endif //PLS_MINI_BENCHMARK_H
......@@ -15,7 +15,7 @@
#ifdef NEW_LINK_ERROR
// This will cause a linker error if new is used in the code.
// We also exit if it is somehow still called.
inline void * operator new (std::size_t) {
inline void *operator new(std::size_t) {
extern int bare_new_erroneously_called();
exit(bare_new_erroneously_called() | 1);
}
......
......@@ -7,12 +7,13 @@
#include <stdint.h>
namespace pls {
namespace internal {
namespace helpers {
struct unique_id {
namespace internal {
namespace helpers {
struct unique_id {
const uint32_t id_;
const std::type_info& type_;
bool operator==(const unique_id& other) const { return id_ == other.id_ && type_ == other.type_; }
const std::type_info &type_;
bool operator==(const unique_id &other) const { return id_ == other.id_ && type_ == other.type_; }
static constexpr unique_id create(const uint32_t id) {
return unique_id(id, typeid(void));
......@@ -22,10 +23,11 @@ namespace pls {
return unique_id(UINT32_MAX, typeid(std::tuple<T...>));
}
private:
explicit constexpr unique_id(const uint32_t id, const std::type_info& type): id_{id}, type_{type} {};
};
}
}
explicit constexpr unique_id(const uint32_t id, const std::type_info &type) : id_{id}, type_{type} {};
};
}
}
}
#endif //PLS_UNIQUE_ID_H
......@@ -2,42 +2,44 @@
#ifndef PLS_ABSTRACT_TASK_H
#define PLS_ABSTRACT_TASK_H
#include "pls/internal/base/spin_lock.h"
#include "pls/internal/base/swmr_spin_lock.h"
#include "pls/internal/helpers/unique_id.h"
namespace pls {
namespace internal {
namespace scheduling {
class abstract_task {
namespace internal {
namespace scheduling {
class abstract_task {
public:
using id = helpers::unique_id;
private:
unsigned int depth_;
abstract_task::id unique_id_;
abstract_task* child_task_;
abstract_task *volatile child_task_;
public:
abstract_task(const unsigned int depth, const abstract_task::id& unique_id):
abstract_task(const unsigned int depth, const abstract_task::id &unique_id) :
depth_{depth},
unique_id_{unique_id},
child_task_{nullptr} {}
virtual void execute() = 0;
void set_child(abstract_task* child_task) { child_task_ = child_task; }
abstract_task* child() { return child_task_; }
void set_child(abstract_task *child_task) { child_task_ = child_task; }
abstract_task *child() const { return child_task_; }
void set_depth(unsigned int depth) { depth_ = depth; }
unsigned int depth() const { return depth_; }
id unique_id() const { return unique_id_; }
protected:
virtual bool internal_stealing(abstract_task* other_task) = 0;
virtual bool split_task(base::spin_lock* lock) = 0;
virtual bool internal_stealing(abstract_task *other_task) = 0;
virtual bool split_task(base::swmr_spin_lock *lock) = 0;
bool steal_work();
};
}
}
};
}
}
}
#endif //PLS_ABSTRACT_TASK_H
......@@ -5,30 +5,34 @@
#include "pls/internal/helpers/profiler.h"
#include "pls/internal/data_structures/aligned_stack.h"
#include "pls/internal/data_structures/deque.h"
#include "pls/internal/data_structures/work_stealing_deque.h"
#include "abstract_task.h"
#include "thread_state.h"
namespace pls {
namespace internal {
namespace scheduling {
class fork_join_task;
class fork_join_sub_task: public data_structures::deque_item {
namespace internal {
namespace scheduling {
class fork_join_task;
class fork_join_sub_task {
friend class fork_join_task;
// Coordinate finishing of sub_tasks
std::atomic_uint32_t ref_count_;
fork_join_sub_task* parent_;
fork_join_sub_task *parent_;
// Access to TBB scheduling environment
fork_join_task* tbb_task_;
fork_join_task *tbb_task_;
bool executed = false;
int executed_at = -1;
// Stack Management (reset stack pointer after wait_for_all() calls)
data_structures::aligned_stack::state stack_state_;
data_structures::work_stealing_deque<fork_join_sub_task>::state deque_state_;
protected:
explicit fork_join_sub_task();
fork_join_sub_task(const fork_join_sub_task& other);
fork_join_sub_task(const fork_join_sub_task &other);
// Overwritten with behaviour of child tasks
virtual void execute_internal() = 0;
......@@ -36,63 +40,84 @@ namespace pls {
public:
// Only use them when actually executing this sub_task (only public for simpler API design)
template<typename T>
void spawn_child(const T& sub_task);
void spawn_child(T &sub_task);
void wait_for_all();
static fork_join_sub_task *current();
private:
void spawn_child_internal(fork_join_sub_task* sub_task);
void execute();
};
};
template<typename Function>
class fork_join_lambda_by_reference : public fork_join_sub_task {
const Function &function_;
public:
explicit fork_join_lambda_by_reference(const Function &function) : fork_join_sub_task{}, function_{function} {};
template<typename Function>
class fork_join_lambda: public fork_join_sub_task {
const Function* function_;
protected:
void execute_internal() override {
function_();
}
};
template<typename Function>
class fork_join_lambda_by_value : public fork_join_sub_task {
const Function function_;
public:
explicit fork_join_lambda(const Function* function): function_{function} {};
explicit fork_join_lambda_by_value(const Function &function) : fork_join_sub_task{}, function_{function} {};
protected:
void execute_internal() override {
(*function_)(this);
function_();
}
};
};
class fork_join_task: public abstract_task {
class fork_join_task : public abstract_task {
friend class fork_join_sub_task;
fork_join_sub_task* root_task_;
fork_join_sub_task* currently_executing_;
data_structures::aligned_stack* my_stack_;
fork_join_sub_task *root_task_;
fork_join_sub_task *currently_executing_;
// Double-Ended Queue management
data_structures::deque<fork_join_sub_task> deque_;
data_structures::work_stealing_deque<fork_join_sub_task> deque_;
// Steal Management
fork_join_sub_task* last_stolen_;
fork_join_sub_task *last_stolen_;
fork_join_sub_task* get_local_sub_task();
fork_join_sub_task* get_stolen_sub_task();
fork_join_sub_task *get_local_sub_task();
fork_join_sub_task *get_stolen_sub_task();
protected:
bool internal_stealing(abstract_task* other_task) override;
bool split_task(base::spin_lock* /*lock*/) override;
bool internal_stealing(abstract_task *other_task) override;
bool split_task(base::swmr_spin_lock * /*lock*/) override;
public:
explicit fork_join_task(fork_join_sub_task* root_task, const abstract_task::id& id);
explicit fork_join_task(fork_join_sub_task *root_task, const abstract_task::id &id);
void execute() override;
fork_join_sub_task* currently_executing() const;
};
fork_join_sub_task *currently_executing() const;
};
template<typename T>
void fork_join_sub_task::spawn_child(const T& task) {
template<typename T>
void fork_join_sub_task::spawn_child(T &task) {
PROFILE_FORK_JOIN_STEALING("spawn_child")
static_assert(std::is_base_of<fork_join_sub_task, T>::value, "Only pass fork_join_sub_task subclasses!");
T* new_task = tbb_task_->my_stack_->push(task);
spawn_child_internal(new_task);
}
}
}
// Keep our refcount up to date
ref_count_++;
// Assign forced values
task.parent_ = this;
task.tbb_task_ = tbb_task_;
task.deque_state_ = tbb_task_->deque_.save_state();
// Push on our deque
const T const_task = task;
tbb_task_->deque_.push_tail(const_task);
}
}
}
}
#endif //PLS_TBB_LIKE_TASK_H
......@@ -5,25 +5,26 @@
#include <mutex>
#include "pls/internal/helpers/profiler.h"
#include "pls/internal/base/spin_lock.h"
#include "pls/internal/base/swmr_spin_lock.h"
#include "abstract_task.h"
namespace pls {
namespace internal {
namespace scheduling {
template<typename Function>
class root_task : public abstract_task {
namespace internal {
namespace scheduling {
template<typename Function>
class root_task : public abstract_task {
Function function_;
std::atomic_uint8_t finished_;
public:
static constexpr auto create_id = helpers::unique_id::create<root_task<Function>>;
explicit root_task(Function function):
explicit root_task(Function function) :
abstract_task{0, create_id()},
function_{function},
finished_{0} {}
root_task(const root_task& other):
root_task(const root_task &other) :
abstract_task{0, create_id()},
function_{other.function_},
finished_{0} {}
......@@ -38,23 +39,23 @@ namespace pls {
finished_ = 1;
}
bool internal_stealing(abstract_task* /*other_task*/) override {
bool internal_stealing(abstract_task * /*other_task*/) override {
return false;
}
bool split_task(base::spin_lock* /*lock*/) override {
bool split_task(base::swmr_spin_lock * /*lock*/) override {
return false;
}
};
};
template<typename Function>
class root_worker_task : public abstract_task {
root_task<Function>* master_task_;
template<typename Function>
class root_worker_task : public abstract_task {
root_task<Function> *master_task_;
public:
static constexpr auto create_id = root_task<Function>::create_id;
explicit root_worker_task(root_task<Function>* master_task):
explicit root_worker_task(root_task<Function> *master_task) :
abstract_task{0, create_id()},
master_task_{master_task} {}
......@@ -65,16 +66,17 @@ namespace pls {
} while (!master_task_->finished());
}
bool internal_stealing(abstract_task* /*other_task*/) override {
bool internal_stealing(abstract_task * /*other_task*/) override {
return false;
}
bool split_task(base::spin_lock* /*lock*/) override {
bool split_task(base::swmr_spin_lock * /*lock*/) override {
return false;
}
};
}
}
};
}
}
}
#endif //PLS_ROOT_MASTER_TASK_H
......@@ -12,12 +12,14 @@
#include "scheduler.h"
namespace pls {
namespace internal {
namespace scheduling {
template<typename Function>
class run_on_n_threads_task : public abstract_task {
namespace internal {
namespace scheduling {
template<typename Function>
class run_on_n_threads_task : public abstract_task {
template<typename F>
friend class run_on_n_threads_task_worker;
friend
class run_on_n_threads_task_worker;
Function function_;
......@@ -38,7 +40,7 @@ namespace pls {
public:
static constexpr auto create_id = helpers::unique_id::create<run_on_n_threads_task<Function>>;
run_on_n_threads_task(Function function, int num_threads):
run_on_n_threads_task(Function function, int num_threads) :
abstract_task{0, create_id()},
function_{function},
counter{num_threads - 1} {}
......@@ -55,21 +57,21 @@ namespace pls {
std::cout << "Finished Master!" << std::endl;
}
bool internal_stealing(abstract_task* /*other_task*/) override {
bool internal_stealing(abstract_task * /*other_task*/) override {
return false;
}
bool split_task(base::spin_lock* lock) override;
};
bool split_task(base::swmr_spin_lock *lock) override;
};
template<typename Function>
class run_on_n_threads_task_worker : public abstract_task {
template<typename Function>
class run_on_n_threads_task_worker : public abstract_task {
Function function_;
run_on_n_threads_task<Function>* root_;
run_on_n_threads_task<Function> *root_;
public:
static constexpr auto create_id = helpers::unique_id::create<run_on_n_threads_task_worker<Function>>;
run_on_n_threads_task_worker(Function function, run_on_n_threads_task<Function>* root):
run_on_n_threads_task_worker(Function function, run_on_n_threads_task<Function> *root) :
abstract_task{0, create_id()},
function_{function},
root_{root} {}
......@@ -83,36 +85,36 @@ namespace pls {
}
}
bool internal_stealing(abstract_task* /*other_task*/) override {
bool internal_stealing(abstract_task * /*other_task*/) override {
return false;
}
bool split_task(base::spin_lock* /*lock*/) override {
bool split_task(base::swmr_spin_lock * /*lock*/) override {
return false;
}
};
};
template<typename Function>
bool run_on_n_threads_task<Function>::split_task(base::spin_lock* lock) {
template<typename Function>
bool run_on_n_threads_task<Function>::split_task(base::swmr_spin_lock *lock) {
if (get_counter() <= 0) {
return false;
}
// In success case, unlock.
// TODO: this locking is complicated and error prone.
lock->unlock();
lock->reader_unlock();
auto scheduler = base::this_thread::state<thread_state>()->scheduler_;
auto task = run_on_n_threads_task_worker<Function>{function_, this};
scheduler->execute_task(task, depth());
return true;
}
}
template<typename Function>
run_on_n_threads_task<Function> create_run_on_n_threads_task(Function function, int num_threads) {
template<typename Function>
run_on_n_threads_task<Function> create_run_on_n_threads_task(Function function, int num_threads) {
return run_on_n_threads_task<Function>{function, num_threads};
}
}
}
}
}
}
}
#endif //PLS_RUN_ON_N_THREADS_TASK_H
......@@ -17,21 +17,22 @@
#include "scheduler_memory.h"
namespace pls {
namespace internal {
namespace scheduling {
void worker_routine();
using scheduler_thread = base::thread<decltype(&worker_routine), thread_state>;
namespace internal {
namespace scheduling {
class scheduler {
void worker_routine();
using scheduler_thread = base::thread<decltype(&worker_routine), thread_state>;
class scheduler {
friend void worker_routine();
const unsigned int num_threads_;
scheduler_memory* memory_;
scheduler_memory *memory_;
base::barrier sync_barrier_;
bool terminated_;
public:
explicit scheduler(scheduler_memory* memory, unsigned int num_threads);
explicit scheduler(scheduler_memory *memory, unsigned int num_threads);
~scheduler();
/**
......@@ -50,17 +51,18 @@ namespace pls {
* @param depth Optional: depth of the new task, otherwise set implicitly.
*/
template<typename Task>
static void execute_task(Task& task, int depth=-1);
static void execute_task(Task &task, int depth = -1);
static abstract_task* current_task() { return base::this_thread::state<thread_state>()->current_task_; }
static abstract_task *current_task() { return base::this_thread::state<thread_state>()->current_task_; }
void terminate(bool wait_for_workers=true);
void terminate(bool wait_for_workers = true);
unsigned int num_threads() const { return num_threads_; }
thread_state* thread_state_for(size_t id) { return memory_->thread_state_for(id); }
};
}
}
thread_state *thread_state_for(size_t id) { return memory_->thread_state_for(id); }
};
}
}
}
#include "scheduler_impl.h"
......
......@@ -3,10 +3,11 @@
#define PLS_SCHEDULER_IMPL_H
namespace pls {
namespace internal {
namespace scheduling {
template<typename Function>
void scheduler::perform_work(Function work_section) {
namespace internal {
namespace scheduling {
template<typename Function>
void scheduler::perform_work(Function work_section) {
PROFILE_WORK_BLOCK("scheduler::perform_work")
root_task<Function> master{work_section};
......@@ -31,42 +32,45 @@ namespace pls {
root_worker_task<Function> worker{new_master};
memory_->task_stack_for(0)->pop<typeof(worker)>();
}
}
}
template<typename Task>
void scheduler::execute_task(Task& task, int depth) {
template<typename Task>
void scheduler::execute_task(Task &task, int depth) {
static_assert(std::is_base_of<abstract_task, Task>::value, "Only pass abstract_task subclasses!");
auto my_state = base::this_thread::state<thread_state>();
abstract_task* old_task;
abstract_task* new_task;
abstract_task *old_task;
abstract_task *new_task;
// Init Task
{
std::lock_guard<base::spin_lock> lock{my_state->lock_};
old_task = my_state->current_task_;
new_task = my_state->task_stack_->push(task);
new_task->set_depth(depth >= 0 ? depth : old_task->depth() + 1);
{
my_state->lock_.writer_lock();
my_state->current_task_ = new_task;
old_task->set_child(new_task);
my_state->lock_.writer_unlock();
}
// Run Task
new_task->execute();
// Teardown state back to before the task was executed
{
std::lock_guard<base::spin_lock> lock{my_state->lock_};
my_state->task_stack_->pop<Task>();
{
my_state->lock_.writer_lock();
old_task->set_child(nullptr);
my_state->current_task_ = old_task;
my_state->task_stack_->pop<Task>();
}
}
}
my_state->lock_.writer_unlock();
}
}
}
}
}
#endif //PLS_SCHEDULER_IMPL_H
......@@ -7,21 +7,22 @@
#define PLS_SCHEDULER_MEMORY_H
namespace pls {
namespace internal {
namespace scheduling {
void worker_routine();
using scheduler_thread = base::thread<decltype(&worker_routine), thread_state>;
namespace internal {
namespace scheduling {
class scheduler_memory {
void worker_routine();
using scheduler_thread = base::thread<decltype(&worker_routine), thread_state>;
class scheduler_memory {
public:
virtual size_t max_threads() const = 0;
virtual thread_state* thread_state_for(size_t id) = 0;
virtual scheduler_thread* thread_for(size_t id) = 0;
virtual data_structures::aligned_stack* task_stack_for(size_t id) = 0;
};
virtual thread_state *thread_state_for(size_t id) = 0;
virtual scheduler_thread *thread_for(size_t id) = 0;
virtual data_structures::aligned_stack *task_stack_for(size_t id) = 0;
};
template<size_t MAX_THREADS, size_t TASK_STACK_SIZE>
class static_scheduler_memory: public scheduler_memory {
template<size_t MAX_THREADS, size_t TASK_STACK_SIZE>
class static_scheduler_memory : public scheduler_memory {
// Everyone of these types has to live on its own cache line,
// as each thread uses one of them independently.
// Therefore it would be a major performance hit if we shared cache lines on these.
......@@ -38,17 +39,18 @@ namespace pls {
public:
static_scheduler_memory() {
for (size_t i = 0; i < MAX_THREADS; i++) {
new ((void*)task_stacks_[i].pointer()) data_structures::aligned_stack(task_stacks_memory_[i].pointer()->data(), TASK_STACK_SIZE);
new((void *) task_stacks_[i].pointer()) data_structures::aligned_stack(task_stacks_memory_[i].pointer()->data(),
TASK_STACK_SIZE);
}
}
size_t max_threads() const override { return MAX_THREADS; }
thread_state* thread_state_for(size_t id) override { return thread_states_[id].pointer(); }
scheduler_thread* thread_for(size_t id) override { return threads_[id].pointer(); }
data_structures::aligned_stack* task_stack_for(size_t id) override { return task_stacks_[id].pointer(); }
};
thread_state *thread_state_for(size_t id) override { return thread_states_[id].pointer(); }
scheduler_thread *thread_for(size_t id) override { return threads_[id].pointer(); }
data_structures::aligned_stack *task_stack_for(size_t id) override { return task_stacks_[id].pointer(); }
};
class malloc_scheduler_memory: public scheduler_memory {
class malloc_scheduler_memory : public scheduler_memory {
// Everyone of these types has to live on its own cache line,
// as each thread uses one of them independently.
// Therefore it would be a major performance hit if we shared cache lines on these.
......@@ -58,21 +60,22 @@ namespace pls {
const size_t num_threads_;
aligned_thread* threads_;
aligned_thread_state * thread_states_;
char** task_stacks_memory_;
aligned_aligned_stack * task_stacks_;
aligned_thread *threads_;
aligned_thread_state *thread_states_;
char **task_stacks_memory_;
aligned_aligned_stack *task_stacks_;
public:
explicit malloc_scheduler_memory(size_t num_threads, size_t memory_per_stack = 2 << 16);
~malloc_scheduler_memory();
size_t max_threads() const override { return num_threads_; }
thread_state* thread_state_for(size_t id) override { return thread_states_[id].pointer(); }
scheduler_thread* thread_for(size_t id) override { return threads_[id].pointer(); }
data_structures::aligned_stack* task_stack_for(size_t id) override { return task_stacks_[id].pointer(); }
};
}
}
thread_state *thread_state_for(size_t id) override { return thread_states_[id].pointer(); }
scheduler_thread *thread_for(size_t id) override { return threads_[id].pointer(); }
data_structures::aligned_stack *task_stack_for(size_t id) override { return task_stacks_[id].pointer(); }
};
}
}
}
#endif //PLS_SCHEDULER_MEMORY_H
......@@ -5,41 +5,46 @@
#include <random>
#include "pls/internal/data_structures/aligned_stack.h"
#include "pls/internal/base/swmr_spin_lock.h"
#include "abstract_task.h"
namespace pls {
namespace internal {
namespace scheduling {
// forward declaration
class scheduler;
struct thread_state {
scheduler* scheduler_;
abstract_task* root_task_;
abstract_task* current_task_;
data_structures::aligned_stack* task_stack_;
size_t id_;
base::spin_lock lock_;
std::minstd_rand random_;
thread_state():
namespace internal {
namespace scheduling {
// forward declaration
class scheduler;
struct thread_state {
alignas(base::system_details::CACHE_LINE_SIZE) scheduler *scheduler_;
alignas(base::system_details::CACHE_LINE_SIZE) abstract_task *root_task_;
alignas(base::system_details::CACHE_LINE_SIZE) abstract_task *current_task_;
alignas(base::system_details::CACHE_LINE_SIZE) data_structures::aligned_stack *task_stack_;
alignas(base::system_details::CACHE_LINE_SIZE) size_t id_;
alignas(base::system_details::CACHE_LINE_SIZE) base::swmr_spin_lock lock_;
alignas(base::system_details::CACHE_LINE_SIZE) std::minstd_rand random_;
thread_state() :
scheduler_{nullptr},
root_task_{nullptr},
current_task_{nullptr},
task_stack_{nullptr},
id_{0},
lock_{},
random_{id_} {};
thread_state(scheduler* scheduler, data_structures::aligned_stack* task_stack, unsigned int id):
thread_state(scheduler *scheduler, data_structures::aligned_stack *task_stack, unsigned int id) :
scheduler_{scheduler},
root_task_{nullptr},
current_task_{nullptr},
task_stack_{task_stack},
id_{id},
lock_{},
random_{id_} {}
};
}
}
};
}
}
}
#endif //PLS_THREAD_STATE_H
......@@ -8,18 +8,22 @@
#include "pls/internal/helpers/unique_id.h"
namespace pls {
using internal::scheduling::static_scheduler_memory;
using internal::scheduling::malloc_scheduler_memory;
using internal::scheduling::scheduler;
using task_id = internal::scheduling::abstract_task::id;
using internal::scheduling::static_scheduler_memory;
using internal::scheduling::malloc_scheduler_memory;
using unique_id = internal::helpers::unique_id;
using internal::scheduling::scheduler;
using task_id = internal::scheduling::abstract_task::id;
using internal::scheduling::fork_join_sub_task;
using internal::scheduling::fork_join_task;
using unique_id = internal::helpers::unique_id;
using internal::scheduling::fork_join_sub_task;
using internal::scheduling::fork_join_lambda_by_reference;
using internal::scheduling::fork_join_lambda_by_value;
using internal::scheduling::fork_join_task;
using algorithm::invoke_parallel;
using algorithm::invoke_parallel;
}
#endif
......@@ -2,26 +2,37 @@
#include "pls/internal/base/system_details.h"
namespace pls {
namespace internal {
namespace base {
namespace alignment {
void* allocate_aligned(size_t size) {
namespace internal {
namespace base {
namespace alignment {
void *allocate_aligned(size_t size) {
return aligned_alloc(system_details::CACHE_LINE_SIZE, size);
}
}
std::uintptr_t next_alignment(std::uintptr_t size) {
std::uintptr_t miss_alignment = size % base::system_details::CACHE_LINE_SIZE;
system_details::pointer_t next_alignment(system_details::pointer_t size) {
system_details::pointer_t miss_alignment = size % base::system_details::CACHE_LINE_SIZE;
if (miss_alignment == 0) {
return size;
} else {
return size + (base::system_details::CACHE_LINE_SIZE - miss_alignment);
}
}
}
char* next_alignment(char* pointer) {
return reinterpret_cast<char*>(next_alignment(reinterpret_cast<std::uintptr_t >(pointer)));
}
}
}
system_details::pointer_t previous_alignment(system_details::pointer_t size) {
system_details::pointer_t miss_alignment = size % base::system_details::CACHE_LINE_SIZE;
if (miss_alignment == 0) {
return size;
} else {
return size - miss_alignment;
}
}
char *next_alignment(char *pointer) {
return reinterpret_cast<char *>(next_alignment(reinterpret_cast<system_details::pointer_t >(pointer)));
}
}
}
}
}
#include "pls/internal/base/barrier.h"
namespace pls {
namespace internal {
namespace base {
barrier::barrier(const unsigned int count): barrier_{} {
namespace internal {
namespace base {
barrier::barrier(const unsigned int count) : barrier_{} {
pthread_barrier_init(&barrier_, nullptr, count);
}
}
barrier::~barrier() {
barrier::~barrier() {
pthread_barrier_destroy(&barrier_);
}
}
void barrier::wait() {
void barrier::wait() {
pthread_barrier_wait(&barrier_);
}
}
}
}
}
}
}
#include "pls/internal/base/swmr_spin_lock.h"
#include "pls/internal/base/system_details.h"
namespace pls {
namespace internal {
namespace base {
bool swmr_spin_lock::reader_try_lock() {
PROFILE_LOCK("Try Acquire Read Lock")
if (write_request_.load(std::memory_order_acquire) == 1) {
return false;
}
// We think we can enter the region
readers_.fetch_add(1, std::memory_order_acquire);
if (write_request_.load(std::memory_order_acquire) == 1) {
// Whoops, the writer acquires the lock, so we back off again
readers_.fetch_add(-1, std::memory_order_release);
return false;
}
return true;
}
void swmr_spin_lock::reader_unlock() {
PROFILE_LOCK("Release Read Lock")
readers_--;
}
void swmr_spin_lock::writer_lock() {
PROFILE_LOCK("Acquire Write Lock")
// Tell the readers that we would like to write
write_request_ = 1;
// Wait for all of them to exit the critical section
while (readers_ > 0)
system_details::relax_cpu(); // Spin, not expensive as relaxed load
}
void swmr_spin_lock::writer_unlock() {
PROFILE_LOCK("Release Write Lock")
write_request_ = 0;
}
}
}
}
#include "pls/internal/helpers/profiler.h"
#include "pls/internal/base/tas_spin_lock.h"
#include "pls/internal/base/backoff.h"
namespace pls {
namespace internal {
namespace base {
void tas_spin_lock::lock() {
namespace internal {
namespace base {
void tas_spin_lock::lock() {
PROFILE_LOCK("Acquire Lock")
int tries = 0;
while (flag_.test_and_set(std::memory_order_acquire)) {
tries++;
if (tries % yield_at_tries_ == 0) {
this_thread::yield();
}
backoff backoff_strategy;
while (true) {
if (flag_.test_and_set(std::memory_order_acquire) == 0) {
return;
}
backoff_strategy.do_backoff();
}
}
bool tas_spin_lock::try_lock(unsigned int num_tries) {
bool tas_spin_lock::try_lock(unsigned int num_tries) {
PROFILE_LOCK("Try Acquire Lock")
while (flag_.test_and_set(std::memory_order_acquire)) {
backoff backoff_strategy;
while (true) {
if (flag_.test_and_set(std::memory_order_acquire) == 0) {
return true;
}
num_tries--;
if (num_tries <= 0) {
return false;
}
backoff_strategy.do_backoff();
}
return true;
}
}
void tas_spin_lock::unlock() {
void tas_spin_lock::unlock() {
PROFILE_LOCK("Unlock")
flag_.clear(std::memory_order_release);
}
}
}
}
}
}
}
#include "pls/internal/base/thread.h"
namespace pls {
namespace internal {
namespace base {
namespace internal {
namespace base {
#ifdef PLS_THREAD_SPECIFIC_PTHREAD
pthread_key_t this_thread::local_storage_key_ = false;
bool this_thread::local_storage_key_initialized_;
pthread_key_t this_thread::local_storage_key_ = false;
bool this_thread::local_storage_key_initialized_;
#endif
#ifdef PLS_THREAD_SPECIFIC_COMPILER
__thread void* this_thread::local_state_;
__thread void *this_thread::local_state_;
#endif
// implementation in header (C++ templating)
}
}
// implementation in header (C++ templating)
}
}
}
#include "pls/internal/helpers/profiler.h"
#include "pls/internal/base/ttas_spin_lock.h"
#include "pls/internal/base/backoff.h"
namespace pls {
namespace internal {
namespace base {
void ttas_spin_lock::lock() {
namespace internal {
namespace base {
void ttas_spin_lock::lock() {
PROFILE_LOCK("Acquire Lock")
int tries = 0;
int expected = 0;
backoff backoff_;
do {
while (flag_.load(std::memory_order_relaxed) == 1) {
tries++;
if (tries % yield_at_tries_ == 0) {
this_thread::yield();
}
}
while (true) {
while (flag_.load(std::memory_order_relaxed) == 1)
system_details::relax_cpu(); // Spin
expected = 0;
} while (!flag_.compare_exchange_weak(expected, 1, std::memory_order_acquire));
if (flag_.compare_exchange_weak(expected, 1, std::memory_order_acquire)) {
return;
}
backoff_.do_backoff();
}
}
bool ttas_spin_lock::try_lock(unsigned int num_tries) {
bool ttas_spin_lock::try_lock(unsigned int num_tries) {
PROFILE_LOCK("Try Acquire Lock")
int expected = 0;
backoff backoff_;
do {
while (flag_.load(std::memory_order_relaxed) == 1) {
while (true) {
while (flag_.load() == 1) {
num_tries--;
if (num_tries <= 0) {
return false;
}
system_details::relax_cpu();
}
expected = 0;
} while (!flag_.compare_exchange_weak(expected, 1, std::memory_order_acquire));
if (flag_.compare_exchange_weak(expected, 1, std::memory_order_acquire)) {
return true;
}
void ttas_spin_lock::unlock() {
flag_.store(0, std::memory_order_release);
}
num_tries--;
if (num_tries <= 0) {
return false;
}
backoff_.do_backoff();
}
}
void ttas_spin_lock::unlock() {
PROFILE_LOCK("Unlock")
flag_.store(0, std::memory_order_release);
}
}
}
}
......@@ -2,12 +2,19 @@
#include "pls/internal/base/system_details.h"
namespace pls {
namespace internal {
namespace data_structures {
aligned_stack::aligned_stack(char* memory_region, const std::size_t size):
namespace internal {
namespace data_structures {
aligned_stack::aligned_stack(pointer_t memory_region, const std::size_t size) :
memory_start_{memory_region},
memory_end_{memory_region + size},
head_{base::alignment::next_alignment(memory_start_)} {}
}
}
aligned_stack::aligned_stack(char *memory_region, const std::size_t size) :
memory_start_{(pointer_t) memory_region},
memory_end_{(pointer_t) memory_region + size},
head_{base::alignment::next_alignment(memory_start_)} {}
}
}
}
......@@ -3,56 +3,58 @@
#include "pls/internal/data_structures/deque.h"
namespace pls {
namespace internal {
namespace data_structures {
deque_item* deque_internal::pop_head_internal() {
namespace internal {
namespace data_structures {
deque_item *deque_internal::pop_head_internal() {
std::lock_guard<base::spin_lock> lock{lock_};
if (head_ == nullptr) {
return nullptr;
}
deque_item* result = head_;
head_ = head_->prev_;
deque_item *result = head_;
head_ = head_->next_;
if (head_ == nullptr) {
tail_ = nullptr;
} else {
head_->next_ = nullptr;
head_->prev_ = nullptr;
}
return result;
}
}
deque_item* deque_internal::pop_tail_internal() {
deque_item *deque_internal::pop_tail_internal() {
std::lock_guard<base::spin_lock> lock{lock_};
if (tail_ == nullptr) {
return nullptr;
}
deque_item* result = tail_;
tail_ = tail_->next_;
deque_item *result = tail_;
tail_ = tail_->prev_;
if (tail_ == nullptr) {
head_ = nullptr;
} else {
tail_->prev_ = nullptr;
tail_->next_ = nullptr;
}
return result;
}
}
void deque_internal::push_tail_internal(deque_item *new_item) {
void deque_internal::push_tail_internal(deque_item *new_item) {
std::lock_guard<base::spin_lock> lock{lock_};
if (tail_ != nullptr) {
tail_->prev_ = new_item;
tail_->next_ = new_item;
} else {
head_ = new_item;
}
new_item->next_ = tail_;
new_item->prev_ = nullptr;
new_item->prev_ = tail_;
new_item->next_ = nullptr;
tail_ = new_item;
}
}
}
}
}
}
}
#include <pls/internal/base/backoff.h>
#include "pls/internal/helpers/profiler.h"
#include "pls/internal/scheduling/thread_state.h"
......@@ -5,31 +6,35 @@
#include "pls/internal/scheduling/scheduler.h"
namespace pls {
namespace internal {
namespace scheduling {
bool abstract_task::steal_work() {
namespace internal {
namespace scheduling {
bool abstract_task::steal_work() {
thread_local static base::backoff backoff{};
PROFILE_STEALING("abstract_task::steal_work")
const auto my_state = base::this_thread::state<thread_state>();
const auto my_scheduler = my_state->scheduler_;
const size_t my_id = my_state->id_;
const size_t offset = my_state->random_() % my_scheduler->num_threads();
const size_t max_tries = 1; // my_scheduler->num_threads(); TODO: Tune this value
const size_t max_tries = my_scheduler->num_threads() - 1; // TODO: Tune this value
for (size_t i = 0; i < max_tries; i++) {
size_t target = (offset + i) % my_scheduler->num_threads();
if (target == my_id) {
continue;
target = (target + 1) % my_scheduler->num_threads();
}
auto target_state = my_scheduler->thread_state_for(target);
// TODO: Cleaner Locking Using std::guarded_lock
target_state->lock_.lock();
if (!target_state->lock_.reader_try_lock()) {
continue;
}
// Dig down to our level
PROFILE_STEALING("Go to our level")
abstract_task* current_task = target_state->root_task_;
abstract_task *current_task = target_state->root_task_;
while (current_task != nullptr && current_task->depth() < depth()) {
current_task = current_task->child_task_;
current_task = current_task->child();
}
PROFILE_END_BLOCK
......@@ -41,12 +46,13 @@ namespace pls {
current_task->depth_ == depth_) {
if (internal_stealing(current_task)) {
// internal steal was a success, hand it back to the internal scheduler
target_state->lock_.unlock();
target_state->lock_.reader_unlock();
backoff.reset();
return true;
}
// No success, we need to steal work from a deeper level using 'top level task stealing'
current_task = current_task->child_task_;
current_task = current_task->child();
}
}
PROFILE_END_BLOCK;
......@@ -58,19 +64,22 @@ namespace pls {
while (current_task != nullptr) {
auto lock = &target_state->lock_;
if (current_task->split_task(lock)) {
// internal steal was no success (we did a top level task steal)
// top level steal was a success (we did a top level task steal)
backoff.reset();
return false;
}
current_task = current_task->child_task_;
}
PROFILE_END_BLOCK;
target_state->lock_.unlock();
target_state->lock_.reader_unlock();
}
// internal steal was no success
backoff.do_backoff();
return false;
};
}
}
}
}
}
}
......@@ -4,51 +4,39 @@
#include "pls/internal/scheduling/fork_join_task.h"
namespace pls {
namespace internal {
namespace scheduling {
fork_join_sub_task::fork_join_sub_task():
data_structures::deque_item{},
namespace internal {
namespace scheduling {
fork_join_sub_task::fork_join_sub_task() :
ref_count_{0},
parent_{nullptr},
tbb_task_{nullptr},
stack_state_{nullptr} {}
deque_state_{0} {}
fork_join_sub_task::fork_join_sub_task(const fork_join_sub_task& other):
data_structures::deque_item(other),
fork_join_sub_task::fork_join_sub_task(const fork_join_sub_task &other) :
ref_count_{0},
parent_{nullptr},
tbb_task_{nullptr},
stack_state_{nullptr} {}
parent_{other.parent_},
tbb_task_{other.tbb_task_},
deque_state_{other.deque_state_} {}
void fork_join_sub_task::execute() {
void fork_join_sub_task::execute() {
PROFILE_WORK_BLOCK("execute sub_task")
auto last_executing = tbb_task_->currently_executing_;
tbb_task_->currently_executing_ = this;
execute_internal();
tbb_task_->currently_executing_ = nullptr;
tbb_task_->currently_executing_ = last_executing;
PROFILE_END_BLOCK
wait_for_all();
if (parent_ != nullptr) {
parent_->ref_count_--;
}
}
void fork_join_sub_task::spawn_child_internal(fork_join_sub_task* sub_task) {
// Keep our refcount up to date
ref_count_++;
// Assign forced values
sub_task->parent_ = this;
sub_task->tbb_task_ = tbb_task_;
sub_task->stack_state_ = tbb_task_->my_stack_->save_state();
tbb_task_->deque_.push_tail(sub_task);
}
}
void fork_join_sub_task::wait_for_all() {
void fork_join_sub_task::wait_for_all() {
while (ref_count_ > 0) {
PROFILE_STEALING("get local sub task")
fork_join_sub_task* local_task = tbb_task_->get_local_sub_task();
fork_join_sub_task *local_task = tbb_task_->get_local_sub_task();
PROFILE_END_BLOCK
if (local_task != nullptr) {
local_task->execute();
......@@ -63,20 +51,24 @@ namespace pls {
}
}
}
tbb_task_->my_stack_->reset_state(stack_state_);
}
tbb_task_->deque_.release_memory_until(deque_state_);
}
fork_join_sub_task* fork_join_task::get_local_sub_task() {
fork_join_sub_task *fork_join_task::get_local_sub_task() {
return deque_.pop_tail();
}
}
fork_join_sub_task* fork_join_task::get_stolen_sub_task() {
fork_join_sub_task *fork_join_task::get_stolen_sub_task() {
return deque_.pop_head();
}
}
bool fork_join_task::internal_stealing(abstract_task* other_task) {
fork_join_sub_task *fork_join_sub_task::current() {
return dynamic_cast<fork_join_task *>(scheduler::current_task())->currently_executing();
}
bool fork_join_task::internal_stealing(abstract_task *other_task) {
PROFILE_STEALING("fork_join_task::internal_stealin")
auto cast_other_task = reinterpret_cast<fork_join_task*>(other_task);
auto cast_other_task = reinterpret_cast<fork_join_task *>(other_task);
auto stolen_sub_task = cast_other_task->get_stolen_sub_task();
if (stolen_sub_task == nullptr) {
......@@ -84,51 +76,54 @@ namespace pls {
} else {
// Make sub-task belong to our fork_join_task instance
stolen_sub_task->tbb_task_ = this;
stolen_sub_task->stack_state_ = my_stack_->save_state();
stolen_sub_task->deque_state_ = deque_.save_state();
// We will execute this next without explicitly moving it onto our stack storage
last_stolen_ = stolen_sub_task;
return true;
}
}
}
bool fork_join_task::split_task(base::spin_lock* lock) {
bool fork_join_task::split_task(base::swmr_spin_lock *lock) {
PROFILE_STEALING("fork_join_task::split_task")
fork_join_sub_task* stolen_sub_task = get_stolen_sub_task();
fork_join_sub_task *stolen_sub_task = get_stolen_sub_task();
if (stolen_sub_task == nullptr) {
return false;
}
fork_join_task task{stolen_sub_task, this->unique_id()};
// In success case, unlock.
// TODO: this locking is complicated and error prone.
lock->unlock();
lock->reader_unlock();
scheduler::execute_task(task, depth());
return true;
}
}
void fork_join_task::execute() {
void fork_join_task::execute() {
PROFILE_WORK_BLOCK("execute fork_join_task");
// Bind this instance to our OS thread
my_stack_ = base::this_thread::state<thread_state>()->task_stack_;
// TODO: See if we did this right
// my_stack_ = base::this_thread::state<thread_state>()->task_stack_;
deque_.reset_base_pointer();
root_task_->tbb_task_ = this;
root_task_->stack_state_ = my_stack_->save_state();
root_task_->deque_state_ = deque_.save_state();
// Execute it on our OS thread until its finished
root_task_->execute();
}
}
fork_join_sub_task* fork_join_task::currently_executing() const { return currently_executing_; }
fork_join_sub_task *fork_join_task::currently_executing() const { return currently_executing_; }
fork_join_task::fork_join_task(fork_join_sub_task* root_task, const abstract_task::id& id):
fork_join_task::fork_join_task(fork_join_sub_task *root_task,
const abstract_task::id &id) :
abstract_task{0, id},
root_task_{root_task},
currently_executing_{nullptr},
my_stack_{nullptr},
deque_{},
last_stolen_{nullptr} {};
}
}
deque_{base::this_thread::state<thread_state>()->task_stack_},
last_stolen_{nullptr} {}
}
}
}
#include "pls/internal/scheduling/root_task.h"
namespace pls {
namespace internal {
namespace scheduling {
namespace internal {
namespace scheduling {
}
}
}
}
}
#include "pls/internal/scheduling/run_on_n_threads_task.h"
namespace pls {
namespace internal {
namespace scheduling {
namespace internal {
namespace scheduling {
}
}
}
}
}
......@@ -2,9 +2,10 @@
#include "pls/internal/base/error_handling.h"
namespace pls {
namespace internal {
namespace scheduling {
scheduler::scheduler(scheduler_memory* memory, const unsigned int num_threads):
namespace internal {
namespace scheduling {
scheduler::scheduler(scheduler_memory *memory, const unsigned int num_threads) :
num_threads_{num_threads},
memory_{memory},
sync_barrier_{num_threads + 1},
......@@ -15,16 +16,17 @@ namespace pls {
for (unsigned int i = 0; i < num_threads_; i++) {
// Placement new is required, as the memory of `memory_` is not required to be initialized.
new((void*)memory_->thread_state_for(i)) thread_state{this, memory_->task_stack_for(i), i};
new ((void*)memory_->thread_for(i))base::thread<void(*)(), thread_state>(&worker_routine, memory_->thread_state_for(i));
}
new((void *) memory_->thread_state_for(i)) thread_state{this, memory_->task_stack_for(i), i};
new((void *) memory_->thread_for(i))base::thread<void (*)(), thread_state>(&worker_routine,
memory_->thread_state_for(i));
}
}
scheduler::~scheduler() {
scheduler::~scheduler() {
terminate();
}
}
void worker_routine() {
void worker_routine() {
auto my_state = base::this_thread::state<thread_state>();
while (true) {
......@@ -40,9 +42,9 @@ namespace pls {
my_state->scheduler_->sync_barrier_.wait();
}
}
}
void scheduler::terminate(bool wait_for_workers) {
void scheduler::terminate(bool wait_for_workers) {
if (terminated_) {
return;
}
......@@ -55,7 +57,8 @@ namespace pls {
memory_->thread_for(i)->join();
}
}
}
}
}
}
}
}
}
#include "pls/internal/scheduling/scheduler_memory.h"
namespace pls {
namespace internal {
namespace scheduling {
malloc_scheduler_memory::malloc_scheduler_memory(const size_t num_threads, const size_t memory_per_stack):
namespace internal {
namespace scheduling {
malloc_scheduler_memory::malloc_scheduler_memory(const size_t num_threads, const size_t memory_per_stack) :
num_threads_{num_threads} {
threads_ = reinterpret_cast<aligned_thread *>(base::alignment::allocate_aligned(num_threads * sizeof(aligned_thread)));
thread_states_ = reinterpret_cast<aligned_thread_state *>(base::alignment::allocate_aligned(num_threads * sizeof(aligned_thread_state)));
threads_ =
reinterpret_cast<aligned_thread *>(base::alignment::allocate_aligned(num_threads * sizeof(aligned_thread)));
thread_states_ = reinterpret_cast<aligned_thread_state *>(base::alignment::allocate_aligned(
num_threads * sizeof(aligned_thread_state)));
task_stacks_ = reinterpret_cast<aligned_aligned_stack *>(base::alignment::allocate_aligned(num_threads * sizeof(aligned_aligned_stack)));
task_stacks_memory_ = reinterpret_cast<char**>(base::alignment::allocate_aligned(num_threads * sizeof(char*)));
task_stacks_ = reinterpret_cast<aligned_aligned_stack *>(base::alignment::allocate_aligned(
num_threads * sizeof(aligned_aligned_stack)));
task_stacks_memory_ = reinterpret_cast<char **>(base::alignment::allocate_aligned(num_threads * sizeof(char *)));
for (size_t i = 0; i < num_threads_; i++) {
task_stacks_memory_[i] = reinterpret_cast<char*>(base::alignment::allocate_aligned(memory_per_stack));
new ((void*)task_stacks_[i].pointer()) data_structures::aligned_stack(task_stacks_memory_[i], memory_per_stack);
}
task_stacks_memory_[i] = reinterpret_cast<char *>(base::alignment::allocate_aligned(memory_per_stack));
new((void *) task_stacks_[i].pointer()) data_structures::aligned_stack(task_stacks_memory_[i], memory_per_stack);
}
}
malloc_scheduler_memory::~malloc_scheduler_memory() {
malloc_scheduler_memory::~malloc_scheduler_memory() {
free(threads_);
free(thread_states_);
......@@ -25,7 +29,8 @@ namespace pls {
}
free(task_stacks_);
free(task_stacks_memory_);
}
}
}
}
}
}
}
#include "pls/internal/scheduling/thread_state.h"
namespace pls {
namespace internal {
namespace scheduling {
namespace internal {
namespace scheduling {
}
}
}
}
}
add_executable(tests
main.cpp
base_tests.cpp scheduling_tests.cpp data_structures_test.cpp)
data_structures_test.cpp)
target_link_libraries(tests catch2 pls)
......@@ -13,7 +13,7 @@ static bool base_tests_visited;
static int base_tests_local_value_one;
static vector<int> base_tests_local_value_two;
TEST_CASE( "thread creation and joining", "[internal/data_structures/thread.h]") {
TEST_CASE("thread creation and joining", "[internal/data_structures/thread.h]") {
base_tests_visited = false;
auto t1 = start_thread([]() { base_tests_visited = true; });
t1.join();
......@@ -21,7 +21,7 @@ TEST_CASE( "thread creation and joining", "[internal/data_structures/thread.h]")
REQUIRE(base_tests_visited);
}
TEST_CASE( "thread state", "[internal/data_structures/thread.h]") {
TEST_CASE("thread state", "[internal/data_structures/thread.h]") {
int state_one = 1;
vector<int> state_two{1, 2};
......@@ -36,12 +36,12 @@ TEST_CASE( "thread state", "[internal/data_structures/thread.h]") {
int base_tests_shared_counter;
TEST_CASE( "spinlock protects concurrent counter", "[internal/data_structures/spinlock.h]") {
TEST_CASE("spinlock protects concurrent counter", "[internal/data_structures/spinlock.h]") {
constexpr int num_iterations = 1000000;
base_tests_shared_counter = 0;
spin_lock lock{};
SECTION( "lock can be used by itself" ) {
SECTION("lock can be used by itself") {
auto t1 = start_thread([&]() {
for (int i = 0; i < num_iterations; i++) {
lock.lock();
......@@ -63,7 +63,7 @@ TEST_CASE( "spinlock protects concurrent counter", "[internal/data_structures/sp
REQUIRE(base_tests_shared_counter == 0);
}
SECTION( "lock can be used with std::lock_guard" ) {
SECTION("lock can be used with std::lock_guard") {
auto t1 = start_thread([&]() {
for (int i = 0; i < num_iterations; i++) {
std::lock_guard<spin_lock> my_lock{lock};
......
......@@ -4,6 +4,7 @@
#include <pls/internal/data_structures/aligned_stack.h>
#include <pls/internal/data_structures/deque.h>
#include <pls/internal/data_structures/work_stealing_deque.h>
#include <vector>
#include <mutex>
......@@ -12,13 +13,12 @@ using namespace pls::internal::data_structures;
using namespace pls::internal::base;
using namespace std;
TEST_CASE( "aligned stack stores objects correctly", "[internal/data_structures/aligned_stack.h]") {
TEST_CASE("aligned stack stores objects correctly", "[internal/data_structures/aligned_stack.h]") {
constexpr long data_size = 1024;
char data[data_size];
aligned_stack stack{data, data_size};
SECTION( "stack correctly pushes sub linesize objects" ) {
SECTION("stack correctly pushes sub linesize objects") {
std::array<char, 5> small_data_one{'a', 'b', 'c', 'd', 'e'};
std::array<char, 64> small_data_two{};
std::array<char, 1> small_data_three{'A'};
......@@ -32,7 +32,7 @@ TEST_CASE( "aligned stack stores objects correctly", "[internal/data_structures/
REQUIRE(reinterpret_cast<std::uintptr_t>(pointer_three) % system_details::CACHE_LINE_SIZE == 0);
}
SECTION( "stack correctly pushes above linesize objects" ) {
SECTION("stack correctly pushes above linesize objects") {
std::array<char, 5> small_data_one{'a', 'b', 'c', 'd', 'e'};
std::array<char, system_details::CACHE_LINE_SIZE + 10> big_data_one{};
......@@ -43,7 +43,7 @@ TEST_CASE( "aligned stack stores objects correctly", "[internal/data_structures/
REQUIRE(reinterpret_cast<std::uintptr_t>(small_pointer_one) % system_details::CACHE_LINE_SIZE == 0);
}
SECTION( "stack correctly stores and retrieves objects" ) {
SECTION("stack correctly stores and retrieves objects") {
std::array<char, 5> data_one{'a', 'b', 'c', 'd', 'e'};
stack.push(data_one);
......@@ -52,7 +52,7 @@ TEST_CASE( "aligned stack stores objects correctly", "[internal/data_structures/
REQUIRE(retrieved_data == std::array<char, 5>{'a', 'b', 'c', 'd', 'e'});
}
SECTION( "stack can push and pop multiple times with correct alignment" ) {
SECTION("stack can push and pop multiple times with correct alignment") {
std::array<char, 5> small_data_one{'a', 'b', 'c', 'd', 'e'};
std::array<char, 64> small_data_two{};
std::array<char, 1> small_data_three{'A'};
......@@ -76,15 +76,15 @@ TEST_CASE( "aligned stack stores objects correctly", "[internal/data_structures/
}
}
TEST_CASE( "deque stores objects correctly", "[internal/data_structures/deque.h]") {
class my_item: public deque_item {
TEST_CASE("deque stores objects correctly", "[internal/data_structures/deque.h]") {
class my_item : public deque_item {
};
deque<my_item> deque;
my_item one, two, three;
SECTION( "add and remove items form the tail" ) {
SECTION("add and remove items form the tail") {
deque.push_tail(&one);
deque.push_tail(&two);
deque.push_tail(&three);
......@@ -94,7 +94,7 @@ TEST_CASE( "deque stores objects correctly", "[internal/data_structures/deque.h]
REQUIRE(deque.pop_tail() == &one);
}
SECTION( "handles getting empty by popping the tail correctly" ) {
SECTION("handles getting empty by popping the tail correctly") {
deque.push_tail(&one);
REQUIRE(deque.pop_tail() == &one);
......@@ -102,7 +102,7 @@ TEST_CASE( "deque stores objects correctly", "[internal/data_structures/deque.h]
REQUIRE(deque.pop_tail() == &two);
}
SECTION( "remove items form the head" ) {
SECTION("remove items form the head") {
deque.push_tail(&one);
deque.push_tail(&two);
deque.push_tail(&three);
......@@ -112,7 +112,7 @@ TEST_CASE( "deque stores objects correctly", "[internal/data_structures/deque.h]
REQUIRE(deque.pop_head() == &three);
}
SECTION( "handles getting empty by popping the head correctly" ) {
SECTION("handles getting empty by popping the head correctly") {
deque.push_tail(&one);
REQUIRE(deque.pop_head() == &one);
......@@ -120,7 +120,7 @@ TEST_CASE( "deque stores objects correctly", "[internal/data_structures/deque.h]
REQUIRE(deque.pop_head() == &two);
}
SECTION( "handles getting empty by popping the head and tail correctly" ) {
SECTION("handles getting empty by popping the head and tail correctly") {
deque.push_tail(&one);
REQUIRE(deque.pop_tail() == &one);
......@@ -131,3 +131,90 @@ TEST_CASE( "deque stores objects correctly", "[internal/data_structures/deque.h]
REQUIRE(deque.pop_tail() == &three);
}
}
TEST_CASE("work stealing deque stores objects correctly", "[internal/data_structures/aligned_stack.h]") {
constexpr long data_size = 2 << 14;
char data[data_size];
aligned_stack stack{data, data_size};
work_stealing_deque<int> deque{&stack};
int one = 1, two = 2, three = 3, four = 4;
SECTION("add and remove items form the tail") {
deque.push_tail(one);
deque.push_tail(two);
deque.push_tail(three);
REQUIRE(*deque.pop_tail() == three);
REQUIRE(*deque.pop_tail() == two);
REQUIRE(*deque.pop_tail() == one);
}
SECTION("handles getting empty by popping the tail correctly") {
deque.push_tail(one);
REQUIRE(*deque.pop_tail() == one);
deque.push_tail(two);
REQUIRE(*deque.pop_tail() == two);
}
SECTION("remove items form the head") {
deque.push_tail(one);
deque.push_tail(two);
deque.push_tail(three);
REQUIRE(*deque.pop_head() == one);
REQUIRE(*deque.pop_head() == two);
REQUIRE(*deque.pop_head() == three);
}
SECTION("handles getting empty by popping the head correctly") {
deque.push_tail(one);
REQUIRE(*deque.pop_head() == one);
deque.push_tail(two);
REQUIRE(*deque.pop_head() == two);
}
SECTION("handles getting empty by popping the head and tail correctly") {
deque.push_tail(one);
REQUIRE(*deque.pop_tail() == one);
deque.push_tail(two);
REQUIRE(*deque.pop_head() == two);
deque.push_tail(three);
REQUIRE(*deque.pop_tail() == three);
}
SECTION("handles jumps bigger 1 correctly") {
deque.push_tail(one);
deque.push_tail(two);
REQUIRE(*deque.pop_tail() == two);
deque.push_tail(three);
deque.push_tail(four);
REQUIRE(*deque.pop_head() == one);
REQUIRE(*deque.pop_head() == three);
REQUIRE(*deque.pop_head() == four);
}
SECTION("handles stack reset 1 correctly when emptied by tail") {
deque.push_tail(one);
deque.push_tail(two);
auto tmp_result = deque.pop_tail();
REQUIRE(*tmp_result == two);
deque.release_memory_until(tmp_result);
REQUIRE(*deque.pop_tail() == one);
deque.push_tail(three);
deque.push_tail(four);
REQUIRE(*deque.pop_head() == three);
REQUIRE(*deque.pop_tail() == four);
}
SECTION("synces correctly") {
}
}
......@@ -4,11 +4,11 @@
using namespace pls;
class once_sub_task: public fork_join_sub_task {
std::atomic<int>* counter_;
class once_sub_task : public fork_join_sub_task {
std::atomic<int> *counter_;
int children_;
protected:
protected:
void execute_internal() override {
(*counter_)++;
for (int i = 0; i < children_; i++) {
......@@ -16,38 +16,37 @@ protected:
}
}
public:
explicit once_sub_task(std::atomic<int>* counter, int children):
public:
explicit once_sub_task(std::atomic<int> *counter, int children) :
fork_join_sub_task(),
counter_{counter},
children_{children} {}
};
class force_steal_sub_task: public fork_join_sub_task {
std::atomic<int>* parent_counter_;
std::atomic<int>* overall_counter_;
class force_steal_sub_task : public fork_join_sub_task {
std::atomic<int> *parent_counter_;
std::atomic<int> *overall_counter_;
protected:
protected:
void execute_internal() override {
(*overall_counter_)--;
if (overall_counter_->load() > 0) {
std::atomic<int> counter{1};
spawn_child(force_steal_sub_task(&counter, overall_counter_));
while (counter.load() > 0)
; // Spin...
while (counter.load() > 0); // Spin...
}
(*parent_counter_)--;
}
public:
explicit force_steal_sub_task(std::atomic<int>* parent_counter, std::atomic<int>* overall_counter):
public:
explicit force_steal_sub_task(std::atomic<int> *parent_counter, std::atomic<int> *overall_counter) :
fork_join_sub_task(),
parent_counter_{parent_counter},
overall_counter_{overall_counter} {}
};
TEST_CASE( "tbb task are scheduled correctly", "[internal/scheduling/fork_join_task.h]") {
TEST_CASE("tbb task are scheduled correctly", "[internal/scheduling/fork_join_task.h]") {
malloc_scheduler_memory my_scheduler_memory{8, 2 << 12};
SECTION("tasks are executed exactly once") {
......@@ -56,7 +55,7 @@ TEST_CASE( "tbb task are scheduled correctly", "[internal/scheduling/fork_join_t
int total_tasks = 1 + 4 + 4 * 3 + 4 * 3 * 2 + 4 * 3 * 2 * 1;
std::atomic<int> counter{0};
my_scheduler.perform_work([&] (){
my_scheduler.perform_work([&]() {
once_sub_task sub_task{&counter, start_counter};
fork_join_task task{&sub_task, unique_id::create(42)};
scheduler::execute_task(task);
......@@ -68,7 +67,7 @@ TEST_CASE( "tbb task are scheduled correctly", "[internal/scheduling/fork_join_t
SECTION("tasks can be stolen") {
scheduler my_scheduler{&my_scheduler_memory, 8};
my_scheduler.perform_work([&] (){
my_scheduler.perform_work([&]() {
std::atomic<int> dummy_parent{1}, overall_counter{8};
force_steal_sub_task sub_task{&dummy_parent, &overall_counter};
fork_join_task task{&sub_task, unique_id::create(42)};
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment