Skip to content
Toggle navigation
P
Projects
G
Groups
S
Snippets
Help
las3_pub
/
predictable_parallel_patterns
This project
Loading...
Sign in
Toggle navigation
Go to a project
Project
Repository
Issues
0
Merge Requests
0
Pipelines
Wiki
Members
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Commit
c87c8e3f
authored
May 28, 2019
by
FritzFlorian
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Add performance notes on matrix multiplication.
parent
afd0331b
Pipeline
#1228
passed with stages
in 3 minutes 50 seconds
Changes
2
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
34 additions
and
0 deletions
+34
-0
PERFORMANCE.md
+33
-0
lib/pls/include/pls/pls.h
+1
-0
No files found.
PERFORMANCE.md
View file @
c87c8e3f
...
@@ -122,3 +122,36 @@ Average results Unbalanced:
...
@@ -122,3 +122,36 @@ Average results Unbalanced:
There seems to be only a minor performance difference between the two,
There seems to be only a minor performance difference between the two,
suggesting tha our two-level approach is not the part causing our
suggesting tha our two-level approach is not the part causing our
weaker performance.
weaker performance.
### Commit afd0331b - Some notes on scaling problems
After tweaking individual values and parameters we can still not find
the main cause for our slowdown on multiple processors.
We also use intel's vtune amplifier to measure performance on our run
and find that we always spend way too much time 'waiting for work',
e.g. in the backoff mechanism when enabled or in the locks for stealing
work when backoff is disabled. This leads us to believe that our problems
might be connected to some issue with work distribution on the FFT case,
as the unbalanced tree search (with a lot 'local' work) performs good.
To get more data in we add benchmarks on matrix multiplication implemented
in two fashions: once with a 'native' array stealing task and once with
a fork-join task. Both implementations use the same minimum array
sub-size of 4 elements and we can hopefully see if they have any
performance differences.
Best case fork-join:
<img
src=
"media/afd0331b_matrix_best_case_fork.png"
width=
"400"
/>
Average case fork-join:
<img
src=
"media/afd0331b_matrix_average_case_fork.png"
width=
"400"
/>
Best case Native:
<img
src=
"media/afd0331b_matrix_best_case_native.png"
width=
"400"
/>
Average case Native:
<img
src=
"media/afd0331b_matrix_average_case_native.png"
width=
"400"
/>
lib/pls/include/pls/pls.h
View file @
c87c8e3f
...
@@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task;
...
@@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task;
using
algorithm
::
invoke_parallel
;
using
algorithm
::
invoke_parallel
;
using
algorithm
::
parallel_for_fork_join
;
using
algorithm
::
parallel_for_fork_join
;
using
algorithm
::
parallel_for
;
}
}
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment