Add performance notes on matrix multiplication.

4 jobs from parallel_for in 3 minutes 50 seconds (queued for 2 seconds)
Status Job ID Name Coverage
  Build
passed #3055
build_cmake

00:45

 
  Test
passed #3056
run_tests

00:46

 
  Sanitizer
passed #3058
run_address_sanitizer

01:22

passed #3057
run_thread_sanitizer

00:57