Allow to use __thread locals and pthread_locals.

The __thread implementation of thread local, static variables seems more efficient, so we activate this by default. If it is (for some reason) not available/slower than the pthread version, one can toggle it for the specific system/processor later on using macros.
4 jobs from cache_align in 3 minutes 36 seconds (queued for 4 seconds)
Status Job ID Name Coverage
  Build
passed #2842
build_cmake

00:42

 
  Test
passed #2843
run_tests

00:39

 
  Sanitizer
passed #2845
run_address_sanitizer

01:16

passed #2844
run_thread_sanitizer

00:58