The __thread implementation of thread local, static variables seems more efficient, so we activate this by default. If it is (for some reason) not available/slower than the pthread version, one can toggle it for the specific system/processor later on using macros.