Abstract Application programmers sometimes write hand-coded synchronization routines rather than using constructs provided by a threading API in order to reduce synchronization overhead or provide different functionality than existing constructs offer. Unfortunately, using hand-coded synchronization routines may have a negative impact on performance, performance tuning, or debugging of multi-threaded applications
Read the original:
Use Synchronization Routines Provided by the Threading API Rather than Hand-Coded Synchronization
Abstract In multithreaded applications, locks are used to synchronize entry to regions of code that access shared resources. The region of code protected by these locks is called a critical section
The rest is here:
Managing Lock Contention: Large and Small Critical Sections
Abstract Many compute-intensive applications involve complex transformations of ordered input data to ordered output data. Examples include sound and video transcoding, lossless data compression, and seismic data processing.
Read more:
Exploiting Data Parallelism in Ordered Data Streams
Abstract Tasks are a light-weight alternative to threads that provide faster startup and shutdown times, better load balancing, an efficient use of available resources, and a higher level of abstraction. Three programming models that include task based programming are Intel® Threading Building Blocks (Intel® TBB), and OpenMP*
Go here to see the original:
Using Tasks Instead of Threads
Abstract Memory sub-system components contribute significantly to the performance characteristics of an application. As an increasing number of threads or processes share the limited resources of cache capacity and memory bandwidth, the scalability of a threaded application can become constrained. Memory-intensive threaded applications can suffer from memory bandwidth saturation as more threads are introduced
Go here to see the original:
Detecting Memory Bandwidth Saturation in Threaded Applications
Abstract In symmetric multiprocessor (SMP) systems, each processor has a local cache. The memory system must guarantee cache coherence
Read the original here:
Avoiding and Identifying False Sharing Among Threads
When using FFTW and links with Intel® MKL 10.2 update 3, it needs to be consistent in matching fftw_malloc & fftw_free statements. We changed the fftwl_malloc and fftwl_free to call MKL_malloc and MKL_free from MKL 10.2 update 3 onwards.
Read the original post:
FFTW mismatching malloc and free calls result error
Technical Notes: Starting with version 10.2 of Intel(R) MKL, the names of some service functions have become obsolete and will be removed in subsequent releases.
Original post:
Some service functions have become obsolete and will be removed in subsequent releases.
When link with Intel® Math Kernel Library 10.2 update 2 on SGI* Workstation with Intel® Xeon processor, the following error is reported: libmkl_scalapack_lp64.so: undefined reference to `MKL_SCALAPACK_INT’ libmkl_scalapack_lp64.so: undefined reference to `Cdsendrecv’ The cause of this issue happens only when SGI’s MPI library and libmkl_blacs_sgimpt_lp64.a library from MKL are used. This is a known issue and this will be fixed in the future versions. Workaround to fix this issue: Compile a C source file with the below two lines and link it in addition to MKL: #include int MKL_SCALAPACK_INT = (int) MPI_INT;
See original here:
libmkl_scalapack_lp64.so: undefined reference to `MKL_SCALAPACK_INT'