This paper is an overview of synchronization methods across the hardware stack, with an analysis that attributes high-level performance differences to low-level hardware constraints.
Among the key takeaways:
Latency of atomic operations: There is a heavy penalty for cross-socket sharing - ie, passing data between cores (or hyperthreads). This outweighs other sources of latency by far.
Locking vs message passing: Message passing is better when data has high contention, locking with low contention.
Performance of different types of locks: Each of the nine locks considered is the best performer in certain situations. The ticket lock is the best performer in most low-contention cases, though–the “ticket lock” being a spinlock that uses a FIFO queue to determine which thread is allowed to proceed.