A brief discussion on the underlying implementation of Java concurrency

Author：Eve Cole Update Time：2025-07-12 10:48:02

The purpose of concurrent programming is to make the program run faster, but using concurrency may not necessarily make the program run faster. The advantages of concurrent programming can only be reflected when the number of concurrent programs reaches a certain order of magnitude. Therefore, it is only meaningful to talk about concurrent programming when there is a high concurrency. Although no programs with high concurrency volume have been developed yet, learning concurrency is to better understand some distributed architectures. Then when the concurrency volume of the program is not high, such as a single-threaded program, the execution efficiency of a single-threaded is higher than that of a multi-threaded program. Why is this? Those who are familiar with the operating system should know that the CPU implements multi-threading by allocating time slices to each thread. In this way, when the CPU switches from one task to another, the state of the previous task will be saved. When the task is executed, the CPU will continue to execute the state of the previous task. This process is called context switching.

In Java multithreading, the synchronized keyword of the volatile keyword plays an important role. They can all implement thread synchronization, but how is it implemented at the bottom?

volatile

volatile can only ensure the visibility of variables to each thread, but cannot guarantee atomicity. I won't say much about how to use the Java language volatile. My suggestion is to use it in any other situation except for the class library in package java.util.concurrent.atomic. See this article for more explanations.

Introduction

See the following code

 package org.go;public class Go { volatile int i = 0; private void inc() { i++; } public static void main(String[] args) { Go go = new Go(); for (int i = 0; i < 10; i++) { new Thread(() -> { for (int j = 0; j < 1000; j++) go.inc(); }).start(); } while(Thread.activeCount()>1){ Thread.yield(); } System.out.println(go.i); }}

The result of each execution of the above code is different, and the output number is always less than 10000. This is because when performing inc(), i++ is not an atomic operation. Perhaps some people would suggest using synchronized to synchronize inc(), or using the lock under package java.util.concurrent.locks to control thread synchronization. But they are not as good as the following solutions:

 package org.go;import java.util.concurrent.atomic.AtomicInteger;public class Go { AtomicInteger i = new AtomicInteger(0); private void inc() { i.getAndIncrement(); } public static void main(String[] args) { Go go = new Go(); for (int i = 0; i < 10; i++) { new Thread(() -> { for (int j = 0; j < 1000; j++) go.inc(); }).start(); } while(Thread.activeCount()>1){ Thread.yield(); } System.out.println(go.i); }}

At this time, if you don't understand the implementation of atomic, you will definitely suspect that the underlying AtomicInteger may be implemented using locks, so it may not be efficient. So what exactly is, let’s take a look.

Internal implementation of atomic classes

Whether it is AtomicInteger or ConcurrentLinkedQueue node class ConcurrentLinkedQueue.Node, they have a static variable
private static final sun.misc.Unsafe UNSAFE;, this class is a Java encapsulation of sun::misc::Unsafe that implements atomic semantics. I want to see the underlying implementation. I happen to have the source code of gcc4.8 on hand. Compared with the local path, it is very convenient to find the path to Github. Look here.

Take the implementation example of the interface getAndIncrement()

AtomicInteger.java

 private static final Unsafe unsafe = Unsafe.getUnsafe();public final int getAndIncrement() { for (;;) { int current = get(); int next = current + 1; if (compareAndSet(current, next)) return current; } }public final boolean compareAndSet(int expect, int update) { return unsafe.compareAndSwapInt(this, valueOffset, expect, update); }

Pay attention to this for loop, it will only return if compareAndSet succeeds. Otherwise, it will always compareAndSet.

The compareAndSet implementation is called. Here, I noticed that the implementation of Oracle JDK is slightly different. If you look at the src under JDK, you can see that Oracle JDK calls Unsafe getAndIncrement(), but I believe that when Oracle JDK implements Unsafe.java, it should only call compareAndSet, because a compareAndSet can implement atomic operations of increasing, decreasing, and setting values.

Unsafe.java

 public native boolean compareAndSwapInt(Object obj, long offset, int expect, int update);

Implementation of C++ called through JNI.

natUnsafe.cc

 jbooleansun::misc::Unsafe::compareAndSwapInt (jobject obj, jlong offset, jint expect, jint update){ jint *addr = (jint *)((char *)obj + offset); return compareAndSwap (addr, expect, update);}static inline boolcompareAndSwap (volatile jint *addr, jint old, jint new_val){ jboolean result = false; spinlock lock; if ((result = (*addr == old))) *addr = new_val; return result;}

Unsafe::compareAndSwapInt calls the static function compareAndSwap. compareAndSwap uses spinlock as lock. The spinlock here has the meaning of LockGuard, which is locked during construction and released during destruction.

We need to focus on spinlock. Here is to ensure that spinlock is a true implementation of atomic operations before it is released.

What is spinlock

spinlock, a kind of busy waiting to acquire the lock of the resource. Unlike mutex's blocking the current thread and releasing CPU resources to wait for the required resources, spinlock will not enter the process of suspending, waiting for the conditions to be met, and re-compete the CPU. This means that Spinlock is better than mutex only if the cost of waiting for the lock is less than the cost of thread execution context switch.

natUnsafe.cc

 class spinlock{ static volatile obj_addr_t lock;public:spinlock () { while (! compare_and_swap (&lock, 0, 1)) _Jv_ThreadYield (); } ~spinlock () { release_set (&lock, 0); }};

Use a static variable static volatile obj_addr_t lock; as a flag bit, a Guard is implemented through C++ RAII, so the so-called lock is actually the static member variable obj_addr_t lock. The volatile in C++ cannot guarantee synchronization. What is guaranteed is the compare_and_swap called in the constructor and a static variable lock. When this lock variable is 1, you need to wait; when it is 0, you set it to 1 through atomic operation, indicating that you have obtained the lock.

It is really an accident to use a static variable here, which means that all lock-free structures share the same variable (actually size_t) to distinguish whether to add a lock. When this variable is set to 1, other spinlock needs to be waited. Why not add a private variable volatile obj_addr_t lock in sun::misc::Unsafe and pass it to spinlock as a constructor parameter? This is equivalent to sharing a flag bit for each UnSafe. Will the effect be better?

_Jv_ThreadYield in the following file, the CPU resource is given up through the system call sched_yield(man 2 sched_yield). The macro HAVE_SCHED_YIELD is defined in configure, which means that if the definition is undefined during compilation, spinlock is called the true spin lock.

posix-threads.h

 inline void_Jv_ThreadYield (void){#ifdef HAVE_SCHED_YIELD sched_yield ();#endif /* HAVE_SCHED_YIELD */}

This lock.h has different implementations on different platforms. We take the ia64 (Intel AMD x64) platform as an example. Other implementations can be seen here.

ia64/locks.h

 typedef size_t obj_addr_t;inline static boolcompare_and_swap(volatile obj_addr_t *addr, obj_addr_t old, obj_addr_t new_val){ return __sync_bool_compare_and_swap (addr, old, new_val);}inline static voidrelease_set(volatile obj_addr_t *addr, obj_addr_t new_val){ __asm__ __volatile__("" : : : "memory"); *(addr) = new_val;}

__sync_bool_compare_and_swap is a built-in gcc function, and the assembly instruction "memory" completes the memory barrier.

Generally speaking, if the CPU hardware supports the instruction cmmpxchg (which guarantees atomicity from the hardware and is undoubtedly very efficient), then __sync_bool_compare_and_swap should be implemented using cmmpxchg.
CPU architectures that do not support cmmpxchg can be implemented by locking the CPU bus.
If the lock command is not supported, it may be implemented through APIC

In short, the hardware ensures multi-core CPU synchronization, and the implementation of Unsafe is as efficient as possible. GCC-java is quite efficient, I believe Oracle and OpenJDK will not be worse.

Atomic operations and GCC built-in atomic operations

Atomic operation

Java expressions and C++ expressions are not atomic operations, which means you are in the code:

 //Suppose i is a variable i++ shared between threads;

In a multithreaded environment, i access is non-atomic and is actually divided into the following three operands:

Get from cache to register
Add 1 in the register
Save to cache

The compiler changes the timing of execution, so the execution result may not be expected.

GCC built-in atomic operation

gcc has built-in atomic operations, which were added from 4.1.2. Before, they were implemented using inline assembly.

 type __sync_fetch_and_add (type *ptr, type value, ...)type __sync_fetch_and_sub (type *ptr, type value, ...)type __sync_fetch_and_or (type *ptr, type value, ...)type __sync_fetch_and_or (type *ptr, type value, ...)type __sync_fetch_and_and (type *ptr, type value, ...)type __sync_fetch_and_xor (type *ptr, type value, ...)type __sync_fetch_and_nand (type *ptr, type value, ...)type __sync_add_and_fetch (type *ptr, type value, ...)type __sync_sub_and_fetch (type *ptr, type value, ...)type __sync_or_and_fetch (type *ptr, type value, ...)type __sync_and_and_fetch (type *ptr, type value, ...)type __sync_and_and_fetch (type *ptr, type value, ...)type __sync_xor_and_fetch (type *ptr, type value, ...)type __sync_nand_and_fetch (type *ptr, type value, ...)bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)__sync_synchronize (...)type __sync_lock_test_and_set (type *ptr, type value, ...)void __sync_lock_release (type *ptr, ...)

What should be noted is:

The relationship between __sync_fetch_and_add and __sync_add_and_fetch corresponds to i++ and ++i. Other analogies
Two implementations of CAS, if the bool version is compared with oldval and ptr, and the ptr value is set to newval, returns true; the other returns the original *ptr value
__sync_synchronize Adds a complete memory barrier

OpenJDK related files

Below are some atomic operation implementations of OpenJDK9 on Github, hoping to help those who need to know. After all, OpenJDK is more widely used than Gcc. ――― But after all, there is no source code for Oracle JDK, although it is said that the source code between OpenJDK and Oracle is very small.

AtomicInteger.java

Unsafe.java::compareAndExchangeObject

unsafe.cpp::Unsafe_CompareAndExchangeObject

oop.inline.hpp::oopDesc::atomic_compare_exchange_oop

atomic_linux_x86.hpp::Atomic::cmpxchg

 inline jlong Atomic::cmpxchg (jlong exchange_value, volatile jlong* dest, jlong compare_value, cmmpxchg_memory_order order) { bool mp = os::is_MP(); __asm__ __volatile__ (LOCK_IF_MP(%4) "cmpxchgq %1,(%3)" : "=a" (exchange_value) : "r" (exchange_value), "a" (compare_value), "r" (dest), "r" (mp) : "cc", "memory"); return exchange_value;}

Here we need to give a prompt to Java programmers who are not familiar with C/C++. The format of embedded assembly instructions is as follows

 __asm__ [__volatile__](assembly template//assembly template: [output operand list]//input list: [input operand list]//output list: [clobber list])//destroy list

%1, %3, %4 in the assembly template corresponds to the following parameter list {"r" (exchange_value), "r" (dest), "r" (mp)}, and the parameter list is separated by commas and sorted from 0. The output parameter is placed on the right of the first colon, and the output parameter is placed on the right of the second colon. "r" means put into a general register, "a" means register EAX, and "=" means for output (write back). The cmpxchg instruction implies the use of EAX registers, that is, parameter %2.

Other details will not be listed here. The implementation of Gcc is to pass down the pointer to be exchanged, and after successful comparison, the value is assigned directly (assigning non-atomic), and the atomicity is guaranteed by spinlock.

The implementation of OpenJDK is to pass down the pointer to be exchanged, and directly assign values through the assembly instruction cmmpxchgq, and atomicity is guaranteed through the assembly instruction. Of course, the underlying layer of GCC's spinlock is also guaranteed through cmmpxchgq.

The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.