The focus of this article is on performance issues of multithreaded applications. We will first define performance and scalability, and then carefully study the Amdahl rule. In the following content, we will examine how to use different technical methods to reduce lock competition and how to implement it with code.
1. Performance
We all know that multithreading can be used to improve program performance, and the reason behind this is that we have multi-core CPUs or multiple CPUs. Each CPU core can complete tasks by itself, so breaking a large task into a series of small tasks that can be run independently of each other can improve the overall performance of the program. You can give an example. For example, there is a program that changes the size of all pictures in a folder on the hard disk, and the application of multi-threading technology can improve its performance. Using a single threaded approach can only traverse all image files in sequence and perform modifications. If our CPU has multiple cores, there is no doubt that it can only use one of them. Using multi-threading, we can have a producer thread scan the file system to add each image to a queue, and then use multiple worker threads to perform these tasks. If the number of worker threads is the same as the total number of CPU cores, we can ensure that each CPU core has work to do until all tasks are executed.
For another program that requires more IO waits, the overall performance can also be improved using multi-threading technology. Suppose we want to write such a program that we need to crawl all HTML files of a certain website and store them on local disk. The program can start from a certain web page, then parse all links to this website in this web page, and then crawl these links in turn, so that it repeats itself. Because it takes a while to wait from the time we initiate a request to the remote website to the time we receive all the web page data, we can hand over this task to multiple threads for execution. Let one or a little more thread parse the received HTML page and put the found link into the queue, leaving all other threads responsible for requesting the page. Unlike the previous example, in this example, you can still get performance improvements even if you use more threads than the number of CPU cores.
The above two examples tell us that high performance is to do as many things as possible in a short time window. This is of course the most classic explanation of the term performance. But at the same time, using threads can also improve the response speed of our programs well. Imagine we have such a graphical interface application, with an input box above and a button named "process" below the input box. When the user presses this button, the application needs to re-render the button's status (the button appears to be pressed, and it returns to its original state when the left mouse button is released), and start processing the user's input. If this task is time-consuming to process user input, a single-threaded program will not be able to continue to respond to other user input actions, such as user clicking the mouse event or mouse pointer moving the event transmitted from the operating system, etc. The responses to these events need to be an independent thread to respond.
Scalability means that programs have the ability to obtain higher performance by adding computing resources. Imagine that we need to adjust the size of many images, because the number of CPU cores of our machine is limited, increasing the number of threads does not always improve performance accordingly. On the contrary, because the scheduler needs to be responsible for the creation and shutdown of more threads, it will also occupy CPU resources, which may reduce performance.
1.1 Amdahl Rule
The previous paragraph mentioned that in some cases, adding additional computing resources can improve the overall performance of the program. In order to calculate how much performance improvement we can get when we add additional resources, it is necessary to check which parts of the program run serially (or synchronously) and which parts run in parallel. If we quantify the proportion of code that needs to be executed synchronously to B (for example, the number of lines of code that needs to be executed synchronously) and record the total number of cores of the CPU as n, then according to the Amdahl law, the upper limit of performance improvements we can obtain is:
If n tends to infinity, (1-B)/n converges to 0. Therefore, we can ignore the value of this expression, so the performance improvement bit count converges to 1/B, where B represents the proportion of code that must be run synchronously. If B is equal to 0.5, it means that half of the code of the program cannot run in parallel, and the reciprocal of 0.5 is 2, so even if we add countless CPU cores, we get a maximum of 2x performance improvement. Suppose we have modified the program now, and after modification, only 0.25 code must be run synchronously. Now 1/0.25=4 means that if our program runs on hardware with a large number of CPUs, it will be about 4 times faster than on single-core hardware.
On the other hand, through the Amdahl law, we can also calculate the proportion of the synchronization code that the program should be based on the speedup target we want to obtain. If we want to achieve 100 times speedup, and 1/100=0.01 means that the maximum number of code that our program executes synchronously cannot exceed 1%.
To summarize the Amdahl law, we can see that the maximum performance improvement we get by adding an extra CPU depends on how small the proportion of the program executes part of the code synchronously. Although in reality, it is not always easy to calculate this ratio, let alone face some large commercial system applications, the Amdahl law gives us an important inspiration, that is, we must consider the code that must be executed synchronously and try to reduce this part of the code.
1.2 Effect on performance
As the article writes here, we have made a point that adding more threads can improve program performance and responsiveness. But on the other hand, it is not easy to achieve these benefits, and it also requires some price. The use of threads will also affect performance improvement.
First, the first impact comes from the time of thread creation. During the creation of threads, the JVM needs to apply for corresponding resources from the underlying operating system and initialize the data structure in the scheduler to determine the order of execution threads.
If your number of threads is the same as the number of CPU cores, each thread will run on a core so that they may not be interrupted often. But in fact, when your program is running, the operating system will also have some of its own operations that need to be processed by the CPU. So, even in this case, your thread will be interrupted and wait for the operating system to resume its operation. When your thread count exceeds the number of CPU cores, the situation may become worse. In this case, the JVM's process scheduler will interrupt certain threads to allow other threads to execute. When threads are switched, the current state of the running thread needs to be saved so that the data state can be restored next time it is run. Not only that, the scheduler will also update its own internal data structure, which also requires CPU cycles. All of this means that context switching between threads consumes CPU computing resources, thus bringing performance overhead compared to that in a single threaded case.
Another overhead brought by multithreaded programs comes from synchronous access protection of shared data. We can use the synchronized keyword for synchronization protection, or we can use the Volatile keyword to share data between multiple threads. If more than one thread wants to access a shared data structure, a contention will occur. At this time, the JVM needs to decide which process is first and which process is behind. If the thread to be executed is not the currently running thread, a thread switching occurs. The current thread needs to wait until it successfully acquires the lock object. The JVM can decide how to perform this "wait". If the JVM expects to be shorter to successfully acquire the locked object, the JVM can use aggressive waiting methods, such as constantly trying to acquire the locked object until it is successful. In this case, this method may be more efficient, because it is still faster to compare process context switching. Moving a waiting thread back to the execution queue will also bring additional overhead.
Therefore, we must try our best to avoid context switching caused by lock competition. The following section will explain two ways to reduce the occurrence of such competition.
1.3 Lock competition
As mentioned in the previous section, competing access to the lock by two or more threads will bring additional computational overhead because the competition occurs to force the scheduler to enter an aggressive waiting state, or to let it perform a waiting state, causing two context switches. There are some cases where the consequences of lock competition can be mitigated by:
1. Reduce the scope of locks;
2. Reduce the frequency of locks that need to be acquired;
3. Try to use optimistic lock operations supported by hardware rather than synchronized;
4. Try to use synchronized as little as possible;
5. Reduce the use of object cache
1.3.1 Reducing the synchronization domain
If the code holds the lock for more than necessary, then this first method can be applied. Usually we can move one or more lines of code out of the synchronization area to reduce the time the current thread holds the lock. The less code runs in the synchronization area, the earlier the current thread will release the lock, allowing other threads to acquire the lock earlier. This is consistent with the Amdahl law, because doing so reduces the amount of code that needs to be executed synchronously.
For a better understanding, look at the following source code:
public class ReduceLockDuration implements Runnable { private static final int NUMBER_OF_THREADS = 5; private static final Map<String, Integer> map = new HashMap<String, Integer>(); public void run() { for (int i = 0; i < 10000; i++) { synchronized (map) { UUID randomUUID = UUID.randomUUID(); Integer value = Integer.valueOf(42); String key = randomUUID.toString(); map.put(key, value); } Thread.yield(); } } public static void main(String[] args) throws InterruptedException { Thread[] threads = new Thread[NUMBER_OF_THREADS]; for (int i = 0; i < NUMBER_OF_THREADS; i++) { threads[i] = new Thread(new ReduceLockDuration()); } long startMillis = System.currentTimeMillis(); for (int i = 0; i < NUMBER_OF_THREADS; i++) { threads[i].start(); } for (int i = 0; i < NUMBER_OF_THREADS; i++) { threads[i].join(); } System.out.println((System.currentTimeMillis()-startMillis)+"ms"); }}In the example above, we let five threads compete to access the shared Map instance. In order to only one thread can access the Map instance at the same time, we put the operation of adding Key/Value to the Map into the synchronized protected code block. When we carefully look at this code, we can see that the few sentences of code that calculate the key and value do not need to be executed synchronously. The key and value only belong to the thread that currently executes this code. It is only meaningful to the current thread and will not be modified by other threads. Therefore, we can move these sentences out of synchronization protection. as follows:
public void run() { for (int i = 0; i < 10000; i++) { UUID randomUUID = UUID.randomUUID(); Integer value = Integer.valueOf(42); String key = randomUUID.toString(); synchronized (map) { map.put(key, value); } Thread.yield(); }}The effect of reducing the synchronization code is measurable. On my machine, the execution time of the entire program was reduced from 420ms to 370ms. Take a look, just moving three lines of code out of the synchronization protection block can reduce program run time by 11%. The Thread.yield() code is to induce thread context switching, because this code will tell the JVM that the current thread wants to hand over the currently used compute resources so that other threads waiting to run can run. This will also lead to more lock competition, because if this is not the case, a thread will occupy a certain core longer, thereby reducing thread context switching.
1.3.2 Split lock
Another way to reduce lock competition is to spread a block of lock-protected code into a number of smaller protection blocks. This method will work if you use a lock in your program to protect multiple different objects. Suppose we want to count some data through a program, and implement a simple count class to hold multiple different statistical indicators, and represent them with a basic count variable (long type). Because our program is multi-threaded, we need to synchronously protect the operations that access these variables, because these actions come from different threads. The easiest way to achieve this is to add the synchronized keyword to each function that accesses these variables.
public static class CounterOneLock implements Counter { private long customerCount = 0; private long shippingCount = 0; public synchronized void incrementCustomer() { customerCount++; } public synchronized void incrementShipping() { shippingCount++; } public synchronized long getCustomerCount() { return customerCount; } public synchronized long getShippingCount() { return shippingCount; }}This means that each modification of these variables will cause locking to other Counter instances. If other threads want to call the increment method on another different variable, they can only wait for the previous thread to release the lock control before they have the chance to complete it. In this case, using a separate synchronized protection for each different variable will improve execution efficiency.
public static class CounterSeparateLock implements Counter { private static final Object customerLock = new Object(); private static final Object shippingLock = new Object(); private long customerCount = 0; private long shippingCount = 0; public void incrementCustomer() { synchronized (customerLock) { customerCount++; } } public void incrementShipping() { synchronized (shippingLock) { shippingCount++; } } public long getCustomerCount() { synchronized (customerLock) { return customerCount; } } public long getShippingCount() { synchronized (shippingLock) { return shippingCount; } }}This implementation introduces a separate synchronized object for each count metric. Therefore, when a thread wants to increase the Customer count, it must wait for another thread that is increasing the Customer count to complete, rather than waiting for another thread that is increasing the Shipping count to complete.
Using the following classes, we can easily calculate the performance improvements brought by split locks.
public class LockSplitting implements Runnable { private static final int NUMBER_OF_THREADS = 5; private Counter counter; public interface Counter { void incrementCustomer(); void incrementShipping(); long getCustomerCount(); long getShippingCount(); } public static class CounterOneLock implements Counter { ... } public static class CounterSeparateLock implements Counter { ... } public LockSplitting(Counter counter) { this.counter = counter; } public void run() { for (int i = 0; i < 100000; i++) { if (ThreadLocalRandom.current().nextBoolean()) { counter.incrementCustomer(); } else { counter.incrementShipping(); } } } public static void main(String[] args) throws InterruptedException { Thread[] threads = new Thread[NUMBER_OF_THREADS]; Counter counter = new CounterOneLock(); for (int i = 0; i < NUMBER_OF_THREADS; i++) { threads[i] = new Thread(new LockSplitting(counter)); } long startMillis = System.currentTimeMillis(); for (int i = 0; i < NUMBER_OF_THREADS; i++) { threads[i].start(); } for (int i = 0; i < NUMBER_OF_THREADS; i++) { threads[i].join(); } System.out.println((System.currentTimeMillis() - startMillis) + "ms"); }}On my machine, the implementation method of a single lock takes an average of 56ms, and the implementation of two separate locks is 38ms. The time-consuming is reduced by about 32%.
Another way to improve is that we can even go further to protect read and write with different locks. The original Counter class provides methods for reading and writing counting indicators respectively. However, in fact, read operations do not require synchronization protection. We can be assured that multiple threads can read the value of the current indicator in parallel. At the same time, write operations must be synchronously protected. The java.util.concurrent package provides an implementation of the ReadWriteLock interface, which can easily achieve this distinction.
The ReentrantReadWriteLock implementation maintains two different locks, one protects read operation and the other protects write operation. Both locks have operations to acquire and release locks. A write lock can only be successfully obtained when no one acquires a read lock. Conversely, as long as the write lock is not acquired, the read lock can be acquired by multiple threads at the same time. To demonstrate this approach, the following Counter class uses ReadWriteLock, as follows:
public static class CounterReadWriteLock implements Counter { private final ReentrantReadWriteLock customerLock = new ReentrantReadWriteLock(); private final Lock customerWriteLock = customerLock.writeLock(); private final Lock customerReadLock = customerLock.readLock(); private final ReentrantReadWriteLock shippingLock = new ReentrantReadWriteLock(); private final Lock shippingWriteLock = shippingLock.writeLock(); private final Lock shippingReadLock = shippingLock.readLock(); private long customerCount = 0; private long shippingCount = 0; public void incrementCustomer() { customerWriteLock.lock(); customerCount++; customerWriteLock.unlock(); } public void incrementShipping() { shippingWriteLock.lock(); shippingCount++; shippingWriteLock.unlock(); } public long getCustomerCount() { customerReadLock.lock(); long count = customerCount; customerReadLock.unlock(); return count; } public long getShippingCount() { shippingReadLock.lock(); long count = shippingCount; shippingReadLock.unlock(); return count; }}All read operations are protected by read locks, and all write operations are protected by write locks. If the read operations performed in the program are much larger than the write operations, this implementation can bring greater performance improvements than the previous section because the read operations can be performed concurrently.
1.3.3 Separation lock
The above example shows how to separate a single lock into multiple separate locks so that each thread can just get the lock of the object they are about to modify. But on the other hand, this method also increases the complexity of the program, and may cause deadlocks if implemented inappropriately.
A detachment lock is a similar method to a detachment lock, but a detachment lock is to add a lock to protect different code snippets or objects, while a detachment lock is to use a different lock to protect different ranges of values. ConcurrentHashMap in JDK's java.util.concurrent package uses this idea to improve the performance of programs that rely heavily on HashMap. In terms of implementation, ConcurrentHashMap uses 16 different locks internally, instead of encapsulating a synchronously protected HashMap. Each of the 16 locks is responsible for protecting synchronous access to one-tenth of the bucket bits (buckets). In this way, when different threads want to insert keys into different segments, the corresponding operations will be protected by different locks. But it will also bring some bad problems, such as the completion of certain operations now requires multiple locks instead of one lock. If you want to copy the entire map, all 16 locks need to be obtained to complete.
1.3.4 Atomic Operation
Another way to reduce lock competition is to use atomic operations, which will elaborate on the principles in other articles. The java.util.concurrent package provides atomically encapsulated classes for some commonly used basic data types. The implementation of the atomic operation class is based on the "Comparison Permutation" function (CAS) provided by the processor. The CAS operation will only perform an update operation when the value of the current register is the same as the old value provided by the operation.
This principle can be used to increase the value of a variable in an optimistic way. If our thread knows the current value, it will try to use the CAS operation to perform the increment operation. If other threads have modified the value of the variable during this period, the so-called current value provided by the thread is different from the real value. At this time, the JVM tries to regain the current value and try again, repeating it again until it succeeds. Although looping operations will waste some CPU cycles, the benefit of doing this is that we do not need any form of synchronization control.
The implementation of the Counter class below uses atomic operations. As you can see, there is no synchronized code used.
public static class CounterAtomic implements Counter { private AtomicLong customerCount = new AtomicLong(); private AtomicLong shippingCount = new AtomicLong(); public void incrementCustomer() { customerCount.incrementAndGet(); } public void incrementShipping() { shippingCount.incrementAndGet(); } public long getCustomerCount() { return customerCount.get(); } public long getShippingCount() { return shippingCount.get(); }}Compared with the CounterSeparateLock class, the average running time has been reduced from 39ms to 16ms, which is about 58%.
1.3.5 Avoid hotspot code segments
A typical LIST implementation records the number of elements contained in the LIST itself by maintaining a variable in the content. Every time an element is deleted or added from the list, the value of this variable will change. If LIST is used in a single-threaded application, this method is understandable. Each time you call size(), you can just return the value after the last calculation. If this count variable is not maintained internally by LIST, each call to size() will cause LIST to re-traverse and calculate the number of elements.
This optimization method used by many data structures will become a problem when it is in a multi-threaded environment. Suppose we share a LIST among multiple threads, and multiple threads simultaneously add or delete elements into the LIST, and query the large length. At this time, the count variable inside LIST becomes a shared resource, so all access to it must be processed synchronously. Therefore, count variables become a hot spot in the entire LIST implementation.
The following code snippet shows this problem:
public static class CarRepositoryWithCounter implements CarRepository { private Map<String, Car> cars = new HashMap<String, Car>(); private Map<String, Car> trucks = new HashMap<String, Car>(); private Object carCountSync = new Object(); private int carCount = 0; public void addCar(Car car) { if (car.getLicencePlate().startsWith("C")) { synchronized (cars) { Car foundCar = cars.get(car.getLicencePlate()); if (foundCar == null) { cars.put(car.getLicencePlate(), car); synchronized (carCountSync) { carCount++; } } } } else { synchronized (trucks) { Car foundCar = trucks.get(car.getLicencePlate()); if (foundCar == null) { trucks.put(car.getLicencePlate(), car); synchronized (carCountSync) { carCount++; } } } } } } public int getCarCount() { synchronized (carCountSync) { return carCount; } }}The above implementation of CarRepository has two LIST variables inside, one is used to place the car wash element and the other is used to place the truck element. At the same time, it provides a method to query the total size of these two LISTs. The optimization method used is that every time a Car element is added, the value of the internal count variable will be increased. At the same time, the incremented operation is protected by synchronized, and the same is true for returning the count value.
To avoid this additional code synchronization overhead, see another implementation of CarRepository below: it no longer uses an internal count variable, but counts this value in real time in the method of returning the total number of cars. as follows:
public static class CarRepositoryWithoutCounter implements CarRepository { private Map<String, Car> cars = new HashMap<String, Car>(); private Map<String, Car> trucks = new HashMap<String, Car>(); public void addCar(Car car) { if (car.getLicencePlate().startsWith("C")) { synchronized (cars) { Car foundCar = cars.get(car.getLicencePlate()); if (foundCar == null) { cars.put(car.getLicencePlate(), car); } } } else { synchronized (trucks) { Car foundCar = trucks.get(car.getLicencePlate()); if (foundCar == null) { trucks.put(car.getLicencePlate(), car); } } } } } public int getCarCount() { synchronized (cars) { synchronized (trucks) { return cars.size() + trucks.size(); } } } }Now, just in the getCarCount() method, the access of the two LISTs needs synchronization protection. Like the previous implementation, the synchronization overhead every time a new element is added no longer exists.
1.3.6 Avoid object cache reuse
In the first version of JAVA VM, the overhead of using the new keyword to create new objects is relatively high, so many developers are accustomed to using object reuse mode. In order to avoid repeated creation of objects again and again, developers maintain a buffer pool. After each creation of object instances, they can be saved in the buffer pool. Next time other threads need to use them, they can be retrieved directly from the buffer pool.
At first glance, this method is very reasonable, but this pattern can cause problems in multithreaded applications. Because the buffer pool of objects is shared among multiple threads, all threads' operations when accessing objects in them need synchronous protection. The overhead of this synchronization is greater than the creation of the object itself. Of course, creating too many objects will increase the burden of garbage collection, but even taking this into account, it is still better to avoid the performance improvements brought by synchronizing the code than to use the object cache pool.
The optimization schemes described in this article once again show that each possible optimization method must be carefully evaluated when it is actually applied. An immature optimization solution seems to make sense on the surface, but in fact it is likely to become a performance bottleneck in turn.