In fact, people who write Java seem to have nothing to do with the CPU. At most, it has something to do with how to run the CPU and how to set the number of threads we mentioned earlier. However, that algorithm is just a reference. Many different scenarios require practical means to solve it. Moreover, after running the CPU, we will also consider how to make the CPU not so full. Haha, humans, that's it. Haha, OK, this article is about other things. Maybe you almost don't write code in Java. Pay attention to the CPU, because satisfying the business is the first important thing. If you want to achieve the framework level and provide the framework with many shared data caches, there must be many data requisition problems in the middle. Of course, Java provides many classes of concurrent packages, and you can use it, but how it is done internally, you have to understand the details to use it better, otherwise it is better not to use it. This article may not explain these contents as the focus, because like the title party: We want to talk about CPU, haha.
The same thing is said, it seems that Java has nothing to do with CPU, so let’s talk about what’s going on now;
1. When encountering a shared element, our first idea is to ensure consistent read operations through volatile, that is, absolute visibility. The so-called visibility means that every time you want to use this data, the CPU will not use any cache content and will grab data from memory. This process is still valid for multiple CPUs, which means that the CPU and memory are synchronized at this time. The CPU will issue an assembly instruction similar to Lock addl 0 like a bus, +0 but will not do anything relative to anything. However, once the instruction is completed, subsequent operations will no longer affect the access of other threads of this element, which is the absolute visibility it can achieve, but cannot implement consistent operations. That is to say, what volatile cannot achieve is the consistency of operations such as i++ (concurrency under multiple threads), because i++ operations are decomposed into:
int tmp = i;tmp = tmp + 1;i = tmp;
These three steps are completed. From this point, you can also see why i++ can do other things first and then add 1 to itself, because it is assigned the value to another variable.
2. If we want to use multi-threaded concurrency consistency, we need to use the lock mechanism. At present, things like Atomic* can basically meet these requirements. Many unsafe class methods are provided internally. By constantly comparing absolute visibility data, we can ensure that the acquired data is up to date; next we will continue to talk about other CPU matters.
3. In the past, we were unable to run the CPU in order to fill it up, but we were not satisfied no matter how we started to ignore the delay between memory and CPU. Since we mentioned it today, we will briefly talk about the delay. Generally speaking, the current CPU has three-level cache, and the delays are different in different ages, so the specific number can only be roughly just. Today's CPUs generally have a delay of 1-2ns, the second-level cache is generally a few ns to about ten ns, and the third-level cache is generally between 30ns and 50ns, and the memory access will generally reach 70ns or even more (the computer is developing very fast, and this value is only for the data on some CPUs, for a range reference); although this delay is very small, it is all at the nanosecond level, you will find that when your program is split into instruction operations, there will be a lot of CPU interactions. If the delay of each interaction is so large, the system performance will change at this time;
4. Go back to the volatile mentioned just now. Every time it gets data from memory, it abandons cache. Of course, if it becomes slower in some single-thread operations, it will become slower. Sometimes we have to do this. Even read and write operations require consistency, and even the entire data block is synchronized. We can only reduce the granularity of the lock to a certain extent, but we cannot have locks at all. Even the CPU level itself will have instruction level restrictions.
5. Atomic operations at the CPU level are generally called barriers, with read barriers, write barriers, etc. They are generally triggered by one point. When multiple instructions of the program are sent to the CPU, some instructions may not be executed in the order of the program, and some must be executed in the order of the program, as long as they can be guaranteed to be consistent in the final order of the program. In terms of sorting, JIT will change during runtime, and the CPU instruction level will also change. The main reason is to optimize the runtime instructions to make the program run faster.
6. The CPU level will operate cache line on memory. The so-called cache line will read a piece of memory continuously, which is generally related to the CPU model and architecture. Nowadays, many CPUs will generally read continuous memory each time, and the early ones will have 32byte, so it will be faster when traversing some arrays (it is very slow based on column traversal), but this is not completely correct. The following will compare some opposite situations.
7. If the CPU changes the data, we have to talk about the state of the CPU modifying the data. If all the data is read, it can be read in parallel by multiple threads under multiple CPUs. When writing operations on the data blocks, it is different. The data blocks will have exclusive, modified, invalidation and other states, and the data will naturally fail after modification. When multiple threads are modifying the same data block under multiple CPUs, bus data copy (QPI) between the CPUs will occur. Of course, if we modify it to the same data, we have no choice, but when we return to the cache line at point 6, the problem is more troublesome. If the data is on the same array, and the elements in the array will be cached to a CPU at the same time, the QPI of multi-threads will be very frequent. Sometimes this problem will occur even if the objects assembled on the array are assembled, such as:
class InputInteger {private int value;public InputInteger(int i) {this.value = i;}}InputInteger[] integers = new InputInteger[SIZE];for(int i=0 ; i < SIZE ; i++) {integers[i] = new InputInteger(i);} At this time, you can see that everything in integers is objects, and there are only references to objects on the array, but the arrangement of objects is theoretically independent and will not be stored continuously. However, when Java allocates object memory, it is often allocated continuously in the Eden area. When in the for loop, if no other threads are accessed, these objects will be stored together. Even if they are GC to the OLD area, it is very likely to be put together. Therefore, the way to modify the entire array by relying on simple objects to solve the cache line seems unreliable, because int It is 4 bytes. If in 64 mode, this size is 24 bytes (4bytes are filled), and the pointer compression is 16 bytes; that is, the CPU can match 3-4 objects each time. How to make the CPU cache, but it does not affect the system's QPI. Don't think of completing it by separating the objects, because the GC process memory copying process is likely to be copied together. The best way is to fill it up. Although it is a bit of a memory waste, this is the most reliable method, which is to fill the object to 64 bytes. If the pointer compression is not enabled, there are 24bytes, and there are 40 bytes at this time. You only need to add 5 longs inside the object.
class InputInteger {public int value;private long a1,a2,a3,a4,a5;} Haha, this method is very rustic, but it works very well. Sometimes, when Jvm is compiled, it finds that these parameters have not been done, so it is killed directly for you. The optimization is invalid. The method plus the method is to simply operate these 5 parameters in a method body (used them all), but this method will never call it.
8. At the CPU level, sometimes it may not be possible to do the first thing to do. It is the king. In the operation of AtomicIntegerFieldUpdater, if you call getAnteSet(true) in a single thread, you will find that it runs quite fast, and it starts to slow down under a multi-core CPU. Why is it said clearly above? Because getAndSet is modified and compared, and then change it first, the QPI will be very high, so at this time, it is better to do get operations first and then modify it; and it is also a good way to get it once. If it cannot be obtained, give in and let other threads do other things;
9. Sometimes, in order to solve the problem of some CPU busy and not busy, there will be many algorithms to solve. For example, NUMA is one of the solutions. However, no matter which architecture is more useful in certain scenarios, it may not be effective for all scenarios. There is a queue lock mechanism to complete the CPU state management, but this also has the problem of cache line, because the state changes frequently, and the cores of various applications will also produce some algorithms to do in order to cooperate with the CPU, so that the CPU can be used more effectively, such as CLH queues.
There are many details about this, such as using ordinary variable loops to superimpose, volatile type, and Atomic* series to do it, which are completely different; multi-dimensional array loops, looping in backward order in different latitudes, there are many details, and I understand why there is inspiration in the actual optimization process; the details of the lock are too thin and dizzy, and at the bottom level of the system, there are always some lightweight atomic operations. No matter who says that his code does not need to be locked, the finest can be as simple as the CPU can only execute one instruction at each moment. Multi-core CPUs will also have a shared area to control some content at the bus level, including read level, write level, memory level, etc. In different scenarios, the granularity of the lock is reduced as much as possible. The performance of the system is self-evident, and it is a normal result.