1. Preface
In fact, since I started writing Java code, I have encountered countless garbled and transcoding problems, such as garbled code that occurs when reading from a text file into a String, garbled code that occurs when getting HTTP request parameters in a servlet, garbled code that occurs when queried by JDBC, etc. These problems are very common. When you encounter them, you can solve them successfully by searching them, so you don’t have an in-depth understanding.
Until two days ago, my classmate talked to me about a Java source file encoding problem (this problem is analyzed in the last example), and started with this problem and started with a series of problems. Then we discussed while searching the information. It was late at night that we finally found a key clue in a blog, solving all doubts, and the sentences that we had not understood before could be explained clearly. Therefore, I decided to use this essay to record my understanding of some coding problems and the results of the experiment.
Some of the following concepts are my own understanding based on actual conditions. If there is any error, please be sure to correct them.
2. Concept summary
In the early days, the Internet had not yet developed, and computers were only used to process some local data, so many countries and regions designed coding schemes for local languages. This kind of regional-related encoding is collectively called ANSI encoding (because they are extensions to ANSI-ASCII codes). However, they did not discuss in advance how to be compatible with each other, but instead did their own, which laid the root of encoding conflicts. For example, the GB2312 encoding used in the mainland conflicts with the Big5 encoding used in Taiwan. The same two bytes represent different characters in the two encoding schemes. With the rise of the Internet, a document often contains multiple languages, and the computer encounters trouble when displaying it because it does not know which encoding these two bytes belong to.
Such problems are common in the world, so calls for redefining a common character set and unified numbering of all characters in the world are rising.
As a result, Unicode code came into being, it uniformly numbered all characters in the world. Since it can uniquely identify a character, the font only needs to be designed for Unicode code. However, the Unicode standard defines a character set, but does not specify the encoding scheme, that is, it only defines abstract numbers and corresponding characters, but does not specify how to store a string of Unicode numbers. The real requirement is how to store UTF-8, UTF-16, UTF-32 and other solutions. Therefore, encodings with UTF beginnings can be directly converted through calculations and Unicode values (CodePoints, code points). As the name suggests, UTF-8 is an 8-bit length encoding, which is a variable-length encoding, using 1 to 6 bytes to encode a character (because it is constrained by the Unicode range, it is actually only 4 bytes at most); UTF-16 is a 16-bit basic unit encoding, which is also a variable-length encoding, either 2 bytes or 4 bytes; UTF-32 is a fixed length, and a fixed 4 bytes store a Unicode number.
Actually, I have always been a bit misunderstanding about Unicode before. In my impression, Unicode code can only reach 0xFFFF, which means it can only represent up to 2^16 characters. After carefully reading Wikipedia, I realized that the early UCS-2 encoding scheme was indeed like this. UCS-2 fixedly used two bytes to encode a character, so it can only encode characters within the range of BMP (basic multilingual plane, that is, 0x0000-0xFFFF, which contains the most commonly used characters in the world). In order to encode characters with Unicode greater than 0xFFFF, people have expanded the UCS-2 encoding and created UTF-16 encoding, which is variable length. In the BMP range, UTF-16 is exactly the same as UCS-2, while UTF-16 outside of BMP uses 4 bytes to store.
To facilitate the description below, let me explain the concept of code unit (CodeUnit). The basic component of a certain encoding is called the code unit. For example, the code unit of UTF-8 is 1 byte, and the code unit of UTF-16 is 2 bytes. It is difficult to explain, but it is easy to understand.
In order to be compatible with various languages and better cross-platform, JavaString saves Unicode code for characters. It used to use the UCS-2 encoding scheme to store Unicode. Later, it found that the characters in the BMP range were not enough, but for memory consumption and compatibility considerations, it did not rise to UCS-4 (i.e. UTF-32, fixed 4-byte encoding), but adopted the UTF-16 mentioned above. The char type can be regarded as its code unit. This practice causes some trouble. If all characters are within the BMP range, it is fine. If there are characters outside of BMP, it is no longer a code unit corresponding to a character. The length method returns the number of code units, not the number of characters. The charAt method naturally returns a code unit instead of a character, which becomes troublesome when traversing. Although some new operation methods are provided, it is still inconvenient and cannot be accessed randomly.
In addition, I found that Java does not process Unicode literals larger than 0xFFFF when compiling, so if you can't type a non-BMP character, but you know its Unicode code, you have to use a relatively stupid method to let String store it: manually calculate the UTF-16 encoding (four bytes) of the character, and use the first two bytes and the last two bytes as a Unicode number, and then assign the value to String. The sample code is as follows.
public static void main(String[] args) {//String str = ""; //We want to assign such a character, assuming that my input method cannot be typed// But I know that its Unicode is 0x1D11E//String str = "/u1D11E"; //This will not be recognized//So it can be calculated through the UTF-16 encoding D834 DD1EString str = "/uD834/uDD1E";//Then write System.out.println(str);//Successfully output ""}The notepad that comes with Windows can be saved as Unicode encoding, which actually refers to UTF-16 encoding. As mentioned above, the main character encodings used are all within the BMP range, and within the BMP range, the UTF-16 encoding value of each character is equal to the corresponding Unicode value, which is probably why Microsoft calls it Unicode. For example, I entered the "good a" two characters in Notepad, and then saved it as Unicode big endian (high bit priority) encoding, and opened the file with WinHex. The content is as shown in the figure below. The first two bytes of the file are called Byte Order Mark (byte order mark), (FE FF) marks the endian order as high bit priority, and then (59 7D) is the "good" Unicode code, and (00 61) is the "a" Unicode code.
With Unicode code, the problem cannot be solved immediately, because first of all, there is a large amount of non-Unicode standard encoding data in the world, and it is impossible for us to discard them. Secondly, Unicode encoding often takes up more space than ANSI encoding, so from the perspective of saving resources, ANSI encoding is still necessary. Therefore, it is necessary to establish a conversion mechanism so that ANSI encoding can be converted to Unicode for unified processing, or Unicode can be converted to ANSI encoding to meet the requirements of the platform.
The conversion method is relatively easy to say. For UTF series or ISO-8859-1, compatible encodings can be directly converted through calculation and Unicode values (in fact, it may also be a table lookup). For the ANSI encoding left over from the system, it can only be done by looking up the table. Microsoft calls this mapping table CodePage (code page) and classifies and numbered by encoding. For example, our common cp936 is the GBK code page, and cp65001 is the UTF-8 code page. The following figure is the GBK->Unicode mapping table found on Microsoft's official website (visually incomplete). Similarly, there should be a reverse Unicode->GBK mapping table.
With a code page, you can easily perform various encoding conversions. For example, converting from GBK to UTF-8, you only need to divide the data by characters according to the GBK encoding rules, use the encoded data of each character to check the GBK code page, get its Unicode value, and then use the Unicode to check the UTF-8 code page (or directly calculate), and you can get the corresponding UTF-8 encoding. The same goes for the other way around. Note: UTF-8 is a standard implementation of Unicode. Its code page contains all Unicode values, so any encoding is converted to UTF-8 and then converted back will not be lost. At this point, we can draw a conclusion that to complete the encoding conversion work, the most important thing is to successfully convert to Unicode, so correctly choosing the character set (code page) is the key.
After understanding the nature of the transcoding loss problem, I suddenly understood why the JSP framework used ISO-8859-1 to decode HTTP request parameters, which led to the fact that we had to write such statements when we got Chinese parameters:
Stringparam=newString(s.getBytes("iso-8859-1"),"UTF-8");
Because the JSP framework receives a binary byte stream encoded by parameter, it does not know what encoding it is (or doesn't care), and it does not know which code page to check to convert to Unicode. Then it chose a solution that will never cause loss. It assumes that this is the data encoded by ISO-8859-1, and then searches the ISO-8859-1 code page to get the Unicode sequence. Because ISO-8859-1 is encoded by bytes, and unlike ASCII, it encodes every bit of space 0~255, so any byte can be found in its code page. If it is turned from Unicode to the original byte stream, there will be no loss. This way, for European and American programmers who do not consider other languages, they can directly decode the String with the JSP framework. If they want to be compatible with other languages, they only need to return to the original byte stream and decode it with the actual code page.
I have finished explaining the related concepts of Unicode and character encoding. Next, I will use Java examples to experience it.
III. Example analysis
1. Convert to Unicode-String constructor
The constructing method of String is to convert various encoded data into a Unicode sequence (stored in UTF-16 encoding). The following test code is used to show the application of JavaString construction method. Non-BMP characters are involved in the examples, so codePointAt methods are not used.
public class Test {public static void main(String[] args) throws IOException {// "Hello" GBK encoded data byte[] gbkData = {(byte)0xc4, (byte)0xe3, (byte)0xba, (byte)0xc3};// "Hello" BIG5 encoded data byte[] big5Data = {(byte)0xa7, (byte)0x41, (byte)0xa6, (byte)0x6e};//Construct String and decode it to UnicodeString strFromGBK = new String(gbkData, "GBK");String strFromBig5 = new String(big5Data, "BIG5");//Output Unicode sequences respectively showUnicode(strFromGBK); showUnicode(strFromBig5);}public static void showUnicode(String str) {for (int i = 0; i < str.length(); i++) {System.out.printf("//u%x", (int)str.charAt(i));}System.out.println();}}The operation results are as follows
It can be found that since String masters Unicode code, it needs to be converted to other encodings soeasy!
3. Using Unicode as a bridge to realize coding mutual conversion
With the foundation of the above two parts, it is very simple to realize coding and mutual conversion. You just need to use them together. First, newString converts the original encoded data into a Unicode sequence, and then call getBytes to transfer to the specified encoding.
For example, a very simple GBK to Big5 conversion code is as follows
public static void main(String[] args) throws UnsupportedEncodingException {//Suppose this is the data read from the file in a byte stream (GBK encoding) byte[] gbkData = {(byte) 0xc4, (byte) 0xe3, (byte) 0xba, (byte) 0xc3};//Convert to UnicodeString tmp = new String(gbkData, "GBK");//Convert from Unicode to Big5 encoding byte[] big5Data = tmp.getBytes("Big5");//Second operations...}4. Coding loss problem
As explained above, the reason why the JSP framework uses the ISO-8859-1 character set to decode it. First use an example to simulate this restore process, the code is as follows
public class Test {public static void main(String[] args) throws UnsupportedEncodingException {//JSP framework receives 6 bytes of data byte[] data = {(byte) 0xe4, (byte) 0xbd, (byte) 0xa0, (byte) 0xe5, (byte) 0xa5, (byte) 0xbd};//Print the original data showBytes(data);//JSP framework assumes that it is ISO-8859-1 encoding, generates a String object String tmp = new String(data, "ISO-8859-1");//******************************//After the developer got it, it was printed and found that it was 6 European characters, instead of the expected "Hello" System.out.println(" ISO decoding result: " + tmp);//So first get the original 6 bytes of data (reversely look up the code page of ISO-8859-1) byte[] utfData = tmp.getBytes("ISO-8859-1");//Print the restored data showBytes(utfData);//The developer knows that it is UTF-8 encoded, so he uses the code page of UTF-8 to reconstruct the String object String result = new String(utfData, "UTF-8");//Print again, it's correct! System.out.println(" UTF-8 decoding result: " + result);}public static void showBytes(byte[] data) {for (byte b : data) System.out.printf("0x%x", b);System.out.println();}}The running result is as follows. The first output is incorrect because the decoding rules are incorrect. I also checked the code page incorrectly and got the wrong Unicode. Then I found that the data can be restored perfectly through the wrong Unicode back-checking of the ISO-8859-1 code page.
This is not the point. If the key is to replace "China" with "China", the compilation will be successful, and the operation result is as shown in the figure below. In addition, it can be further found that when the number of Chinese characters is odd, the compilation fails and when the number is even, it passes. Why is this? Let’s analyze it in detail below.
Because JavaString uses Unicode internally, the compiler will transcode our string literals during compilation and convert from the source file encoding to Unicode (Wikipedia says it uses a slightly different encoding from UTF-8). When compiling, we did not specify the encoding parameter, so the compiler will decode it in GBK by default. If you have some knowledge of UTF-8 and GBK, you should know that generally a Chinese character needs 3 bytes to use UTF-8 encoding, while GBK only needs 2 bytes. This can explain why the parity of the character number will affect the result, because if there are 2 characters, the UTF-8 encoding occupies 6 bytes, and decoding in GBK can be decoded to 3 characters. If it is 1 character, there will be an unmappable byte, which is the place where the question mark in the figure.
To be more specific, the UTF-8 encoding of the word "China" in the source file is e4b8ade59bbd. The compiler decodes it in GBK. The 3 byte pairs look up cp936 to obtain 3 Unicode values, which are 6d93e15e6d57 respectively, corresponding to the three strange characters in the result graph. As shown in the figure below, after compilation, these three Unicodes are actually stored in UTF-8-like encoding in the .class file. When running, the Unicode is stored in the JVM. However, when the final output is output, it will still be encoded and passed to the terminal. The agreed encoding this time is the encoding set by the system area, so if the terminal encoding settings are changed, it will still be garbled. Our e15e here does not define the corresponding characters in the Unicode standard, so the display will be different under different fonts on different platforms.
It can be imagined that if the source file is stored in GBK encoding, and then tricks the compiler into saying it is UTF-8, it basically cannot be compiled and passed no matter how many Chinese characters are entered, because the encoding of UTF-8 is very regular, and the randomly combined bytes will not comply with the UTF-8 encoding rules.
Of course, the most direct way to enable the compiler to convert the encoding to Unicode correctly is to honestly tell the compiler what the encoding of the source file is.
4. Summary
After this collection and experiment, I learned a lot of concepts related to coding and became familiar with the specific process of coding conversion. These ideas can be generalized to various programming languages, and the implementation principles are similar. So I think I will no longer be ignorant of this kind of problem in the future.
The above is all the content of this article on coding concept examples such as ANSI, Unicode, BMP, UTF, etc. I hope it will be helpful to everyone. Interested friends can continue to refer to other related topics on this site. If there are any shortcomings, please leave a message to point it out. Thank you friends for your support for this site!