1. Java encoding conversion process
We always use a java class file to interact with users most directly (input, output), and the text contained in these interactive content may contain Chinese. Whether these java classes interact with the database or with the front-end page, their life cycle is always like this:
(1) Programmers write program code through an editor on the operating system and save the operating system in the format of .java. We call these files the source files.
(2) Compile these source files through javac.exe in JDK to form the .class class.
(3) Run these classes directly or deploy them in a WEB container to get the output result.
These processes are observed from the macro perspective, and it is definitely not possible to understand this. We need to truly understand how Java is encoded and decoded:
Step 1: When we use an editor to write a java source file, the program file will use the default encoding format of the operating system (usually our Chinese operating system uses the GBK encoding format) to form a .java file. The java source file is saved in the file.encoding encoding format supported by the operating system by default. The following code can view the system's file.encoding parameter value.
System.out.println(System.getProperty("file.encoding")); Step 2: When we use javac.exe to compile our java file, JDK will first confirm its compilation parameter encoding to determine the source code character set. If we do not specify the compilation parameter, JDK will first obtain the operating system's default file.encoding parameter, and then JDK will convert the java source program we wrote from the file.encoding encoding format to the default UNICODE format inside JAVA and put it in memory.
Step 3: JDK writes the compiled and saved information on the above memory into the class file to form a .class file. At this time, the .class file is Unicode encoded, which means that the contents in our common .class files are converted to Unicode encoding format, whether they are Chinese or English characters.
In this step, the handling method of JSP source files is a bit different: the WEB container calls the JSP compiler. The JSP compiler will first check whether the JSP file has the file encoding format. If it is not set, the JSP compiler will call the JDK to convert the JSP file into a temporary servlet class using the default encoding method, and then compile it into a .class file and keep it in a temporary folder.
Step 4: Run the compiled class: There will be several situations here
(1) Run directly on the console.
(2) JSP/Servlet class.
(3) Between the java class and the database.
Each of these three situations will have different ways of doing it.
1. Classes running on Console
In this case, the JVM will first read the class file saved in the operating system into memory. At this time, the class file in memory is encoded in Unicode, and then the JVM will run it. If the user needs to enter information, the information input by the user will be encoded in the file.encoding format and converted to Unicode encoding format to save it in memory. After the program is run, the result is converted into file.encoding format and returned to the operating system and output to the interface. The entire process is as follows:
In the entire process above, no errors can occur in any encoding conversion involved, otherwise garbled code will occur.
2.Servlet class
Since JSP files will eventually be converted into servlet files (but the storage location is different), we will also include JSP files here.
When a user requests a servlet, the WEB container calls its JVM to run the servlet. First of all, the JVM will load the servlet class into memory. The servlet code in memory is in Unicode encoding format. Then the JVM runs the servlet in memory. During the run, if you need to accept data passed from the client (such as data passed by forms and URLs), the WEB container will accept the incoming data. During the reception process, if the program sets the encoding of incoming parameters, the set encoding format is adopted. If it is not set, the default ISO-8859-1 encoding format is adopted. After the received data, the JVM will convert the encoding format into Unicode and store it in memory. After running the Servlet, the output results are generated, and the encoding format of these output results is still Unicode. Immediately afterwards, the WEB container will directly send the generated Unicode encoding format string to the client. If the program specifies the encoding format at the time of output, it will be output to the browser according to the specified encoding format. Otherwise, the default ISO-8859-1 encoding format is adopted. The entire process flow chart is as follows:
3. Database part
We know that the connection between Java programs and databases is connected through the JDBC driver, and the JDBC driver defaults to the ISO-8859-1 encoding format. That is to say, when we pass data to the database through the Java program, JDBC will first convert the data in the Unicode encoding format into the ISO-8859-1 encoding format, and then store it in the database, that is, when the database saves data, the default format is ISO-8859-1.
2. Encoding & decoding
The following will end the encoding and decoding operations that Java needs to perform in those occasions, and sort out the intermediate process in detail to further master the encoding and decoding process of Java. There are four main scenarios in Java that require encoding and decoding:
(1): I/O operation
(2): Memory
(3): Database
(4): javaWeb
The following mainly introduces the previous two scenarios. As long as the database part is set correctly, there will be no problem. There are too many javaWeb scenarios and you need to understand the encoding of URL, get, POST, and servlet decoding, so the LZ introduction of the javaWeb scenario.
1.I/O operation
In the previous LZ mentioned that the garbled problem is nothing more than the inconsistency of the encoding format during the transcoding process. For example, UTF-8 is used for encoding and GBK is used for decoding. However, the most fundamental reason is that there is a problem with character-to-byte or byte-to-character conversion, and the main scenario of conversion in this case is the I/O operation. Of course, I/O operations mainly include network I/O (that is, javaWeb) and disk I/O. Network I/O is introduced in the next section.
First, let’s look at the I/O encoding operation.
InputStream is the superclass of all classes of the byte input stream, and Reader is the abstract class of the reader. Java reads files in a way that is divided into byte stream and by character stream. InputStream and Reader are the superclasses of these two reading methods.
By byte, we usually use the InputStream.read() method to read bytes in the data stream (read() only reads one byte at a time, which is very slow. We usually use read(byte[])), then save it in a byte[] array, and finally convert it to String. When we read a file, the encoding of the bytes depends on the encoding format used by the file, and the encoding problem will also be involved in the conversion to String process. If the encoding formats are different between the two, the problem may occur. For example, there is a problem that test.txt encoding format is UTF-8, so the data stream encoding format obtained when reading a file through a byte stream is UTF-8. If we do not specify the encoding format during the conversion to String, we use the system encoding format (GBK) by default to decoding. Since the encoding formats of the two are inconsistent, then the garbled code will definitely occur in the process of constructing String, as follows:
File file = new File("C://test.txt"); InputStream input = new FileInputStream(file); StringBuffer buffer = new StringBuffer(); byte[] bytes = new byte[1024]; for(int n ; (n = input.read(bytes))!=-1 ; ){ buffer.append(new String(bytes,0,n)); } System.out.println(buffer); The output result is garbled...
The content in test.txt is: I am cm.
To avoid garbled code, specify the encoding format during the String construction process so that the encoding formats of the two are consistent during encoding and decoding:
buffer.append(new String(bytes,0,n,"UTF-8"));
By character, character stream can be regarded as a wrapper stream. Its underlying layer still uses a byte stream to read bytes, and then it decodes the read bytes into characters using the specified encoding method. In java, Reader is a superclass that reads character streams. So from the bottom perspective, there is no difference between reading files by bytes and reading by character. When reading, character reading is left with bytes each time, and byte streams read one byte at a time.
Byte & Character Conversion Convert bytes to characters is essentially InputStreamReader. The API is explained as follows: InputStreamReader is the bridge between the byte stream to the character stream: it reads bytes using the specified charset and decodes them into characters. The character set it uses can be specified or explicitly given by the name, or it can accept the platform's default character set. Each call to a read() method in the InputStreamReader results in one or more bytes being read from the underlying input stream. To enable effective conversion from byte to character, you can read more bytes from the underlying stream in advance, exceeding the bytes required to satisfy the current read operation. The API explanation is very clear. InputStreamReader still uses byte reading when reading the file at the bottom. After reading the bytes, it needs to parse into characters according to a specified encoding format. If there is no specified encoding format, it adopts the system default encoding format.
String file = "C://test.txt"; String charset = "UTF-8"; // Write characters to convert to byte stream FileOutputStream outputStream = new FileOutputStream(file); OutputStreamWriter writer = new OutputStreamWriter(outputStream, charset); try { writer.write("I am cm"); } finally { writer.close(); } // Read bytes to convert to characters FileInputStream inputStream = new FileInputStream(file); InputStreamReader reader = new InputStreamReader( inputStream, charset); StringBuffer buffer = new StringBuffer(); char[] buf = new char[64]; int count = 0; try { while ((count = reader.read(buf)) != -1) { buffer.append(buf, 0, count); } } finally { reader.close(); } System.out.println(buffer); 2. Memory
First, let's look at the following simple code
String s = "I am cm"; byte[] bytes = s.getBytes(); String s1 = new String(bytes,"GBK"); String s2 = new String(bytes);
In this code we see three encoding conversion processes (one encoding, two decoding). Let's look at String.getTytes() first:
public byte[] getBytes() { return StringCoding.encode(value, 0, value.length); }Internally call the StringCoding.encode() method:
static byte[] encode(char[] ca, int off, int len) { String csn = Charset.defaultCharset().name(); try { // use charset name encode() variant which provides caching. return encode(csn, ca, off, len); } catch (UnsupportedEncodingException x) { warnUnsupportedCharset(csn); } try { return encode("ISO-8859-1", ca, off, len); } catch (UnsupportedEncodingException x) { // If this code is hit During VM initialization, MessageUtils is // the only way we will be able to get any kind of error message. MessageUtils.err("ISO-8859-1 charset not available: " + x.toString()); // If we can not find ISO-8859-1 (a required encoding) then things // are seriously wrong with the installation. System.exit(1); return null; } }The encode(char[] paramArrayOfChar, int paramInt1, int paramInt2) method first calls the system's default encoding format. If the encoding format is not specified, the encoding operation is performed by default using the ISO-8859-1 encoding format. Further deepening is as follows:
String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;
In the same method, you can see that the constructor of new String is called the StringCoding.decode() method:
public String(byte bytes[], int offset, int length, Charset charset) { if (charset == null) throw new NullPointerException("charset"); checkBounds(bytes, offset, length); this.value = StringCoding.decode(charset, bytes, offset, length); } The decode method and encode handle the encoding format the same way.
For the above two situations, we only need to set a unified encoding format, generally there will be no garbled problems.
3. Encoding & encoding format
First, look at the Java encoding class diagram
First, set the ChartSet class according to the specified chart, then create the ChartSetEncoder object according to the ChartSet, and finally call CharsetEncoder.encode to encode the string. Different encoding types will correspond to a class, and the actual encoding process is completed in these classes. The following timing diagram shows the detailed encoding process:
Through this coded class diagram and timing diagram, you can understand the detailed encoding process. The following will encode ISO-8859-1, GBK, and UTF-8 through a simple code.
public class Test02 { public static void main(String[] args) throws UnsupportedEncodingException { String string = "I am cm"; Test02.printChart(string.toCharArray()); Test02.printChart(string.getBytes("ISO-8859-1")); Test02.printChart(string.getBytes("GBK")); Test02.printChart(string.getBytes("UTF-8")); } /** * Convert char to hexadecimal*/ public static void printChart(char[] chars){ for(int i = 0 ; i < chars.length ; i++){ System.out.print(Integer.toHexString(chars[i]) + " "); } System.out.println(""); } /** * byte converted to hex */ public static void printChart(byte[] bytes){ for(int i = 0 ; i < bytes.length ; i++){ String hex = Integer.toHexString(bytes[i] & 0xFF); if (hex.length() == 1) { hex = '0' + hex; } System.out.print(hex.toUpperCase() + " "); } System.out.println(""); } }Output:
6211 662f 20 63 6d 3F 3F 20 63 6D CE D2 CA C7 20 63 6D E6 88 91 E6 98 AF 20 63 6D
Through the program we can see that the result of "I am cm" is:
char[]: 6211 662f 20 63 6d ISO-8859-1: 3F 3F 20 63 6D GBK: CE D2 CA C7 20 63 6D UTF-8: E6 88 91 E6 98 AF 20 63 6D
The picture is as follows: