Summary of experience in handling character encoding problems in Java

Author：Eve Cole Update Time：2025-04-28 09:00:04

When facing a stream of bytes, its actual meaning cannot be known without specifying its encoding.
This sentence should also be what we always remember in our minds when facing the problem of "character to byte, byte to character". Otherwise, garbled problems may follow.
In fact, the essence of the garbled problem is that Encoding and Decoding do not use the same encoding. If you understand this principle, it is easy to solve the garbled problem.
Commonly used in Java are as follows:
1. The String class uses the constructor String(byte[] bytes) of byte[]. The String class provides two overloads at the same time (1) String(byte[] bytes, Charset charset)
(2) String(byte[] bytes, String charsetName) is used to specify the encoding.

2. The getBytes function byte[] getBytes() of the String class also has the following two overloads:
(1) byte[] getBytes(Charset charset)
(2) byte[] getBytes(String charsetName)
All that do not require the specified encoding are obtained using the platform's default charset, which can be obtained using System.getProperty("file.encoding"), Charset.defaultCharset().
3. PrintStream's print(String s) is also designed to this problem. For this reason, the constructor of PrintStream is not only PrintStream(File file) and PrintStream(File file, String csn)
Otherwise the string's characters are converted into bytes according to the platform's default character encoding,
DataOutputStream construction does not have a method to specify the encoding, but it provides a writeUTF(String str)

Give the examples at the beginning to illustrate the need for specifying the encoding:
If a web page specifies the encoding as utf-8, <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />, there is a form on the page, submitted to a servlet
Then the byte stream passed by the characters entered by the user is encoding according to the specified encoding. For example, if you enter "Hello Hello", if it is utf-8, then the transmitted is as follows:

 [104, 101, 108, 108, 111, -28, -67, -96, -27, -91, -67]

, We see that each of the following Chinese characters uses 3 bytes, which can be used to refer to the relevant knowledge of Utf-8.
But if your page specifies GBK, then the transmission will be different:

 [104, 101, 108, 108, 111, -60, -29, -70, -61]

Therefore, when using request.getParameter on the servlet side, it should be called internally
String s = new String(bytes, response.getEncoding()). If your response does not set the encoding, then the default encoding null will be converted to GBK for the Java platform, and then the Chinese will become garbled.
Therefore, in order to avoid garbled code, jsp sites generally set a filter, and all pages and servets are set to have unified encoding. response.setEncoding, request.setEncoding.

Inside Java String is a char[], which is a unit encoded with utf-16 stored 16-bit. To this end, when converting characters and strings into bytes and outputting them to files or networks, or reverting byte streams read from files or networks to characters with practical significance, you must understand what their encoding is.

Some experiences
1. The String class is always stored in Unicode encoding.
2. Pay attention to the use of String.getBytes():
If the character set parameter is not included, it will depend on the character set encoding of the JVM. It is generally UNICODE on LINUX, and GBK under WINDOWS. (If you want to change the default character set encoding of the JVM, use the option -Dfile.encodeing=UTF-8 when starting the JVM.
For safety reasons, it is recommended to always call with parameters, for example: String s; s.getBytes("UTF-8").
3. The Charset class is very useful.
(1) Charset.encode is encoding, that is, encoding the String in the character set encoding format you specify and outputting a byte array.
(2) Charset.decode is decoding, that is, decode a byte array in the character set encoding format you specify and output it into a string.

As an example:

 String s = Charset.defaultCharset().displayName(); String s1 = "I like you,My Love"; ByteBuffer bb1 = ByteBuffer.wrap(s1.getBytes("UTF-8")); for(byte bt:bb1.array()){ System.out.printf("%x",bt); } //char[] Usage char[] chArray={'I','L','o','v','e','you'}; //CharBuffer Usage CharBuffer cb = CharBuffer.wrap(chArray); //Reposition pointer cb.flip(); String s2= new String(chArray); //ByteBuffer Usage ByteBuffer bb2 = Charset.forName("utf-8").encode(cb); // Use Charset to encode as a specified character set ByteBuffer bb3 = Charset.forName("utf-8").encode(s1); byte [] b = bb3.array(); // Use Charset to decode as a string according to the specified character set ByteBuffer bb4= ByteBuffer.wrap(b); String s2 = Charset.forName("utf-8").decode(bb4).toString();