Detailed explanation of character encoding format in Java

Author：Eve Cole Update Time：2025-06-07 18:00:04

1. Preface

When analyzing Comparable and Comparator, the compareTo method of the String class was analyzed. The underlying String uses a char[] array to store elements. When comparing, the characters of two strings are compared, and the characters are stored with char. At this time, I suddenly thought that can chars in Java store Chinese? Later I found that it is OK, and this also leads to the encoding format of characters in Java.

2. Java storage format

In Java, the following code obtains various encoding formats of the character 'Zhang'.

 import java.io.UnsupportedEncodingException;public class Test { public static String getCode(String content, String format) throws UnsupportedEncodingException { byte[] bytes = content.getBytes(format); StringBuffer sb = new StringBuffer(); for (int i = 0; i < bytes.length; i++) { sb.append(Integer.toHexString(bytes[i] & 0xff).toUpperCase() + " "); } return sb.toString(); } public static void main(String[] args) throws UnsupportedEncodingException { System.out.println("gbk : " + getCode("Zhang", "gbk")); System.out.println("gb2312: " + getCode("Zhang", "gb2312")); System.out.println("iso-8859-1: " + getCode("Zhang", "iso-8859-1")); System.out.println("unicode : " + getCode("Zhang", "unicode")); System.out.println("utf-16 : " + getCode("Zhang", "utf-16")); System.out.println("utf-8 : " + getCode("Zhang", "utf-8")); }}

Running results:

 gbk : D5 C5 gb2312 : D5 C5 iso-8859-1 : 3F unicode : FE FF 5F 20 utf-16 : FE FF 5F 20 utf-8 : E5 BC A0

Note: From the results, we can know that the gbk of the character 'Zhang' is the same as gb2312 encoding, and the unicode is the same as utf-16 encoding, but its iso-8859-1, unicode, and utf-8 encoding are all different. So, in JVM, what encoding format is the character 'Zhang' stored? Let’s start our analysis below.

3. Explore the secret ideas

1. View the storage format of the .class file constant pool

The test code is as follows

 public class Test { public static void main(String[] args) { String str = "Zhang"; }}

Use javap -verbose Test.class for decompilation and the constant pooling situation is as follows:

Then use winhex to open the class file and find that the character 'Zhang' is stored in the constant pool as follows

Note: The above two can be stored in the class file in utf-8 format.

But is it in utf-8 format at runtime? Continue our journey of exploration.

2. Find out in the program

Use the following code

 public class Test { public static void main(String[] args) { String str = "Zhang"; System.out.println(Integer.toHexString(str.codePointAt(0)).toUpperCase()); }}

Running results:

5F20

Note: Based on the results, we know that at runtime, the JVM uses the utf-16 format for storage. Utf-16 is generally stored with 2 bytes. If two bytes are encountered, it will be represented by 4 bytes. There will be another article to introduce it later. When we check the source code of the Character class, we will find that it is the coded using utf-16, and we found the answer we want from both sides.

3. Can the char type be stored in Chinese?

Based on the above exploration, we already know that the characters in Java class files are encoded in utf-8, and are encoded and stored in utf-16 when running the JVM. The character 'Zhang' can be represented by two bytes, and char is also two bytes in Java, so it can be stored.

4. Summary

After the above analysis, we know:

1. Characters are encoded in the class file in utf-8 format, and are encoded in the utf-16 format when running the JVM.

2. The char type is two bytes and can be used to store Chinese.

During this call, I read a lot of information about characters, and I benefited a lot and found it particularly interesting. I will share it next, so I will give it a brief introduction to the issues of encoding and encoding in Java. Stay tuned