Java solution to intercept strings with Chinese characters by bytes (recommended)

Author：Eve Cole Update Time：2025-04-24 12:32:01

Since the oracle field used by the interface is a fixed number of bytes, and the string passed in is estimated to be larger than the total number of bytes in the database field, then a string smaller than the number of bytes in the database is intercepted.

I refer to the examples on the Internet and just complete a recursive call, because the byte length of the intercepted character must be smaller than the byte length of the database, that is, if the last character is a Chinese character, then you can only remove the intercept forward.

 /** * Determine whether the string passed in is greater than the specified byte. If it is greater than the recursive call* until it is less than the specified byte number, be sure to specify the character encoding, because the character encoding of each system is different and the number of bytes is also different* @param s * Original string * @param num * Passing in to specify the number of bytes* @return String Intercepted string* @throws UnsupportedEncodingException */ public static String idgui(String s,int num)throws Exception{ int changdu = s.getBytes("UTF-8").length; if(changdu > num){ s = s.substring(0, s.length() - 1); s = idgui(s,num); } return s; }

Java interview questions:

Write a function that intercepts strings, inputs as a string and bytes, and outputs as a string intercepted by bytes. However, you must ensure that the Chinese characters are not cut off half. For example, "I ABC" 4 should be cut off as "I AB", enter "I ABC Chinese DEF", and 6 should be output as "I ABC" instead of "I ABC+ Chinese half".

Currently, many popular languages, such as C# and Java, use Unicode 16 (UCS2) encoding. In this encoding, all characters are two characters. Therefore, if the string to be intercepted is mixed with Chinese, English, and numbers, problems will arise, such as the following string:

String s = "a plus b equals c, if a etc. 1 and b equals 2, then c etc. 3";

The string above contains both Chinese characters, English characters and numbers. If you want to intercept the characters of the first 6 bytes, it should be "a plus b, etc.", but if you use the substring method to intercept the first 6 characters, it will become "a plus b equals c". The reason for this problem is that the substring method treats double-byte Chinese characters as one byte character (UCS2 characters).

The number of bytes occupied by English letters and Chinese characters in different encoding formats is also different. We can use the following examples to see how many bytes a English letter and a Chinese character occupy in some common encoding formats.

 import java.io.UnsupportedEncodingException; public class EncodeTest { /** * Print the number of bytes and encoding name of the string under the specified encoding to the console * * @param s * String * @param encodingName * Encoding format */ public static void printByteLength(String s, String encodingName) { System.out.print("Bytes:"); try { System.out.print(s.getBytes(encodingName).length); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } System.out.println("; Encoding: " + encodingName); } public static void main(String[] args) { String en = "A"; String ch = "people"; // Calculate the number of bytes of an English letter under various encodings System.out.println("English letter: " + en); EncodeTest.printByteLength(en, "GB2312"); EncodeTest.printByteLength(en, "GBK"); EncodeTest.printByteLength(en, "GB18030"); EncodeTest.printByteLength(en, "ISO-8859-1"); EncodeTest.printByteLength(en, "UTF-8"); EncodeTest.printByteLength(en, "UTF-16"); EncodeTest.printByteLength(en, "UTF-16BE"); EncodeTest.printByteLength(en, "UTF-16LE"); System.out.println(); // Calculate the number of bytes of a Chinese character under various encodings System.out.println("Chinese character: " + ch); EncodeTest.printByteLength(ch, "GB2312"); EncodeTest.printByteLength(ch, "GBK"); EncodeTest.printByteLength(ch, "GB18030"); EncodeTest.printByteLength(ch, "ISO-8859-1"); EncodeTest.printByteLength(ch, "UTF-8"); EncodeTest.printByteLength(ch, "UTF-16"); EncodeTest.printByteLength(ch, "UTF-16BE"); EncodeTest.printByteLength(ch, "UTF-16LE"); } }

The operation results are as follows:

1. English letters: A
2. Number of bytes: 1; Encoding: GB2312
3. Number of bytes: 1; Encoding: GBK
4. Number of bytes: 1; Encoding: GB18030
5. Number of bytes: 1; Encoding: ISO-8859-1
6. Number of bytes: 1; Encoding: UTF-8
7. Number of bytes: 4; Encoding: UTF-16
8. Number of bytes: 2; Encoding: UTF-16BE
9. Number of bytes: 2; Encoding: UTF-16LE
10. Chinese characters: people
11. Number of bytes: 2; Encoding: GB2312
12. Number of bytes: 2; Encoding: GBK
13. Number of bytes: 2; Encoding: GB18030
14. Number of bytes: 1; Encoding: ISO-8859-1
15. Number of bytes: 3; Encoding: UTF-8
16. Number of bytes: 4; Encoding: UTF-16
17. Number of bytes: 2; Encoding: UTF-16BE
18. Number of bytes: 2; Encoding: UTF-16LE

UTF-16BE and UTF-16LE are two members of the UNICODE encoding family. The UNICODE standard defines three encoding formats: UTF-8, UTF-16, and UTF-32, and has seven encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The encoding scheme used by JAVA is UTF-16BE. From the running results of the above example, we can see that the three encoding formats of GB2312, GBK, and GB18030 can all meet the requirements of the question. Let’s take GBK encoding as an example to answer.

We cannot directly use the substring(int beginIndex, int endIndex) method of the String class because it is intercepted by character. Both 'I' and 'Z' are treated as one character, and both length is 1. In fact, as long as we can distinguish between Chinese characters and English letters, this problem will be solved easily. The difference is that Chinese characters are two bytes and English letters are one byte.

 package com.newyulong.iptv.billing.ftpupload;import java.io.UnsupportedEncodingException;public class CutString { /** * Determine whether it is a Chinese character* * @param c * Character* @return true means it is a Chinese character, false means it is an English letter* @throws UnsupportedEncodingException * Used encoding formats that are not supported by JAVA*/ public static boolean isChineseChar(char c) throws UnsupportedEncodingException { // If the number of bytes is greater than 1, it is a Chinese character// This way is not very rigorous in distinguishing English letters from Chinese characters, but in this question, this judgment is enough to return String.valueOf(c).getBytes("UTF-8").length > 1; } /** * Intercept string by byte* * @param original string* @param count * Intercepted digits* @return Intercepted string* @throws UnsupportedEncodingException * Use encoding formats that JAVA does not support*/ public static String substring(String original, int count) throws UnsupportedEncodingException { // The original character is not null, nor is it an empty string if (orignal != null && !"".equals(orignal)) { // Convert the original string to GBK encoding format orignal = new String(orignal.getBytes(), "UTF-8");// // System.out.println(orignal); //System.out.println(orignal.getBytes().length); // The number of bytes to be intercepted is greater than 0 and less than the number of bytes of the original string if (count > 0 && count < orignal.getBytes("UTF-8").length) { StringBuffer buff = new StringBuffer(); char c; for (int i = 0; i < count; i++) { System.out.println(count); c = original.charAt(i); buff.append(c); if (CutString.isChineseChar(c)) { // When encountering Chinese characters, cut the total number of bytes bytes by 1 --count; } } // System.out.println(new String(buff.toString().getBytes("GBK"),"UTF-8")); return new String(buff.toString().getBytes(),"UTF-8"); } } return original; } /** * Intercept string by byte* * @param original string* @param count * Intercept digits* @return Intercepted string* @throws UnsupportedEncodingException * Used encoding formats that JAVA does not support*/ public static String gsubstring(String original, int count) throws UnsupportedEncodingException { // The original characters are not null, nor are they empty strings if (orignal != null && !"".equals(orignal)) { // Convert the original string to GBK encoding format orignal = new String(orignal.getBytes(), "GBK"); // The number of bytes to be intercepted is greater than 0 and less than the number of bytes of the original string if (count > 0 && count < orignal.getBytes("GBK").length) { StringBuffer buff = new StringBuffer(); char c; for (int i = 0; i < count; i++) { c = orignal.charAt(i); buff.append(c); if (CutString.isChineseChar(c)) { // When encountering Chinese characters, cut the total number of bytes into 1 --count; } } return buff.toString(); } } return original; } /** * Determine whether the passed string is greater than the specified bytes, if it is greater than the recursive call* until it is less than the specified bytes* @param s * Original string* @param num * Passing in to specify the number of bytes* @return String The intercepted string*/ public static String idgui(String s,int num){ int changdu = s.getBytes().length; if(changdu > num){ s = s.substring(0, s.length() - 1); s = idgui(s,num); } return s; } public static void main(String[] args) throws Exception{ // Original string String s = "I ZWR love you JAVA"; System.out.println("Raw string: " + s + " : The number of bytes is: " + s.getBytes().length); /* System.out.println("Intercept the first 1 digit: " + CutString.substring(s, 1)); System.out.println("Intercept the first 2 digits: " + CutString.substring(s, 2)); System.out.println("Intercept the first 4 bits: " + CutString.substring(s, 4)); */ //System.out.println("Intercept the first 12 bits: " + CutString.substring(s, 12)); System.out.println("Intercept the first 12 bytes: " + CutString.idgui(s, 11)); } }

The above solution to the Java intercepting strings with Chinese characters by bytes (recommended) is all the content I share with you. I hope it can give you a reference and I hope you can support Wulin.com more.