Java encoding and decoding processing on web pages and Chinese URL garbled solution

Author：Eve Cole Update Time：2025-05-06 18:32:02

Encoding & decoding
Through the following figure, we can understand where there are transcoding in javaWeb:

The user wants the server to send an HTTP request. The places where encoding are url, cookies, and parameter are required. After encoding, the server accepts the HTTP request, parses the HTTP request, and then decodes the url, cookies, and parameter. During the server's business logic processing, it may be necessary to read databases, local files, or other files in the network, etc., and these processes require encoding and decoding. After the processing is completed, the server encodes the data and sends it to the client, and the browser displays it to the user after decoding. There are many encoding and decoding involved in this entire process, and the most likely place to appear garbled is the process of interacting with the server and the client.
The entire process above can be summarized as follows: the page-encoded data is passed to the server, and the server decodes the obtained data, and after some business logic processing, the final result is encoded and processed, and the client decodes and displays it to the user. So below I will ask for an explanation of the encoding and decoding of javaweb.
If the client wants the server to send a request, it will pass four situations:
1. Direct access by URL.
2. Page link.
3. Form get submission
4. Form post submission
URL method: For URLs, if all the URLs are in English, there is no problem. If there is Chinese, encoding will be involved. How to encode? What rules do you want to encode? So how to decode it? The answers will be answered one by one below! First look at the components of the URL:

In this URL, the browser will encode path and parameter. To better explain the encoding process, use the following URL
http://127.0.0.1:8080/perbank/I am cm?name=I am cm
Enter the above address into the browser URL input box. By viewing the http message header information, we can see how the browser encodes it. Here are the encoding conditions of three browsers:

You can see that the encoding of "I am" by major browsers is as follows:

	path part	Query String
Firefox	E6 88 91 E6 98 AF	E6 88 91 E6 98 AF
Chrome	E6 88 91 E6 98 AF	E6 88 91 E6 98 AF
IE	E6 88 91 E6 98 AF	CE D2 CA C7

Looking at the encoding of the previous blog, we can see that the path part Firefox, chrome, and IE all use UTF-8 encoding format, the Query String part Firefox and chrome use UTF-8, and the IE use GBK. As for why % is added, this is because the encoding specification of the URL stipulates that the browser encodes ASCII characters non-ASCII characters into hexadecimal numbers in some encoding format and then prefixes "%" for each hexadecimal representation.
Of course, for different browsers, different versions of the same browser, different operating systems and other environments, they will lead to different encoding results. For a certain situation listed above, it is too early to draw any conclusions on URL encoding rules. Since the URL URI and QueryString encoding of major browsers and operating systems may be different, this will inevitably cause great trouble for the server's decoding. Let's see how tomcat decodes the URL.
The URL to parse request is in the parseRequestLine method of org.apache.coyote.HTTP11.InternalInputBuffer. This method sets the passed URL byte[] into the corresponding property of org.apache.coyote.Request. The URL here is still in byte format, and conversion to char is done in the convertURI method of org.apache.catalina.connector.CoyoteAdapter:

 protected void convertURI(MessageBytes uri, Request request) throws Exception { ByteChunk bc = uri.getByteChunk(); int length = bc.getLength(); CharChunk cc = uri.getCharChunk(); cc.allocate(length, -1); String enc = connector.getURIEncoding(); //Get the URI decoding set if (enc != null) { B2CConverter conv = request.getURIConverter(); try { if (conv == null) { conv = new B2CConverter(enc); request.setURIConverter(conv); } } catch (IOException e) {...} if (conv != null) { try { conv.convert(bc, cc, cc.getBuffer().length - cc.getEnd()); uri.setChars(cc.getBuffer(), cc.getStart(), cc.getLength()); return; } catch (IOException e) {...} } } // Default encoding: fast conversion byte[] bbuf = bc.getBuffer(); char[] cbuf = cc.getBuffer(); int start = bc.getStart(); for (int i = 0; i < length; i++) { cbuf[i] = (char) (bbuf[i + start] & 0xff); } uri.setChars(cbuf, 0, length); }

From the above code, we can see that the decoding operation of the URI is to first obtain the decoding set of the Connector, which is configured in server.xml

 <Connector URIEncoding="utf-8" />

If not defined, the default encoding ISO-8859-1 will be used for parsing.
For the Query String part, we know that whether we submit it through get or POST, all parameters are saved in Parameters, and then we use request.getParameter, the decoding work is done when the getParameter method is called the first time. Inside the getParameter method it calls the parseParameters method of org.apache.catalina.connector.Request, which will decode the passed parameters. The following code is just part of the parseParameters method:

 //Get the encoding String enc = getCharacterEncoding(); //Get the Charset boolean defined in ContentType useBodyEncodingForURI = connector.getUseBodyEncodingForURI(); if (enc != null) { //If the encoding is not empty, set the encoding to enc parameters.setEncoding(enc); if (useBodyEncodingForURI) { //If Chartset is set, set the decoding of queryString to ChartSet parameters.setQueryStringEncoding(enc); } } else { //Set the default decoding method parameters.setEncoding(org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING); if (useBodyEncodingForURI) { parameters.setQueryStringEncoding(org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING); } }

From the above code, we can see that the decoding format of query String either uses the set ChartSet or uses the default decoding format ISO-8859-1. Note that the ChartSet in this setting is the ContentType defined in the http Header. At the same time, if we need to change the specified attribute to take effect, we need to configure the following:

 <Connector URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

The above part introduces the encoding and decoding process of URL requests in detail. In fact, for us, our more ways are to submit in the form.
Form GET
We know that submitting data through URLs is easy to cause garbled code problems, so we tend to use form forms. When the user clicks submit the form, the browser will set more codes to pass the data to the server. The data submitted through GET is spliced after the URL (can it be regarded as a query String??), so URIEncoding plays a role in the decoding process of the tomcat server. The tomcat server will decode according to the set URIEncoding, and if it is not set, it will use the default ISO-8859-1 to decode. If we set the encoding to UTF-8 on the page, and URIEncoding is not or is not set, then garbled code will occur when the server decodes. At this time, we can generally obtain the correct data through the form of new String(request.getParameter("name").getBytes("iso-8859-1"),"utf-8").
Form POST
For the POST method, the encoding it uses is also determined by the page, that is, contentType. When I submit a form by clicking the submit button on the page, the browser will first encode the parameters of the POST form according to the charset encoding format of the ontentType and submit it to the server. On the server side, it also uses the character set set in the contentType to decode (it is different from the get method here). This means that the parameters submitted through the POST form generally do not have garbled problems. Of course, we can set the character set encoding ourselves: request.setCharacterEncoding(charset) .

Solve the problem of garbled URLs in Chinese
We mainly send requests to the server through two forms of submission: URL and form. The form form generally does not have garbled problems, and the garbled problems are mainly on the URL. Through the introduction of the previous blogs, we know that the process of sending request encoding to the server by URL is really too confusing. Different operating systems, different browsers, and different web character sets will lead to completely different encoding results. Isn’t it too scary if programmers want to take every result into account? Is there a way to ensure that the client only uses one encoding method to issue a request to the server?
have! Here I mainly provide the following methods
javascript
Using javascript encoding does not give the browser the chance to intervene. After encoding, send a request to the server and then decode it in the server. When mastering this method, we need three methods of javascript encoding: escape(), encodeURI(), and encodeURIComponent().
escape
The specified string is encoded using the SIO Latin character set. All non-ASCII characters are encoded as strings in %xx format, where xx represents the hexadecimal number corresponding to the character in the character set. For example, the encoding corresponding to the format is %20. Its corresponding decoding method is unescape().

In fact, escape() cannot be used directly for URL encoding, its real function is to return a character's Unicode-encoded value. For example, the result of "I am cm" above is %u6211%u662Fcm, where the corresponding encoding of "I" is 6211, the encoding of "Yes" is 662F, and the encoding of "cm" is cm.
Note that escape() is not encoded by "+". But we know that when a web page submits a form, if there are spaces, it will be converted into + characters. When the server processes data, the + sign will be processed into spaces. Therefore, be careful when using it.
encodeURI
Encoding the entire URL, it uses UTF-8 format to output the encoded string. However, encodeURI will not encode some special characters except ASCII encoding, such as: ! @ # $& * ( ) = : / ; ? + '.

encodeURIComponent
Convert URI strings into escape format strings in UTF-8 encoding format. Compared to encodeURI, encodeURIComponent will be more powerful, and it will be encoded for symbols (; / ? : @ & = + $ , #) that are not encoded in encodeURI(). However, encodeURIComponent will only encode the components of the URL individually, and will not be used to encode the entire URL. The corresponding decode function method decodeURIComponent.
Of course, we usually use the encodeURI party to perform encoding operations. The so-called JavaScript encoding and decoding twice in the background is to use this method. There are two solutions to solve this problem in JavaScript: one transcoding and two transcoding methods.
Transcoding once
JavaScript transcoding:

 var url = '<s:property value="webPath" />/ShowMoblieQRCode.servlet?name=I am cm'; window.location.href = encodeURI(url);

Transcoded URL: http://127.0.0.1:8080/perbank/ShowMoblieQRCode.servlet?name=%E6%88%91%E6%98%AFcm
Backend processing:

 String name = request.getParameter("name"); System.out.println("Foreground incoming parameters: " + name); name = new String(name.getBytes("ISO-8859-1"),"UTF-8"); System.out.println("Decoded parameters: " + name);

Output result:
Incoming parameters in the front desk: ??????cm
After decoding parameters: I am cm
Secondary transcoding
javascript

 var url = '<s:property value="webPath" />/ShowMoblieQRCode.servlet?name=I am cm'; window.location.href = encodeURI(encodeURI(url));

Transcoded url: http://127.0.0.1:8080/perbank/ShowMoblieQRCode.servlet?name=%25E6%2588%2591%25E6%2598%25AFcm
Backend processing:

 String name = request.getParameter("name"); System.out.println("Foreground incoming parameters: " + name); name = URLDecoder.decode(name,"UTF-8"); System.out.println("Decoded parameters: " + name);

Output result:
Front-end incoming parameters: E68891E698AFcm
After decoding parameters: I am cm

filter
Using filters, filters provide two types, the first is to set encoding, and the second is to perform decoding operations directly in the filter.
Filter 1
This filter directly sets the encoding format of the request.

 public class CharacterEncoding implements Filter { private FilterConfig config ; String encoding = null; public void destroy() { config = null; } public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { request.setCharacterEncoding(encoding); chain.doFilter(request, response); } public void init(FilterConfig config) throws ServletException { this.config = config; //Get configuration parameters String str = config.getInitParameter("encoding"); if(str!=null){ encoding = str; } } }

Configuration:

 <!-- Chinese filter configuration --> <filter> <filter-name>chineseEncoding</filter-name> <filter-class>com.test.filter.CharacterEncoding</filter-class> <init-param> <param-name>encoding</param-name> <param-value>utf-8</param-value> </init-param> </filter> <filter-mapping> <filter-name>chineseEncoding</filter-name> <url-pattern>/*</url-pattern> </filter-mapping>

Filter 2
In the processing method, the filter directly decodes the parameters, and then resets the decoded parameters to the request attribute.

 public class CharacterEncoding implements Filter { protected FilterConfig filterConfig ; String encoding = null; public void destroy() { this.filterConfig = null; } /** * Initialize*/ public void init(FilterConfig filterConfig) { this.filterConfig = filterConfig; } /** * Convert inStr into UTF-8's encoding form* * @param inStr Enter string* @return UTF - 8's encoding form string* @throws UnsupportedEncodingException */ private String toUTF(String inStr) throws UnsupportedEncodingException { String outStr = ""; if (inStr != null) { outStr = new String(inStr.getBytes("iso-8859-1"), "UTF-8"); } return outStr; } /** * Chinese garbled filtering processing*/ public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain chain) throws IOException, ServletException { HttpServletRequest request = (HttpServletRequest) servletRequest; HttpServletResponse response = (HttpServletResponse) servletResponse; // The method to obtain the request (1.post or 2.get), and different processing is performed according to different request methods String method = request.getMethod(); // 1. For requests submitted in post, directly set the encoding to UTF-8 if (method.equalsIgnoreCase("post")) { try { request.setCharacterEncoding("UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } } // 2. Request submitted in get else { // Get out the parameter set submitted by the client Enumeration<String> paramNames = request.getParameterNames(); // traverse the parameter set to get out the name and value of each parameter while (paramNames.hasMoreElements()) { String name = paramNames.nextElement(); // Get out the parameter name String values[] = request.getParameterValues(name); // Take out its value according to the parameter name // If the parameter value set is not empty if (values != null) { // traverse the parameter value set for (int i = 0; i < values.length; i++) { try { // Circle back and call each value toUTF(values[i]) method in turn to convert the character encoding of the parameter value String vlustr = toUTF(values[i]); values[i] = vlustr; } catch (UnsupportedEncodingException e) { e.printStackTrace(); } } // Hide the value in the form of an attribute in request request request.setAttribute(name, values); } } } // Set the response method and support Chinese character set response.setContentType("text/html;charset=UTF-8"); // Continue to execute the next filter. If there is no filter, the request will be executed.doFilter(request, response); } }

Configuration:

 <!-- Chinese filter configuration --> <filter> <filter-name>chineseEncoding</filter-name> <filter-class>com.test.filter.CharacterEncoding</filter-class> </filter> <filter-mapping> <filter-name>chineseEncoding</filter-name> <url-pattern>/*</url-pattern> </filter-mapping>

other
1. Set pageEncoding and contentType

 <%@ page language="java" contentType="text/html;charset=UTF-8" pageEncoding="UTF-8"%>

2. Set up the URIEncoding of tomcat
By default, the tomcat server uses the ISO-8859-1 encoding format to encode the URL requested by the URIEncoding parameter, so we only need to add URIEncoding="utf-8" to the <Connector> tag of the server.xml file of tomcat.