URLs are everywhere, but it seems that developers don't really understand them, because I often see people asking how to create a URL correctly on Stack Overflow. If you want to know how URL syntax works, you can read this article by Lunatech, which is very good.
This article will not introduce the entire syntax of URLs in depth (if you want to fully understand URLs, you can read RFC 3986, RFC 1738, and the article mentioned above, as well as the documentation above W3). Here I want to talk about some common libraries in operating URLs and how to use it correctly through URL-builder. This is a Java library we published to create URLs correctly.
Question 1: Java's URLEncoder
Not only is this class named poorly, but its first sentence in the document is not very correct.
Utility class for HTML form encoding.
You may be wondering why it is called URLEncoder, but you are completely speechless when you see this line.
If you have read the Lunatech blog post, you should now understand that you cannot miraculously convert a URL string into a safe, correctly encoded URL object through this class. Of course, if you have not done enough homework, here is a small example to help you understand.
Suppose you have an HTTP service endpoint http://foo.com/search, which accepts a query parameter p, and the value of p is the string to be searched. If you search for the string "You & I", the URL of the search you created for the first time might be like this: http://foo.com/search?q=You & I. Of course this won't work, because & is the separator that separates the query parameter name/value pairs. If you get this messy URL string, you are helpless because first of all, you can't parse it correctly.
OK, let's use URLEncoder. URLEncoder.encode("You & I", "UTF-8") is the result that you+%26+I. After decoding this %26, it is &, and the + sign represents spaces in the query string, so this URL can work normally.
Now suppose you want to use your query string to splice the URL path instead of putting it into the URL parameters. Obviously, http://foo.com/search/You & I is wrong. Unfortunately, the result of URLEncoder.encode() is also wrong. http://foo.com/search/You+%26+I will get /search/You+&+I, because the + sign will not resolve to spaces in the URL path.
URLEncoder may satisfy some of your scenarios. Unfortunately, its overly generic name makes it easy for developers to misuse it. Therefore, the best way is not to use it, so that other developers will make mistakes when using other functions on your basis (unless you are really doing "HTML form encoding").
Question 2: Groovy HttpBuilder and Java's URI
HTTP Builder is an HTTP client library of Groovy.
Creating a normal GET request is very simple:
new HTTPBuilder("http://localhost:18080").request(Method.GET) { uri.path = "/foo" }This code will send GET /foo HTTP/1.1 to the server (you can run nc -l -p 18080 and then execute this code to verify it).
Let's try the URL containing spaces.
new HTTPBuilder("http://localhost:18080").request(Method.GET) { uri.path = "/foo bar" }This sends GET /foo%20bar HTTP/1.1, which looks pretty good.
Now suppose there is a section in our path called foo/bar. This cannot be done simply by sending foo/bar, because this will be considered as two segments in the path, foo and bar. Let's try foo%2Fbar (replace / with the corresponding encoding).
new HTTPBuilder('http://localhost:18080').request(Method.GET) { uri.path = '/foo%2Fbar' }This sends GET /foo%252Fbar HTTP/1.1. This is not very good. % in %2F is repeatedly encoded, so the path obtained after decoding is foo%2Fbar instead of foo/bar. The real thing to blame here is java.net.URI, because the URIBuilder class in HTTPBuilder uses it.
The type of the uri property exposed in the configuration closure in the above code is URIBuilder. If you update the path property of the uri through uri.path = …, it will eventually call a constructor of the URI. This method describes the incoming path property as follows:
If the path parameter is provided, it is appended to the URL. The characters in path are encoded as long as they are not non-reserved, punctuated, escaped and other categories (translator's note: These categories are detailed in RFC 2396), and are not/or @ numbers.
This approach is not very meaningful, because if the text before encoding contains special characters, it cannot generate a correctly encoded path segment. In other words, "I will encode this string, and after encoding it is correct", which is of course a fallacy, and URI happens to be a victim of this fallacy. If the string has been encoded correctly, there is no problem. If not, it will be done because the string cannot be parsed. In fact, what the documentation says does not escape the / means that it assumes that the path string has been correctly encoded (that is, it is correctly used to separate the paths), and it has not been correctly encoded (the other parts except / still need to be encoded).
It would be great if HTTPBuilder does not use this defective function of the URI class. Of course, it would be even better if the URI itself is fine.
The correct way to do it
We wrote this url-builder, which can help developers easily splice various types of URLs. It follows the encoding specifications in the reference materials at the beginning of the article, and it also provides a streaming API. The following usage example can cover almost all usage scenarios:
UrlBuilder.forHost("http", "foo.com") .pathSegment("with spaces") .pathSegments("path", "with", "varArgs") .pathSegment("&=?/") .queryParam("fancy + name", "fancy?=value") .matrixParam("matrix", "param?") .fragment("#?=") .toUrlString()The result is: http://foo.com/with%20spaces/path/with/varArgs/&=%3F%2F;matrix=param%3F?fancy%20%2B%20name=fancy?%3Dvalue#%23?=
This example demonstrates different encoding rules for each part of the URL. For example, the unencoded &= in the path is allowed, while ?/ needs to be encoded, but the = needs to be encoded in the query parameters, but the ? number does not need it, because this is already part of the query string (translator's note: the query string starts with a ? number, so it can include a ? number afterward).
Thank you for reading, I hope it can help you. Thank you for your support for this site!