Note: The following code is based on httpclient4.5.2.
It is an easier task to implement the get request crawling of web pages using java's HttpClient:
public static String get(String url) { CloseableHttpResponsesponse = null; BufferedReader in = null; String result = ""; try { CloseableHttpClienthttppclient = HttpClients.createDefault(); HttpGethttpGet = new HttpGet(url); response = httppclient.execute(httpGet); in = new BufferedReader(new InputStreamReader(response.getEntity().getContent())); StringBuffersb = new StringBuffer(""); String line = ""; String NL = System.getProperty("line.separator"); while ((line = in.readLine()) != null) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if (null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; }The above method is also useful when multi-threaded execution of get requests. However, this multi-threaded request is based on creating an HttpClient instance every time the get method is called. Each HttpClient instance is recycled once. This is obviously not an optimal implementation.
HttpClient provides a multi-threaded request solution, and you can view the official document "Pooling connection manager". HttpCLient implementation multi-threaded requests are implemented based on the built-in connection pool, which has a key class, namely PoolingHttpClientConnectionManager, which is responsible for managing the HttpClient connection pool. There are two key methods available in PoolingHttpClientConnectionManager: setMaxTotal and setDefaultMaxPerRoute. setMaxTotal sets the maximum number of connections to the connection pool, and setDefaultMaxPerRoute sets the default number of connections on each route. There is also a method setMaxPerRoute - set the maximum number of connections for a certain site separately, like this:
HttpHosthost = new HttpHost("locahost", 80); cm.setMaxPerRoute(new HttpRoute(host), 50);Our get request implementation is slightly adjusted according to the documentation:
package com.zhyea.robin; import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpGet;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader; public class HttpUtil { private static CloseableHttpClienthttpClient; static { PoolingHttpClientConnectionManagercm = new PoolingHttpClientConnectionManager(); cm.setMaxTotal(200); cm.setDefaultMaxPerRoute(20); cm.setDefaultMaxPerRoute(50); httpClient = HttpClients.custom().setConnectionManager(cm).build(); } public static String get(String url) { CloseableHttpResponsponse = null; BufferedReaderin = null; String result = ""; try { HttpGethttpGet = new HttpGet(url); response = httpClient.execute(httpGet); in = new BufferedReader(new InputStreamReader(response.getEntity().getContent())); StringBuffersb = new StringBuffer(""); String line = ""; String NL = System.getProperty("line.separator"); while ((line = in.readLine()) != null) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if (null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } public static void main(String[] args) { System.out.println(get("https://www.baidu.com/")); }}That's about it. But for me, I prefer the fluent implementation of http pclient. For example, the http get request we just implemented can be implemented simply like this:
package com.zhyea.robin; import org.apache.http.client.fluent.Request;import java.io.IOException; public class HttpUtil { public static String get(String url) { String result = ""; try { result = Request.Get(url) .connectTimeout(1000) .socketTimeout(1000) .execute().returnContent().asString(); } catch (IOException e) { e.printStackTrace(); } return result; } public static void main(String[] args) { System.out.println(get("https://www.baidu.com/")); }}All we have to do is replace the previous httpclient dependency with fluent-hc dependency:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>fluent-hc</artifactId> <version>4.5.2</version></dependency>
And this fluent implementation is naturally accomplished using PoolingHttpClientConnectionManager. The values of maxTotal and defaultMaxPerRoute that it sets are 200 and 100 respectively:
CONNMGR = new PoolingHttpClientConnectionManager(sfr); CONNMGR.setDefaultMaxPerRoute(100); CONNMGR.setMaxTotal(200);
The only thing that makes people feel upset is that Executor does not provide a way to adjust these two values. But this is enough. If it really doesn't work, you can also consider rewriting the Executor method and then directly use the Executor to execute the get request:
Executor.newInstance().execute(Request.Get(url)) .returnContent().asString();
that's all!