Detailed explanation of two simple implementations of Java: crawl web pages and save them

Author：Eve Cole Update Time：2025-05-27 10:32:02

I have always been curious about the Internet. I used to think about writing a crawler, but I was too lazy to implement it. I felt that this was a very troublesome thing. If a small error occurred, I would have to debug a lot of time, which was a waste of time.

Later, I thought that since I had given myself a guarantee early, I should implement it first, start with simplicity, slowly add functions, implement one if I have time, and optimize the code at any time.

Below is a simple implementation of crawling a specified web page and saving it. In fact, there are several ways to implement it. Here are several implementations of this function slowly.

UrlConnection crawling implementation

 package html;import java.io.BufferedReader;import java.io.FileOutputStream;import java.io.FileWriter;import java.io.IOException;import java.io.InputStreamReader;import java.io.OutputStreamWriter;import java.net.MalformedURLException;import java.net.URL;import java.net.URL;import java.net.URLConnection;public class Spider { public static void main(String[] args) { String filepath = "d:/124.html"; String url_str = "http://www.hao123.com/"; URL url = null; try { url = new URL(url_str); } catch (MalformedURLException e) { e.printStackTrace(); } String charset = "utf-8"; int sec_cont = 1000; try { URLConnection url_con = url.openConnection(); url_con.setDoOutput(true); url_con.setReadTimeout(10 * sec_cont); url_con.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"); InputStream htm_in = url_con.getInputStream(); String htm_str = InputStream2String(htm_in,charset); saveHtml(filepath,htm_str); } catch (IOException e) { e.printStackTrace(); } } /** * Method: saveHtml * Description: save String to file * @param filepath * file path which need to be saved * @param str * string saved */ public static void saveHtml(String filepath, String str){ try { /*@SuppressWarnings("resource") FileWriter fw = new FileWriter(filepath); fw.write(str); fw.flush();*/ OutputStreamWriter outs = new OutputStreamWriter(new FileOutputStream(filepath, true), "utf-8"); outs.write(str); System.out.print(str); outs.close(); } catch (IOException e) { System.out.println("Error at save html..."); e.printStackTrace(); } } /** * Method: InputStream2String * Description: make InputStream to String * @param in_st * inputstream which need to be converted * @param charset * encoder of value * @throws IOException * if an error occurred */ public static String InputStream2String(InputStream in_st,String charset) throws IOException{ BufferedReader buff = new BufferedReader(new InputStreamReader(in_st, charset)); StringBuffer res = new StringBuffer(); String line = ""; while((line = buff.readLine()) != null){ res.append(line); } return res.toString(); }}

During the implementation process, the problem of garbled Chinese codes of the crawled web pages is a relatively troublesome thing.

HttpClient crawling implementation

HttpClient encountered many problems when crawling web pages. First, there are two versions of HttpClient. One is built-in sun and the other is a project of open source for apache. It seems that sun is not built-in, so I did not implement it, but adopted the apache open source project (the HttpClient mentioned later refers to the open source version of apache); second, when using HttpClient, the latest version is different from the previous version. After the HttpClient 4.x version, the imported packages are different. Many parts found on the Internet are HttpClient 3.x version, so if you use the latest version, it is better to look at the help file.

I am using Eclipse and need to configure the environment to import the reference package.

First, download HttpClient, the address is: http://hc.apache.org/downloads.cgi, I am using HttpClient version 4.2.

Then, unzip, find commons-codec-1.6.jar, commons-logging-1.1.jar, httpclient-4.2.5.jar, httpcore-4.2.4.jar in the /lib folder (the version number varies depending on the downloaded version, and there are other jar files, which I cannot use for the time being, so import the required ones first);

Finally, add the above jar file to the classpath, that is, right-click the project file => Bulid Path => Configure Build Path => Add External Jar.., and then add the above package.

Another method is to directly copy the package above to the lib folder under the project folder.

Here is the implementation code:

 package html;import java.io.BufferedReader;import java.io.FileOutputStream;import java.io.IOException;import java.io.InputStreamReader;import java.io.InputStreamReader;import java.io.OutputStreamWriter;import org.apache.http.HttpEntity;import org.apache.http.HttpResponse;import org.apache.http.client.*;import org.apache.http.client.methods.HttpGet;import org.apache.http.impl.client.DefaultHttpClient;public class SpiderHttpClient { public static void main(String[] args) throws Exception { // TODO Auto-generated method stub String url_str = "http://www.hao123.com"; String charset = "utf-8"; String filepath = "d:/125.html"; HttpClient hc = new DefaultHttpClient(); HttpGet hg = new HttpGet(url_str); HttpResponse response = hc.execute(hg); HttpEntity entity = response.getEntity(); InputStream htm_in = null; if(entity != null){ System.out.println(entity.getContentLength()); htm_in = entity.getContent(); String htm_str = InputStream2String(htm_in,charset); saveHtml(filepath,htm_str); } } /** * Method: saveHtml * Description: save String to file * @param filepath * file path which need to be saved * @param str * string saved */ public static void saveHtml(String filepath, String str){ try { /*@SuppressWarnings("resource") FileWriter fw = new FileWriter(filepath); fw.write(str); fw.flush();*/ OutputStreamWriter outs = new OutputStreamWriter(new FileOutputStream(filepath, true), "utf-8"); outs.write(str); outs.close(); } catch (IOException e) { System.out.println("Error at save html..."); e.printStackTrace(); } } /** * Method: InputStream2String * Description: make InputStream to String * @param in_st * inputstream which need to be converted * @param charset * encoder of value * @throws IOException * if an error occurred */ public static String InputStream2String(InputStream in_st,String charset) throws IOException{ BufferedReader buff = new BufferedReader(new InputStreamReader(in_st, charset)); StringBuffer res = new StringBuffer(); String line = ""; while((line = buff.readLine()) != null){ res.append(line); } return res.toString(); }}

The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.