Write Java with zero foundation and practice Zhihu crawler first with Baidu homepage

Author：Eve Cole Update Time：2025-03-04 09:32:02

In the last episode, we talked about the need to use Java to make a Zhihu crawler, so this time, we will study how to use code to obtain the content of the web page.

First of all, if you have no experience with HTML, CSS, JS and AJAX, it is recommended to go to W3C (click me, click me) to learn a little bit.

Speaking of HTML, this involves a problem of GET access and POST access.

If you lack understanding of this aspect, you can read this article from W3C: "GET vs. POST".

Aha, I won’t go into details here.

Then, next we need to use Java to crawl the content of a web page.

At this time, our Baidu will come in handy.

That's right, he is no longer the unknown Internet speed tester, he is about to become our reptile guinea pig! ~

Let’s take a look at Baidu’s homepage first:

I believe everyone knows that a page like this is the result of the joint work of HTML and CSS.

We right-click the page in the browser and select "View page source code":

That's right, it's something like this. This is the source code of the Baidu page.

Our next task is to use our crawler to get the same thing.

Let’s first look at a simple source code:

 import java.io.*;
 import java.net.*;
 public class Main {
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.baidu.com";
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 //Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream()));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 //Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 System.out.println(result);
 }
 }

The above is Java's simulation of Get accessing Baidu's Main method.

You can run it to see the results:

Aha, it’s exactly the same as what we saw in the browser earlier. At this point, the simplest crawler is ready.

But such a large pile of things may not all be what I want. How can I grab what I want from it?

Take Baidu’s big paw logo as an example.

Temporary needs:

Get the picture link of the big paw of Baidu Logo.

Let’s first talk about the browser viewing method.

Right-click the image and select Inspect Elements (Firefox, Google, and IE11 all have this function, but the names are different):

Aha, you can see the poor img tag surrounded by a lot of divs.

This src is the link to the image.

So how do we do it in java?

Please note in advance that in order to facilitate code demonstration, all codes are not encapsulated by classes, please understand.

Let’s first encapsulate the previous code into a sendGet function:

 import java.io.*;
 import java.net.*;
 public class Main {
 static String sendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 //Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream()));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.baidu.com";
 //Access the link and get the page content
 String result = sendGet(url);
 System.out.println(result);
 }
 }

This looks a little tidier, please forgive my obsessive-compulsive disorder.

The next task is to find the link to the picture from a lot of things obtained.

The first method we can think of is to use the indexof function to search for String substrings on the string result of the page source code.

Yes, this method can slowly solve this problem, such as directly indexOf("src") to find the starting serial number, and then get the ending serial number in a hurry.

However, we cannot always use this method. After all, straw sandals are only suitable for walking around. Later, we still need to cut off the prosthetic legs to hold the heads.

Please forgive my intrusion and continue.

So how do we find the src of this picture?

That's right, as the audience below said, regular matching.

If any students are not sure about regular expressions, you can refer to this article: [Python] Web Crawler (7): Regular Expressions Tutorial in Python.

Simply put, regex is like matching.

For example, three fat men are standing here, wearing red clothes, blue clothes, and green clothes.

The rule is: catch the one in green!

Then he caught the fat green man alone.

It's that simple.

However, the regular grammar is still extensive and profound, and it is inevitable that you will be a little confused when you first come into contact with it.

I recommend a regular online testing tool to everyone: regular expression online testing.

With regularity as a magic weapon, how to use regularity in java?

Let’s look at a simple little plum first.

Ah, wrong, little chestnut.

 // Define a style template, using regular expressions, and the content to be captured is in parentheses
 // It's equivalent to burying a trap and it will fall if it matches.
 Pattern pattern = Pattern.compile("href=/"(.+?)/"");
 //Define a matcher for matching
 Matcher matcher = pattern.matcher("＜a href=/"index.html/"＞My homepage＜/a＞");
 // if found
 if (matcher.find()) {
 // print out the result
 System.out.println(matcher.group(1));
 }

Running results:

index.html

Yes, this is our first regular code.

The link for grabbing pictures in this application must be at your fingertips.

We encapsulate the regular matching into a function, and then modify the code as follows:

 import java.io.*;
 import java.net.*;
 import java.util.regex.*;
 public class Main {
 static String SendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 //Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream()));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 static String RegexString(String targetStr, String patternStr) {
 // Define a style template, using regular expressions, and the content to be captured is in parentheses
 // It's equivalent to burying a trap and it will fall if it matches.
 Pattern pattern = Pattern.compile(patternStr);
 //Define a matcher for matching
 Matcher matcher = pattern.matcher(targetStr);
 // if found
 if (matcher.find()) {
 // print out the result
 return matcher.group(1);
 }
 return "";
 }
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.baidu.com";
 //Access the link and get the page content
 String result = SendGet(url);
 // Use regular expressions to match the src content of the image
 String imgSrc = RegexString(result, "Upcoming regular grammar");
 // print results
 System.out.println(imgSrc);
 }
 }

Okay, now everything is ready, just a regular grammar!

So what regular statement is more appropriate?

We found that as long as we grab the string src="xxxxxx", we can grab the entire src link.

So a simple regular statement: src=/"(.+?)/"

The complete code is as follows:

 import java.io.*;
 import java.net.*;
 import java.util.regex.*;
 public class Main {
 static String SendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 // Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream()));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 static String RegexString(String targetStr, String patternStr) {
 // Define a style template, using regular expressions, and the content to be captured is in parentheses
 // It's equivalent to burying a trap and it will fall if it matches.
 Pattern pattern = Pattern.compile(patternStr);
 //Define a matcher for matching
 Matcher matcher = pattern.matcher(targetStr);
 // if found
 if (matcher.find()) {
 // print out the result
 return matcher.group(1);
 }
 return "Nothing";
 }
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.baidu.com";
 //Access the link and get the page content
 String result = SendGet(url);
 // Use regular expressions to match the src content of the image
 String imgSrc = RegexString(result, "src=/"(.+?)/"");
 // print results
 System.out.println(imgSrc);
 }
 }

In this way, we can use java to grab the link to Baidu LOGO.

Well, although I have spent a lot of time talking about Baidu, the foundation must be laid solidly. Next time we will officially start to focus on Zhihu! ~