Based on the links to the target website, this article further increases the difficulty, captures the content we need on the target page and saves it in the database. The test case here uses a movie download website that I often use (http://www.80s.la/). I originally wanted to crawl all the movie download links on the website, but later I felt that it took too long, so I changed it to crawl the download links for the 2015 movie.
Introduction to the principle
In fact, the principles are similar to the first article. The difference is that since there are too many classification lists on this website, it is unimaginable if these tags are not chosen.
You don’t need to use category links or tag links. You don’t use these links to crawl other pages. You can only obtain the movie list of other pages through the pagination of all types of movies at the bottom of the page. At the same time, for the movie details page, you only grab the movie title and Thunder download links, and do not crawl in depth. You don’t need any recommended movies and other links on the details page.
Finally, save the download links of all the obtained movies in the videoLinkMap collection, and save the data in MySQL by traversing this collection.
Two-code implementation
The implementation principle has been mentioned above, and there are detailed comments in the code, so I won't talk about it here, the code is as follows:
package action;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;import java.sql.Connection;import java.sql.PreparedStatement;import java.sql.SQLException;import java.util.LinkedHashMap;import java.util.Map;import java.util.regex.Matcher;import java.util.regex.Pattern; public class VideoLinkGrab { public static void main(String[] args) { VideoLinkGrab videoLinkGrab = new VideoLinkGrab(); videoLinkGrab.saveData("http://www.80s.la/movie/list/-2015-----p"); } /** * Save the retrieved data in the database* * @param baseUrl * Crawler starting point* @return null * */ public void saveData(String baseUrl) { Map<String, Boolean> oldMap = new LinkedHashMap<String, Boolean>(); // Store link-Is it traversed by Map<String, String> videoLinkMap = new LinkedHashMap<String, String>(); // Video download link String oldLinkHost = ""; // host Pattern p = Pattern.compile("(https?://)?[^///s]*"); // For example: http://www.zifangsky.cn Matcher m = p.matcher(baseUrl); if (m.find()) { oldLinkHost = m.group(); } oldMap.put(baseUrl, false); videoLinkMap = crawlLinks(oldLinkHost, oldMap); // traverse and save the data in the database try { Connection connection = JDBCDemo.getConnection(); for (Map.Entry<String, String> mapping : videoLinkMap.entrySet()) { PreparedStatement pStatement = connection .prepareStatement("insert into movie(MovieName,MovieLink) values(?,?)"); pStatement.setString(1, mapping.getKey()); pStatement.setString(2, mapping.getValue()); pStatement.executeUpdate(); pStatement.close();// System.out.println(mapping.getKey() + " : " + mapping.getValue()); } connection.close(); } catch (SQLException e) { e.printStackTrace(); } } /** * Crawl all the web page links that can be crawled on a website. In the idea, the breadth priority algorithm is used to continuously initiate GET requests for new links that have not been traversed. Until the complete set is traversed, no new links can be found*, which means that new links cannot be found. The task ends* * When making a request for a link, use regular searches for the video link we need for the web page, and then save it in the collection videoLinkMap * * @param oldLinkHost * Domain name, such as: http://www.zifangsky.cn * @param oldMap * Link collection to be traversed* * @return Return all crawled video download link collection* */ private Map<String, String> crawlLinks(String oldLinkHost, Map<String, Boolean> oldMap) { Map<String, Boolean> newMap = new LinkedHashMap<String, Boolean>(); // New link obtained from each loop Map<String, String> videoLinkMap = new LinkedHashMap<String, String>(); // Video download link String oldLink = ""; for (Map.Entry<String, Boolean> mapping : oldMap.entrySet()) { // System.out.println("link:" + mapping.getKey() + "--------check:" // + mapping.getValue()); // If it has not been traversed by (!mapping.getValue()) { oldLink = mapping.getKey(); // Initiate a GET request try { URL url = new URL(oldLink); HttpURLConnection connection = (HttpURLConnection) url .openConnection(); connection.setRequestMethod("GET"); connection.setConnectTimeout(2500); connection.setReadTimeout(2500); if (connection.getResponseCode() == 200) { InputStream inputStream = connection.getInputStream(); BufferedReader reader = new BufferedReader( new InputStreamReader(inputStream, "UTF-8")); String line = ""; Pattern pattern = null; Matcher matcher = null; //Movie details page, take out the video download link in it, and do not continue to crawl other pages in depth if(isMoviePage(oldLink)){ boolean checkTitle = false; String title = ""; while ((line = reader.readLine()) != null) { //Retrieve the video title in the page if(!checkTitle){ pattern = Pattern.compile("([^//s]+).*?</title>"); matcher = pattern.matcher(line); if(matcher.find()){ title = matcher.group(1); checkTitle = true; continue; } } // Remove the video download link in the page pattern = Pattern .compile("(thunder:[^/"]+).*thunder[rR]es[tT]itle=/"[^/"]*/""); matcher = pattern.matcher(line); if (matcher.find()) { videoLinkMap.put(title,matcher.group(1)); System.out.println("Video name: " + title + " ------ Video link: " + matcher.group(1)); break; //The current page has been detected} } } //The movie list page else if(checkUrl(oldLink)){ while ((line = reader.readLine()) != null) { pattern = Pattern .compile("<a href=/"([^/"//s]*)/""); matcher = pattern.matcher(line); while (matcher.find()) { String newLink = matcher.group(1).trim(); // Link// Determine whether the obtained link starts with http if (!newLink.startsWith("http")) { if (newLink.startsWith("/")) newLink = oldLinkHost + newLink; else newLink = oldLinkHost + "/" + newLink; } // Remove / at the end of the link if (newLink.endsWith("/")) newLink = newLink.substring(0, newLink.length() - 1); // Deduplicate and discard links from other websites if (!oldMap.containsKey(newLink) && !newMap.containsKey(newLink) && (checkUrl(newLink) || isMoviePage(newLink))) { System.out.println("temp: " + newLink); newMap.put(newLink, false); } } } } } reader.close(); inputStream.close(); } connection.disconnect(); } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } oldMap.replace(oldLink, false, true); } } // There is a new link, continue to traverse if (!newMap.isEmpty()) { oldMap.putAll(newMap); videoLinkMap.putAll(crawlLinks(oldLinkHost, oldMap)); // Due to the characteristics of Map, duplicate key-value pairs will not occur} return videoLinkMap; } /** * Judge whether it is a 2015 movie list page * @param url URL to be checked * @return status * */ public boolean checkUrl(String url){ Pattern pattern = Pattern.compile("http://www.80s.la/movie/list/-2015-----p//d*"); Matcher matcher = pattern.matcher(url); if(matcher.find()) return true; //2015 list else return false; } /** * Judge whether the page is a movie details page* @param url page link * @return status * */ public boolean isMoviePage(String url){ Pattern pattern = Pattern.compile("http://www.80s.la/movie///d+"); Matcher matcher = pattern.matcher(url); if(matcher.find()) return true; //Movie page else return false; } } Note: If you want to crawl some specified content from other websites, you need to modify some of the regular expressions reasonably according to the actual situation.
Three test results
The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.