Detailed explanation of Java Douban movie crawler - the growth story of the little crawler (with source code)

Author：Eve Cole Update Time：2025-06-14 03:32:01

I have also used crawlers before, such as using nutch to crawl a specified seed, searching based on the crawled data, and roughly looking at some source code. Of course, nutch considers crawlers very comprehensively and meticulously. Whenever I see the web page information and processing information that have been crawled over the screen, I always feel that this is very dark technology. This time, I took advantage of the opportunity to sort out Spring MVC and wanted to make a small crawler by myself. It doesn’t matter if I can simply do it, and some small bugs don’t matter. All I need is a seed website that can crawl the information I want. If there is an Exception, it may be because some APIs are used improperly, or they may encounter an abnormal http request status, or there is a problem with database reading and writing. In the process of reporting exceptions and solving exceptions, JewelCrawler (son's nickname) can already crawl data independently, and there is also a small skill for emotional analysis based on Word2Vec algorithm.

There may be unknown Exceptions waiting to be solved later, and some performance needs to be optimized, such as interaction with the database, reading and writing data, etc. However, I don’t have much energy to put this in the year, so I’ll give a simple summary today. The first two articles mainly focus on functions and results. This article talks about how JewelCrawler was born and puts the code on Github (the source code address is at the end of the article). If you are interested, you can pay attention (for communication and learning only, please douban. Please douban. Be more sincere and less harm)

Environment introduction

Development Tools: Intellij idea 14

Database: Mysql 5.5 + database management tool Navicat (can be used to connect to query databases)

Language: Java

Jar package management: Maven

Version Management: Git

Directory structure

com.ansj.vec is the Java version implementation of Word2Vec algorithm

com.jackie.crawler.doubanmovie is a crawler implementation module, which also includes

Some packages are empty because these modules are not used yet, among which

The constants package is a constant class
crawl package stores crawler entry program
entity package maps database table entity class
Test package stores test classes
Utils package storage tool class

The resource module stores configuration files and resource files, such as

beans.xml: configuration file for Spring context
seed.properties: seed file
stopwords.dic: Stopwords library
comment12031715.txt: Crawled short comment data
tokenizerResult.txt: The result file after using IKAnalyzer participle
vector.mod: Model data trained based on Word2Vec algorithm

The test module is a test module used to write UT.

Database configuration

1. Add dependency packages

JewelCrawler uses maven management, so you only need to add the corresponding dependencies in pom.xml.

 <dependency> <groupId>org.springframework</groupId> <artifactId>spring-jdbc</artifactId> <version>4.1.1.RELEASE</version></dependency><dependency> <groupId>commons-pool</groupId> <artifactId>commons-pool</artifactId> <version>1.6</version></dependency><dependency> <groupId>commons-dbcp</groupId> <artifactId>commons-dbcp</artifactId> <artifactId>commons-dbcp</artifactId> <artifactId>commons-dbcp</artifactId> <version>1.4</version></dependency><dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version></dependency><dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version></dependency>

2. Declare the data source bean

We need to declare the bean of the data source in beans.xml

 <context:property-placeholder location="classpath*:*.properties"/><bean id="dataSource" destroy-method="close"> <property name="driverClassName" value="${jdbc.driver}"/> <property name="url" value="${jdbc.url}"/> <property name="username" value="${jdbc.username}"/> <property name="password" value="${jdbc.password}"/>

Note: Here is the external configuration file jdbc.properties bound, and the parameters of the specific data source are read from this file.

If you encounter the problem, "SQL [insert into user(id) values(?)]; Field 'name' doesn't have a default value;" The solution is to set the corresponding field of the table to a self-growth field.

Problems encountered when parsing pages

For the web page data you crawled, you need to parse the dom structure and get the data you want. During this period, you encounter the following error

org.htmlparser.Node is not recognized

Solution: Add jar package dependency

 <dependency> <groupId>org.htmlparser</groupId> <artifactId>htmlparser</artifactId> <version>1.6</version></dependency>

org.apache.http.HttpEntity is not recognized

Solution: Add jar package dependency

 <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version></dependency>

Of course, this is a problem encountered during the period, and the page analysis done by Jsoup is finally used.

Maven warehouse download speed is slow

I used the default maven central repository before, and the speed of downloading jar packages was very slow. I don’t know if it was my network problem or other reasons. Later, I found the maven repository of Alibaba Cloud online. After the update, it was recommended to vomit blood compared to before.

 <mirrors> <mirror> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <mirrorOf>central</mirrorOf> </mirror></mirrors>

Find the settings.xml file of maven and add this image.

A way to read files under the resource module

For example, read the seed.properties file

 @Test public void testFile(){ File seedFile = new File(this.getClass().getResource("/seed.properties").getPath()); System.out.print("===========" + seedFile.length() + "===========" ); }

About regular expressions

When using regrex regular expressions, if the defined pattern is matched, you need to call the matcher's find method before you can use the group method to find the substring. There is no way to find the result you want by calling the group method directly.

I looked at the source code of the Matcher class above

 package java.util.regex;import java.util.Objects;public final class Matcher implements MatchResult { /** * The Pattern object that created this Matcher. */ Pattern parentPattern; /** * The storage used by groups. They may contain invalid values if * a group was skipped during the matching. */ int[] groups; /** * The range within the sequence that is to be matched. Anchors * will match at these "hard" boundaries. Changing the region * changes these values. */ int from, to; /** * Lookbehind uses this value to ensure that the subexpression * match ends at the point where the lookbehind was encountered. */ int lookbehindTo; /** * The original string being matched. */ CharSequence text; /** * Matcher state used by the last node. NOANCHOR is used when a * match does not have to consume all of the input. ENDANCHOR is * the mode used for matching all the input. */ static final int ENDANCHOR = 1; static final int NOANCHOR = 0; int acceptMode = NOANCHOR; /** * The range of string that last matched the pattern. If the last * match failed then first is -1; last initially holds 0 then it * holds the index of the end of the last match (which is where the * next search starts). */ int first = -1, last = 0; /** * The end index of what matched in the last match operation. */ int oldLast = -1; /** * The index of the last position applied in a substitution. */ int lastAppendPosition = 0; /** * Storage used by nodes to tell what repetition they are on in * a pattern, and where groups begin. The nodes themselves are stateless, * so they rely on this field to hold state during a match. */ int[] locals; /** * Boolean indicating whether or not more input could change * the results of the last match. * * If hitEnd is true, and a match was found, then more input * might cause a different match to be found. * If hitEnd is true and a match was not found, then more * input could cause a match to be found. * If hitEnd is false and a match was not found, then more * input will not cause a match to be found. */ boolean hitEnd; /** * Boolean indicating whether or not more input could change * a positive match into a negative one. * * If requireEnd is true, and a match was found, then more * input could cause the match to be lost. * If requireEnd is false and a match was found, then more * input might change the match but the match won't be lost. * If a match was not found, then requireEnd has no meaning. */ boolean requireEnd; /** * If transparentBounds is true then the boundaries of this * matcher's region are transparent to lookhead, lookbehind, * and boundary matching constructs that try to see beyond them. */ boolean transparentBounds = false; /** * If anchoringBounds is true then the boundaries of this * matcher's region match anchors such as ^ and $. */ boolean anchoringBounds = true; /** * No default constructor. */ Matcher() { } /** * All matchers have the state used by Pattern during a match. */Matcher(Pattern parent, CharSequence text) { this.parentPattern = parent; this.text = text; // Allocate state storage int parentGroupCount = Math.max(parent.capturingGroupCount, 10); groups = new int[parentGroupCount * 2]; locals = new int[parent.localCount]; // Put fields into initial states reset();}..../** * Returns the input subsequence matched by the previous match. * * <p> For a matcher <i>m</i> with input sequence <i>s</i>, * the expressions <i>m.</i><tt>group()</tt> and * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt> * are equivalent. </p> * * <p> Note that some patterns, for example <tt>a*</tt>, match the empty * string. This method will return the empty string when the pattern * successfully matches the empty string in the input. </p> * * @return The (possibly empty) sequence matched by the previous match, * in string form * * @throws IllegalStateException * If no match has yet been attempted, * or if the previous match operation failed */public String group() { return group(0);}/** * Returns the input subsequence captured by the given group during the * previous match operation. * * <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index * <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt> * are equivalent. </p> * * <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left * to right, starting at one. Group zero denotes the entire pattern, so * the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>. * </p> * * <p> If the match was successful but the group specified failed to match * any part of the input sequence, then <tt>null</tt> is returned. Note * that some groups, for example <tt>(a*)</tt>, match the empty string. * This method will return the empty string when such a group successfully * matches the empty string in the input. </p> * @param group * The index of a capturing group in this matcher's pattern * * @return The (possibly empty) subsequent captured by the group * during the previous match, or <tt>null</tt> if the group * failed to match part of the input * * @throws IllegalStateException * If no match has yet been attempted, * or if the previous match operation failed * * @throws IndexOutOfBoundsException * If there is no capturing group in the pattern * with the given index */public String group(int group) { if (first < 0) throw new IllegalStateException("No match found"); if (group < 0 || group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); if ((groups[group*2] == -1) || (groups[group*2+1] == -1)) return null; return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();}/** * Attempts to find the next sequence of the input sequence that matches * the pattern. * * <p> This method starts at the beginning of this matcher's region, or, if * a previous invocation of the method was successful and the matcher has * not since been reset, at the first character not matched by the previous * match. * * <p> If the match succeeds then more information can be obtained via the * <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p> * * @return <tt>true</tt> if, and only if, a sequence of the input * sequence matches this matcher's pattern */public boolean find() { int nextSearchIndex = last; if (nextSearchIndex == first) nextSearchIndex++; // If next search starts before region, start it at region if (nextSearchIndex < from) nextSearchIndex = from; // If next search starts beyond region then it fails if (nextSearchIndex > to) { for (int i = 0; i < groups.length; i++) groups[i] = -1; return false; } return search(nextSearchIndex);} /** * Initiates a search to find a Pattern within the given bounds. * The groups are filled with default values and the match of the root * of the state machine is called. The state machine will hold the state * of the match as it proceeds in this matcher. * * Matcher.from is not set here, because it is the "hard" boundary * of the start of the search which anchors will set to. The from param * is the "soft" boundary of the start of the search, meaning that the * regex tries to match at that index but ^ won't match there. Subsequent * calls to the search methods start at a new "soft" boundary which is * the end of the previous match. */boolean search(int from) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = NOANCHOR; boolean result = parentPattern.root.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result;}...}

The reason is this: if you do not call the find method first and call the group directly, you can find that the group method calls group(int group). There is if first<0 in the method body of the method. Obviously, this condition is true here, because the initial value of first is -1, so an exception will be thrown here. However, if you call the find method, you can find that search (nextSearchIndex) will eventually be called. Note that the nextSearchIndex here has been assigned by last, and the value of last is 0, and then jump to the search method

 boolean search(int from) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = NOANCHOR; boolean result = parentPattern.root.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result;}

This nextSearchIndex is passed to from, and from is assigned to first in the method body. Therefore, after calling the find method, the first of this is no longer -1, and it is not throwing an exception.

The source code has been uploaded to Baidu Netdisk: http://pan.baidu.com/s/1dFwtvNz

The problems mentioned above are relatively shattered, and they are some summary when encountering problems and solving them. There will be other problems during specific operations. If you have any questions or suggestions, please feel free to mention ^^.

Finally, put a few pieces of crawled data so far

Record table

Among them, 79,032 are stored, and 48,471 crawled web pages are

movie table

Currently, 2964 film and television works have been crawled

Comments table

29,711 records were crawled

The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.