Java analysis html algorithm (java web spider algorithm example)

Author：Eve Cole Update Time：2025-02-23 16:00:04

Everyone is discouraged when encountering complex and cumbersome html pages. Because it is difficult to obtain the corresponding data.

The oldest way is to try to use regular expressions. It is estimated that such tedious things will not be worth the cost and waste our precious time.

The second method is to use the open source organization htmlparser package. This is an old project, but the effect is probably not very good. It seems that you cannot analyze html in depth, and you can only analyze the structure of level 5;

I have the source code of htmlparser here, which can get all the hyperlinks

The code copy is as follows:

* To change this template, choose Tools | Templates

* and open the template in the editor.

package test;

import java.util.HashMap;

import java.util.Map;

import org.htmlparser.Node;

import org.htmlparser.NodeFilter;

import org.htmlparser.Parser;

import org.htmlparser.tags.LinkTag;

import org.htmlparser.util.NodeList;

public class GetLinkTest {

public static void main(String[] args) {

try {

// Filter out the <A> tag through the filter

Parser parser = new Parser("//www.VeVB.COM");

NodeList nodeList = parser.extractAllNodesThatMatch(new NodeFilter() {

// Implement this method to filter tags

public boolean accept(Node node) {

if (node instanceof LinkTag)// Tag

{

return true;

}

return false;

}

});

// Print

for (int i = 0; i < nodeList.size(); i++) {

LinkTag n = (LinkTag) nodeList.elementAt(i);

//System.out.print(n.getStringText() + " ==>> ");

//System.out.println(n.extractLink());

try {

if (n.extractLink().equals("//www.VeVB.COM")) {

System.out.println(n.extractLink());

}

} catch (Exception e) {

}

} catch (Exception e) {

e.printStackTrace();

}

The third method is also the method I have been using now. First, clean the html into xml, and then parse the xml with java to get the data. Now upload a java clean html source code:

The code copy is as follows:

* To change this template, choose Tools | Templates

* and open the template in the editor.

package exec;

import java.io.File;

import java.io.IOException;

import org.htmlcleaner.CleanerProperties;

import org.htmlcleaner.HtmlCleaner;

import org.htmlcleaner.PrettyXmlSerializer;

import org.htmlcleaner.TagNode;

/**

public class HtmlClean {

public void cleanHtml(String htmlurl, String xmlurl) {

try {

long start = System.currentTimeMillis();

HtmlCleaner cleaner = new HtmlCleaner();

CleanerProperties props = cleaner.getProperties();

props.setUseCdataForScriptAndStyle(true);

props.setRecognizeUnicodeChars(true);

props.setUseEmptyElementTags(true);

props.setAdvancedXmlEscape(true);

props.setTranslateSpecialEntities(true);

props.setBooleanAttributeValues("empty");

TagNode node = cleaner.clean(new File(htmlurl));

System.out.println("vreme:" + (System.currentTimeMillis() - start));

new PrettyXmlSerializer(props).writeXmlToFile(node, xmlurl);

System.out.println("vreme:" + (System.currentTimeMillis() - start));

} catch (IOException e) {

e.printStackTrace();

}