I need to do something in the past two days and need to crawl some information from other people’s web pages. Finally, use htmlparser to parse html.
Just look at it from the code:
First of all, you need to note that the import package is: the package below import org.htmlparser
The code copy is as follows:
List<Mp3> mp3List = new ArrayList<Mp3>();
try{
Parser parser = new Parser(htmlStr);//Initialize Parser, here you should pay attention to the import package as org.htmlparser. There are many parameters here. I wrote this place to get the good html text in advance. You can also pass in URl objects
parser.setEncoding("utf-8");//Set the encoding machine
AndFilter filter =
new AndFilter(
new TagNameFilter("div"),
new HasAttributeFilter("id","songListWrapper")
);// Find the div through filter and the id of the div is songListWrapper
NodeList nodes = parser.parse(filter);//Get nodes through filter
Node node = nodes.elementAt(0);
NodeList nodesChild = node.getChildren();
Node[] nodesArr = nodesChild.toNodeArray();
NodeList nodesChild2 = nodesArr[1].getChildren();
Node[] nodesArr2 = nodesChild2.toNodeArray();
Node nodeul = nodesArr2[1];
Node[] nodesli = nodeul.getChildren().toNodeArray();// parse out nodesli as desired
for(int i=2;i<nodesli.length;i++){
//System.out.println(nodesli[i].toHtml());
Node tempNode = nodesli[i];
TagNode tagNode = new TagNode();//Get attributes through TagNode. Only by converting Node to TagNode can you get the attributes of a certain tag
tagNode.setText(tempNode.toHtml());
String claStr = tagNode.getAttribute("class");//claStr is bb-dotimg clearfix song-item-hook { 'songItem': { 'sid': '113275822', 'sname': 'My requirements are not high ', 'author': 'Huang Bo' } }
claStr = claStr.replaceAll(" ", "");
if(claStr.indexOf("//?")==-1){
Pattern pattern = Pattern.compile("[//s//wa-z//-]+//{'songItem'://{'sid':'([//d]+)','sname' :'([//s//S]*)','author':'([//s//S]*)'//}//}");
Matcher matcher = pattern.matcher(claStr);
if(matcher.find()){
Mp3 mp3 = new Mp3();
mp3.setSid(matcher.group(1));
mp3.setSname(matcher.group(2));
mp3.setAuthor(matcher.group(3));
mp3List.add(mp3);
//for(int j=1;j<=matcher.groupCount();j++){
//System.out.print(" "+j+"--->"+matcher.group(j));
//}
}
}
//System.out.println(matcher.find());
}
}catch(Exception e){
e.printStackTrace();
}
The above is something I analyzed in the project. It is relatively simple to use and easy to get started.
/////claStr is bb-dotimg clearfix song-item-hook { 'songItem': { 'sid': '113275822', 'sname': 'My requirements are not high', 'author': 'Huang Bo
It is the content parsed from the web page.