HTML is the core of the WEB. All pages you see on the Internet are HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP or other WEB technologies. Your browser will parse the HTML and render it for you. But what if you need to parse the HTML document yourself in a Java program and find certain elements, tags, attributes or check whether a specific element exists? If you have been programming in Java for many years, I believe you must have tried parsing XML and used parsers like DOM or SAX, but it is very likely that you have never done any HTML parsing work. What's even more ironic is that in Java applications, there are very few times when you need to parse HTML documents, and this does not include Servlets or other Java WEB technologies. What's worse is that the JDK core does not include HTTP or HTML libraries, at least I don't know of them. This is why when it comes to parsing HTML files, many Java programmers have to Google first to see how to extract an HTML tag in Java. When I had this need, I believed that there would be some open source libraries that could achieve this, but I didn't expect that there would be such a cool and fully functional library as JSoup. It not only supports reading and parsing HTML documents, but also allows you to extract any elements from HTML files, as well as their attributes and CSS properties, and you can also modify them. With JSoup you can do almost anything with HTML documents. We will see an example of how to download and parse an HTML file from the Google homepage or any URL in Java.
What is the JSoup library
Jsoup is an open source Java library that can be used to process HTML in practical applications. It provides a very convenient API for data extraction and modification, taking full advantage of the advantages of DOM, CSS and jquery style methods. Jsoup implements the WAHTWG HTML5 specification, and the DOM it parses from HTML is exactly the same as that parsed by modern browsers such as Chrome and Firefox. Here are some useful features of the Jsoup library:
1.Jsoup can obtain and parse HTML from URL, file, or string.
2.Jsoup can find and extract data, using DOM traversal or CSS selectors.
3. You can use Jsoup to modify HTML elements, attributes and text.
4.Jsoup ensures that the content submitted by users is clean through a safe whitelist to prevent XSS attacks.
5.Jsoup can also output neat HTML.
Jsoup is designed to handle all kinds of different HTML that appear in real life, including correct and valid HTML as well as incomplete and invalid tag collections. One of Jsoup's core competitiveness is its robustness.
Using Jsoup for HTML parsing in Java
In this Java HTML parsing tutorial, we will see three different examples of using Jsoup to parse and traverse HTML in Java. In the first example, we will parse an HTML string, whose content is a tag composed of string literals in Java. In the second example, we will download the HTML document from the WEB, and in the third example, we will load an HTML sample file login.html for parsing. This file is an example of an HTML document. It contains a title tag and a div tag inside the body, which contains a form. It has input tags for obtaining usernames and passwords, as well as submit and reset buttons for the next step. It is correct and valid HTML, that is, all tags and attributes are closed correctly. Here is our sample HTML file:
Copy the code code as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Login Page</title>
</head>
<body>
<div id="login" >
<form action="login.do">
Username : <input id="username" type="text" /><br>
Password : <input id="password" type="password" /><br>
<input id="submit" type="submit" />
<input id="reset" type="reset" />
</form>
</div>
</body>
</html>
Using Jsoup to parse HTML is very simple, you just need to call its static method Jsoup.parse() and pass it your HTML string. Jsoup provides multiple overloaded parse() methods, which can read HTML files from strings, files, URIs, URLs, and even InputStreams. If it is not UTF-8 encoding, you can also specify the character encoding so that the HTML file can be read correctly. Below is a complete list of HTML parsing methods in the Jsoup library. The parse(String html) method parses the input HTML into a new Document. In Jsoup, Document inherits from Element, which in turn inherits from Node. The same TextNode also inherits from Node. As long as you pass in a non-null string, you will definitely get a successful and meaningful parsing and get a Document containing head and body elements. Once you get this Document, you can call the appropriate methods on Document and its parent classes Element and Node to get the data you want.
Java program that parses HTML documents
Below is a complete Java program that parses HTML strings, HTML files downloaded from the Internet, and HTML files in the local file system. You can use Eclipse IDE or other IDE or even commands to run this program. In Eclipse, it's very simple. Just copy this code, create a new Java project, right-click on the src package and paste it in. Eclipse will create the correct package and Java source file with the same name, so the workload is minimal. If you already have a Java sample project, then only one step is required. The following Java program shows three different examples of parsing and traversing HTML files. In the first example, we directly parse a string with HTML content, in the second example we parse an HTML file downloaded from a URL, and in the third example we load an HTML document from the local file system and parse it. The parse method is used in both the first and third examples to obtain a Document object, which you can query to extract any tag value or attribute value. In the second example, we use the Jsoup.connect method, which will create a URL connection, download the HTML and parse it. This method will also return a Document, which can be used for subsequent queries and obtaining the value of a tag or attribute.
Copy the code code as follows:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/**
[*] Java Program to parse/read HTML documents from File using Jsoup library.
[*] Jsoup is an open source library which allows Java developer to parse HTML
[*] files and extract elements, manipulate data, change style using DOM, CSS and
[*] JQuery like method.
[*]
[*] @author Javin Paul
[*]/
public class HTMLParser{
public static void main(String args[]) {
// Parse HTML String using JSoup library
String HTMLSTring = "<!DOCTYPE html>"
+ "<html>"
+ "<head>"
+ "<title>JSoup Example</title>"
+ "</head>"
+ "<body>"
+ "|[b]HelloWorld[/b]"
+ ""
+ "</body>"
+ "</html>";
Document html = Jsoup.parse(HTMLSTring);
String title = html.title();
String h1 = html.body().getElementsByTag("h1").text();
System.out.println("Input HTML String to JSoup:" + HTMLSTring);
System.out.println("After parsing, Title : " + title);
System.out.println("After parsing, Heading : " + h1);
// JSoup Example 2 - Reading HTML page from URL
Document doc;
try {
doc = Jsoup.connect("http://google.com/").get();
title = doc.title();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Jsoup Can read HTML page from URL, title : " + title);
// JSoup Example 3 - Parsing an HTML file in Java
//Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
Document htmlFile = null;
try {
htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} // right
title = htmlFile.title();
Element div = htmlFile.getElementById("login");
String cssClass = div.className(); // getting class form HTML element
System.out.println("Jsoup can also parse HTML file directly");
System.out.println("title : " + title);
System.out.println("class of div tag : " + cssClass);
}
}
Output:
Copy the code code as follows:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1> </tr></table></body></html>
After parsing, Title: JSoup Example
Afte parsing, Heading: HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly title : Login Page
class of div tag: simple
The advantage of Jsoup is that it is very robust. The Jsoup HTML parser will parse the HTML you provide as cleanly as possible, regardless of whether the HTML is well-formed. It can handle errors such as unclosed tags (e.g., Java <p>Scala to <p>JavaScala), implicit tags (e.g., a bare |Java is Great wrapped in |), and it can always create Output a document structure (HTML including head and body, and only the correct elements will be included in the head). This is how HTML is parsed in Java. Jsoup is an excellent and robust open source library that makes it very simple to read HTML documents, body fragments, HTML strings, and parse HTML content directly from the WEB. In this article, we learned how to get a specific HTML tag in Java, as in the first example we extracted the value of the title and H1 tags into text, and in the third example we learned how to get it through Extract CSS properties to get property values from HTML tags. In addition to the powerful jQuery-style html.body().getElementsByTag("h1").text() method, you can also extract arbitrary HTML tags, and it also provides functions like Document.title() and Element.className() Convenient method, you can quickly get the title and CSS class. I hope you have fun with JSoup and we'll see some more examples of this API soon.