In fact, there are many ways to delete comments in html text. I wrote a random processing method here, just as a note, and students in need can refer to it.
There are several characteristics of comments on html text:
1. If it appears in pairs, there will be an end if it starts.
2. The comment tag is not nested, and the comment start tag (hereinafter referred to as <!--) next must be its corresponding end tag (hereinafter referred to as -->).
3. There may be multiple comment tag pairs in a line.
4. Comments can also be broken.
There are roughly the following situations:
The code copy is as follows:
<html>
<!--This is a head-->
<head>A Head</head>
<!--This is
a div -->
<div>A Div</div>
<!--This is
a span--><!--span in
a div--><div>a div</div>
<div><span>A span</span><div>
<!--This is a
span--><div>A div</div><!--span in a div-->
<div><span>A span</span><div>
<html>
Ideas:
1. Read one line of text at a time.
2. If the line contains only <!-- and --> and before <!--. Directly delete the comment content between the two tags and get other content.
3. If the line contains only <!-- and -->, but <!-- after -->. Gets the content between the two tags, and the tag has encountered the <!-- tag.
4. If the line contains only <!--, get the content before the tag, and the tag has encountered the <!-- tag.
5. If the line contains only -->, get the content behind the tag, and the tag has encountered the --> tag.
6. Execute 2, 3, 4, and 5 steps for the remaining content of the line.
7. Save the rest.
8. Read the next line.
Copy the code as follows: public class HtmlCommentHandler {
/**
* Detector annotated in html content
*
* @author boyce
* @version 2013-12-3
*/
private static class HtmlCommentDetector {
private static final String COMMENT_START = "<!--";
private static final String COMMENT_END = "-->";
// Is this string annotation line annotated, contains the comment's start tag and the end tag "<!-- -->"
private static boolean isCommentLine(String line) {
return containsCommentStartTag(line) && containsCommentEndTag(line)
&& line.indexOf(COMMENT_START) < line.indexOf(COMMENT_END);
}
// Whether to include the comment's start tag
private static boolean containsCommentStartTag(String line) {
return StringUtils.isNotEmpty(line) &&
line.indexOf(COMMENT_START) != -1;
}
// Whether to include an annotation end tag
private static boolean containsCommentEndTag(String line) {
return StringUtils.isNotEmpty(line) &&
line.indexOf(COMMENT_END) != -1;
}
/**
* Delete the comments in this line
*/
private static String deleteCommentInLine(String line) {
while (isCommentLine(line)) {
int start = line.indexOf(COMMENT_START) + COMMENT_START.length();
int end = line.indexOf(COMMENT_END);
line = line.substring(start, end);
}
return line;
}
// Get the content before the start comment symbol
private static String getBeforeCommentContent(String line) {
if (!containsCommentStartTag(line))
return line;
return line.substring(0, line.indexOf(COMMENT_START));
}
// Get the content after the end comment line
private static String getAfterCommentContent(String line) {
if (!containsCommentEndTag(line))
return line;
return line.substring(line.indexOf(COMMENT_END) + COMMENT_END.length());
}
}
/**
* Read the html content and remove the comments
*/
public static String readHtmlContentWithoutComment(BufferedReader reader) throws IOException {
StringBuilder builder = new StringBuilder();
String line = null;
// Is the current line in the comment
boolean inComment = false;
while (ObjectUtils.isNotNull(line = reader.readLine())) {
// If the comment tag is included
while (HtmlCommentDetector.containsCommentStartTag(line) ||
HtmlCommentDetector.containsCommentEndTag(line)) {
// Delete the content between the comment tags that appear in pairs
// <!-- comment -->
if (HtmlCommentDetector.isCommentLine(line)) {
line = HtmlCommentDetector.deleteCommentInLine(line);
}
// If it is not a comment line, but the start label and the end label still exist, the end label must be before the start label
// xxx -->content<!--
else if (HtmlCommentDetector.containsCommentStartTag(line) && HtmlCommentDetector.containsCommentEndTag(line)) {
// After getting the end tag, the text before the start tag is set and set inComment to true
line = HtmlCommentDetector.getAfterCommentContent(line);
line = HtmlCommentDetector.getBeforeCommentContent(line);
inComment = true;
}
// If only the start tag exists, because the comment tag does not support nesting, the lines with only the start tag will definitely not be inComment
// content <!--
else if (!inComment && HtmlCommentDetector.containsCommentStartTag(line)) {
// Set inComment to true. Get content before the start tag
inComment = true;
line = HtmlCommentDetector.getBeforeCommentContent(line);
}
// If only the end tag exists, because the comment tag does not support nesting, only the end tag lines must be inComment
// -->content
else if (inComment && HtmlCommentDetector.containsCommentEndTag(line)) {
// Set inComment to false. Get the content after the end tag
inComment = false;
line = HtmlCommentDetector.getAfterCommentContent(line);
}
// Save the non-commented content of this line
if (StringUtils.isNotEmpty(line))
builder.append(line);
}
// Save the line with no comment tags in Comment = false
if (StringUtils.isNotEmpty(line) && !inComment)
builder.append(line);
}
return builder.toString();
}
}
Of course, there are many other methods, which can also be deleted through regular matching, or you can start and end with Stack tag.
Wait, the above code has been tested and used, and I hope it will be useful to students in need.