This article describes the example of Java regular matching of Chinese characters in the tag a in HTML. Share it for your reference, as follows:
Today a friend in the group asked a question about regular expressions, which contains the following content:
<a href='www.baidu.comds=id32434#comment'rewr>Special432</a>453543<a guhll,,l>a1Special123 Are you? </a><a href=id=32434#comment'ewrer>Special 2</a><a>Text 2</a><a>Text </a>
Now you want to match the Chinese characters in the <a> tag whose content contains Chinese but whose attributes do not contain comment.
The solution is as follows:
1. First match the <a> tag that does not include comment;
2. Make a quadratic match in the matching result to produce Chinese;
The code is as follows:
package com.mmq.regex;import java.util.regex.Matcher;import java.util.regex.Pattern;/** * @use Match the Chinese characters in the <a> tag of HTML* @ProjectName stuff * @Author mumaoqiang * @FullName com.mmq.regex.MatchChineseCharacters.java * @JDK 1.6.0 * @Version 1.0 */public class MatchChineseCharacters { /** * Based on the input, match the Chinese characters in the <a> tag that contains Chinese but does not contain comment* @param source Content to match* @return Chinese characters in <a> tag*/ public static String matchChineseCharacters(String source) { //Match the <a> tag that contains Chinese but does not contain comment String reg = "<a((?!comment).)*?>([^<>]*?[//u4e00-//u9fa5]+[^<>]*?)+(?=</a>)"; Pattern pattern = Pattern.compile(reg); Matcher matcher = pattern.matcher(source); StringBuilder character = new StringBuilder(); while(matcher.find()){ String result = matcher.group(); System.out.println(result); //Make the result quadratic regularization and match the Chinese character String reg1 = "[//u4e00-//u9fa5]+"; Pattern p1 = Pattern.compile(reg1); Matcher m1 = p1.matcher(result); while(m1.find()){ character.append(m1.group()); } //System.out.println(character.toString()); } return character.toString(); } public static void main(String[] args) { String result = matchChineseCharacters("<a href='www.baidu.comds=id32434#comment'rewr>Special432</a>453543<a guhll,,l>a1Special123Hello123?</a><a href=id=32434#comment'ewrer>Special2</a><a>Text2</a><a>Text</a>"); System.out.println(result); }}The output result is as follows:
<a guhll,,l>a1special123 Are you happy 123? <a>Text in the tag, how are you?
Here is an explanation:
String reg = "<a((?!comment).)*?>([^<>]*?[//u4e00-//u9fa5]+[^<>]*?)+(?=</a>)";
This matching content contains Chinese but the tag attribute does not contain comment. Backward search?<= cannot be used, because backward search can only be content of fixed length. The attributes in the tag are uncertain, so they cannot be used; [//u4e00-//u9fa5]+ matches Chinese strings; while (?=</a>) uses forward search?=, and the end tag will not be included in the result.
This problem was solved. If you want to match the specified content in the specified tag, it is also easy to improve. If there are better rules, please leave a message to learn from each other.
PS: Here are two very convenient regular expression tools for your reference:
JavaScript regular expression online testing tool:
http://tools.VeVB.COM/regex/javascript
Regular expression online generation tool:
http://tools.VeVB.COM/regex/create_reg
I hope this article will be helpful to everyone's Java programming.