Constructing summary of regular expressions
Construct Match
character
x characters x
// Backslash characters
/0n Character n with octal value 0 (0 <= n <= 7)
/0nn Character nn with octal value 0 (0 <= n <= 7)
/0mnn Character mnn with octal value 0 (0 <= m <= 3, 0 <= n <= 7)
/xhh Character hh with hexadecimal value 0x
/uhhhh Character hhhhh with hexadecimal value 0x
/t Tab ('/u0009')
/n New line (line break) character ('/u000A')
/r carriage return character ('/u000D')
/f page break ('/u000C')
/a Alarm (bell) symbol ('/u0007')
/e escape character ('/u001B')
/cx corresponding to x
Character Class
[abc] a, b or c (simple class)
[^abc] Any character except a, b, or c (negative)
[a-zA-Z] a to z or A to Z, letters at both ends are included (range)
[ad[mp]] a to d or m to p: [a-dm-p] (union)
[az&&[def]] d, e or f (intersection)
[az&&[^bc]] a to z, except b and c: [ad-z] (minus)
[az&&[^mp]] a to z, not m to p: [a-lq-z] (minus)
Predefined character classes
. Any character (may or may not match the line ending character)
/d Number: [0-9]
/D Non-number: [^0-9]
/s whitespace character: [ /t/n/x0B/f/r]
/S Non-whitespace characters: [^/s]
/w Word characters: [a-zA-Z_0-9]
/W Non-word characters: [^/w]
POSIX character class (US-ASCII only)
/p{Lower} lowercase alphabet characters: [az]
/p{Upper} capital letter characters: [AZ]
/p{ASCII} All ASCII:[/x00-/x7F]
/p{Alpha} alpha characters: [/p{Lower}/p{Upper}]
/p{Digit} Decimal number: [0-9]
/p{Alnum} Alphanumeric characters: [/p{Alpha}/p{Digit}]
/p{Punct} Punctuation:!"#$%&'()*+,-./:;<=>?@[/]^_`{|}~
/p{Graph} visible characters: [/p{Alnum}/p{Punct}]
/p{Print} Printable characters: [/p{Graph}/x20]
/p{Blank} space or tab character: [ /t]
/p{Cntrl} Control characters: [/x00-/x1F/x7F]
/p{XDigit} Hexadecimal number: [0-9a-fA-F]
/p{Space} Whitespace character: [ /t/n/x0B/f/r]
java.lang.Character class (simple java character type)
/p{javaLowerCase} is equivalent to java.lang.Character.isLowerCase()
/p{javaUpperCase} is equivalent to java.lang.Character.isUpperCase()
/p{javaWhitespace} is equivalent to java.lang.Character.isWhitespace()
/p{javaMirrored} is equivalent to java.lang.Character.isMirrored()
Unicode blocks and classes
/p{InGreek} Characters in a Greek block (simple block)
/p{Lu} capital letters (simple category)
/p{Sc} currency symbol
/P{InGreek} All characters, except in the Greek block (negative)
[/p{L}&&[^/p{Lu}]] All letters, except capital letters (minus)
Boundary matcher
^ The beginning of the line
The end of the $ line
/b Word boundaries
/B Non-word boundary
/A Start of input
/G The end of the previous match
The end of the /Z input, only for the last ending character (if any)
The end of the input /z
Greedy Quantitative Word
X? X, once or once, no
X* X, zero or multiple times
X+ X, once or more
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n times, but not more than m times
Reluctant Quantitative Word
X?? X, once or once, no
X*? X, zero or multiple times
X+? X, once or more
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n times, but not more than m times
Possessive Quantitative Words
X?+ X, once or once, no
X*+ X, zero or multiple times
X++ X, once or more
X{n}+ X, exactly n times
X{n,}+ X, at least n times
X{n,m}+ X, at least n times, but not more than m times
Logical operator
XY X Heel Y
X|YX or Y
(X) X, as a capture group
Back Quote
/n Any matching nth capture group
Quote
/Nothing, but quote the following characters
/Q Nothing, but quotes all characters until /E
/E Nothing, but ends the reference starting with /Q
Special construction (non-captured)
(?:X) X, as a non-capture group
(?idmsux-idmsux) Nothing, but will match the flag idmsux on - off
(?idmsux-idmsux:X) X, as idmsux on - off with the given flag
non-capturing group (?=X) X, through zero-width positive lookahead
(?!X) X, through zero width negative lookahead
(?<=X) X, through a zero-width positive lookbehind
(?<!X) X, negative lookbehind through zero width
(?>X) X, as an independent non-capture group
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Backslashes, escapes, and references
The backslash character ('/') is used to reference escape constructs, as defined in the above table, and also to reference other characters that will be interpreted as non-escaped constructs. Therefore, the expression // matches a single backslash, and /{ matches the left bracket.
It is wrong to use backslashes before any alphabetical characters that do not represent escape constructs; they are reserved for future extensions of regular expression languages. A backslash can be used before a non-alphabetical character, regardless of whether the character is not part of the escaped construct.
According to the requirements of Java Language Specification, backslashes in strings of Java source code are interpreted as Unicode escapes or other character escapes. Therefore, two backslashes must be used in the string literal to indicate that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "/b" matches a single backspace character, and "//b" matches the word boundary. The string literal "/(hello/)" is illegal and will cause a compile-time error; to match the string (hello), the string literal "//(hello//)" must be used.
Character Class
Character classes can appear in other character classes and can contain union operators (implicitly) and intersection operators (&&). The union operator represents a class that contains at least all characters in one of its operand classes. The intersection operator represents a class that contains all characters in its two operand classes at the same time.
The priority of character class operators is as follows, arranged in order from highest to lowest:
1 literal escape/x
2 Grouping [...]
3 range az
4 union[ae][iu]
5 Intersection [az&&[aeiou]]
Note that different sets of metacharacters are actually located inside the character class, not outside the character class. For example, regular expressions. The special meaning is lost inside a character class, and the expression - becomes the range that forms metacharacters.
Line ending character
A line ending character is a sequence of one or two characters that marks the end of the line of the input character sequence. The following code is recognized as a line ending character:
New line (line newline) character ('/n'),
The carriage return character ("/r/n") followed by the new line character,
A separate carriage return character ('/r'),
Next line character ('/u0085'),
Line delimiter ('/u2028') or
Paragraph separator ('/u2029).
If UNIX_LINES mode is activated, the new line character is the uniquely recognized line end character.
If the DOTALL flag is not specified, the regular expression. can match any character (except the end of the line).
By default, regular expressions ^ and $ ignore line endings and only match the beginning and end of the entire input sequence, respectively. If MULTILINE mode is activated, a match occurs only after the beginning of the input and the end of the line (the end of the input). When in MULTILINE mode, $ matches only before the line ending or at the end of the input sequence.
Group and Capture
Capture groups can be numbered by calculating their open brackets from left to right. For example, in the expression ((A)(B(C)))), there are four such groups:
1 ((A)(B(C)))
2 /A
3 (B(C))
4 (C)
Group zeros always represent the entire expression.
The capture groups are named in this way because in the match, each subsequence of the input sequence matching those groups is saved. The captured subsequence can later be used in the expression via Back references or can be obtained from the matcher after the matching operation is completed.
The capture input associated with a group is always the subsequence that matches the group most recently. If the group is calculated again due to quantization, its previously captured value will be retained on the second calculation failure (if any). For example, matching the string "aba" to the expression (a(b)?)+ will set the second group to "b". At the beginning of each match, all captured inputs are discarded.
Groups starting with (?) are pure non-capturing groups that do not capture text and do not count against combo counts.
The above is all the content of the regular expression (recommended by the grammar article) brought to you by the editor. I hope everyone can support Wulin.com more~