Summarize the differences between JavaScript's regularity and other languages

Author：Eve Cole Update Time：2025-07-31 23:00:03

Preface

Recently, I found that the performance of regularities in JavaScript in some places is somewhat different from those in other languages or tools, and is relatively alternative. Although it is almost impossible for you to write them and you can hardly use the rules I mentioned below, it is good to understand them after all.

The code examples in this article are executed in a JavaScript environment that is compatible with ES5. That is to say, the performance in versions before IE9, versions around Fx4, etc. is likely to be different from what I mentioned below.

1. Empty character class

A character class that does not contain any [] is called an empty empty char class class. I believe you have never heard others call it because in other languages, this writing method is illegal, and all documents and tutorials do not talk about an illegal syntax. Let me demonstrate how other languages or tools report this error:

 $echo | grep '[]'grep: Unmatched [ or [^$echo | sed '/[]/'sed: -e Expression #1, Character 4: Unterminated Address Regular Expression $echo | awk '/[]/'awk: cmd. line:1: /[]/awk: cmd. line:1: ^ unterminated regexpawk: cmd. line:1: error: Unmatched [ or [^: /[]//$echo | perl -ne '/[]/'Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE ]/ at -e line 1.$echo | ruby -ne '/[]/'-e:1: empty char-class: /[]/$python -c 'import re;re.match("[]","")'Traceback (most recent call last): File "<string>", line 1, in <module> File "E:/Python/lib/re.py", line 137, in match return _compile(pattern, flags).match(string) File "E:/Python/lib/re.py", line 244, in _compile raise error, v # invalid expressionsre_constants.error: unexpected end of regular expression

In JavaScript, the empty character class is a legal regular component, but its effect is "never matched", that is, everything will fail. It is equivalent to the effect of an (empty negative lookahead)(?!) :

 js> "whatever/n".match(/[]/g) //Null character class, never match nulljs> "whatever/n".match(/(?!)/g) //Null negative forward looking around, never match null

Obviously, this kind of thing is useless in JavaScript.

2. Negate the empty character class

Negative character classes that do not contain any characters are called negative empty char class or empty negative char class, either, because this noun was "self-created" and similar to the empty character class mentioned above. This writing method is also illegal in other languages:

 $echo | grep '[^]'grep: Unmatched [ or [^$echo | sed '/[^]/'sed: -e Expression #1, Character 5: Unterminated Address Regular Expression $echo | awk '/[^]/'awk: cmd. line:1: /[^]/awk: cmd. line:1: ^ unterminated regexpawk: cmd. line:1: error: Unmatched [ or [^: /[^]//$echo | perl -ne '/[^]/'Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE ^]/ at -e line 1.$echo | ruby -ne '/[^]/'-e:1: empty char-class: /[^]/$python -c 'import re;re.match("[^]","")'Traceback (most recent call last): File "<string>", line 1, in <module> File "E:/Python/lib/re.py", line 137, in match return _compile(pattern, flags).match(string) File "E:/Python/lib/re.py", line 244, in _compile raise error, v # invalid expressionsre_constants.error: unexpected end of regular expression$

In JavaScript, negating the null character class is a legal regular component. Its effect is just the opposite of the effect of the null character class. It can match any character, including the newline "/n" , that is, it is equivalent to the common [/s/S] and [/w/W] :

 js> "whatever/n".match(/[^]/g) //Neizontal character class, match any character ["w", "h", "a", "t", "e", "v", "e", "r", "/n"]js> "whatever/n".match(/[/s/S]/g) //Complementary character class, match any character ["w", "h", "a", "t", "e", "v", "e", "r", "/n"]

It should be noted that it cannot be called "permanent matching regularity", because the character class must have a character to match. If the target string is empty or has been consumed by the left regularity, the match will fail, for example:

 js> /abc[^]/.test("abc") //There are no characters after c, and the matching failed.false

If you want to know the true "permanent matching rules", you can check out an article I translated before: "empty" rules

3.[]] and [^]]

This is relatively simple, that is: in the regular expressions of Perl and some other Linux commands, if the character class [] contains a right square bracket immediately following []] left square bracket, the right square bracket will be regarded as a normal character, that is, it can only match "]". In JavaScript, this regularity will be recognized as an empty character class followed by a right square bracket, and the empty character class will not match anything .[^]] is similar: in JavaScript, it matches an arbitrary character (negative null character class) followed by a right square bracket, such as "a]","b]" , while in other languages, it matches any non-] characters.

 $perl -e 'print "]" =~ /[]]/'1$js -e 'print(/[]]/.test("]"))'false$perl -e 'print "x" =~ /[^]]/'1$js -e 'print(/[^]]/.test("x"))'false

4.$Anchor Point

Some beginners think that $ matches the newline character "/n" , which is a big mistake. $ is a zero-width assertion, it is impossible to match a real character, it can only match one position. The difference I want to talk about happens in non-multi-line mode: you might think that in non-multi-line mode, isn't $ matched the position after the last character? Actually it's not that simple. In most other languages, if the last character in the target string is the newline character "/n" , $ will also match the position before the newline, that is, match the two positions on the left and right sides of the line break at the end. Many languages have two notations /Z and /z. If you know the difference between them, you should understand that in other languages (Perl, Python, php, Java, c#...), $ in non-multi-line mode is equivalent to /Z, while in JavaScript, $ in non-multi-line mode is equivalent to /z (it will only match the last position, regardless of whether the last character is a newline). Ruby is a special case because it defaults to multi-line mode. $ in multi-line mode will match the position before each newline, and of course it will also include the line break that may appear at the end. Yu Sheng's book "Regular Guidelines" also talks about these points.

 $perl -e 'print "whatever/n" =~ s/$/replace character/rg' //Global replacement whatever character//The position before the line break is replaced by the replacement character//The position after the line break is replaced by the $js -e 'print("whatever/n".replace(/$/g,"replace character"))' //Global replacement whatever character//The position after the line break is replaced

5. Dot metacharacter "."

In regular expressions in JavaScript, the dot metacharacter "." can match all characters except four line terminators (/r-carriage return, /n-line newline, /u2028-line separator, /u2029-paragraph separator), while in other common languages, only line newline /n will be excluded.

6. Quote forward

We all know that there is a back reference in a regular, that is, a backslash + number reference to the string that has matched in the previous capture group. The purpose is to match again or as a replacement result (/ becomes $). But there is a special case that if the referenced capture group has not started (the left bracket is bounded), it uses the back reference, what will happen? For example, regular /(/2(a)){2}/ , (a) is the second capture group, but the matching result of it is used on its left side. We know that regular matches from left to right. This is the origin of the title forwards reference in this section. It is not a strict concept. So now you think about it, what will the following JavaScript code return:

 js> /(/2(a)){2}/.exec("aaa")???

Before answering this question, let's take a look at the performance in other languages. Similarly, in other languages, writing this way is basically invalid:

 $echo aaa | grep '(/2(a)){2}'grep: Invalid back reference$echo aaa | sed -r '/(/2(a)){2}/'sed: -e Expression #1, character 12: Illegal back reference$echo aaa | awk '/(/2(a)){2}/'$echo aaa | perl -ne 'print /(/2(a)){2}/'$echo aaa | ruby -ne 'print $_ = ~/(/2(a)){2}/'$python -c 'import re;print re.match("(/2(a)){2}","aaa")'None

There is no error in awk because awk does not support this backreference, and /2 is interpreted as a character with ASCII code 2. However, there is no error in Perl Ruby Python. I don’t know why this design should be learned by Perl, but the effects are the same. In this case, it is impossible to match successfully.

In JavaScript, not only does it not report an error, but it can also match it successfully. Let's see that the answer is the same as the one you just thought:

 js> /(/2(a)){2}/.exec("aaa")["aa", "a", "a"]

To prevent you from forgetting what the result is returned by exec method, let me say. The first element is the complete matching string, that is, RegExp["$&"] , followed by the content of each capture group matching, that is, RegExp.$1 and RegExp.$2. Why can the matching be successful? What is the matching process? My understanding is:

First, we enter the first capture group (the leftmost left bracket), where the first valid match is /2, but at this time the second capture group (a) has not yet been on the round, so the value of RegExp.$2 is still undefined , so /2 matches an empty character on the left of the first a in the target string, or "position", just like ^ and other zero-width assertions. The point is that the match is successful. Continue to go, and then the second capture group (a) matches the first a in the target string, and the value of RegExp.$2 is also assigned to "a", and then the first capture group ends (the rightmost rightmost rightmost rightmost left bracket), The value of RegExp.$1 is also "a". Then there is the quantifier {2}, that is, after the first a in the target string, a new round of matching of regular (/2(a)) is started. The key point is here: the value of RegExp.$2 is that the value of /2 matches or is it the value assigned at the end of the first round of matching "a". The answer is: "No", the values RegExp.$1 and RegExp.$2 will be cleared as undefined , and /1 and /2 will be the same as the first time, successfully matching an empty character (equivalent to no effect, whether it is written or not). The second a in the target string is successfully matched, and the values of RegExp.$1 and RegExp.$2 become "a" again, The value of RegExp["$&"] becomes the complete matching string, the first two a:"aa".

In earlier versions of Firefox (3.6), the re-match of quantifiers will not clear the value of the existing captured group, so that is, in the second round of matches, /2 will match the second a, thus:

 js> /(/2(a)){2}/.exec("aaa")["aaa", "a"]

In addition, the end of a capture group depends on whether the closing bracket is closed. For example, /(a/1){3}/. Although the first capture group has started to match when /1 is used, it has not ended yet. This is also a forward reference, so the match between /1 is still empty:

 js> /(a/1){3}/.exec("aaa")["aaa", "a"]

Another example:

 js> /(?:(f)(o)(o)|(b)(a)(r))*/.exec("foobar")["foobar", undefined, undefined, undefined, "b", "a", "r"]

* is a quantifier. After the first round of matching: $1 is "f", $2 is "o", $3 is "o", $4 is undefined, $5 is undefined , and $6 is undefined .

At the beginning of the second round of matches: all captured values are reset to undefined .

After the second round of matches: $1 is undefined , $2 is undefined , $3 is undefined , $4 is "b", $5 is "a", and $6 is "r".

& is assigned as "foobar", and the match ends.

Summarize

The above is the entire content that summarizes the differences between JavaScript's regularity and other languages. I hope that the content of this article will be helpful to everyone's study and work.