1. What is Unicode?
Unicode originated from a very simple idea: to include all characters in the world in a collection. As long as the computer supports this character set, it can display all characters, and there will be no garbled code again.
It starts at 0 and specifies a number for each symbol, which is called a "codepoint". For example, the symbol of code point 0 is null (indicates that all binary bits are 0).
Copy the code as follows: U+0000 = null
In the above formula, U+ means that the hexadecimal number immediately following is the code point of Unicode.
At present, the latest version of Unicode is version 7.0, with a total of 109,449 symbols, of which 74,500 are included in Chinese, Japanese and Korean characters. It can be approximately believed that more than two-thirds of the existing symbols in the world come from East Asian characters. For example, the code point for Chinese "good" is hexadecimal 597D.
Copy the code as follows: U+597D = OK
With so many symbols, Unicode is not defined at one time, but partition definition. Each area can store 65,536 (216) characters, called a plane. Currently, there are 17 (25) planes in total, which means that the size of the entire Unicode character set is now 221.
The first 65536 character bits are called the basic plane (abbreviated BMP), and their code points range from 0 to 216-1. Written in hexadecimal is from U+0000 to U+FFFF. All the most common characters are placed in this plane, which is the first plane that Unicode defines and publishes.
The remaining characters are placed in the auxiliary plane (abbreviated SMP), and the code points range from U+010000 to U+10FFFF.
2. UTF-32 and UTF-8
Unicode only specifies the code point of each character. What kind of byte order is used to represent this code point, which involves the encoding method.
The most intuitive encoding method is that each code point is represented by four bytes, and the byte content corresponds to the code point one by one. This encoding method is called UTF-32. For example, code point 0 is represented by four bytes of 0, and code point 597D is added with two bytes of 0 in front.
Copy the code as follows: U+0000 = 0x0000 0000U+597D = 0x0000 597D
The advantage of UTF-32 is that the conversion rules are simple and intuitive and the search efficiency is high. The disadvantage is that it is wasted space, and English text with the same content will be four times larger than ASCII encoding. This disadvantage is fatal, resulting in no one actually using this encoding method. The HTML5 standard explicitly stipulates that web pages must not be encoded into UTF-32.
What people really need is a space-saving coding method, which led to the birth of UTF-8. UTF-8 is a variable-length encoding method, with character lengths ranging from 1 byte to 4 bytes. The more commonly used characters, the shorter the bytes. The first 128 characters are represented by only 1 byte, which is exactly the same as the ASCII code.
Number range bytes 0x0000 - 0x007F10x0080 - 0x07FF20x0800 - 0xFFF30x010000 - 0x10FFFF4
Due to UTF-8's space-saving feature, it has become the most common web encoding on the Internet. However, it has little to do with today's topic, so I won't go into it. For the specific transcoding method, please refer to "Character Encoding Notes" I wrote many years ago.
III. Introduction to UTF-16
UTF-16 encoding is between UTF-32 and UTF-8, and it combines the characteristics of two encoding methods: fixed length and variable length.
Its encoding rules are simple: the characters in the basic plane occupy 2 bytes, and the characters in the auxiliary plane occupy 4 bytes. That is to say, the encoding length of UTF-16 is either 2 bytes (U+0000 to U+FFFF) or 4 bytes (U+010000 to U+10FFFF).
So there is a question: When we encounter two bytes, how do we see that it is a character itself, or do we need to interpret it with the other two bytes?
It's very clever, and I don't know if it was intentional. In the basic plane, from U+D800 to U+DFFF is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map characters of the auxiliary plane.
Specifically, there are 220 character bits in the auxiliary plane, which means that at least 20 binary bits are required for these characters. UTF-16 splits these 20 bits in half. The first 10 bits are mapped in U+D800 to U+DBFF (space size 210), which is called the high bit (H), and the last 10 bits are mapped in U+DC00 to U+DFFF (space size 210), which is called the low bit (L). This means that a character of an auxiliary plane is broken into two basic planes of character representations.
Therefore, when we encounter two bytes and find that its code point is between U+D800 and U+DBFF, we can conclude that the code point immediately following the two bytes should be between U+DC00 and U+DFFF. These four bytes must be interpreted together.
IV. UTF-16 transcoding formula
When converting Unicode code points to UTF-16, first distinguish whether this is a basic flat character or an auxiliary flat character. If it is the former, directly convert the code point to the corresponding hexadecimal form, with a length of two bytes.
Copy the code as follows: U+597D = 0x597D
If it is an auxiliary flat character, Unicode version 3.0 gives a transcoding formula.
Copy the code code as follows: H = Math.floor((c-0x10000) / 0x400)+0xD800L = (c - 0x10000) % 0x400 + 0xDC00
Taking a character as an example, it is an auxiliary plane character with a code point U+1D306. The calculation process of converting it to UTF-16 is as follows.
Copy the code code as follows: H = Math.floor((0x1D306-0x10000)/0x400)+0xD800 = 0xD834L = (0x1D306-0x10000) % 0x400+0xDC00 = 0xDF06
Therefore, the UTF-16 encoding of the character is 0xD834 DF06, with a length of four bytes.
5. Which encoding is used in JavaScript?
The JavaScript language uses Unicode character sets, but only supports one encoding method.
This encoding is neither UTF-16, nor UTF-8, nor UTF-32. JavaScript does not use the above encoding methods.
JavaScript uses UCS-2!
VI. UCS-2 encoding
Why did a UCS-2 suddenly appear? This requires a little history.
In the era when the Internet had not yet appeared, there were two teams who both wanted to create a unified character set. One is the Unicode team established in 1988, and the other is the UCS team established in 1989. When they discovered each other's existence, they quickly reached an agreement: there is no need for two unified character sets in the world.
In October 1991, the two teams decided to merge the character set. In other words, only one set of character sets will be released from now on, which is Unicode, and the previously released character sets will be revised, and the code points of UCS will be exactly the same as Unicode.
The development progress of UCS is faster than Unicode. In 1990, the first encoding method UCS-2 was announced, using 2 bytes to represent characters with code points. (At that time, there was only one plane, which was the basic plane, so 2 bytes was enough.) UTF-16 encoding was not announced until July 1996, and it was clearly announced that it was a superset of UCS-2, that is, the basic plane characters were encoded by UCS-2, and the auxiliary plane characters defined a 4-byte representation method.
Simply put, the relationship between the two is that UTF-16 replaces UCS-2, or UCS-2 is integrated into UTF-16. So, now there is only UTF-16, no UCS-2.
7. The birth background of JavaScript
So, why does JavaScript not choose the more advanced UTF-16, but uses the already obsolete UCS-2?
The answer is very simple: it is not that you don’t want to, it is that you can’t. Because when the JavaScript language appeared, there was no UTF-16 encoding.
In May 1995, Brendan Eich designed the JavaScript language in 10 days; in October, the first explanation engine was released; in November of the following year, Netscape officially submitted language standards to ECMA (see "The Birth of JavaScript" for details on the entire process). By comparing the release date of UTF-16 (July 1996), you will understand that Netscape had no other choice at that time, only UCS-2 encoding method was available!
8. Limitations of JavaScript character functions
Since JavaScript can only handle UCS-2 encoding, all characters are 2 bytes in this language. If they are 4 bytes, they will be treated as two double bytes. JavaScript's character functions are affected by this and cannot return the correct result.
Take characters as an example, its UTF-16 encoding is 4 bytes 0xD834DF06. The problem is that the 4 byte encoding does not belong to UCS-2, and JavaScript does not recognize it, and will only regard it as two separate characters U+D834 and U+DF06. As mentioned earlier, these two code points are empty, so JavaScript will consider them to be strings composed of two empty characters!
The above code indicates that JavaScript believes that the length of the character is 2, the first character obtained is a null character, and the code point of the first character obtained is 0xDB34. These results are not correct!
To solve this problem, you must make a judgment on the code point and then manually adjust it. The following is the correct way to write a string.
Copy the code code as follows: while (++index < length) { // ... if (charCode >= 0xD800 && charCode <= 0xDBFF) { output.push(character + string.charAt(++index)); } else { output.push(character); }}
The above code indicates that when traversing the string, you must make a judgment on the code point. As long as it falls in the interval between 0xD800 and 0xDBFF, it must be read together with the next 2 bytes.
Similar problems exist in all JavaScript character manipulation functions.
String.prototype.replace()
String.prototype.substring()
String.prototype.slice()
...
All the above functions are valid for 2-byte code points only. To correctly process 4-byte code points, you must deploy your own version one by one and judge the code point range of the current character.
9. ECMAScript 6
The next version of JavaScript, ECMAScript 6 (ES6 for short), has greatly enhanced Unicode support, basically solving this problem.
(1) Correctly identify characters
ES6 can automatically recognize 4-byte code points. Therefore, it is much easier to traverse strings.
Copy the code as follows: for (let s of string ) { // ...}
However, in order to maintain compatibility, the length attribute is still the original behavior. In order to get the correct length of the string, you can use the following method.
Copy the code as follows: Array.from(string).length
(2) Code point representation
JavaScript allows Unicode characters to be represented directly with code points, which is written as "backslash + u + code points".
Copy the code as follows: 'OK' === '/u597D' // true
However, this notation is invalid for 4-byte code points. ES6 fixed this problem and could correctly identify it as long as the code points are placed in curly braces.
(3) String processing function
ES6 has added several new functions that specifically deal with 4-byte code points.
String.fromCodePoint(): Return the corresponding character from Unicode code point
String.prototype.codePointAt(): Return the corresponding code point from the character
String.prototype.at(): Returns the character at the given position of the string
(4) Regular expression
ES6 provides the u modifier to add 4-byte code points to regular expressions.
(5) Unicode regularization
Some characters have additional symbols in addition to letters. For example, in the Chinese pinyin, the tone on the letter is an additional symbol. Tone symbols are very important for many European languages.
Unicode provides two representation methods. One is a single character with additional symbols, that is, a code point represents a character, such as Ǒ's code point is U+01D1; the other is to use the additional symbol as a code point and display it in combination with the main character, that is, two code points represent a character, such as Ǒ can be written as O(U+004F) + ˇ(U+030C).
Copy the code as follows: // Method 1 '/u01D1'// 'Ǒ'// Method 2 '/u004F/u030C'// 'Ǒ'
These two representation methods are exactly the same in vision and semantics and should be treated as equivalent situations. However, JavaScript cannot tell.
Copy the code as follows: '/u01D1'==='/u004F/u030C' //false
ES6 provides a normalize method, allowing "Unicode regularization", that is, converting the two methods into the same sequence.
Copy the code as follows: '/u01D1'.normalize() === '/u004F/u030C'.normalize() // true
For more introduction to ES6, please see "Entertainment of ECMAScript 6".