JavaScript's character set:
JavaScript programs are written using Unicode character sets. Unicode is a superset of ASCII and Latin-1 and supports almost all languages on Earth. ECMAScript3 requires JavaScript to support Unicode 2.1 and subsequent versions, while ECMAScript5 requires support Unicode 3 and subsequent versions. So, we wrote it
JavaScript programs are all encoded using Unicode.
UTF-8
UTF-8 (UTF8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and is also a prefix code.
It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, which makes it possible for software that originally handles ASCII characters to continue to use without or require a small amount of modification. Therefore, it has gradually become the preferred encoding in emails, web pages and other applications for storing or sending text.
Most websites currently use UTF-8 encoding.
Convert Unicode-encoded string generated by javascript to UTF-8 encoded string
As mentioned in the title, the application scenario is very common. For example, when sending a binary to the server, the server stipulates that the encoding of the binary content must be UTF-8. In this case, we must convert the Unicode string of javascript into a UTF-8 encoded string through the program.
Conversion method
Before conversion, we must understand that Unicode's encoding structure is fixed.
If you don't believe it, you can try the charCodeAt method of String to see how many bytes the returned charCode takes up.
•English occupies 1 character and Chinese characters occupies 2 characters
However, the length of the encoding structure of UTF-8 is determined by the size of a single character.
Below is the size of a single character that takes up several bytes. The maximum length after a single unicode character is 6 bytes.
•1 byte: Unicode code is 0 - 127
•2 bytes: Unicode code is 128 - 2047
•3 bytes: Unicode code is 2048 - 0xFFFF
•4 bytes: Unicode code is 65536 - 0x1FFFFF
•5 bytes: Unicode code is 0x200000 - 0x3FFFFFF
•6 bytes: Unicode code is 0x4000000 - 0x7FFFFFFF
For details, please see the picture:
Because the Unicode codes of English and English characters are 0 - 127, the length and bytes of English in Unicode and UTF-8 are the same, and only occupy 1 byte. This is why UTF8 is a superset of Unicode!
Now let’s discuss Chinese characters, because the unicode code interval of Chinese characters is 0x2e80 - 0x9fff, so the length of Chinese characters in UTF8 is up to 3 bytes.
So how do Chinese characters convert from 2 bytes of Unicode to three bytes of UTF8?
Suppose I need to convert Chinese character "中" into UTF-8 encoding
1. Get the Unicode value size of Chinese characters
var str = 'In';var charCode = str.charCodeAt(0);console.log(charCode); // => 20013
2. Judging the length of UTF8 based on the size
From the previous step we get the charCode of the Chinese character "In" is 20013. Then we find that 20013 is located in the interval 2048 - 0xFFFF, so the Chinese character "In" should occupy 3 bytes in UTF8.
3. Complement
Since we know that the Chinese character "me" needs to occupy 3 bytes, how can we get these 3 bytes?
This requires designing the complement code. The specific complement code logic is as follows:
OK, I know you can’t understand this picture, so I’ll just talk about it!
The specific fill code is as follows, "x" indicates the empty space, used for fill.
•0xxxxxxx
•110xxxxx 10xxxxxx
•1110xxxx 10xxxxxx 10xxxxxx
•11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
•111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
•1111110x 10xxxxxx 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
warning: Have you found it? The first byte of the fill code indicates how many bytes the entire UTF-8 code occupies! This feature is used by UTF-8 decoded to Unicode~
Let’s give a simple example first. Convert the English letter "A" to UTF8 encoding.
1. The charCode of "A" is 65
2. 65 is in the interval between 0-127, so "A" occupies one byte
3. The complement of one byte in UTF8 is 0xxxxxxx. x represents a vacant position and is used for complement.
4. Convert 65 to binary to get 1000001
5. Add 1000001 to the vacancies of 1xxxxxx in order from front to back, and get 01000001
6. Convert 11000001 into a string to get "A"
7. Finally, "A" is encoded by UTF8.
With this small example, did we verify again that UTF-8 is a superset of Unicode!
Okay, let's go back to the Chinese character "middle". Before, we have got the charCode of "middle" as 20013 and the binary is 010011100 00101101. The details are as follows:
var code = 20013;code.toString(2); // => 10011100101 is equivalent to 01001110 00101101
Then, we follow the method of "A" filling in the above to fill in the position.
Complement 01001110 00101101 in the order from front to back to 1110xxxxx 10xxxxxx 10xxxxxx. Get 11100100 10111000 10101101.
4. Get UTF8 encoded content
Through the above steps we get three UTF8 bytes of "in", 11100100 10111000 1010110101.
We convert each byte to hexadecimal and get 0xE4 0xB8 0xAD;
Then this 0xE4 0xB8 0xAD is the UTF8 encoding we finally got.
We use nodejs buffer to verify whether it is correct.
var buffer = new Buffer('In'); console.log(buffer.length); // => 3console.log(buffer); // => <Buffer e4 b8 ad>// Finally get three bytes 0xe4 0xb8 0xadBecause hexadecimal is case-free, is it exactly the same as we calculated that 0xE4 0xB8 0xAD?
Write the above encoding logic into a function.
// Format the string into UTF8-encoded bytes var writeUTF = function (str, isGetBytes) { var back = []; var byteSize = 0; for (var i = 0; i < str.length; i++) { var code = str.charCodeAt(i); if (0x00 <= code && code <= 0x7f) { byteSize += 1; back.push(code); } else if (0x80 <= code && code <= 0x7ff) { byteSize += 2; back.push((192 | (31 & (code >> 6))))); back.push((128 | (63 & code))) } else if ((0x800 <= code && code <= 0xd7ff) || (0xe000 <= code && code <= 0xffff)) { byteSize += 3; back.push((224 | (15 & (code >> 12)))); back.push((128 | (63 & (code >> 6)))); back.push((128 | (63 & code))) } } for (i = 0; i < back.length; i++) { back[i] &= 0xff; } if (isGetBytes) { return back } if (byteSize <= 0xff) { return [0, byteSize].concat(back); } else { return [byteSize >> 8, byteSize & 0xff].concat(back); }}writeUTF('In-On'); // => [0, 3, 228, 184, 173] // The first two digits represent the length of the subsequent utf8 bytes. Because the length is 3, the first two bytes are `0, 3`// The content is `228, 184, 173` and converted to hexadecimal is `0xE4 0xB8 0xAD` // Read UTF8-encoded bytes and is specially designed for Unicode's string var readUTF = function (arr) { if (typeof arr === 'string') { return arr; } var UTF = '', _arr = this.init(arr); for (var i = 0; i < _arr.length; i++) { var one = _arr[i].toString(2), v = one.match(/^1+?(?=0)/); if (v && one.length == 8) { var bytesLength = v[0].length; var store = _arr[i].toString(2).slice(7 - bytesLength); for (var st = 1; st < bytesLength; st++) { store += _arr[st + i].toString(2).slice(2) } UTF += String.fromCharCode(parseInt(store, 2)); i += bytesLength - 1 } else { UTF += String.fromCharCode(_arr[i]) } } return UTF}readUTF([0, 3, 228, 184, 173]); => 'In'Another method to parse Chinese to obtain UTF8 bytecode
Another relatively simple method to convert Chinese to UTF8 bytecode is relatively simple. The browser also provides a method, and everyone has been using this method. What is it? It's encodeURI. Of course, encodeURIComponent is also OK.
That's right, that's the method. So how does this method convert a Unicode-encoded Chinese into UTF8 bytecode?
var str = '';var code = encodeURI(str);console.log(code); // => %E4%B8%AD
Have you found that I got an escaped string, and the content in this string is the same as the bytecode I got above before.
Next we convert %E4%B8%AD into a number array.
var codeList = code.split('%');codeList = codeList.map(item => parseInt(item,16));console.log(codeList); // => [228, 184, 173]So simple, is there any ~~~
What is the principle of this simple method?
Here is the problem of querystring encoding in URIs. Because according to regulations, querystring in URI must be transmitted according to UTF8 encoding, and JavaScript is Unicode, so the browser provides us with a method, that is, the encodeURI/encodeURIComponent method. This method will be explained
Non-English characters (this is considered, why are non-English characters?) are first converted to UTF8 bytecode, and then added % in front to splice them, so we escaped the Chinese character "中" and got "%E4%B8%AD".
Well, that's all the principles, nothing else.
However, this method has another disadvantage, that is, it will only escape non-English characters, so when we need to format the English characters into UTF8 encoding, this method cannot meet our needs, and we also need to escape the English characters.
So what should I do when I want to analyze it? Just use decodeURI/decodeURIComponent.
var codeList = [228, 184, 173];var code = codeList.map(item => '%'+item.toString(16)).join('');decodeURI(code); // =>Okay, this article will introduce UTF8 encoding.
I hope it can help you understand the principles of UTF-8 encoding.
The above is all the implementation methods for UTF-8 encoding through javascript brought to you. I hope everyone will support Wulin.com more~