Support for parsing string types and character encoding in JavaScript

Author：Eve Cole Update Time：2025-07-04 13:32:01

definition

A string is zero or more characters arranged together, placed in single or double quotes.

 'abc'"abc"

Double quotes can be used inside a single quote string. Inside a double quote string, single quotes can be used.

 'key = "value"'"It's a long journey"

Both above are legal strings.

If you want to use single quotes inside a single quote string (or double quotes inside a double quote string), you must prefix the single quote (or double quotes) inside to escape.

 'Did she says /'Hello/'?'// "Did she says 'Hello'?""Did she says /"Hello/"?"// "Did she says "Hello"?"

Since the attribute values of HTML language use double quotes, many projects agree that JavaScript language strings only use single quotes, and this tutorial follows this convention. Of course, it is also perfect to use double quotes only. It is important to stick to one style and not mix the two styles.

By default, strings can only be written in one line, and if they are divided into multiple lines, they will report an error.

 'abc'// SyntaxError: Unexpected token ILLEGAL

The above code divides a string into three lines, and JavaScript will report an error.

If a long string must be divided into multiple lines, a backslash can be used at the end of each line.

 var longString = "Long /long /long /string";longString// "Long long long string"

The above code shows that after adding a backslash, the string originally written on one line can be divided into multiple lines of writing. However, when outputting, the effect is still a single line, and the effect is exactly the same as writing on the same line. Note that the backslash must be followed by a newline character, and there must be no other characters (such as spaces), otherwise an error will be reported.

The concatenation operator (+) can concatenate multiple single-line strings, split the long string into multiple lines to write, and it is also a single line when output.

 var longString = 'Long ' + 'long ' + 'long ' + 'string';

If you want to output multi-line strings, there is a workaround to utilize multi-line comments.

 (function () { /*line 1line 2line 3*/}).toString().split('/n').slice(1, -1).join('/n')// "line 1// line 2// line 3"

In the above example, the output string is multiple lines.

Escape

Backslash (/) has a special meaning in a string and is used to represent some special characters, so it is also called an escape character.

Special characters that need to be escaped with backslashes are mainly as follows:

/0 null (/u0000)
/b Back key (/u0008)
/f page renewal (/u000C)
/n newline character (/u000A)
/r Enter key (/u000D)
/t Tab (/u0009)
/v vertical tab character (/u000B)
/' Single quotes (/u0027)
/" double quotes (/u0022)
/Backslash (/u005C)

The above characters are preceded by backslashes, which all represent special meanings.

 console.log('1/n2')// 1// 2

In the above code, /n means a new line, and it is divided into two lines when outputting.

There are three special uses for backslashes.

(1)/HHHH

The backslash is followed by three octal numbers (000 to 377), representing a character. HHH corresponds to the Unicode code point of the corresponding character, such as /251 represents the copyright symbol. Obviously, this method can only output 256 characters.

(2)/xHH

/x is followed by two hexadecimal numbers (00 to FF), representing a character. HH corresponds to the Unicode code point of the character, such as /xA9 represents the copyright symbol. This method can only output 256 characters.

(3)/uXXXXX

/u is followed by four hexadecimal numbers (0000 to FFFFF), representing a character. HHHHH corresponds to the Unicode code point of the character, such as /u00A9 represents the copyright symbol.

Below are examples of these three special characters written in detail.

 '/251' // "©"'/xA9' // "©"'/u00A9' // "©"'/172' === 'z' // true'/x7A' === 'z' // true'/u007A' === 'z' // true

If a backslash is used before a non-special character, the backslash is omitted.

 '/a'// "a"

In the above code, a is a normal character, and there is no special meaning to add a backslash before it, and the backslash will be automatically omitted.

If the backslash needs to be included in the normal content of the string, then another backslash needs to be added before the backslash to escape itself.

 "Prev // Next"// "Prev / Next"

Strings and arrays

A string can be treated as a character array, so the square bracket operator of the array can be used to return characters at a certain position (position number starts at 0).

 var s = 'hello';s[0] // "h"s[1] // "e"s[4] // "o"// Use the square bracket operator 'hello'[1] // "e"

If the number in square brackets exceeds the length of the string, or if the number in square brackets is not at all, undefined is returned.

 'abc'[3] // undefined'abc'[-1] // undefined'abc'['x'] // undefined

However, that's all about the similarity between strings and arrays. In fact, it is impossible to change a single character in a string.

 var s = 'hello';delete s[0];s // "hello"s[1] = 'a';s // "hello"s[5] = '!';s // "hello"

The above code indicates that individual characters inside a string cannot be changed or added or deleted, and these operations will fail silently.

The reason why strings are similar to character arrays is actually because when performing square bracket operation on strings, the string will be automatically converted into a string object.

length attribute

The length attribute returns the length of the string, which cannot be changed.

 var s = 'hello';s.length // 5s.length = 3;s.length // 5s.length = 7;s.length // 5

The above code indicates that the length attribute of the string cannot be changed, but there will be no errors.

Character Set

JavaScript uses Unicode character sets, which means that within JavaScript, all characters are represented by Unicode.

Not only does JavaScript use Unicode to store characters internally, but Unicode can also be used directly in the program. All characters can be written in the form of "/uxxxx", where xxxx represents the Unicode encoding of the character. For example, /u00A9 represents a copyright symbol.

 var s = '/u00A9';s // "©"

Each character is stored in 16-bit (i.e. 2 bytes) UTF-16 format inside JavaScript. That is to say, the unit character length of JavaScript is fixed to 16-bit length, that is, 2 bytes.

However, UTF-16 has two lengths: for characters between U+0000 and U+FFFF, the length is 16 bits (i.e. 2 bytes); for characters between U+10000 and U+10FFFF, the length is 32 bits (i.e. 4 bytes), and the first two bytes are between 0xD800 and 0xDBFF, and the last two bytes are between 0xDC00 and 0xDFFF. For example, the corresponding character of U+1D306 is ?, and it is written as UTF-16, which is 0xD834 0xDF06. The browser will correctly recognize these four bytes as one character, but the character length inside JavaScript is always fixed to 16 bits, and these four bytes will be treated as two characters.

 var s = '/uD834/uDF06';s // "?"s.length // 2/^.$/.test(s) // falseses.charAt(0) // ""s.charAt(1) // ""s.charCodeAt(0) // 55348s.charCodeAt(1) // 57094

The above code shows that for characters between U+10000 and U+10FFFF, JavaScript is always treated as two characters (the length attribute of the character is 2). The regular expression used to match a single character will fail (JavaScript believes that more than one character is here), the charAt method cannot return a single character, and the charCodeAt method returns the decimal value corresponding to each byte.

Therefore, when dealing with this, this must be taken into account. For 4 bytes Unicode characters, assuming that C is the Unicode number of the character, H is the first two bytes, and L is the last two bytes, the conversion relationship between them is as follows.

 // Convert characters larger than U+FFFFFF from Unicode to UTF-16H = Math.floor((C - 0x10000) / 0x400) + 0xD800L = (C - 0x10000) % 0x400 + 0xDC00// Convert characters larger than U+FFFFF from UTF-16 to UnicodeC = (H - 0xD800) * 0x400 + L - 0xDC00 + 0x10000

The following regular expression can recognize all UTF-16 characters.

 ([/0-/uD7FF/uE000-/uFFF]|[/uD800-/uDBFF][/uDC00-/uDFF])

Because the JavaScript engine (strictly speaking, ES5 specification) cannot automatically recognize Unicode characters of the auxiliary plane (number greater than 0xFFFF), all string processing functions will produce incorrect results when encountering such characters. If you want to complete string-related operations, you must determine whether the characters fall within the range of 0xD800 to 0xDFFF.

Below is a function that can correctly handle string traversal.

 function getSymbols(string) { var length = string.length; var index = -1; var output = []; var character; var charCode; while (++index < length) { character = string.charAt(index); charCode = character.charCodeAt(0); if (charCode >= 0xD800 && charCode <= 0xDBFF) { output.push(character + string.charAt(++index)); } else { output.push(character); } } return output;}var symbols = getSymbols('?');symbols.forEach(function(symbol) { // ...});

Other string operations such as replacement (String.prototype.replace), intercept substring (String.prototype.substring, String.prototype.slice) must be handled similarly.

Base64 transcoding

Base64 is an encoding method that can convert any character into printable characters. This encoding method is mainly used not to encrypt, but to avoid special characters, and simplify the processing of the program.

JavaScript natively provides two Base64-related methods.

btoa(): convert a string or binary value to Base64 encoding
atob(): Base64 encoding converted to the original encoding

 var string = 'Hello World!';btoa(string) // "SGVsbG8gV29ybGQh"atob('SGVsbG8gV29ybGQh') // "Hello World!" These two methods are not suitable for non-ASCII characters and will report an error. btoa('Hello')// Uncaught DOMException: The string to be encoded contains characters outside of the Latin1 range. To convert non-ASCII characters to Base64 encoding, a transcoding link must be inserted in the middle, and then these two methods are used. function b64Encode(str) { return btoa(encodeURIComponent(str));}function b64Decode(str) { return decodeURIComponent(atob(str));}b64Encode('Hello') // "JUU0JUJEJUEwJUU1JUE1JUJE"b64Decode('JUU0JUJEJUEwJUU1JUE1JUJE') // "Hello"