I have been studying UTF-8 encoding for the past few days and I am so confused. I will discuss my opinions with you. Welcome to approve. The following are my thoughts. If there is anything wrong, please feel free to enlighten me and help me point it out.
Related digressions:
1. Operating system
The window system is all unicode internally. Folder names, file names, etc. are all unicode and can be displayed normally in any language system.
2. Input method:
Microsoft Pinyin output is Unicode, and Smart ABC output is Simplified Chinese (so Smart ABC cannot be used in non-Simplified Chinese systems at all, and can only type in English).
3. Textarea of web page
The textarea of the web page is displayed in unicode. So whatever you type into it will be displayed. But some input boxes made in flash will not work.
4. Access2000
The data saved in access is unicode and can be displayed in any language system.
If some characters are not normal when viewed in the data view, it is because the font used for display is not a Unicode font.
Change to the Arial Unicode MS font to display everything. (access help, search, enter unicode, instructions are available)
5. Word
Conversion between Traditional Chinese and Simplified Chinese in Word. After converting from Simplified Chinese to Traditional Chinese, the internal code is still Simplified Chinese. In fact, it is just Traditional Chinese characters in Simplified Chinese.
6. ASP is internally Unicode, and all text is stored in Unicode. Convert to the specified character set when necessary.
First let’s draw the conclusion:
<%@ codepage=936%>Simplified Chinese
<%@ codepage=950%>Traditional Chinese
<%@ codepage=65001%>UTF-8
The codepage specifies the encoding in which IIS reads the passed string (form submission, address bar transmission, etc.).
Also specifies the encoding to which all text variables are converted from Unicode,
It also specifies the encoding to which the data retrieved from the database is converted from Unicode. (Note this, it’s very important.)
Keywords:
Reading: A string, if read in Simplified Chinese it will be some characters, if it is read in Traditional Chinese it will be some characters, the encoding of the string itself has not changed.
Conversion: The system actively converts, for example, from Unicode's "化" character to Big5's "化" character, the internal code becomes Big5's. If there is no corresponding word in Big5, the Unicode form is retained (&#xxxx;)
Simplified Chinese: Six conclusions
Unicode hexadecimal form: six conclusions
Unicode decimal form: six conclusions
The following is the encoding conversion process I speculated:
Client: input method Unicode--input box unicode--convert from Unicode to the corresponding encoding by charset ()--form sending encoding
Server side: IIS decodes the form--reads according to the encoding specified by codepage--converts to the corresponding Unicode--can be read with request("")--performs some processing--saves to the database in Unicode encoding
Server side: Read the Unicode data from the database and convert it to the encoding specified by codepage --- generate source code -- IE reads and displays it according to charset.
Here are some examples:
Example 1:
Assume there are three asp pages, a typical message page:
1.write.asp is a simple input form and is submitted to add.asp.
<META http-equiv="Content-Type" content="text/html; charset=big5">
2.add.asp receives messages and saves them to the database
<%@ codepage=936%>
3.read.asp obtains messages from the database and displays them.
<%@ codepage=936%> charset=GB2312 or
<%@ codepage=950%> charset=big5
You can take a guess. I used the Microsoft Pinyin input method to input "Hua Liu Discussion" in write.asp. What will be displayed in read.asp in the end?
Are you dizzy? Let's analyze it from the beginning.
Example 2:
What will happen if we change the <%@ codepage=936%> in add.asp in Example 1 to <%@ codepage=950%>?
What did you find here?
1. If the input text is different from the corresponding Charset, once converted, the characters in Unicode form may appear. Here's why. The entire process is retained from now on.
2. The codepage in Add.asp determines the text saved to the database and which language corresponds to Unicode. For example, codepage=936,
Then the database saves Simplified Chinese Unicode (the database gets back the Simplified Chinese system, everything is normal),
Codepage=950 saves Traditional Chinese Unicode. (It would be wrong to take back the Simplified Chinese system).
3. Pay attention to the changing process of the string:
1) Input method --- CharsetUnicode---- specifies the mapping of the character set
2)Charset----form encoding string simple encoding
3) The reverse process of the previous step of form decoding, the two steps are offset.
4) The string à press codepage to read the string and the string has not changed. This step may cause "misunderstanding of reading"
5) Convert to the corresponding Unicode Codepage specified character set----Unicode mapping
6) Intermediate processing, no change in the database, directly entered in Unicode form
7) Press codepage to read the database Unicode----codepage specified character set mapping
8) It shows that the string read from the character set specified by Charset has not changed.
Let’s illustrate with example 1:
Example 2:
Dizzy. Now let’s put the knowledge to use.
Case 1.
The code that runs well under the Simplified Chinese system is garbled in the database when placed in a foreign space, and the original data is also garbled.
Analysis: Because most people usually use the Simplified Chinese system, the default codepage=936, so it doesn’t matter if everyone doesn’t write it.
But when we go abroad, space problems arise. The Unicode in the database has been converted to English encoding, so after the original Simplified Chinese in the database is converted to English, the GB display will be naturally garbled.
As shown in the picture, the newly entered text is displayed normally, but the English Unicode is saved in the database.
Solution: Add <%@codepage=936%> to all.
The whole process only involves conversion between Simplified Chinese and the corresponding Unicode.
Case 2:
What should I do if I want to convert Simplified Chinese code and data to the complete Traditional Chinese version?
Analysis: 1. The encoding of all code files is changed to Big5, and the file itself is saved in Traditional Chinese.
2. <%@ codepage=936 %>
3.Charset=big5
4. The access version does not matter, because the data in access is Unicode.
5. Ok, the code can run under the pure Traditional Chinese system.
6. Remaining issues: There will be some question marks when reading the original Simplified Chinese data. The effect is the same as the 950 reading in Example 1, big5 display. Because the Unicode of Simplified Chinese is converted to Traditional Chinese, some characters are not in Traditional Chinese, so question marks will appear.
7. Solution: Use a temporary asp page, codepage=65001, read it as Simplified Chinese Unicode, use a Unicode->Big5 function to convert it to Traditional Chinese, and then write it back to the database. This should work, right?
The two cases are completely deduced by me based on theory and have not been confirmed.
Criticisms and corrections are welcome if you have similar experiences.