utf-8 - character encoding

Unicode supports almost all existingcharacter sets. The best form of Unicode character set encoding is utf-8 encoding. It provides compatibility with ASCII, resistance to data corruption, efficiency and ease of processing. But first things first.

Forms of coding

Computers operate with numbers not just asabstract mathematical objects, but as combinations of units of storage and processing of fixed-size information - bytes and 32-bit words. The encoding standard must take this into account when determining the way characters are represented by numbers.

In computer systems, integers are stored inmemory cells in the size of 8 bits (1 byte), 16 or 32 bits. Each Unicode encoding form determines which sequence of memory cells represents an integer corresponding to a particular character. The standard provides three different forms of encoding Unicode characters: 8, 16 and 32-bit blocks. Accordingly, they are called utf-8, UTF-16 and UTF-32. The name UTF stands for Unicode conversion format. Each of the three forms of encoding is an equal means of representing Unicode characters, has advantages in various applications.

These encodings can be used forrepresentation of all Unicode characters. Thus, they are fully compatible for solutions for different reasons using different forms of coding. Each encoding can be uniquely converted into any of the other two without loss of data.

Principle of non-imposition

Each of the Unicode encoding forms is designed withtaking into account the inadmissibility of partial overlapping. For example, Windows-932 generates characters from one or two bytes of code. The length of the sequence depends on the first byte, so the leading byte values in the sequence of two bytes and a single byte do not intersect. However, the values of the single byte and the closing byte of the sequence may be the same. This means, for example, that when searching for the character D (code 44), you can mistakenly find it entering the second part of the sequence of two bytes of the character "D" (code 84 44). To determine which sequence is correct, the program must take into account the previous bytes.

The situation becomes more complicated if the leading and trailingbytes will match. This means that to reverse the ambiguity, a reverse search will be performed until the beginning of the text or an unambiguous sequence of code. This is not only inefficient, but not protected against possible errors, because one bad byte is enough to make the entire text unreadable.

Unicode conversion format avoidsof this problem, because the values of the leading, closing and single information storage unit do not match. Because of this, all Unicode encodings are suitable for searching and comparing, never giving an erroneous result due to the coincidence of different parts of the character code. The fact that these encoding forms observe the principle of non-assignment distinguishes them from other multibyte East Asian encodings.

Another aspect of non-intersection of Unicode encodingsis that each character has clearly defined boundaries. This eliminates the need to scan an undetermined number of previous characters. This feature of encodings is sometimes called self-synchronization. Distortion of one unit of code will lead to the distortion of only one character, and surrounding symbols remain intact. In the 8-bit conversion format, if the pointer refers to a byte beginning with 10xxxxxx (in binary encoding), one to three reverse transitions are needed to find the beginning of the character.

Consistency

Unicode Consortium fully supports all3 forms of encodings. It is important not to oppose utf-8 and Unicode, because all conversion formats are equally legitimate implementations of Unicode character encoding forms.

Byte-orientation

To represent the UTF-32 symbol, you need one 32-bit unit of code that matches the Unicode code. UTF-16 - from one to two 16-bit units. And utf-8 uses up to 4 bytes.

The encoding utf-8 was created for compatibility withbyte-oriented systems based on ASCII. Most of the existing software and information technology practices have for a long time relied on the representation of symbols in the form of a sequence of bytes. Many protocols depend on the unchanged ASCII encoding and either uses or avoids special control characters. An easy way to adapt Unicode to such situations is by using 8-bit encoding to represent Unicode characters equivalent to any ASCII character or control character. For this, utf-8 encoding is intended.

Variable length

utf-8 is a variable-length encoding consisting of8-bit information storage units whose high-order bits indicate which part of the sequence each single byte belongs to. One range of values is allocated for the first element of the code sequence, the other for the subsequent elements. This ensures disjoint encoding.

ASCII

utf-8 encoding fully supports ASCII codes(0x00-0x7F). This means that Unicode characters U + 0000-U + 007F are converted to a single byte 0x00-0x7F utf-8 and thus become indistinguishable from ASCII. Moreover, to avoid ambiguity, the values 0x00-0x7F are not used anymore in any byte of the Unicode character representation. To encode non-ideographic symbols other than ASCII, a sequence of two bytes is used. The symbols of the range U + 0800-U + FFFF are represented by three bytes, and additional ones with codes greater than U + FFFF require four bytes.

Application area

The encoding utf-8 is usually preferred in the HTML protocol and similar to it.

XML became the first standard with full supportencodings utf-8. Organizations involved in standardization, too, recommend it. The problem of support in URL addresses other than ASCII characters was resolved when the W3C consortium and IETF engineering group agreed to encode all URLs exclusively in utf-8.

Compatibility with ASCII facilitates the transition to a newsoftware. With utf-8 most of the text editors work, including JEdit, Emacs, BBEdit, Eclipse and Notepad of the Windows operating system. No other form of Unicode coding can boast of such support from the tools.

The advantage of the encoding is that itconsists of a sequence of bytes. With utf-8 strings, it's easy to work in C and other programming languages. This is the only form of encoding that does not require the marking of the order of the BOM bytes or the encoding declaration in XML.

Self-Sync

In an environment using 8-bit character processing, compared to other multi-byte encodings, utf-8 has the following advantages:

The first byte of the code sequence contains information about its length. This increases the efficiency of direct search.
It is easier to find the beginning of the character, since the initial byte is limited to a fixed range of values.
There is no intersection of byte values.

Comparison of advantages

utf-8-encoding is compact. But when applying for the encoding of East Asian characters (Chinese, Japanese, Korean, using Chinese characters) 3-byte sequences are used. Also utf-8-encoding is inferior to other forms of encoding by processing speed. A binary string sorting produces the same result as a Unicode binary sort.

Character encoding scheme

The character encoding scheme consists of a formcharacter encoding, and a method of byte-by-pixel arrangement of code units. To determine the encoding scheme with the Unicode standard, the use of the initial byte order mark (BOM, Byte order mark) is provided.

When BOM is enabled in utf-8, the label functionis limited only by the indication of the use of the encoding form. There are no problems determining the order of bytes in utf-8, since its coding unit size is one byte. The use of BOM for this encoding form is neither mandatory nor recommended. BOM can occur in texts converted from other encodings that use the byte order mark, or for the utf-8 encoding signature. It is a sequence of 3 bytes of EF₁₆ BB₁₆ BF₁₆.

How to set utf-8 encoding

In HTML, utf-8 encoding is set using the following code:

˂head˃

˂meta http-equiv = "Content-Type" content = "text / html; charset = utf-8" ˂

In PHP, utf-8 encoding is specified using the header () function at the very beginning of the file after setting the value of the error output level:

˂? Php

error_reporting (-1);

header ("Content-Type: text / html; charset = utf-8");

To connect to MySQL databases, utf-8 encoding is set as follows:

˂? Php

mysql_set_charset ("utf8");

In CSS files, the character encoding utf-8 is specified as follows:

@charset "utf-8";

When you save files of all types, selectencoding utf-8 without BOM, otherwise the site will not work. To do this, in the program DreamWeave, you need to select the menu item "Modifications - Page Properties - Title / Coding", change the encoding to utf-8. Then you should reload the page, uncheck the box "Connect Unicode Signatures (BOM)" and apply the changes. If any text on the page or in the database has been entered by another encoding form, then it must be re-entered or re-encoded. When working with regular expressions, it is mandatory to use the u modifier.

You can also save the file in utf-8 encoding in Windows Notepad. After selecting the menu item "File - Save As ..." set the necessary encoding form and save the file in utf-8 encoding.

In the Notepad ++ text editor, if the encoding is different from utf-8, change the encoding and save it in utf-8 encoding via the menu item "Convert to utf-8 without BOM".

There is no alternative

In the context of globalization, when political andlanguage boundaries are erased, sets of symbols that have local characteristics become less useful. Unicode is the only character set that supports all localizations. And utf-8 is an example of the correct implementation of Unicode, which: