Chapter 4. Coded Character Sets And Encodings in the World
26
Characters in range of u+0021 u+007e are same to ASCII and characters in range of u+0xa0
u+0xff are same to ISO 8859 1. Thus it is very easy to convert between ASCII or ISO 8859 1 and
UCS.
Unicode (version 3.0.1) uses a 20bit subset of UCS 4 as a CCS.
4
The unique feature of these CCS compared with other CCS is open repertoire. They are develop
ing even after they are released. Characters will be added in future. However, already coded
characters will not changed. Unicode version 3.0.1 includes 49194 distinct coded characters.
4.4.2 UTF as Character Encoding Schemes
A few CES are used to construct encodings which use UCS as a CCS. They are UTF 7, UTF 8,
UTF 16, UTF 16LE, and UTF 16BE. UTF means Unicode (or UCS) Transformation Format. Since
these CES always take UCS as the only CCS, they are also names for encodings.
5
UTF 8
UTF 8 is an encoding whose CCS is UCS 4. UTF 8 is designed to be upward compatible to ASCII.
UTF 8 is multibyte and number of bytes needed to express one character is from 1 to 6.
Conversion from UCS 4 to UTF 8 is performed using a simple conversion rule.
UCS 4 (binary)
UTF 8 (binary)
00000000 00000000 00000000 0???????
0???????
00000000 00000000 00000??? ????????
110????? 10??????
00000000 00000000 ???????? ????????
1110???? 10?????? 10??????
00000000 000????? ???????? ????????
11110??? 10?????? 10?????? 10??????
000000?? ???????? ???????? ????????
111110?? 10?????? 10?????? 10?????? 10??????
0??????? ???????? ???????? ????????
1111110? 10?????? 10?????? 10?????? 10?????? 10??????
Note the shortest one will be used though longer representation can express smaller UCS values.
UTF 8 seems to be one of the major candidates for standard codesets in the future. For example,
Linux console and xterm supports UTF 8. Debian package of
locales
(version 2.1.97 1) contains
ko_KR.UTF 8
locale. I think the number of UTF 8 locale will increase.
4
Exactly speaking, u+000000 u+10ffff.
5
Compare UTF and EUC. There are a few variants of EUC whose CCS are different (EUC JP, EUC KR, and so on).
This is why we cannot call EUC as an encoding. In other words, calling of 'EUC' cannot specify an encoding. On the
other hands, 'UTF 8' is the name for a specific concrete encoding.
footer
Our partners:
PHP: Hypertext Preprocessor Best Web Hosting
Java Web Hosting
Inexpensive Web Hosting
Jsp Web Hosting
Cheapest Web Hosting
Jsp Hosting
Cheap Hosting
Visionwebhosting.net Business web hosting division of Web
Design Plus. All rights reserved