Indiana Web Hosting - Indianapolis Website Hosting Provider - Vision Web Hosting

Chapter 4. Coded Character Sets And Encodings in the World

4.4 ISO 10646 and Unicode

ISO 10646 and Unicode are an another standard so that we can develop international softwares

easily. The special features of this new standard are:

  A united single CCS which intends to include all characters in the world. (ISO 2022 consists

of multiple CCS.)

  The character set intends to cover all conventional (or legacy) CCS in the world.

  Compatibility with ASCII and ISO 8859 1 is considered.

  Chinese, Japanese, and Korean ideograms are united. This comes from a limitation of Uni

code. This is not a merit.

ISO 10646 is an official international standard. Unicode is developed by Unicode Consortium

http://www.unicode.org

). These two are almost identical. Indeed, these two are exactly

identical at code points which are available in both two standards. Unicode is sometimes updated

and the newest version is 3.0.1.

4.4.1 UCS as a Coded Character Set

ISO 10646 defines two CCS (coded character sets), UCS 2 and UCS 4. UCS 2 is a subset of UCS 4.

UCS 4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits and each of them has special

term.

  The top 7 bits are called Group.

  Next 8 bits are called Plane.

  Next 8 bits are Row.

  The smallest 8 bits are Cell.

The first plane (Group = 0, Plane = 0) is called BMP (Basic Multilingual Plane) and UCS 2 is same

to BMP. Thus, UCS 2 is a 16bit CCS.

Code points in UCS are often expressed as u+

????

, where

????

is hexadecimal expression of the

code point.

This is obviously not true for CNS 11643 because CNS 11643 contains 48711 characters while Unicode 3.0.1 contains

49194 characters, only 483 excess than CNS 11643.

footer