Chapter 4. Coded Character Sets And Encodings in the World
29
Cross Mapping Tables
Unicode intents to be a superset of all major encodings in the world, such as ISO 8859 *, EUC
*, KOI8 *, and so on. The aim of this is to keep round trip compatibility and to enable smooth
migration from other encodings to Unicode.
Only providing a superset is not sufficient. Reliable cross mapping tables between Unicode and
other encodings are needed. They are provided by Unicode Consortium (
http://www.unicode.
org/Public/MAPPINGS/
).
However, tables for East Asian encodings are not provided. They were provided but now are
obsolete (
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/
).
You may want to use these mapping tables even though they are obsolete, because there are no
other mapping tables available. However, you will find a severe problem for these tables. There
are multiple different mapping tables for Japanese encodings which include JIS X 0208 character
set. Thus, one same character in JIS X 0208 will be mapped into different Unicode characters
according to these mapping tables. For example, Microsoft and Sun use different table, which
results in Java on MS Windows sometimes break Japanese characters.
Though we Open Source people should respect interoperativity, we cannot achieve sufficient in
teroperativity because of this problem. All what we can achieve is interoperativity between Open
Source softwares.
GNU libc uses JIS/JIS0208.TXT (
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/
EASTASIA/JIS/JIS0208.TXT
) with a small modification. The modification is that
original JIS0208.TXT: 0x815F 0x2140 0x005C # REVERSE SOLIDUS
modified: 0x815F 0x2140 0xFF3C # FULLWIDTH REVERSE SOLIDUS
The reason of this modification is that JIS X 0208 character set is almost always used with combina
tion with ASCII in form of EUC JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should be mapped
into U+005C. This modified table is found at
/usr/share/i18n/charmaps/EUC JP.gz
in De
bian system. Of course this mapping table is NOT authorized nor reliable.
I hope Unicode Consortium to release an authorized reliable unique mapping table between Uni
code and JIS X 0208. You can read the detail of this problem (
http://www.debian.or.jp/
~kubota/unicode symbols.html
).
Combining Characters
Unicode has a way to synthesize a accented character by combining an accent symbol and a base
character. For example, combining 'a' and '~' makes 'a' with tilde. More than two accent symbol
can be added to a base character.
footer
Our partners:
PHP: Hypertext Preprocessor Best Web Hosting
Java Web Hosting
Inexpensive Web Hosting
Jsp Web Hosting
Cheapest Web Hosting
Jsp Hosting
Cheap Hosting
Visionwebhosting.net Business web hosting division of Web
Design Plus. All rights reserved