“This article examines the handling of character encodings in
XML and Perl. I will look at what character encodings are and
what their relationship to XML is. We will then move on to how
encodings are handled in Perl, and end with some practical examples
of translating between encodings.“
“Encodings! The hidden face of XML. For most people, at least
here in the US, XML is simply a data format that specifies elements
and attributes, and how to write them properly in a nice tree
structure.”
“But the truth is that, in order to encode text or data, you
first need to specify an encoding for it. The most common of all
encodings (at least in Western countries) is without a doubt ASCII.
Other encodings you may have come across include the following:
EBCDIC, which will remind some of you of the good old days when
computer and IBM meant the same thing; Shift-JIS, one of the
encodings used for Japanese characters; and Big 5, a Chinese
encoding.”
“What all of these encodings have in common is that they are
largely incompatible. There are very good reasons for this, the
first being that Western languages can live with 256 characters,
encoded in 8-bits, while Eastern languages use many more, thus
requiring multi-byte encodings. Recently, a new standard was
created to replace all of those various encodings: Unicode, a.k.a.
ISO 10646. (Actually they are two different standards — ISO 10646
from ISO and Unicode from the Unicode consortium — but they are so
close that we can consider them equivalent for most purposes.)”