Linux Today: Linux News On Internet Time.

XML.com: Character Encodings within XML and Perl

Apr 30, 2000, 15:35 (0 Talkback[s])
(Other stories by Michel Rodriguez)

"This article examines the handling of character encodings in XML and Perl. I will look at what character encodings are and what their relationship to XML is. We will then move on to how encodings are handled in Perl, and end with some practical examples of translating between encodings."

"Encodings! The hidden face of XML. For most people, at least here in the US, XML is simply a data format that specifies elements and attributes, and how to write them properly in a nice tree structure."

"But the truth is that, in order to encode text or data, you first need to specify an encoding for it. The most common of all encodings (at least in Western countries) is without a doubt ASCII. Other encodings you may have come across include the following: EBCDIC, which will remind some of you of the good old days when computer and IBM meant the same thing; Shift-JIS, one of the encodings used for Japanese characters; and Big 5, a Chinese encoding."

"What all of these encodings have in common is that they are largely incompatible. There are very good reasons for this, the first being that Western languages can live with 256 characters, encoded in 8-bits, while Eastern languages use many more, thus requiring multi-byte encodings. Recently, a new standard was created to replace all of those various encodings: Unicode, a.k.a. ISO 10646. (Actually they are two different standards -- ISO 10646 from ISO and Unicode from the Unicode consortium -- but they are so close that we can consider them equivalent for most purposes.)"

Complete Story

Related Stories: