Non-European computing

  • Home
  • The Arabic Mac
  • Programs
  • Scripts
  • Downloads
  • Jaghbub
  • Eudora Tables
  • Links

  • European and non-European languages on the net: Further detail

    Character sets

    The idea of character set is the key to the problems involved in using many languages on a computer. It sounds very nerdy, but is simple if you keep in mind what the word 'computer' actually means. It is in reality just an advanced calculating machine, and it only understands and deals with numbers.

    We humans, however, write words made up of characters. Thus, somehow characters - human-speak - must be translated to numbers - machine-speak. This we can do by simply giving each character in the alphabet a number - let's say that 'a' is 1, 'b' is 2, 'c' is 3. Then the word 'cab' can be written as '312' - c-a-b. If we do this, we are both happy - the computer has a set of numbers; we humans a word made up of characters.

    This is what happens. When we press the 'c' key on the keyboard, the number '3' is sent to the computer. The software checks the font, finds shape number 3 - which our human eye recognizes as a 'c' and displays that shape on the screen. If the character is stored on disk, or sent across the network, it doesn't send the shape of the 'c'. It just sends the number, 3, leaving it to the computer that receives it to figure out how to display.

    That is all fine and dandy, but what if the computer the receives this '3' from the network uses a totally different number system? What if it is programmed to consider 'z' to be 1, 'y' to be 2, and 'x' to be 3? If so, it will do just as its counterpart, find shape number 3 and display that shape on the screen - an 'x' instead of the intended 'c'. So, if we want the 'c' to be displayed as a 'c' we must make certain that both computers use the same number system.

    This is then what a 'character set' is. It is a table of all the characters in the alphabet with the numbers they correspond to in machine-speak. (In 'characters' we also include numerals, period, comma and other punctuation marks etc.)

    Every computer has some form of character set - otherwise humans could not communicate with it, and all PCs, Macs and Unix machines share the same character set for the basic characters used in English. That is why transferring English text across the network is normally fairly easy. Since both PCs and Macs use the same number for 'a', an 'a' will be displayedcorrectly however it is sent. However, there are a point or two to be added to this issue:

    Keymaps and code pages

    We said that when we pressed a key on the keyboard, a signal was sent to the brain of the computer. Since the letters are painted onto the keys, we think of this as pretty fixed and unchangeable, but in fact it is not. The keyboard only knows and transmits the key location. It is up to the software to link this key location to a character number. This, which is known as a keymap, can therefore also be manipulated.

    The reason for dividing this process in two, is that different countries have different keyboards - the Brits want a £ on the key that the Americans use for $. By putting the key-character link in software, they can use the same hardware, just painting different letters on the keys for each market).

    Thus the road from key-press to screen actually has three elements:

    Pressing the key marked 'A' --> Keyboard sends location signal 'xxyy'
    'xxyy' checked against keymap --> Character value '#1' is sent to screen
    '#1' is checked against font --> Character shape 'a' is displayed

    Here Macs and PCs vary slightly in how they express these things. In the PC world, these elements are combined into the idea of a code page, which is a combination of a keymap and a character set. You can have several code pages installed on your computer; if you choose one, both keymap and character set (and font) change in unison. This hides the mechanics from the user. If you choose 'Code Page 850', and use a Norwegian keyboard to press the 'å' key - we have one such - xzzz is sent to keymap 850, which translates it to character value #134. This is passed to character set 850, which displays 134 as an 'å' in the chosen font. If, however, we move to a newer machine that uses code page 1252, we still can press key xzzz, using the same keyboard. Keymap 1252 translates this to character value #229, it is then passed to the character map, where again, 229 has the 'å' shape.

    Thus, the user suspects nothing; he presses the 'å' key and gets an 'å' on screen in both cases. The problem only appears during network communication, because in the first case, the value 134 is passed on; in the second a totally different one, 229.

    The Mac doesn't really use the code page concept. Instead, it operates with fonts - which really are raw character sets - and 'keyboards' that are selected in a menu or control panel. It is up to the user to make sure that the keyboard and font corresponds, if he or she uses a non-standard one (except that the Mac does keep each writing system separate - Chinese is linked to Chinese, Arabic to Arabic). It comes to the same thing in the end, only that the Mac isn't set up in the same way to operate with different keymap/character set combinations within the same language.

    The reason for this is historical; in the old DOS times, the PC side gave each application program fairly free rein in how they wanted to define European (non-English) characters and fonts. Thus, it was quite possible to run programs on the same computer that had different character values e.g. for 'ø' and 'å'. The code page combination became a simple way to hide this difference from the user. The Mac system kept much tighter control over this, fonts are shared by all programs on the computer. Therefore, there was not the same need to accomodate multiple solutions on the Mac as under DOS. (Under Windows, the PC side has now also gained greater standardization, but has kept the code page concept.)


    Formal rule

    The reason behind this rule is the way network computers used to send characters in days of yore. As you know, somwhere far inside the computer, it translates everything to 'binary' numbers. How large a binary number may be depends on the number of digits it can contain (just as a decimal number; with four digits the highest possible number 9,999; with three it is only 999). Most computers write characters with binary numbers containing 8 digits. The way it works, this makes 256 the highest possible number [why? Take the number 2, then double it - 4, then double the result - 8, then double the result - 16, etc., eight times. You will get 256.] Thus, with eight digits ('bits') you can display up to 256 different shapes or characters.

    However, in olden days, some computers needed to use one of these digits as a 'stop bit', which separated one character from the next. This left only seven digits free for use, and with 7 binary digits, the highest number you can express - and thus the highest number of characters - is 128. Today, this is no longer the case, modern system can use all eight bits to express character information. Thus, the two systems are known as the '7-bit' and '8-bit'. (Notice that is has nothing to do with the kind of technical information you sometimes see in advertisements, as some computers being '16-bit' or having '32-bit data path'. That describes something completely different in a PC's hardware).


    Corruption of extended characters

    The most common way an eight-bit based message is corrupted when passing through a 'rule enforcer' is simply that it removes the information of the eighth digit, leaving you with seven - the same way as if, e.g. the decimal number 2,345 was sent to a display that could only show three digits. It would then probaby only show 345, chopping off the highest digit.

    For binary numbers, this has the effect of deducting 128 from the character's numeric value (as you remember, 128 is the difference between the 8bit maximum and the 7bit maximum). So, if you try to send an accented é, it will in the original 8bit character set have the value 233. The rule enforcer chops of the equivalent of 128; which leaves 233 minus 128: 105. This is the character value that is passed on, and eventually displayed on the receiver's machine. In the standard character set, 'i' has the value 105. So the original André is displayed as Andri.

    We generally call a message that contains any single character with a value above 128 as an '8bit text', although the majority of the text is normally with the A-Z range, ie. within what can be expressed with 7 bits. If there is no such 'extended' character, we call the message a 7bit message.


    Character encoding with QP

    Quoted-Printable codes look awful, but the principle is simple. In a system that only allows 128 different characters, we cannot send 256 different ones. Half of them - the 'upper' half will not pass through the system. However, what if we take one of these 'uacceptable' characters, and translate it into two separate characters, each of which are among the 128 allowable ones?

    That is what QP does, changing the 'upper', extended characters into several standard characters. To simplify the encoding, it uses each character's numeric value, but written in 'human form', i.e. instead of sending character # 233, it sends the number '233' - '2', '3', '3'. The numerals '2' and '3' are of course quite acceptable, so this does not conflict with the 7bit rule.

    To distinguish this code from when the user actually has written '2' or '3' in the text, each code has an equal - = - before it. Also, QP doesn't actually write '233': to save a little space, it uses hexadecimal numbers. They are more efficient, 233 is written in hex as E9. So, the code for é is =E9. If your software understands MIME and QP and can decode it, you will see 'André' in the text, if it doesn't, you will see 'Andr=E9'.

    Hex numbers
    While decimal numbers - the 'human' ones - are based on a number of unit of '10' (100 is 10 times 10) and binaries on '2' (100 is 2 x 2), hex is based on 16 (100 is 16 times 16). As we humans don't have number shapes for what comes between 9 and 10, hex numbers add the characters A-F for these 'extra' digits.

    Thus hex numbers are more space efficient than decimals. The 8-bit maximum, whcih we write 256 in decimals, is written 'FF' in hex - two digits instead of three. It looks like a code, but it is actually just a number written efficiently.

    In addition, the message must contain information about what character set is used to encode the characters, so you may also get some awful things like =?iso-8859-1?= in the text. QP also encodes a few other things, like long line breaks, and - a strange detail - an 'F' at the beginning of a paragraph - this last because some email system may think this means a "From" header. A 'real' equal sign is of course also encoded, as =3D.

    QP is part of the MIME system, described below, so all MIME email programs will understand QP, at least the basic elements, while no program that does not have MIME will understand QP. The two are completely linked.



    This term is often used incorrectly in an extremely confusing fashion. Therefore, please take note: What you understand and have read about as being 'MIME' may well just a small part of it. MIME is not a way of encoding attachments. It is not a way of sending sound or pictures in email. It is related to these tasks, but it is both much more and much less.

    Bascially, MIME - which stands for 'Multi-Purpose Internet Mail Extensions' - doesn't actually 'do' anything at all. It is merely a way for a sending program to inform or declare to the receiver what the email message contains. It is thus just a few extra lines in the email message's header section which include extra information.

    All email messages will have some lines at the top saying 'From: ', 'To:', 'Date:' etc. This is used by the mail software when transmitting the message. If the mail software is aware of the MIME agreement, it may include some extra lines saying 'Character set: The following is meant to be Arabic', 'Type of content: This message contains a video clip in the Video1 format' etc. The receiving program, if it is able to display Arabic can then use an Arabic font for the text. Or, if, God forbid, it can display videos inside the email window, it will use that resource to display the video clip.

    This is all MIME does, it declares the content using a standardised format. It is up to the receiving program to act upon the information. Clearly, not all machines in the world can display Arabic text, so for them, the 'Arabic' information is meaningless. But if it can do Arabic, then the MIME system can tell that this is appropriate. There is no other way the receiving email program can know that the text is supposed to Arabic, French or whatever.

    MIME announces itself by having a header saying 'MIME: Version 1.0'. Without this header, the receiver will ignore any other MIME headers.

    The charset header

    Character set information is optional.. The sending program does not have to include it. The rules say that if there is no such header, the character set is assumed to be Latin 1 (iso-8859-1, see below). If there is to be such a header, it is called 'charset:' and the following information should give the name in a set fashion (thus: 'windows-1252' and not 'Windows Latin 1'). A message may be multipart. divided into sections, with different charsets, the charset header is then included above each section. This is however currently not implemented by anyone.

    In addition to the charset header, a MIME message should also contain informaton if the text is encoded or not (the 'content-transfer-encoding=' header). If the text is encoded, it almost always 'quoted-printable', although other types of encoding are theoretically possible, only QP is universally understood by MIME programs.

    If it is not encoded, the message may declare that the message is in 8bit (i.e. in violation of the 7bit rule; the header is however legal) or 7bit. While all MIME software will allow such headers to be added or add it automatically, it does not always do so correctly. It is quite common to see uncoded 8bit text in a message that declares it self as 'content-transfer-encoding: 7bit'.

    Enclosed files

    Attached files are also important in the MIME system, although they do not directly concern our topic at hand. Such files that you send along your email message must be encoded for the same reason as European text inside the message: All PCs and Macs files use all 8bits when they create any kind of files (except unformatted text files), so they cannot pass the 7bit rule.

    Therefore attached files have always been encoded '8bit sent as 7bit', using different methods. Common encoding systems have been 'Binhex', most common on Macs, and 'uuencode', mostly on PCs and Unix machines. MIME will accept both of these, that is; both are legal transfer-encoding systems under MIME, although not all MIME-compatible programs will necessarily understand them. In addition, the MIME system adds a new one, which all MIME programs must understand (but older systems may not), called 'Base64' encoding. To avoid this 'techhy' name, it often is called 'MIME encoding' - thus in Eudora/PC, while Eudora/Mac calls it 'AppleDouble'. But it is all the same thing.


    ISO character sets

    The International Standards Organization deals with everything from metric screw threads to the code for country names, that is why any standard made by them get high and strange numbers. Their standard no. 8859 concerns computer character sets. It only covers those languages which can be expressed with single-byte characters, i.e. not Chinese, Japanese, or Korean, for which there are other standards. But it does cover Russian (Cyrillic), Hebrew and Arabic among other languages.

    ISO 8859 consists of 12 different character sets, number iso-8859-1 through -12. All of these are identical in the lower half of the table, i.e. the part that is expressed by the first seven bits of the character. They only vary in the upper half. Because of this, basic English characters - A-Z, numbers, punctuation, is part of, and the same, in all parts of 8859. This half also corresponds to the older standard made by the 'American Standard for Computer Information Interchange', and is thus known by its acronym ASCII. ISO 8859 is thus based on the ASCII character set and only extends this by adding 128 further, 'extended' characters, which vary in the 12 parts.

    We sometimes call these two halves, respectively 'low ASCII and high ASCII'; this is thus incorrect, as only the 'low' half is actually defined by ASCII. Also, there is a tradition, in particular on the PC side, to call a text file without formatting for an 'ASCII file' - it only contains the plain charcaters. Again, that is confusing, the distinction between formatting and not formatting text has no relevance for what ASCII actually is.

    Out of these ASCII parts, the most commonly used are 8859-1, which covers the characters used in most West European languages (including e.g. Icelanding, but excluding Catalan). Parts 2 and 3 cover the Eastern European languages that use Latin characters - Polish, Czech etc. Part 5 is for Cyrillic (Russian, Ukrainian etc.), 6 for Arabic, 7 for Greek, 8 for Hebrew. The others are lesser used, and cover specific situations such as Baltic languages etc.

    (Slightly confusingly, those parts of 8859 that cover only European languages using a Latin script are named Latin-1, Latin-2, Latin-3 etc. Latin 2 is the same as ISO 8859-2 , but e.g. the eighth of the latin-based tables, Latin-8, is not 8859-8 (that is Hebrew, not Latin), but ISO 8859-13. Confusing, so it is not advisable to use the name Latin-x for others than Latin 1 and perhaps -2 and -3.)

    The 8859 system is not complete, there is e.g. no part for Persian or Urdu, while there are on the other hand several for the Baltic region, because of historical disagreements. Also, some parts are not much used, thus 8859-5 for Cyrillic is less used on the net that older standards like the one called 'koi-r'. So, 8859 is not a completely universal answer

    But for some regions, the relevant 8859 part is clearly the dominant character set on the Net, and the one it will be easier to comply to. That includes Western Europe and 8859-1 (Latin-1); the Arab world and 8859-6, the Hebrew world and 8859-8, and probably most of the Eastern European Latin languages.

    The 8859 is not the only character set accepted by the ISO, they also have older systems, like ir-11/koi-r for Russian, and 646 for 7bit characters - where 'plain' iso-646 is the same as ASCII, while there are about a dozen national 'variants' that change three or four character to accomodate German, Swedish, Danish etc. These 'national 7bits' are now mostly obsolete.

    [And, to be meiculously precise: We say here that 'all computers' agree on using ASCII as a basis for the character sets. By this we actually mean 'microcomputers and related', the kind we normally see. However, many older large-scale computers [mainframes] use character sets specific to themselves, such as IBM's EBCDIC, which has no similarity to ASCII whatsoever. But when such computers communicate with us mortals, it is always done in the form of ASCII characters, so we do not normally need to know about these older charater sets.]


    Windows and 8859

    It is actually more correct to say that the standard Windows code page (code page 1252) is based on iso-8859-1. For historical reasons [i.e. because some older machines types are unable to use these], 8859 leaves a number of character code values undefined, or 'empty' - to be precise, the first 32 characters of the extended set; character code # 128-59 (in hex numbers, 80-9F). Windows does have access to these code values and does not need to follow 8859-1 in this; so they have filled them with various symbols and special characters that don't exist in 8859-1. However, wherever 8859-1 has a defined character, Windows follows it using the same character value. Thus, Windows uses a 'Latin-1 plus'.

    The Mac, on the other hand, is sligthly older than Windows 3, and did not take 8859 into account when it defined its standard character/font set. It does, of course, follow ASCII for the lower part of its character set, like all computers do. The 'higher' half contains most of the same characters as 8859-1 and Windows, but has given them different code values than they have in Latin 1. There is also one notable discrepancy: The standard Mac font set does not include the three characters required for Icelandic [you need a special Icelandic font /keyboard to use these], unlike 8859-1. Otherwise, Mac, Windows and 8859-1 cover the same languages in their standard English/European set-ups.



    In order to illustrate what we mean, like us take an example. Remember that the computer sees any character only as a number. It does not inherently 'know' whether it is an Arabic or Danish character - this is only determined by the font chosen to display the text - and the font is not transmitted with the email message, just the character numbers, which are shared between the European and Arabic characters.

    So, let us say we use Danish, and want to send an 'ø' from a Mac to a PC. On the Mac, 'ø' has number 191. In the PC character set (Latin 1), however, the 'ø' has number 248. So, to make sure it is displayed correctly, we transform all characters value '191' to '248'. That way, the PC will display them as ø's.

    Then, we switch to Arabic, and in that Arabic text there is a question mark. Now, it so happens that the Arabic question mark also has character value 191, just in a different font. And the Mac and PC agree that Arabic question mark is 191.

    But the email program does not know whether the text was written in Arabic or European. It will probably assume that it is a European text, and change all the 191s to 248s. If it is a Dane writing to his mother, fine. But if its an Arab, all his question marks will be changed into something quite different (Persian g's, actually). And not only question marks, every Arabic character will be changed into something else. Thus, an Arabic text sent from a Mac to a Windows machine under the assumption that it is actually European, will be unreadable; all the character values will have been changed. So, we need to pass along the information that this text is 'intended to be' Arabic or European or whatever. As with the MIME charset headers described above.

    Knut S. Vikør
    December, 1996

    Back Forward

    Home | The Arabic Mac | Downloads | Index
    Responsible for this Web page is Knut S. Vikør.
    Last updated