Arabic

Non-European computing

  • Home
  • The Arabic Mac
  • Programs
  • Downloads
  • Jaghbub
  • Eudora Tables
  • Links

  • EUROPEAN AND NON-EUROPEAN LANGUAGES ON THE NET

    A SURVEY OF PROBLEMS AND SOLUTIONS

    Over the years, the originally English-language dominance over the computer world has disappeared, giving room for not only European languages but even non-European scripts like Russian, Arabic or Chinese. Giving normal, non-specialized computers the possibility to work in such languages is increasingly a fairly trivial and inexpensive affair.

    At the same time, however, computers have ceased to be islands separate from each other, where each of us process our words and publish our desktops in splendid isolation. They are increasingly a means of communication with other computers over a network. How does our Russian, Chinese or German fare in this Internetted world? Can we use our non-English languages, or are we forced to step back into the English-only primitive era of computing?

    The answer is yes and no to both. One can use any language in Internet communication, e-mail, newsgroups or the Web. But it is neither automatic nor ubiquitous, and even if we are able to send messages in Chinese - or in European languages like French and German - there is no guarantee that the message will reach the addressee in the form we want. In a few years we may perhaps be able to say, 'write, send and forget it', but for the moment we do need some awareness of what goes on if we want to be reasonably certain that the reader reads what we intended to write.

    The following is therefore an introduction to the general theme of using non-English languages on the net, written by a non-specialist for the semi-literate computer user, and ending up in a description of some tools we have on this web-site for using various European and non-European languages in email.



    Unfortunately, it is virtually impossible to write about such things without touching on matters of high nerdiness, using strange terms and abbreviations. Rather than defining them as I go along - making the text even less readable, I have linked each such term to an explanatory sub-text. Hopefully they will clarify more than mystify.

    Although the French and German languages are closer to English than to Arabic or Chinese, some of the same problems apply whenever you use a language other than English on the net. Let us therefore first look at the problems generally, before branching off into the specific problems of non-European scripts.


    COMPUTERS AND CHARACTERS

    There are two main reasons why using non-English languages pose problems. One is because there are different computer types - DOS, Windows, Mac, Unix - on the Net, and these differ in how they handle non-English. The other is caused by the network itself, and the specific restrictions that are built into it, which hurts in particular non-English languages, that is languages that require more characters than unaccented A-Z.

    Why does it matter that Windows and Macs are different? Because on the Net, we don't always know what kind of computer is at the other end of the cable. This can cause confusion. The computers transmit each letter in a text as a numeric value, and the correspondence between the letters and numeric values are called character sets. Now, the Mac, PC and so on are all completely standardized in their character set for English; those characters have identical numeric values on all computers. But everything else is basically free for all. Not only do the character sets for non-European scripts like Chinese and Arabic differ from computer type to computer type. Even the accents and characters needed for European languages other than English - Spanish ñ, German ß, Danish ø, are handled differently by each computer type. The numeric value of e is the same on a Mac and a PC, but the numeric value of é is quite different.

    Thus, any software that handles relations between PC and Mac has to take this difference into account. Normal programs, word processors and the like usually do this OK; if you use a program like WordPerfect/Mac to open a document made by WordPerfect/PC, then the é, ñ and other accents are translated from the PC to the Mac way without the user even noticing. Unfortunately, network software isn't as well behaved. If you put an é into an email message from a PC, the software will send numeric value of the é under the PC system. The Mac will receive this character value, and display the corresponding character in its own character set; which, whatever it is, is not an é. Leading to a confused or corrupted message.

    It is even worse for non-European languages. At least, Europe is a fairly large market. Thus, some 'international' software exists that will convert from PC characters to Mac characters. But only for European characters (accents and similar). Which is fine for the French. But, transmitting an Arabic or Japanese text brings up quite different issues, and what solves a problem in French can cause new ones in Arabic. And as the software has no awareness of what language or script the message was originally written in, it will often just assume that everything that is not English is West European. Which may just add to the confusion.

    That makes communication beween different computers difficult. But, beyond this, even Mac-to-Mac communication or Windows-to-Windows can be impossible, both in French and in Russian or Chinese. This is because the restrictions on the net itself. For historical reasons, the net has a formal rule that certain types of messages - in particular email - can only used a limited number of characters. While any Mac or PC can display about 220 different characters in any font, only about 90 of these should be used on the net. 90 sounds like a lot, but when you take away the numerals, punctuation etc., you only have left room for A-Z upper case and lower case. No accents, no national characters, no non-European scripts. This rule, called the 7-bit rule - click on the link to see why it is called that - basically stops the use of characters not present in English in email, and other services concerned.



    In the following, we will call the standard English A-Z, 0-9 characters - those that you can fit into the 90-strong character set - for the basic character set, and any character outside this, be it French accents or Chinese ideographs, as the extended characters. As mentioned, both PC and Mac can display at least 130 such extended characters, and it is here they, in different ways, store both European accents and all non-European scripts and characters.

    EUROPEAN SOLUTIONS

    Clearly, this is not an acceptable situation. We non-English are becoming more demanding; English may be the lingua franca of the Net, but we will be damned if we have to write it in English to our own compatriots, just because of some stupid ole rule. So solutions are sought, and found. Unfortunately, not everyone has found the same solution. Thus, while there are ways, even officially sanctioned and standardized ways, of getting around all of the problems mentioned, they are not necessarily applied in the software in existence on the net; all the more so since most network software is produced in the US, where these kinds of problems are considered somewhat exotic.

    So, one must expect various kinds of solutions or non-solutions among any group of users, and apply the old network wisdom: 'be liberal in what you accept, but strict [following the standard] in what you send out'. For European languages, that adage applies to the software developers, for non-European languages it also applies to the users.

    Problem 1: Beating the 7bit-rule

    Let us look at some of these solutions. The first and basic problem, is the regulation that you can only use the restricted set of 90 basic characters on the net. There are three possible kinds of solutions to this:
    1. Accept it, and try to squeeze the required characters into the restricted set, by replacing English letters or standard symbols.
    2. Ignore it. Use the full set of extended PC/Mac characters as if the rule never existed.
    3. Adapt to it, by encoding the characters so that a full range of extended characters can be transported over a 7bit communication line.
    4. Change it, by adopting an 'extended' network system that properly negotiates a full, 8bit, communication link between computers that can handle it.

    Beyond acceptance, not yet change

    Of these, option (1) belongs to the past. It was common earlier, but could never be more than a local solution - we could fit either the Swedish, or the French characters, either the Arabic or the English, but not both; which was in the long run not tenable.

    Option (4) belongs to the future. It would solve problems if it was generally used, but such 'extended' post office software is still rare. As it is not done at the user level, but by network administrators, we will not discuss it further here.

    Law-breaking and its dangers

    Options (2) and (3) are both practical and possible. The 7bit rule is in reality obsolete, and is mainly kept because in the anarchic Internet world we cannot guarantee that there isn't a stone age computer somewhere that will actually break down if it sees an extended character. But if such exist, they must be rare by now. Thus many specialists considers it unproblematic to send a message that contains the full range of characters, basic and extended.

    The problem with this 'illegal' option, is that not everyone agrees you can break rules at your leisure. Thus, much network software is set up to check for and 'correct' any such behaviour. In the worst case, such police procedures will return your email as 'undeliverable', more common is however that they will allow it through, but change any extended character to a basic one. Typically, an é - extended - is changed into an i - basic, an ø - extended - is changed into an x - basic. Légère will become lighre.

    It may be the choice of your network administrator if this 'rule enforcer' is operational or not; in Europe it often is not, in the US it is rather common.

    Encoding - a solution for all?

    So, mail and be damned is not offically accepted, and can be impossible for some. The other option, (3) - character encoding - is the offical solution, and is in fact also the most useful one for non-European mail, as will be explained. However, it also has an important drawback: if the receiver doesn't have the software to decode the message, it is quite unreadable to him or her; due to which the most common system of encoding, 'quoted-printable' or QP, is often called 'quoted-unreadable'. Such readers will just see codes that look like this: l=E9g=E8re.

    This kind of encoding is part of the larger specification for email software called MIME, a term often used in a confused fashion. MIME is not the name of a particular program, it is a specification that any modern email program should adhere to. Easing the use of non-English languages in email is one of them. Unfortunately, not all MIME-compatible software gives much emphasis on this part of the specification, and there is still some US-based software that does not even try to follow this standard.

    Thus there is no panacea either way. There is nothing you can do to guarantee that the receiver reads what you intended to write. Some people will receive one system fine, but gag on the other, others will only be able to use the other system, not the first. But there are ways to improve the chances of success.


    Solution: Check your software, check your mail

    These steps will increase your chances of communicating with non-English:
    • You should yourself make sure that the e-mail software you use is MIME compatible. Check the manual, or send a message to yourself. When it comes back, see in the header section of the mail (up with 'To:', 'From:', 'Subject' etc. and see if you can see a line saying MIME: Version 1. If you do, you're OK. If not, try to change to an e-mail program that does support MIME. That will improve your chances of receiving e-mail correctly, however it is sent.
    1. When you writing to someone from whom you have received email, check if he or she is in the same category: look at a recent message from him or her, and see if you can find the "MIME: Version 1.0" header.
    2. If you find this header, turn QP encoding on. He will receive your accented characters fine. If you don't find it; turn QP encoding off and never use it. He will not be able to read your accented characters.
      • How you turn QP on or off, depends on the software. Good MIME software should allow you to turn encoding on or off for individual messages (Eudora has a "QP" icon that you can click on). Bad software should at least let you set this in Preferences or Configuration. Lousy MIME software will always have QP on, and not give you the choice.
    3. If your correspondent doesn't have MIME software, so you turn QP off in messages to him or her, my suggestion is to try using extended (accented) characters anyway and see what happens. If you are lucky, the correspondent will receive everything OK (more likely if the sender is in Europe). If not, he will most likely tell you; in that case you should advice him or her to upgrade their software, until then you must avoid extended characters in writing to him or her.
    4. If you are writing to a group, or do not know about the receiver, there is no general advice. I would suggest to have QP encoding on (if you have the choice) and see if you get any cries of anguish back. Others say one should first try with QP off - it is a matter of choice. Experiment and see what gives the best result in each case.

    Problem 2:
    A Mac is not a PC, nor is a Window a Dos.

    The above is as far as we can come in beating the 7bit rule, and most of the time for most of the people, it will solve that issue. But there was another basic problem with extended characters: that they are different on Mac and PC (and between DOS and Windows, too, for that matter). How do we settle that? Again, there are two possible ways to go:
    • Either every one agrees that in email and on the net, we use only one, and a common character set, we standardize our network software.
    • Or we allow the use of many character sets, but we inform the receiver about what character set we have used in our email, so that the receiver (the software) can treat the mail accordingly.
    These can of course be combined. There is an organization that specializes in this, called the 'International Standards Organization', and it has created such a standard. For bureaucratic reasons, this character set standard has got the number 8859, and is divided into many parts for various continents and scripts. The part concerning European accented and extended characters is the first, 8859-1, and is often also called 'Latin 1' (Latin or Roman is the name for English + West European script, as opposed to Cyrillic, Arabic etc.)

    As it happens, Windows uses Latin 1, as do most Unix machines, while the Mac does not. Thus, it is fairly common for European mail to accept this ISO standard, and let the Mac network software take care of the conversions needed between the Mac's own character set and Latin 1. (Most Mac network software will do that, by converting text coming from the network to the Mac standard, and vice versa for texts being sent out.)

    This normally works, but it might still be a good idea to state in each message what character set we have used. This can be done under the MIME system we discussed above, a standardized line can be added to the email's header saying which character set it was written in. However, most email software today assumes that any message is either in the restricted A-Z (7bit) character set, or in Latin 1 (iso-8859-1).


    Solution: Check your software

    Yes, the solution to this problem ends up being the same as that above: If you have made sure that the email software you use follows the MIME specifications (is MIME compatible), you have solved the PC/Mac problem as far as European extended characters are concerned. For these languages, the standard is Latin 1, and all MIME-compatible email software will understand and be able to handle Latin 1 messages correctly. (You may have to inform the more stupid programs that you are using European - Latin 1 - characters, and not just US English, in a Configuration or Preferences setting)


    BEYOND EUROPE

    So, in this way, we have more or less solved the European problem. But what about non-European scripts? They certainly produce problems beyond the few accents and characters you need to add in order to write French or Danish. But the problems are of the same type as the European problems, and the same kind of solutions that we have sketched above for Europe also applies to non-European mail. Our major problem is not with the standards and theory of soutions, but with the sloppiness of the software, which does not always apply these solutions properly.



    In order to use e.g. Arabic on your PC or Mac, you need a special Arabic font, as well as some software that can take care of the peculiarities of the script (for Arabic, right-to-left writing direction, different shapes of the characters etc.). Network software alone will not do this for you: In order to have any benefit from the following, you must already have a Mac, Windows or other machine that has the resources required to use the scripts of interest to you, and the ability to use it in your email program or external text editor. That is different on each machine; in this context we are only concerned with the network side of the equation, how your email program communicates with others.

    The first problem we discussed, that of the 7bit rule, is the same for non-European as for European scripts, and the same advice applies. Non-European scripts generally require an extended character set (an exception being Japanese, which early developed a method using a restricted character set), so we must get 8bit text across.

    Problem 3: Which script is this?

    The second problem, that of the multitude of character sets in the computer world, is however not quite as easy when we go outside Europe. We mentioned two alternatives, one just assuming that everyone uses the same character set, so that the software does not have to worry about the difference; the other adding a line to each message's headers informing about what character set was used.

    Our conclusion for Europe was that the second option might be helpful. Outside Europe, it is more than that, it is crucial. We cannot 'assume' that everyone uses the same character set if some messages may be in Swedish and others in Chinese. Not only do these use different scripts - that could be corrected by selecting the appropriate font manually - but the network software must handle European, Cyrillic and Arabic texts quite differently even before they come as far as being displayed in a font. They must use different conversion methods on the messages depending on the script used.

    So, outside Europe, we must use the second option, so that the email software has the information it needs to handle each individual message. And as mentioned, this method exists, and is part of the MIME standard. Its 'charset' header contains this information - while it is normally set to represent only European characters, e.g. set to 'iso-8859-1', an Arabic user can change that to 'iso-8859-6' and the receiving email program can then act accordingly, depending on its ability to handle Arabic text.


    The ideal package for non-European email

    Thus, the standards are already in place. To have a correct rendering of non-European languages in email, the software should have the following requirements:
    • It should allow text editing and display in the non-European script
    • It should be MIME compatible, and preferably
      • Allow the user to set quoted-printeable encoding to 'on' or 'off' -- preferably for each indvidual message.
      • Allow the user to set charset headers, -- preferably for each indvidual message
      • -- Do any required adaptation of the email message based on the information in the charset header.

    - An ideal that does not yet exist?

    MIME-compatible email software is no longer a rarity; it exists on all platforms and most up-to-date email software claims MIME compatibility. Unfortunately, what interest us, the character sets, is not well developed in current email programs (they are much more keen on multimedia, file attachments and other flashy things). Thus, while most (but not all) software claiming MIME compatibility will allow some of features above, other features -- largely what I have written after the double hyphen -- are seldom or never implemented. The software may allow you write a charset header into the program's overall configuration file, but it will almost never let you easily change that header between a message that is supposed to be in a European language and another which is to be in Arabic.

    Further, in the last and most crucial point above, I have put nothing before the double hyphen at all. Most software, even if it allows to you type 'iso-8859-6' (i.e. Arabic) in a charset header, will do nothing to ensure that the message is sent or displayed accordingly. What "adaptation" means varies from computer to computer. On the Mac, it normally means converting the text itself (the numeric values) to conform to the Mac's standard; on the PC it may mean to choose a 'code page' that fits the message. But most software will not do either. Sometimes, it can be done manually by the user, if only he knows how; in others, it is fatal and leads to the messages remaining unreadable.

    So, the solution is possible and doable, but the awareness of the needs for multi-language email has not reached the software engineers yet. Or the marketing people, whoever makes the decisions.

    An example: Eudora/Mac, a world-wide email program

    I know of only one proper exception, which is the Eudora email program on the Mac, and only the Mac version -- Eudora for Windows does not support all this yet. Standard Mac Eudora supports MIME, it allows you to set QP for each message, and it allows a multitude of scripts and character sets for any number of languages you may want, selectable from a menu. Granted, you need to get the non-European charset plugins from another source (on the Web), and it is not all automatic yet: The user has to check a couple of settings and manually select the Arabic font (for incoming text) or the Arabic charset (for outgoing texts); Eudora will not combine these automatically yet. But Eudora is still a model for how it could be done. If every program had the same capability, the problem would have been solved. (Read my non-European Eudora and "Mac Arabic on the net" pages for more on how to set up Eudora correctly).


    Non-European solution, so far:

    As it is, what users of non-European languages should shoot for, is at least to
    • Install MIME compatible email software
    • Find out how to set a charset header in the message - if possible - and make sure that it corresponds to the language you send in, even if you have to change the configuration for every message you send.

    When that is done, there is at least a fair chance that a correspondent with a computer that has the same language/script installed will be able to read the message correctly.


    OTHER NET SERVICES

    Most of what is said about about email does actually go also for other Internet services such as Usenet news, the Web etc., because most of these do include the MIME system that we have pointed to as the way forward for non-european mail interchange.

    Usenet

    Thus the Usenet news can be based on MIME. However, there are many news programs that are not MIME compatible, and it is generally considered bad manners to use QP-encoded messages in newsgroups, because a large number of the readers will not be able to read the message. While the 7bit rule in theory also applies to news (it has adopted the email rules en bloc), here is normally ignored, and there are no technical restrictions on using uncoded 8bit text in News.

    However, without MIME, information about which character set is used is absent. In practice this is to some extent decided by informal consensus within each newsgroup. In groups where non-European languages are used, most people tend to gather around certain character sets (such as e.g. the koi8 standard for Russian groups), and the users then knows what to expect in that particular newsgroup. This is not a complete solution, but it alleviates the problem.

    Some MIME-compatible news programs that allow non-European email do exist. One such for Mac users is the newest versions of "Yet Another Newswatcher", which comes with conversion tables for many non-European languages (and others are available at my site, Middle East News tables, 28 K). This also includes MIME char.set header insertion and conversions.


    Usenet suggestion

    Choose a Newsreader in which you can read and write your preferred script. MIME compatibility is a plus, but not essential in news. Find out which character set(s) is commonly used in the newsgroups you want to access, and try to adapt your software to send in and display messages from this character set / code page / font. One cannot be more specific on the last score, as it will depend on each newsreader program how and if this can be done. (Read my "Mac Arabic on the net" page for an example on the Mac.)

    The Web

    Again, the World-wide Web system is supposed to include MIME. However, the integration of this, and in particular the usage of character set information, is not yet properly implemented in the Law of the Web - the http protocol. One would imagine that its character set information could be added to a page's header section like in email, but this is not the case. No browser will recognize this information, and as far as I understand, http envisages a rather more complex setup with external databases, not operational today.
    • So, other solutions are found. One that does not require the reader's machine to have any special fonts or resources installed, is to save the non-European text as pictures. The Web of course has the capability to display pictures as well as text. So, the author 'photographs' the page of Arabic or Chinese text and publishes this picture. The quality of the picture may be fine. However, as it just a picture, the reader cannot of course search the text or copy it into another program.
    The main drawback with this procedure, however, is that a 'photograph' of a page of text takes up much more computer storage space than the real text, about 15-20 times as much. And it is correspondingly slower to transfer and display. In a day where most people still have fairly slow network connections, that is a major argument against using this method.
    • Another, and more useful - but more technically demanding - solution, is to map out the most common character sets used for each script, and then, on the server, keep separate text files for each variant, doling them out as requested. The way it works is, again using Arabic as an example: Of the many variant character sets for Arabic, there are basically two that are most common, one used by Mac and Unix machines, the other by Arabic Windows. So, the author of an Arabic Web page produces two versions of every page, one in the Mac and the other in the Windows character set.
    Now, when a reader clicks on any link in a Web page, his Web browser program sends a request to that address to have the linked page transferred. In this request, the Web browser identifies itself, saying e.g. 'This is Netscape 2.0 Mac. Please transfer "/home/welcome.html" '. So, the server has a little program that checks the request, and if it comes from a Mac Netscape, it sends the Mac version of the text, if from a Windows Netscape, it sends the Windows version.

    All of this is completely invisible to the user, who has no idea that these things go on behind the scenes. But it does require the publisher / author to have the technically savvy to produce the program that can perform this checking. There are such solutions in use today, but not many. It is also applicable to Arabic, where character sets normally correspond to machine types.

    But where conflicting character sets may be used on the same machine type - as in Russian - the browser self-identification will not work. However, one may of course do this manually: on the 'Welcome' page of the site - which would be in English - the reader may be invited to click on a button corresponding to his setup, and is then taken to the pages written in the character set he uses.

    -- A possible third alternative is the arrogant approach: 'we have decided to support only this standard, and if that doesn't fit your computer, that is your problem'. That is possible, but not recommended - the cost of creating enemies is larger than the savings in work or server space. However, one may meet such attitudes on the Net. If as a reader you meet this, you should avoid that service.


    Web suggestion

    • From the author/publisher's point of view, both options - text as pictures, or different Web pages suited to different readers, have their points. However, it is probably preferable to transfer text as text, so that the author produces as many versions of the text page as there are commonly used character sets / computer standards. Once set up, it is normally an easy process, and the extra storage space required is insignificant.

    • As for the reader, he or she should try to get hold of Web browser software that allows the display of different character sets, preferably by manual choice in a menu, and of course which has the ability to display the non-European script properly.

    Such software is not yet fully developed, but recent versions of the major Web browsers from Netscape and Microsoft are moving along this way. (Read my "Mac Arabic on the net" page for more detail; the same procedure will basically be valid in Windows versions.) There are also some browsers that specialize in displaying Web pages in this or that script; they may on the other hand lack other functions that the user will want. However, on the whole, the non-European user is advised to keep informed about recent developments in browser software. Things are changing rapidly in this segment.

    As a conclusion, then, we are moving towards a stage where the use of non-European scripts will become easy. So far, it is possible, for most people, but at the cost of wasting some time in making sure one has the correct software and configuration.

    Knut S. Vikør
    15.4.97


    Forward

    Home | The Arabic Mac | Downloads | Index
    Responsible for this Web page is Knut S. Vikør.
    Last updated