EUROPEAN AND NON-EUROPEAN LANGUAGES ON THE NET
A SURVEY OF PROBLEMS AND SOLUTIONSOver the years, the originally English-language dominance over the computer world has disappeared, giving room for not only European languages but even non-European scripts like Russian, Arabic or Chinese. Giving normal, non-specialized computers the possibility to work in such languages is increasingly a fairly trivial and inexpensive affair.
At the same time, however, computers have ceased to be islands separate from each other, where each of us process our words and publish our desktops in splendid isolation. They are increasingly a means of communication with other computers over a network. How does our Russian, Chinese or German fare in this Internetted world? Can we use our non-English languages, or are we forced to step back into the English-only primitive era of computing?
The answer is yes and no to both. One can use any language in Internet communication, e-mail, newsgroups or the Web. But it is neither automatic nor ubiquitous, and even if we are able to send messages in Chinese - or in European languages like French and German - there is no guarantee that the message will reach the addressee in the form we want. In a few years we may perhaps be able to say, 'write, send and forget it', but for the moment we do need some awareness of what goes on if we want to be reasonably certain that the reader reads what we intended to write.
The following is therefore an introduction to the general theme of using non-English languages on the net, written by a non-specialist for the semi-literate computer user, and ending up in a description of some tools we have on this web-site for using various European and non-European languages in email.
Unfortunately, it is virtually impossible to write about such things without touching on matters of high nerdiness, using strange terms and abbreviations. Rather than defining them as I go along - making the text even less readable, I have linked each such term to an explanatory sub-text. Hopefully they will clarify more than mystify.
Although the French and German languages are closer to English than to Arabic or Chinese, some of the same problems apply whenever you use a language other than English on the net. Let us therefore first look at the problems generally, before branching off into the specific problems of non-European scripts.
COMPUTERS AND CHARACTERSThere are two main reasons why using non-English languages pose problems. One is because there are different computer types - DOS, Windows, Mac, Unix - on the Net, and these differ in how they handle non-English. The other is caused by the network itself, and the specific restrictions that are built into it, which hurts in particular non-English languages, that is languages that require more characters than unaccented A-Z.
Why does it matter that Windows and Macs are different? Because on the Net, we don't always know what kind of computer is at the other end of the cable. This can cause confusion. The computers transmit each letter in a text as a numeric value, and the correspondence between the letters and numeric values are called character sets. Now, the Mac, PC and so on are all completely standardized in their character set for English; those characters have identical numeric values on all computers. But everything else is basically free for all. Not only do the character sets for non-European scripts like Chinese and Arabic differ from computer type to computer type. Even the accents and characters needed for European languages other than English - Spanish ñ, German ß, Danish ø, are handled differently by each computer type. The numeric value of e is the same on a Mac and a PC, but the numeric value of é is quite different.
Thus, any software that handles relations between PC and Mac has to take this difference into account. Normal programs, word processors and the like usually do this OK; if you use a program like WordPerfect/Mac to open a document made by WordPerfect/PC, then the é, ñ and other accents are translated from the PC to the Mac way without the user even noticing. Unfortunately, network software isn't as well behaved. If you put an é into an email message from a PC, the software will send numeric value of the é under the PC system. The Mac will receive this character value, and display the corresponding character in its own character set; which, whatever it is, is not an é. Leading to a confused or corrupted message.
It is even worse for non-European languages. At least, Europe is a fairly large market. Thus, some 'international' software exists that will convert from PC characters to Mac characters. But only for European characters (accents and similar). Which is fine for the French. But, transmitting an Arabic or Japanese text brings up quite different issues, and what solves a problem in French can cause new ones in Arabic. And as the software has no awareness of what language or script the message was originally written in, it will often just assume that everything that is not English is West European. Which may just add to the confusion.
That makes communication beween different computers difficult. But, beyond this, even Mac-to-Mac communication or Windows-to-Windows can be impossible, both in French and in Russian or Chinese. This is because the restrictions on the net itself. For historical reasons, the net has a formal rule that certain types of messages - in particular email - can only used a limited number of characters. While any Mac or PC can display about 220 different characters in any font, only about 90 of these should be used on the net. 90 sounds like a lot, but when you take away the numerals, punctuation etc., you only have left room for A-Z upper case and lower case. No accents, no national characters, no non-European scripts. This rule, called the 7-bit rule - click on the link to see why it is called that - basically stops the use of characters not present in English in email, and other services concerned.
In the following, we will call the standard English A-Z, 0-9 characters - those that you can fit into the 90-strong character set - for the basic character set, and any character outside this, be it French accents or Chinese ideographs, as the extended characters. As mentioned, both PC and Mac can display at least 130 such extended characters, and it is here they, in different ways, store both European accents and all non-European scripts and characters.
EUROPEAN SOLUTIONSClearly, this is not an acceptable situation. We non-English are becoming more demanding; English may be the lingua franca of the Net, but we will be damned if we have to write it in English to our own compatriots, just because of some stupid ole rule. So solutions are sought, and found. Unfortunately, not everyone has found the same solution. Thus, while there are ways, even officially sanctioned and standardized ways, of getting around all of the problems mentioned, they are not necessarily applied in the software in existence on the net; all the more so since most network software is produced in the US, where these kinds of problems are considered somewhat exotic.
So, one must expect various kinds of solutions or non-solutions among any group of users, and apply the old network wisdom: 'be liberal in what you accept, but strict [following the standard] in what you send out'. For European languages, that adage applies to the software developers, for non-European languages it also applies to the users.
Problem 1: Beating the 7bit-ruleLet us look at some of these solutions. The first and basic problem, is the regulation that you can only use the restricted set of 90 basic characters on the net. There are three possible kinds of solutions to this:
Beyond acceptance, not yet changeOf these, option (1) belongs to the past. It was common earlier, but could never be more than a local solution - we could fit either the Swedish, or the French characters, either the Arabic or the English, but not both; which was in the long run not tenable.
Option (4) belongs to the future. It would solve problems if it was generally used, but such 'extended' post office software is still rare. As it is not done at the user level, but by network administrators, we will not discuss it further here.
Law-breaking and its dangersOptions (2) and (3) are both practical and possible. The 7bit rule is in reality obsolete, and is mainly kept because in the anarchic Internet world we cannot guarantee that there isn't a stone age computer somewhere that will actually break down if it sees an extended character. But if such exist, they must be rare by now. Thus many specialists considers it unproblematic to send a message that contains the full range of characters, basic and extended.
The problem with this 'illegal' option, is that not everyone agrees you can break rules at your leisure. Thus, much network software is set up to check for and 'correct' any such behaviour. In the worst case, such police procedures will return your email as 'undeliverable', more common is however that they will allow it through, but change any extended character to a basic one. Typically, an é - extended - is changed into an i - basic, an ø - extended - is changed into an x - basic. Légère will become lighre.
It may be the choice of your network administrator if this 'rule enforcer' is operational or not; in Europe it often is not, in the US it is rather common.
Encoding - a solution for all?So, mail and be damned is not offically accepted, and can be impossible for some. The other option, (3) - character encoding - is the offical solution, and is in fact also the most useful one for non-European mail, as will be explained. However, it also has an important drawback: if the receiver doesn't have the software to decode the message, it is quite unreadable to him or her; due to which the most common system of encoding, 'quoted-printable' or QP, is often called 'quoted-unreadable'. Such readers will just see codes that look like this: l=E9g=E8re.
This kind of encoding is part of the larger specification for email software called MIME, a term often used in a confused fashion. MIME is not the name of a particular program, it is a specification that any modern email program should adhere to. Easing the use of non-English languages in email is one of them. Unfortunately, not all MIME-compatible software gives much emphasis on this part of the specification, and there is still some US-based software that does not even try to follow this standard.
Thus there is no panacea either way. There is nothing you can do to guarantee that the receiver reads what you intended to write. Some people will receive one system fine, but gag on the other, others will only be able to use the other system, not the first. But there are ways to improve the chances of success.
The above is as far as we can come in beating the 7bit rule, and most of the
time for most of the people, it will solve that issue. But there was another
basic problem with extended characters: that they are different on Mac and PC
(and between DOS and Windows, too, for that matter). How do we settle that?
Again, there are two possible ways to go:
Our conclusion for Europe was that the second option might be helpful. Outside Europe, it is more than that, it is crucial. We cannot 'assume' that everyone uses the same character set if some messages may be in Swedish and others in Chinese. Not only do these use different scripts - that could be corrected by selecting the appropriate font manually - but the network software must handle European, Cyrillic and Arabic texts quite differently even before they come as far as being displayed in a font. They must use different conversion methods on the messages depending on the script used.
So, outside Europe, we must use the second option, so that the email software has the information it needs to handle each individual message. And as mentioned, this method exists, and is part of the MIME standard. Its 'charset' header contains this information - while it is normally set to represent only European characters, e.g. set to 'iso-8859-1', an Arabic user can change that to 'iso-8859-6' and the receiving email program can then act accordingly, depending on its ability to handle Arabic text.
Further, in the last and most crucial point above, I have put nothing before the double hyphen at all. Most software, even if it allows to you type 'iso-8859-6' (i.e. Arabic) in a charset header, will do nothing to ensure that the message is sent or displayed accordingly. What "adaptation" means varies from computer to computer. On the Mac, it normally means converting the text itself (the numeric values) to conform to the Mac's standard; on the PC it may mean to choose a 'code page' that fits the message. But most software will not do either. Sometimes, it can be done manually by the user, if only he knows how; in others, it is fatal and leads to the messages remaining unreadable.
So, the solution is possible and doable, but the awareness of the needs for multi-language email has not reached the software engineers yet. Or the marketing people, whoever makes the decisions.
When that is done, there is at least a fair chance that a correspondent with a computer that has the same language/script installed will be able to read the message correctly.
However, without MIME, information about which character set is used is absent. In practice this is to some extent decided by informal consensus within each newsgroup. In groups where non-European languages are used, most people tend to gather around certain character sets (such as e.g. the koi8 standard for Russian groups), and the users then knows what to expect in that particular newsgroup. This is not a complete solution, but it alleviates the problem.
Some MIME-compatible news programs that allow non-European email do exist. One such for Mac users is the newest versions of "Yet Another Newswatcher", which comes with conversion tables for many non-European languages (and others are available at my site, Middle East News tables, 28 K). This also includes MIME char.set header insertion and conversions.
All of this is completely invisible to the user, who has no idea that these things go on behind the scenes. But it does require the publisher / author to have the technically savvy to produce the program that can perform this checking. There are such solutions in use today, but not many. It is also applicable to Arabic, where character sets normally correspond to machine types.
But where conflicting character sets may be used on the same machine type - as in Russian - the browser self-identification will not work. However, one may of course do this manually: on the 'Welcome' page of the site - which would be in English - the reader may be invited to click on a button corresponding to his setup, and is then taken to the pages written in the character set he uses.
-- A possible third alternative is the arrogant approach: 'we have decided to support only this standard, and if that doesn't fit your computer, that is your problem'. That is possible, but not recommended - the cost of creating enemies is larger than the savings in work or server space. However, one may meet such attitudes on the Net. If as a reader you meet this, you should avoid that service.
Such software is not yet fully developed, but recent versions of the major Web browsers from Netscape and Microsoft are moving along this way. (Read my "Mac Arabic on the net" page for more detail; the same procedure will basically be valid in Windows versions.) There are also some browsers that specialize in displaying Web pages in this or that script; they may on the other hand lack other functions that the user will want. However, on the whole, the non-European user is advised to keep informed about recent developments in browser software. Things are changing rapidly in this segment.
As a conclusion, then, we are moving towards a stage where the use of non-European scripts will become easy. So far, it is possible, for most people, but at the cost of wasting some time in making sure one has the correct software and configuration.
Knut S. Vikør