Microsoft® Office XP Resource Kit

microsoft.com Home  
Microsoft
http://www.microsoft.com/office/ork  

    Office Resource Kit
    Toolbox
    Getting Started
    Deployment
    Maintenance
    Worldwide Deployment
    Messaging
    Site Index
    Glossary
Office Resource Kit / Worldwide Deployment / Maintaining International Installation
Topics in this chapter
  Unicode Support and Multilingual Documents  
  Taking Advantage of Unicode Support  
  Changing Language Settings  
  Removing Multilingual User Interface Files  
  Managing Language Settings for Each Application  
 

Unicode Support and Multilingual Documents

Sharing documents in a multilingual environment can be challenging when the languages involved span multiple Microsoft Windows code pages. However, using the Unicode® character encoding standard overcomes many of these challenges.

Without Unicode, systems typically use a code page–based environment, in which each script has its own table of characters. Documents based on the code page of one operating system rarely travel well to an operating system that uses another code page. In some cases, the documents cannot contain text that uses characters from more than one script.

For example, if a user running the English version of the Microsoft Windows® 98 operating system with the Latin code page opens a plain text file created in the Japanese version of Windows 98, the code points of the Japanese code page are mapped to unexpected or nonexistent characters in the Western script, and the resulting text is unintelligible.

The universal character set provided by Unicode overcomes this problem. In Microsoft Office XP, all applications are capable of using Unicode.

Scripts and code pages

Multilingual documents can contain text in languages that require different scripts. A single script can be used to represent many languages.

For example, the Latin or Roman script has character shapes — glyphs — for the 26 letters (both uppercase and lowercase) of the English alphabet, as well as accented (extended) characters used to represent sounds in other Western European languages.

The Latin script has glyphs to represent all of the characters in most European languages and a few others. Other European languages, such as Greek or Russian, have characters for which there are no glyphs in the Latin script; these languages have their own scripts.

Some Asian languages use ideographic scripts that have glyphs based on Chinese characters. Other languages, such as Thai and Arabic, use scripts that have glyphs that are composed of several smaller glyphs or glyphs that must be shaped differently depending on adjacent characters. These scripts are referred to throughout the documentation as complex scripts.

A common way to store plain text is to represent each character by using a single byte. The value of each byte is a numeric index — or code point — in a table of characters; a code point corresponds to a character in the default code page of the computer on which the text document is created. For example, a byte value of decimal 65 (the code point for which is the decimal value 65) might represent the capital letter 'A' on a machine with Western European code page.

A table of characters grouped together is called a code page. For single-byte code pages, each code page contains a maximum of 256 byte values; because each character in the code page is represented by a single byte, a code page can contain as many as 256 characters.

One code page with its limit of 256 characters cannot accommodate all languages because all languages together use far more than 256 characters. Therefore, different scripts use separate code pages. There is one code page for Greek, another for Cyrillic, and so on.

In addition, single-byte code pages cannot accommodate Asian languages, which commonly use more than 5,000 Chinese-based characters. Double-byte code pages were developed to support these languages.

One drawback of the code page system is that the character represented by a particular code point depends on the specific code page on which it was entered. If you do not know which code page a code point is from, you cannot determine how to interpret the code point unambiguously. This can cause problems when a text document is shared between users on different computers.

For example, unless you know which code page it comes from, the code point 230 might be the Greek lowercase zeta (), the Cyrillic lowercase zhe (), or the Western European diphthong (). All three characters have the same code point (230), but the code point is from three different code pages (1253, 1251, and 1252, respectively). Users exchanging documents between these languages are likely to see incorrect characters.

Unicode: a worldwide character set

Unicode was developed to create a universal character set that can accommodate all known scripts. Unicode uses a unique, two-byte encoding for every character; so in contrast to code pages, every character has its own unique code point. For example, the Unicode code point of lowercase zeta () is the hexadecimal value 03B6, lowercase zhe () is 0436, and the diphthong () is 00E6.

Unicode 2.0 defines code points for approximately 40,000 characters. More definitions were added in Unicode 2.1 and Unicode 3.0. Built-in expansion mechanisms in Unicode allow for more than one million characters to be defined, which is more than sufficient for all known scripts.

Currently in the Microsoft Windows operating systems, the two systems of storing text — code pages and Unicode — coexist. However, Unicode-based systems are replacing code page–based systems. For example, Microsoft Windows NT® 4, Microsoft Windows 2000, Microsoft Office 97 and later, Microsoft Internet Explorer 4.0 and later, and Microsoft SQL Server™ 7.0 and later all support Unicode.


Top

 
© 2001 Microsoft Corporation. All rights reserved. Terms of use.
License