Microsoft® Office XP Resource Kit		microsoft.com Home

http://www.microsoft.com/office/ork

Office Resource Kit

Toolbox

Tools

Getting Started

Office XP Resource Kit

Collaboration with Office

Deployment Prerequisites

Deployment

Overview of Setup

Installing & Customizing Office

Deploying on Windows 2000

Deploying on Windows NT 4.0

Maintenance

Maintaining an Installation

Using System Policies

Administering Security

Creating Custom Help

Worldwide Deployment

Planning an International Deployment

Deploying Office Internationally

Upgrading International Installations

Maintaining International Installations

Preparing Users' Computers for International Use

Messaging

Planning for Outlook

Deploying Outlook

Special Outlook Deployment Scenarios

Upgrading to Outlook 2002

Maintaining an Outlook Installation

Administering Outlook Security

		Site Index
		Glossary

Office Resource Kit / Worldwide Deployment / Maintaining International Installation

Topics in this chapter

	Unicode Support and Multilingual Documents

	Taking Advantage of Unicode Support

	Changing Language Settings

	Removing Multilingual User Interface Files

	Managing Language Settings for Each Application

Unicode Support and Multilingual Documents

Sharing documents in a multilingual environment can be challenging when the languages involved span multiple Microsoft Windows code pages. However, using the Unicode® character encoding standard overcomes many of these challenges.

Without Unicode, systems typically use a code page–based environment, in which each script has its own table of characters. Documents based on the code page of one operating system rarely travel well to an operating system that uses another code page. In some cases, the documents cannot contain text that uses characters from more than one script.

For example, if a user running the English version of the Microsoft Windows® 98 operating system with the Latin code page opens a plain text file created in the Japanese version of Windows 98, the code points of the Japanese code page are mapped to unexpected or nonexistent characters in the Western script, and the resulting text is unintelligible.

The universal character set provided by Unicode overcomes this problem. In Microsoft Office XP, all applications are capable of using Unicode.

Scripts and code pages

Multilingual documents can contain text in languages that require different scripts. A single script can be used to represent many languages.

For example, the Latin or Roman script has character shapes — glyphs — for the 26 letters (both uppercase and lowercase) of the English alphabet, as well as accented (extended) characters used to represent sounds in other Western European languages.

The Latin script has glyphs to represent all of the characters in most European languages and a few others. Other European languages, such as Greek or Russian, have characters for which there are no glyphs in the Latin script; these languages have their own scripts.

Some Asian languages use ideographic scripts that have glyphs based on Chinese characters. Other languages, such as Thai and Arabic, use scripts that have glyphs that are composed of several smaller glyphs or glyphs that must be shaped differently depending on adjacent characters. These scripts are referred to throughout the documentation as complex scripts.

A common way to store plain text is to represent each character by using a single byte. The value of each byte is a numeric index — or code point — in a table of characters; a code point corresponds to a character in the default code page of the computer on which the text document is created. For example, a byte value of decimal 65 (the code point for which is the decimal value 65) might represent the capital letter 'A' on a machine with Western European code page.

A table of characters grouped together is called a code page. For single-byte code pages, each code page contains a maximum of 256 byte values; because each character in the code page is represented by a single byte, a code page can contain as many as 256 characters.

One code page with its limit of 256 characters cannot accommodate all languages because all languages together use far more than 256 characters. Therefore, different scripts use separate code pages. There is one code page for Greek, another for Cyrillic, and so on.

In addition, single-byte code pages cannot accommodate Asian languages, which commonly use more than 5,000 Chinese-based characters. Double-byte code pages were developed to support these languages.

One drawback of the code page system is that the character represented by a particular code point depends on the specific code page on which it was entered. If you do not know which code page a code point is from, you cannot determine how to interpret the code point unambiguously. This can cause problems when a text document is shared between users on different computers.

For example, unless you know which code page it comes from, the code point 230 might be the Greek lowercase zeta (), the Cyrillic lowercase zhe (), or the Western European diphthong (). All three characters have the same code point (230), but the code point is from three different code pages (1253, 1251, and 1252, respectively). Users exchanging documents between these languages are likely to see incorrect characters.

Unicode: a worldwide character set

Unicode was developed to create a universal character set that can accommodate all known scripts. Unicode uses a unique, two-byte encoding for every character; so in contrast to code pages, every character has its own unique code point. For example, the Unicode code point of lowercase zeta () is the hexadecimal value 03B6, lowercase zhe () is 0436, and the diphthong () is 00E6.

Unicode 2.0 defines code points for approximately 40,000 characters. More definitions were added in Unicode 2.1 and Unicode 3.0. Built-in expansion mechanisms in Unicode allow for more than one million characters to be defined, which is more than sufficient for all known scripts.

Currently in the Microsoft Windows operating systems, the two systems of storing text — code pages and Unicode — coexist. However, Unicode-based systems are replacing code page–based systems. For example, Microsoft Windows NT® 4, Microsoft Windows 2000, Microsoft Office 97 and later, Microsoft Internet Explorer 4.0 and later, and Microsoft SQL Server™ 7.0 and later all support Unicode.

Top © 2001 Microsoft Corporation. All rights reserved. Terms of use.
License

Microsoft® Office XP Resource Kit

Unicode Support and Multilingual Documents

Scripts and code pages

Unicode: a worldwide character set