ASCII and Unicode

Traditionally, computers in English speaking countries only used the ASCII (American Standard Code for Information Interchange) character set to represent alphanumeric characters. Each character in the ASCII is represented by 7 bits. There are therefore 128 characters in this character set. These include the lower case and upper case Latin letters, numbers, and punctuation marks.

The ASCII character set was later extended to include another 128 characters, such as the German characters ä, ö, ü and the British currency symbol £. This character set is called extended ASCII and each character is represented by 8 bits.

ASCII and the extended ASCII are only two of the many character sets available.

Another popular one is the character set standardized by the ISO (International Standards Organization), ISO-8859-1, which is also known as Latin-1. Each character in ISO-8859-1 is represented by eight bits as well. This character set contains all the characters required for writing text in many of the western European languages, such as German, Danish, Dutch, French, Italian, Spanish, Portuguese and, of course, English. An eight-bit-per-character character set is convenient because a byte is also 8 bits long. As such, storing and transmitting text written in an 8-bit character set is most efficient.

Need for Unicode character set:

However, not every language uses Latin letters. Chinese and Japanese are examples of languages that use different character sets. For example, each character in the Chinese language represents a word, not a letter. There are thousands of these characters and eight bits are not enough to represent all the characters in the character set. The Japanese use a different character set for their language too. In total, there are hundreds of different character sets for all the world languages. To unify all these characters sets, a computing standard called Unicode was created.

Unicode is a character set developed by a non-profit organization called the Unicode Consortium (www.unicode.org). This body attempts to include all characters in all languages in the world into one single character set. A unique number in Unicode represents exactly one character. Currently at version 8, Unicode is used in Java, XML, ECMAScript, LDAP, etc.

Problem?? While Unicode provides enough space for all the characters used in all languages, storing and transmitting Unicode text is not as efficient as storing and transmitting ASCII or Latin-1 characters. In the Internet world, this is a huge problem. Imagine having to transfer 4 times as much data as ASCII text!

Fortunately, character encoding can make it more efficient to store and transmit Unicode text. You can think of character encoding as analogous to data compression. And, there are many types of character encodings available today. The Unicode Consortium endorses three of them:

  • UTF-8. This is popular for HTML and for protocols whereby Unicode characters are transformed into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software. Most browsers support the UTF-8 character encoding.
  • UTF-16. In this character encoding, all the more commonly used characters fit into a single 16-bit code unit, and other less often used characters are accessible via pairs of 16-bit code units.
  • UTF-32. This character encoding uses 32 bits for every single character. This is clearly not a choice for Internet applications. At least, not at present.

ASCII characters still play a dominant role in software programming. Java too uses ASCII for almost all input elements, except comments, identifiers, and the contents of characters and strings. For the latter, Java supports Unicode characters. This means, you can write comments, identifiers, and strings in languages other than English.