Unicode

From RogueBasin
Jump to navigation Jump to search

Unicode is the international character encoding standard. A character encoding assigns each character (a letter, a glyph, or some other symbol) to a code point (a number). For example, the character "A" would be mapped to the code point 65, while the symbol for Yen, "?", would be mapped to code point 157. These code points are stored (in memory or on disk) as a sequence of bits.

History of character encodings

ASCII is a character encoding which was introduced in 1963. Before ASCII, few computers would agree on what bit patterns constituted, for example, the letter 'A'. ASCII reserves 128 (2^7) code points (half of one byte):

0-31    [non-printable control characters]
32-63    !"#$%&'()*+,-./0123456789:;<=>?
64-95   @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
96-127  `abcdefghijklmnopqrstuvwxyz{|}~

Non-english languages found that their fancy a-with-a-hat-on-top (i.e. â) could not be represented. This was solved with the introduction of various extensions to ASCII, which used the upper half of 8-bit character set to define accented characters and various other graphical symbols:

128-159 Çüéâä??ç?ë??î?Ä?É??ôö????ÖÜ?????
160-191 áíóú??????¬???«»????????????????
192-223 ????????????????????????????????
224-255 ?ß????µ??????????±????÷?°?·???? 

The trouble with these extensions was that there were so many to choose from. While the 100th character was always the same from computer to computer, the 200th character was not. It depended upon which "Code page" was loaded. Assumptions about Code pages being the same led to strange gibberish being displayed rather than the correct accented character. The other problem was that ASCII had no support for any of the thousands of characters used by non-Roman languages.

Unicode is an attempt to resolve this once and for all by using a larger character set of 1,114,112 (2^20 + 2^16) code points, allowing it to map all known human glyphs into one consistent character set. Nearly 10% of those code points are already assigned.

Unicode may be stored using a variety of encodings. Java, for example, uses UCS-2, and each character takes up 2 bytes, or 16 bits.

Unicode handles definitions, but fonts handle the display, and a certain font may not be equipped to display certain characters. As a result, while it is guaranteed that the a-with-a-hat-on-top will be defined as an a-with-a-hat-on-top on every computer under Unicode, it is not at all guaranteed that the user will be able to see it.

Unicode in Roguelikes

Many roguelikes, including NetHack, have included options to take advantage of high order ASCII to get extra symbols, allowing them to draw walls with lines rather than # marks.

Unicode has the potential of letting roguelikes access these special glyphs in a platform independent manner. It also allows for the use of glyphs from typically eastern languages (which may be ideal for those who can read those glyphs, as entire words may be represented as a single character). Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.

Roguelikes that use Unicode are:

External links