Difference between revisions of "Unicode"

From RogueBasin
Jump to navigation Jump to search
Line 1: Line 1:
== What is Unicode ==
'''Unicode''' is the international character encoding standard. A character encoding assigns each character (a letter, a glyph, or some other symbol) to a code point (a number). For example, the character "A" would be mapped to the code point 65, while the symbol for Yen, "?", would be mapped to code point 157. These code points are stored (in memory or on disk) as a sequence of bits.


First there was [[ASCII]].  While no one wants to return to the pre-ascii days where no two computers would agree on what bit patterns constituted the letter 'A', all was not right in the world.  Languages with non-Latin orthographies found that their fancy a-with-a-hat-on-top (i.e. â) could not be represented.  At first, this was solved by taking the upper half of 8-bit character set to define accented characters.  These high-order characters often also had various graphical symbols attached to them.
== History of character encodings ==


The trouble with standards is that there are so many to choose from.  While the 100th character is always the same, the 200th character isn't the same from computer to computer.  It depends on which [[Code page]] is loaded.  Assumptions about [[Code page]]s being the same leads to strange gibberish being displayed rather than the correct accented character. Also, not all languages can fit even with the 128 extra characters of extended [[ASCII]].
[[ASCII]] is a character encoding which was introduced in 1963. Before ASCII, few computers would agree on what bit patterns constituted, for example, the letter 'A'. ASCII reserves 128 (2^7) code points (half of one byte):


Unicode is an attempt to resolve this once and for all by using a larger character set (21 bits) with built in extensibility to allow it to map all known human glyphs into one consistent character set.
<pre>
0-31    [non-printable control characters]
32-63    !"#$%&'()*+,-./0123456789:;<=>?
64-95  @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
96-127  `abcdefghijklmnopqrstuvwxyz{|}~
</pre>


This process proceeded quite well until someone asked who the heck gets to actually draw the hundreds of thousands of characters that would make up a Unicode font. As a result, while it is guaranteed that the A with a squiggle on top will be defined as an A with squiggle on top on every computer under Unicode, it is not at all guaranteed that the user will be able to render it as anything other than a blank space.
Non-english languages found that their fancy a-with-a-hat-on-top (i.e. â) could not be represented. This was solved with the introduction of various extensions to ASCII, which used the upper half of 8-bit character set to define accented characters and various other graphical symbols:
 
<pre>
128-159 Çüéâä??ç?ë??î?Ä?É??ôö????ÖÜ?????
160-191 áíóú??????¬???«»????????????????
192-223 ????????????????????????????????
224-255 ?ß????µ??????????±????÷?°?·????
</pre>
 
The trouble with these extensions was that there are so many to choose from.  While the 100th character was always the same from computer to computer, the 200th character was not. It depended upon which "Code page" was loaded. Assumptions about Code pages being the same led to strange gibberish being displayed rather than the correct accented character. The other problem was that ASCII had no support for any of the thousands of characters used by non-Roman languages.
 
Unicode is an attempt to resolve this once and for all by using a larger character set of 1,114,112 (2^20 + 2^16) code points, allowing it to map all known human glyphs into one consistent character set. Nearly 10% of those code points are already assigned.
 
Unicode may be stored using a variety of encodings. [[Java]], for example, uses UCS-2, and each character takes up 2 bytes, or 16 bits.
 
Unicode handles definitions, but fonts handle the display, and a certain font may not be equipped to display certain characters. As a result, while it is guaranteed that the a-with-a-hat-on-top will be defined as an a-with-a-hat-on-top on every computer under Unicode, it is not at all guaranteed that the user will be able to see it.


== Unicode in Roguelikes ==
== Unicode in Roguelikes ==
Line 13: Line 33:
Many roguelikes, including [[NetHack]], have included options to take advantage of high order [[ASCII]] to get extra symbols, allowing them to draw walls with lines rather than # marks.
Many roguelikes, including [[NetHack]], have included options to take advantage of high order [[ASCII]] to get extra symbols, allowing them to draw walls with lines rather than # marks.


Unicode has the potential of letting roguelikes access these special glyphs in a platform independent manner.  Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.
Unicode has the potential of letting roguelikes access these special glyphs in a platform independent manner. It also allows for the use of Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.  


Roguelikes that use Unicode are:
Roguelikes that use Unicode are:

Revision as of 23:55, 2 October 2005

Unicode is the international character encoding standard. A character encoding assigns each character (a letter, a glyph, or some other symbol) to a code point (a number). For example, the character "A" would be mapped to the code point 65, while the symbol for Yen, "?", would be mapped to code point 157. These code points are stored (in memory or on disk) as a sequence of bits.

History of character encodings

ASCII is a character encoding which was introduced in 1963. Before ASCII, few computers would agree on what bit patterns constituted, for example, the letter 'A'. ASCII reserves 128 (2^7) code points (half of one byte):

0-31    [non-printable control characters]
32-63    !"#$%&'()*+,-./0123456789:;<=>?
64-95   @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
96-127  `abcdefghijklmnopqrstuvwxyz{|}~

Non-english languages found that their fancy a-with-a-hat-on-top (i.e. â) could not be represented. This was solved with the introduction of various extensions to ASCII, which used the upper half of 8-bit character set to define accented characters and various other graphical symbols:

128-159 Çüéâä??ç?ë??î?Ä?É??ôö????ÖÜ?????
160-191 áíóú??????¬???«»????????????????
192-223 ????????????????????????????????
224-255 ?ß????µ??????????±????÷?°?·???? 

The trouble with these extensions was that there are so many to choose from. While the 100th character was always the same from computer to computer, the 200th character was not. It depended upon which "Code page" was loaded. Assumptions about Code pages being the same led to strange gibberish being displayed rather than the correct accented character. The other problem was that ASCII had no support for any of the thousands of characters used by non-Roman languages.

Unicode is an attempt to resolve this once and for all by using a larger character set of 1,114,112 (2^20 + 2^16) code points, allowing it to map all known human glyphs into one consistent character set. Nearly 10% of those code points are already assigned.

Unicode may be stored using a variety of encodings. Java, for example, uses UCS-2, and each character takes up 2 bytes, or 16 bits.

Unicode handles definitions, but fonts handle the display, and a certain font may not be equipped to display certain characters. As a result, while it is guaranteed that the a-with-a-hat-on-top will be defined as an a-with-a-hat-on-top on every computer under Unicode, it is not at all guaranteed that the user will be able to see it.

Unicode in Roguelikes

Many roguelikes, including NetHack, have included options to take advantage of high order ASCII to get extra symbols, allowing them to draw walls with lines rather than # marks.

Unicode has the potential of letting roguelikes access these special glyphs in a platform independent manner. It also allows for the use of Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.

Roguelikes that use Unicode are:

External links