Difference between revisions of "Unicode"

From RogueBasin
Jump to navigation Jump to search
(Strengthened assertion that fonts lack character defintions.)
 
(9 intermediate revisions by 8 users not shown)
Line 1: Line 1:
'''Unicode''' is the international character encoding standard. A character encoding assigns each character (a letter, a glyph, or some other symbol) to a code point (a number). For example, the character "A" would be mapped to the code point 65, while the symbol for Yen, "?", would be mapped to code point 157. These code points are stored (in memory or on disk) as a sequence of bits.
'''Unicode''' is a character encoding standard, similar in spirit to [[ASCII]] but encoding vastly more characters. Whereas there are only 94 printable ASCII characters, there are literally hundreds of thousands of printable Unicode characters.


== History of character encodings ==
Roguelike games are historically console-based games that use characters to represent monsters, treasures, and maps.  While some developers now prefer to use graphical [[tiles]] instead, Unicode allows roguelike games that use characters to represent things a much greater variety of visual representations.


[[ASCII]] is a character encoding which was introduced in 1963. Before ASCII, few computers would agree on what bit patterns constituted, for example, the letter 'A'. ASCII reserves 128 (2^7) code points (half of one byte):
Unicode is derived from a host of earlier character encodings, each of which added some characters to the ASCII set, but most of which added no more than 128 additional characters by using the high bit (8th bit).  While ASCII with its 94 printable characters was a reasonable way to represent English text, these high-bit encodings, or code pages, each of which was ASCII plus some selected characters, were a reasonably complete way to encode one, two, or a few additional languages each.  But it wouldn't work if you took the high-bit encoding that handled, say, Arabic characters and tried to use it to represent German text, which requires Roman characters with umlauts.


<pre>
The trouble with these extensions was that there were so many to choose from.  While characters 0-127 (which were encoded by ASCII) were always the same from computer to computer, characters 128-255 (which were encoded by whatever high-bit encoding was in use locally) were notAssumptions about code pages being the same led to strange gibberish being displayed rather than the correct characters.
0-31    [non-printable control characters]
32-63    !"#$%&'()*+,-./0123456789:;<=>?
64-95  @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
96-127 `abcdefghijklmnopqrstuvwxyz{|}~
</pre>


Non-english languages found that their fancy a-with-a-hat-on-top (i.e. â) could not be represented. This was solved with the introduction of various extensions to ASCII, which used the upper half of 8-bit character set to define accented characters and various other graphical symbols:
Unicode is an attempt to resolve this once and for all by using a larger character set of 1,114,112 (2^20 + 2^16) code points, allowing it to map all known human glyphs into one consistent character set. Nearly 10% of those code points are already assigned. Any Unicode character can be represented with 21 bits, although depending on what 'encoding' your local system uses the actual number of bits devoted to representing a particular character may be more or less.  The first 127 code points, like the first 127 code points of all the ISO standards, are the same as in ASCII. 


<pre>
The majority of the most useful characters are encoded in the first 65536 locations of Unicode, which is called the Basic Multilingual Plane.  Since this is the number of different bit patterns that can be represented with two bytes, some programming languages (such as Java) use two bytes as their basic character type. 
128-159 Çüéâä??ç?ë??î?Ä?É??ôö????ÖÜ?????
160-191 áíóú??????¬???«»????????????????
192-223 ????????????????????????????????
224-255 ?ß????µ??????????±????÷?°?·????
</pre>


The trouble with these extensions was that there were so many to choose from.  While the 100th character was always the same from computer to computer, the 200th character was not. It depended upon which "Code page" was loaded. Assumptions about Code pages being the same led to strange gibberish being displayed rather than the correct accented character. The other problem was that ASCII had no support for any of the thousands of characters used by non-Roman languages.
== Unicode in Roguelikes ==
 
Unicode is an attempt to resolve this once and for all by using a larger character set of 1,114,112 (2^20 + 2^16) code points, allowing it to map all known human glyphs into one consistent character set. Nearly 10% of those code points are already assigned.
 
Unicode may be stored using a variety of encodings. [[Java]], for example, uses UCS-2, and each character takes up 2 bytes, or 16 bits.


Unicode handles the problem of defining characters, but it is up to fonts handle the display of charactersAlmost every font is only equipped to display small subset of Unicode characters. As a result, while it is guaranteed that the a-with-a-hat-on-top will be defined as an a-with-a-hat-on-top on every computer under Unicode, it is not at all guaranteed that the user will be able to see it.
Many roguelikes, including [[NetHack]], have historically included options to take advantage of high-bit encodings to get extra symbols, allowing them to draw walls with lines rather than # marksBut if a different code page than the expected one happens to be loaded, the results can be humourous since the code points that the program expects to contain lines may contain, say, Arabic or Cyrillic characters instead. This can be seen when NetHack's IBMgraphics display is played with DOS code page 850.


== Unicode in Roguelikes ==
Modern terminal programs, however, mostly use Unicode characters now rather than the various earlier code pages, so using Unicode characters allows roguelikes to access these special glyphs in a platform independent manner. It also allows for the use of glyphs from typically eastern languages (which may be ideal for those who can read those glyphs, as entire words may be represented as a single character).


Many roguelikes, including [[NetHack]], have included options to take advantage of high order [[ASCII]] to get extra symbols, allowing them to draw walls with lines rather than # marks.
Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.  


Unicode has the potential of letting roguelikes access these special glyphs in a platform independent manner. It also allows for the use of glyphs from typically eastern languages (which may be ideal for those who can read those glyphs, as entire words may be represented as a single character). Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.  
It is surprisingly difficult to get a C program to use Unicode characters effectively in combination with a curses display, mostly due to inadequate documentation of the requirements. These requirements are listed at the page on [[Ncursesw]].


== Roguelikes using Unicode==
Roguelikes that use Unicode are:
Roguelikes that use Unicode are:
* [[ChessRogue]]
* [[ChessRogue]]
* [[Legerdemain]]
* [[Legerdemain]]
* [[Dungeon Crawl Stone Soup]]
* [[Neon]]


== External links ==
== External links ==


* http://en.wikipedia.org/wiki/Unicode - Everything you wanted to know about Unicode but were afraid to ask.
* [https://github.com/globalcitizen/zomia/blob/master/USEFUL-UNICODE.md Useful Unicode characters for Roguelikes] (c/[[Zomia]] .. not hosted here as Wikimedia breaks many Unicode characters .. contributions welcome)
* [http://en.wikipedia.org/wiki/Unicode Everything you wanted to know about Unicode but were afraid to ask]
* [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]


[[category:roguelike development]]
[[category:articles]]

Latest revision as of 07:53, 22 September 2016

Unicode is a character encoding standard, similar in spirit to ASCII but encoding vastly more characters. Whereas there are only 94 printable ASCII characters, there are literally hundreds of thousands of printable Unicode characters.

Roguelike games are historically console-based games that use characters to represent monsters, treasures, and maps. While some developers now prefer to use graphical tiles instead, Unicode allows roguelike games that use characters to represent things a much greater variety of visual representations.

Unicode is derived from a host of earlier character encodings, each of which added some characters to the ASCII set, but most of which added no more than 128 additional characters by using the high bit (8th bit). While ASCII with its 94 printable characters was a reasonable way to represent English text, these high-bit encodings, or code pages, each of which was ASCII plus some selected characters, were a reasonably complete way to encode one, two, or a few additional languages each. But it wouldn't work if you took the high-bit encoding that handled, say, Arabic characters and tried to use it to represent German text, which requires Roman characters with umlauts.

The trouble with these extensions was that there were so many to choose from. While characters 0-127 (which were encoded by ASCII) were always the same from computer to computer, characters 128-255 (which were encoded by whatever high-bit encoding was in use locally) were not. Assumptions about code pages being the same led to strange gibberish being displayed rather than the correct characters.

Unicode is an attempt to resolve this once and for all by using a larger character set of 1,114,112 (2^20 + 2^16) code points, allowing it to map all known human glyphs into one consistent character set. Nearly 10% of those code points are already assigned. Any Unicode character can be represented with 21 bits, although depending on what 'encoding' your local system uses the actual number of bits devoted to representing a particular character may be more or less. The first 127 code points, like the first 127 code points of all the ISO standards, are the same as in ASCII.

The majority of the most useful characters are encoded in the first 65536 locations of Unicode, which is called the Basic Multilingual Plane. Since this is the number of different bit patterns that can be represented with two bytes, some programming languages (such as Java) use two bytes as their basic character type.

Unicode in Roguelikes

Many roguelikes, including NetHack, have historically included options to take advantage of high-bit encodings to get extra symbols, allowing them to draw walls with lines rather than # marks. But if a different code page than the expected one happens to be loaded, the results can be humourous since the code points that the program expects to contain lines may contain, say, Arabic or Cyrillic characters instead. This can be seen when NetHack's IBMgraphics display is played with DOS code page 850.

Modern terminal programs, however, mostly use Unicode characters now rather than the various earlier code pages, so using Unicode characters allows roguelikes to access these special glyphs in a platform independent manner. It also allows for the use of glyphs from typically eastern languages (which may be ideal for those who can read those glyphs, as entire words may be represented as a single character).

Unfortunately, Unicode, by its "embrace every known glyph" mentality, can't guarantee that the font that you are using actually has the glyph defined.

It is surprisingly difficult to get a C program to use Unicode characters effectively in combination with a curses display, mostly due to inadequate documentation of the requirements. These requirements are listed at the page on Ncursesw.

Roguelikes using Unicode

Roguelikes that use Unicode are:

External links