Unicode Issues
|
|
Glossary
(adapted from NG:digitalmars.D/11247)
code unit: the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept.
code point: the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar.
character: officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. Unicode characters are often written in the form U+#### (for example, U+20AC, which is the character corresponding to codepoint 0x20AC).
As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF.
grapheme: a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent).
glyph: one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue.
Unicode: a standard to generate a complete listing of all characters (glyphs, printable symbols) that are in use all over the world in all written languages. Unicode is published as book (4.0, 1500+ pages, $57, ISBN 0321185781) and on the internet [1]. A Unicode character definition connects a unique number (code U+####), a unique picture of the character, and a unique name for the character.
D Libraries
- ICU Bindings for D (Mango.icu)
- libiconv: http://www.algonet.se/~afb/d/libiconv.d ( NG:digitalmars.D/17913)
- StringClasses
- DsourceProject:deimos has some Unicode modules (in the future it may be moved to Phobos).
Links for learning about Unicode
- http://www.icu-project.org/docs/papers/forms_of_unicode/
- http://www.hackcraft.net/xmlUnicode/
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
- http://www.unicode.org/unicode/faq/utf_bom.html
- http://en.wikipedia.org/wiki/UTF-8
- http://www.joelonsoftware.com/articles/Unicode.html
Links
- Unicode Home Page ( http://www.unicode.org/)
- CharsAndStrs
- StringsInD
- IdeaDiscussion/Internationalization
- Tech tip: Using UTF on Windows
What are Unicode and UTF
(adapted from NG:digitalmars.D/11409)
Well, they are different kinds of objects. Unicode is a character set; UTF-16 is an encoding. Bear with me - I'll try to make that clearer.
A character set is a set of characters in which each character has a number associated with it, called its "codepoint". For example, in the ASCII character set, the character 'A' has a codepoint of 65 (more usually written in hex, as 0x41). In the Unicode character set, 'A' also has a codepoint of 65, and the character '€' (not present in ASCII) has a codepoint of 8,364 (more normally written in hex as 0x20AC).
Unicode characters are often written as U+ followed by their codepoint in hexadecimal. That is, U+20AC means the same thing as €.
Once upon a time, Unicode was going to be a sixteen-bit wide character set. That is, there were going to be (at most) 65,536 characters in it. Thus, every Unicode string would fit comfortably into an array of 16-bit-wide words.
Then things changed. Unicode grew too big. Suddenly, 65,536 characters wasn't going to be enough. But too many important real-life applications had come to rely on characters being 16-bits wide (for example: Java and Windows, to name a couple of biggies). Something had to be done. That something was UTF-16.
UTF-16 is a sneaky way of squeezing >65535 characters into an array originally designed for 16-bit words. Unicode characters with codepoints <0x10000 still occupy only one word; Unicode characters with codepoints >=0x10000 now occupy two words. (A special range of otherwise unused codepoints makes this possible).
In general, an "encoding" is a bidirectional mapping which maps each codepoint to an array of fixed-width objects called "code units". How wide is a code unit? Well, it depends on the encoding. UTF-8 code units are 8 bits wide; UTF-16 code units are 16 bits wide; and UTF-32 code units are 32 bits wide. So UTF-16 is a mapping from Unicode codepoints to arrays of 16-bit wide units. For example, the codepoint 0x10000 maps (in UTF-16) to the array [ 0xD800, 0xDC00 ].
You can learn all about this in much more detail here: http://www.unicode.org/faq/utf_bom.html
Newsgroup Threads Involving Unicode
|
FolderDiscussions