Daniel Keep /
Text In D
Copyright: This document has been placed in the Public Domain.
Many thanks to Bill Baxter, Jarrett Billingsley, Anders F Björklund, Lutger Blijdestijn, Thomas Kuehne, Pierre Rouleau and Max Samuha for their input, and to Walter Bright for making such a great language.
One of the great features of D is its’ fantastic support for text. However, many people new to D have trouble understanding why things are the way they are. People coming from a C or C++ background are quickly confused by the fact that char does not appear to work the way they expect it to, whilst people coming from a Java, C# or interpreted language background wonder why D has three different character types, and no string class.
This article will hopefully address these questions, and help explain the how and why of text in D. But first, some background.
Back when C was created, the dominant character encoding in use was ASCII. ASCII was cool because it could encode every letter of the western alphabet, numbers, and a whole bunch of punctuation. If you needed more characters, then by golly you could just stick them in the upper 128 fields as an extension to ASCII.
This led to the rather unfortunate mess that are character encodings. They arose out of the impossibility of fitting every language’s symbols into just 128 characters. Things became worse with multi byte character sets like Shift_JIS where you couldn’t even count on each 8-bit code being an actual symbol. You also had to carry around a description of which code page you were using. It only got worse if you wanted to use multiple character encodings in a single text document: you usually can’t.
In the end, this led to the creation of Unicode; a character encoding to replace all other character encodings. Unicode significantly differs from most other character encodings in that it encodes every one of its’ symbols using a unique integer identifier called a code point. For example, the N-ary summation symbol “∑” is identified in Unicode as code point 0x2211. By contrast, this symbol is not defined in most character encodings, usually because there simply isn’t room.
However, Unicode by itself does not specify how to actually store these code points; it merely defines what they mean. This is where the Unicode Transformation Formats come in to play.
UTF-32 is the easiest to understand. Every Unicode code point is stored literally as a 32-bit unsigned integer. The obvious disadvantage to this is that it requires a large amount of space to store even the simplest of text.
UTF-16 is somewhat more complex. As the name suggests, it is based around 16-bit unsigned integers. However, since you cannot represent every Unicode code point with only 16-bits, it uses variable length encoding to make sure you can store any code point you please. Most normal code points will only use a single 16-bit value, with more uncommon code points taking up two. Each of these 16-bit values is called a “code unit.”
UTF-8 can be thought of as an “extension” of UTF-16 in that it uses a similar variable length encoding scheme based on 8-bit integers. Code points that fall into the traditional ASCII range remain exactly the same (meaning ASCII is effectively a subset of UTF-8), with other code points taking somewhere between 2 and 4 bytes (aka: code units) to store.
So, by now, you’re probably thinking “what a complete and total mess!” To a degree it is, but it’s important to realise that this is a huge simplification of how things used to be. What’s important to take from all this is that there are three distinct ways of representing Unicode text, and all three are supported directly in D.
Unlike C which says nothing on, for example, how to store Japanese text, D is designed to use Unicode internally for all text storage. This means that instead of having to support multiple character encodings in your programs, you only need to support one, possibly using a library to convert to and from Unicode as necessary.
Specifically, here is how the various encodings translate to D types:
- char is a UTF-8 code unit,
- wchar is a UTF-16 code unit,
- dchar is a UTF-32 code point,
- char is a UTF-8 string,
- wchar is a UTF-16 string and
- dchar is a UTF-32 string.
The first thing that trips up people new to D is that the following program works:
But this one doesn’t:
It simply crashes out with an error saying something about “invalid UTF sequence.” Many people see this and wonder what’s going on.
The answer is something like this: remember how UTF-8 encodes code points using somewhere between one and four individual code units? Well, in D, a char is only a single UTF-8 code unit, so it cannot contain all possible code points. The problem is that “є” requires two code units to represent; it is actually stored as "\xD1\x94".
So when the program comes to print out the second “character,” the standard library throws up the red flag saying “wait a second, 0xD1 isn’t a valid UTF-8 sequence; you can’t print that!” You’re basically trying to write out half a code point, which really doesn’t make any sense.
Is the standard library at fault? Not really; you don’t exactly want to be outputting incomplete code points, otherwise other programs could choke on your output. You certainly wouldn’t appreciate being fed garbage text.
The way to fix this is to realise that you’re using the wrong type for the job. Remember, a single char cannot possibly hold all valid code points. What you need to do is use a type which can:
The above code works perfectly, since the foreach loop is smart enough to decode a single complete code point at a time.
The second problem comes up when programmers discover the power of D’s arrays. They see things like the built-in length property and slicing and think “cool; I can use those on strings!”
When their code fails miserably on international text, they wonder just what’s gone wrong. The problem is, once again, that UTF-8 and UTF-16 don’t necessarily store a single code point in a single code unit. For example, if we are using UTF-8,
does not give you “є”. It gives you "\xD1" which isn’t what you really wanted. Similarly,
gives you “єll” and not “єllѲ” as you would expect (since the “є” actually takes up two chars.) The reason for this is that decoding a UTF-8 or UTF-16 stream is all well and good, but trying to decode a slice in the middle is difficult to do efficiently.
Similarly, the length property of a UTF-8 or UTF-16 string can be misleading; it is counting the number of code units, not the number of actual code points.
The simplest way to deal with this is to stick to UTF-32 strings (aka: dchar) if you’re going to be doing a lot of indexing or slicing. This is because they do not suffer from these variable length encoding problems. Another possible way to do this is to use a foreach loop to convert your string into individual code points, and manually extract the slice you want as you go.
The std.utf module provides many functions which you might find useful:
- std.utf.toUTF8(s) – converts s from any UTF encoding to UTF-8, and returns the result.
- std.utf.toUTF16(s) – as above, but for UTF-16.
- std.utf.toUTF32(s) – as above, but for UTF-32.
Another trick to keep in mind is that when using foreach, you can also ask it to give you the index of each code point within the string:
The above code produces the following output (assuming your terminal can display UTF-8):
Note that the index is that of the first code unit for that code point. These indices can be used in slicing operations to ensure you get a valid UTF sequence.
This is an area of active discussion. Many people assert that D needs a string class, whilst others say that it is unnecessary. Instead of trying to convince you either way, I’ll just explain why D doesn’t have a string class, and show what you can do without one.
An important thing to remember is that C++ grew a string class because C’s string handling was so incredibly painful. Java has a string class because Java is object-oriented to the extreme, and it makes sense to have one.
On the other hand, D does many of the things that C++ needed the string class for quite nicely by itself:
- Since all strings are arrays, all strings have a length property, meaning you don’t need a function to go looking for the end of a string.
- Strings can also be trivially concatenated together using the concatenation operator ~.
- Slicing works as expected for UTF-32 strings, and in UTF-8 and UTF-16 strings as long as you slice on known code point positions.
can be rewritten as:
Which means that although you don’t have a string class, you can “fake” it, by simply writing functions that take strings as their first argument; you aren’t even limited to what comes in the standard library, unlike in C++ and Java!
For a full list of what string manipulation functions come with D, take a look at http://www.digitalmars.com/d/phobos/std_string.html.
If you really, really can’t live without the warm comforting embrace of a string class, you can find a good one at http://www.dprogramming.com/dstring.php.
By now you should understand the problems that arise because of D’s use of UTF encodings. However, there is another problem that comes about because of how D represents arrays.
Back before D had the std.stdio.writefln method, most examples used the old C function printf. This worked fine until you tried to output a string:
Statements like the above are very likely to print out garbage, which leaves many people scratching their heads. The reason is that C uses NUL-terminated strings, whereas D uses true arrays. In other words:
- Strings in C are a pointer to the first character. A string ends at the first NUL character.
- Strings in D are a pointer to the first code unit, followed by a length. There is no terminator.
Thankfully, there is an easy solution:
The std.string.toStringz function converts any char string to a C-compatible char* string by ensuring that there is a terminating NUL.
So you’ve been clever and added some nifty symbols into your source file using Unicode, only to have the compiler barf on them. “What's wrong?” I hear you ask; “I thought D supported Unicode source!”
In fact, it does. There are two problems you might run into:
- The editor you used may support Unicode, but didn't end up saving in it. Go back and double-check that the file really is Unicode. How you do this depends on your editor, but there's usually an option lying around somewhere to set a file's character encoding.
- The other is a bit obscure: if you save your source file in Unicode without a Byte Order Mark and the first character is outside the ASCII character range, D won't be able to read it properly.
Use an editor that properly supports UTF. Seriously, even Windows Notepad does it correctly!
Yes, it can. D source files support four character encodings: ASCII, UTF-8, UTF-16 and UTF-32. Provided your source file is saved in one of these encodings, you can include any character you like.
Of course, this requires that you use an editor that properly supports UTF; as stated above, using an editor that incorrectly writes out UTF files can cause the D compiler to choke on your source files.
There are two ways to do this:
- Enter the characters you want directly, and save the source file in one of the UTF encodings.
- Find out what the code point for the symbol you want to use is, and then manually enter it into the string literal using \uXXXX for code points 0xFFFF and below, or \UXXXXXXXX if they don't fit in the first form. Remember, each X is a hexadecimal digit.
You can store ASCII text directly using char strings. Remember, ASCII is a subset of UTF-8, which means that all ASCII strings are valid UTF-8 strings.
You can use pretty much any character allowed in C99. This boils down to any of the following:
- underscore (_),
- code points greater than or equal to \u00A0 and less than \uD800 and
- code points greater than \uDFFF.
- \u0024 ($),
- \u0040 (@) and
- \u0060 (`).
For that, you will need to use a ubyte array. You should not use char for this purpose, since char is supposed to contain UTF-8 strings, and other encodings more than likely aren't valid UTF-8 strings.
To convert between Unicode and your chosen code page, you will want to use a library designed to do this: iconv < http://www.gnu.org/software/libiconv/ > is a popular open source library for code page conversions.
On windows, you can look in std.windows.charset, functions toMBSz() and fromMBSz() for converting to/from Win-ANSI/Oem? encodings.
Not directly. You can either roll your own system, or use an existing library like gettext < http://www.gnu.org/software/gettext/ > to do this.
This one’s tricky to answer.
For most cases, char is more than sufficient. It’s also usually the most succinct encoding for Unicode text. Problems only really arise when you need to look at a string's length or do indexing/slicing on fixed locations.
The first problem (getting the length) can be solved by using a function like the following:
This will give you the correct answer. The second problem’s a little trickier. First of all, it’s important to realise that you can slice a UTF-8 or UTF-16 string: you just need to make sure you're not slicing in the middle of a code sequence. For example:
Works just fine since the find function returns the code unit index, and not the code point index. What you need to be careful of is code like this:
This doesn't work because ‘ö’ requires two UTF-8 code units to encode. Currently, there is no function in the standard library for extracting the nth character from a string, however you can use something like this:
Once you take care of those two problems, aside from things like the system API or what kind of text you're storing, it doesn't really matter which encoding you use.
It is a bit, actually. Here's a fast version written by Derek Parnell and Frits van Bommel that supports any given string type passed to it (not just char.)
See the newsgroup thread starting at http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.learn&article_id=7444.
Here's the long and short of it:
- Windows: ASCII for old Win9x APIs, UTF-16 for WinNT? APIs. You can tell the difference because ASCII APIs have a trailing 'A' on their name, whilst UTF-16 APIs have a trailing 'W'. For example: GetCommandLineA? and GetCommandLineW?.
- Linux: Depends on what you're calling, how it was compiled, system (locale) settings, etc. Best to read the documentation.
- Mac OSX: Usually UTF-8, some old-old-old functions may expect MacRoman? (yuck!). Be careful with filenames though, becase they allow only specific normalized subset of UTF-8 (you can read them as UTF-8, but you can't use any UTF-8 as filename unless you normalize it). http://developer.apple.com/qa/qa2001/qa1173.html
So here’s the short and sweet on text in D:
- char is a UTF-8 code unit, and may not be a complete code point.
- wchar is a UTF-16 code unit, and may not be a complete code point.
- dchar is a UTF-32 code unit, which is guaranteed to be a complete code point.
- char is a UTF-8 string, and uses one to four bytes per code point.
- wchar is a UTF-16 string, and uses two to four bytes per code point.
- dchar is a UTF-32 string and uses four bytes per code point.
- Outputting an incomplete UTF-8 or UTF-16 sequence will result in an error.
- You cannot reliably index or slice a UTF-8 or UTF-16 string due to variable-length encoding.
- The length property of a char or wchar array is the number of code units, not code points.
- Strings destined for a C function that expects NUL-terminated strings need to be passed through std.string.toStringz first (or manually make sure the NUL-terminator exists).
This version was manually transcribed from the original, and so there may be a few formatting errors.
If you update this document, please inform the original author.