Non Unicode Text In D
Difference (last change) (no other diffs, normal page display)
Changed: 53,57c53
ubyte[] outp = new ubyte[](len); for (int i=0; i<len; i++) { outp[i]=(*p); p++; } return outp; |
return p[0..len].dup; |
“Working with non-Unicode text files (e.g. in latin1-encoding)”
Most other common encodings are byte-width, e.g. latin-1/-15 or windows-1252, depending on what operating system one is using.
If you still have text files in such encodings, there are several possibilities:
- convert all text files to utf8
- keep the text files with the current encoding, but convert the characters to unicode when reading the file, and back to the other encoding when writing the file (Java does this for example), see below.
- work with the current encoding, without converting anything. You can store such strings as char[] or ubyte[], and process them as normal. What you have to be careful about is, to not use standard library functions on these strings, which were made for utf8 data (treat them as binary data when reading/writing to files or the console, see next section).
“How to print non-utf8 strings (e.g. in latin1-encoding)”
You cannot use writefln for this, because you will get an "invalid utf8-sequence" error. You have to use a lower-level function.
see HowTo/printf
Latin-1 to Utf32 conversion (for reading from file)
This function can be used to convert latin1 to unicode. It is easy, because latin1 and unicode share the same first 256 codepoints.
![]() |
|
Resulting dchars allow for easy character manipulation in the program (one dchar = one character).
(adapted from {{ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=54530}})
Utf32 to Latin-1 conversion (writing to a file)
similar, to be done
WinAnsi? to/from Utf8 conversion (reading/writing textfiles on Windows)
![]() |
|
Example:
![]() |
|
Latin1, WinAnsi? ...
Latin1 (=ISO 8859-1) is similar to some other encodings, e.g. Latin-9 (=ISO 8859-15) mainly has the Euro (€)-character added. Windows uses special encodings like win-1252, which also have some differences. Windows api calls should be used to convert these code pages to Unicode (see above).
see:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/ISO/IEC_8859-15
http://en.wikipedia.org/wiki/Windows-1252 (aka cp1252)
See also