6/07/2010

06-07-10 - Unicode CMD Code Page Checkup

I wrote a while ago about the horrible fuckedness of Unicode support on Windows :

cbloom rants 06-21-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 3

Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very bad idea, don't do that).

Here is what I have figured out on Win64 so far :

1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.

2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida Concole works if you like that). The raster fonts do not support it.

3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation. Even wprintf() passed in correct unicode file names does not do the code page translation correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code page translation correctly. (as usual the pedantic rule lawyers will spout a bunch of bullshit about the fact that printf is just fine the way it is and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing translation and it should let me control how!)


expected : printf with wide strings (unicode) would do translation to the console code page 
(as selected with chcp) so that  characters show up right.
(this should probably be done in the stdout -> console pipe)

observed : does not translate to CMD code page, appears to always use A code page even with
the console is set to another code page

4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that your command line apps might be piped to a unicode text file. To handle this correctly in your own console app, you need to detect that you are being piped to unicode and do unicode output instead of converting to console CP. Ugly ugly.

5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app because the user's font choice might not support the code page you want.

6. CMD argc/argv argument encoding is still in the console code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), in which case GetCommandLineW actually is useful, but that is so rare it's not really worth worrying about.


expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
(in all cases)

observed : arguments are only available in CMD code page

7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way to take a command line that works in CMD and run it from your app. _wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the work of promoting your CMD code page arguments to proper unicode).


repro : write an app that takes your own full argc/argv array, and spawn a process with those same args 
(use an exe name that involved troublesome characters)

expected : same app runs again

observed : if A and Console code pages differ, your own app may not be found

8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer and paste it into a console. My oh my.


repro : copy-paste a file name from (unicode) Explorer into CMD

expected : unicode is translated into current CMD code page and thus usable for command line arguments

observed : does not work

9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else to list the files then changing it back (their output seems to be the same as mine in UTF8 code page). This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be found if you copy-paste them to your command line!


repro : dir a file with strange characters in it.  copy-paste the text output from dir and type 
"dir <paste>" on the command line

expected : file is found by dir of copy-paste

observed : depending on code page, the file is not be found

So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the same; I think the answer is they are only different if output is piped to a file).

I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :

1. Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names. (I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For arguments that aren't files, convert from the console code page to wchar.

2. Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.

3. To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into printf.

For more information :

Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI

Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.


BTW a semi-related rant :

WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :


typedef wchar_t wchar; // it's not fucking char_t

// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }

// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str)    (sizeof(str)/sizeof(wchar))
// best to just always use countof for strings instead of sizeof
#define countof(str)	(sizeof(str)/sizeof(str[0]))

fucking be reasonable to your users. You ask too much and make me do stupid busy work with the pointless difference in names between the non-w and the w string function names.

Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a


#define wprintf wprintf_does_not_do_what_you_want

kind of thing.

7 comments:

nerdasha said...
This comment has been removed by the author.
Thatcher Ulrich said...

UTF-8 FTW!

Anonymous said...

I imagine the _t came about because they wanted to avoid collisions in code that was already using the term 'wchar' (or 'size', or etc.).

As far as I can imagine wcslen() must have been a decision not to want things to be one character longer. Possibly also it's "wcs" for "wide character string" and someone pedantically decided that "wide string" sounded lame and so "wstr" wouldn't do? I don't know, I can't get behind the wcs thing at all, because it just totally fucks. (E.g. wstrlen, wstrcpy, wsprintf, wfgets... that could have all been consistent.)

I use utf8 everywhere and have wrappers around fopen etc. that in windows convert to utf16 and call the appropriate functions. And I just blow off functioning correctly for filenames from the commandline.

cbloom said...

"I use utf8 everywhere and have wrappers around fopen etc. that in windows convert to utf16 and call the appropriate functions. "

Yeah I thought about doing that when I originally went unicode. On the surface it has the appeal that your existing char * based string code all just works, so that's cool. The problem I have is that workage is a bit illusory. eg. lots of those string routines will assume that a char is one letter. And you have to be really careful to always call the wrapped routines, it makes it hard to make sure you converted everything to unicode correctly because you can easily be calling plain old A-page stuff from anywhere.

"And I just blow off functioning correctly for filenames from the commandline. "

CLI4LIFE !

cbloom said...

"Yeah I thought about doing that when I originally went unicode. "

Anyway, the internal unicode part in UTF-whatever is not the hard part, that's just coding, and Coding is Easy. The hard part is interacting with the Windows CLI which is thoroughly fucked.

(I have yet to determine if other Windows mechanisms like DDE or Clipboard or any of those ways of getting file paths sent to you are functional or not).

Anonymous said...

"chcp 65001" works in Win XP. Well, I haven't checked that it actually sends commandline args in utf8. I just mean that if I type the command it claims to work.

cbloom said...

"chcp 65001" works in Win XP. Well, I haven't checked that it actually sends commandline args in utf8. I just mean that if I type the command it claims to work."

Yeah chcp 65001 works on any modern Windows - depending on what font you have set on your command line. I haven't actually tested thoroughly how useful that is, because if you're trying to write command line apps that work on any system you can't rely on that.

old rants