Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.
Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.
On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with
char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.
I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8
std::strings to native UTF-16, which Windows itself does not support properly.
To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding
wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every
char* parameter would be considered unicode-compatible.
I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).
I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:
- Do not use
std::wstring in any place other than adjacent point to APIs accepting UTF-16.
- Don't use
L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
- Don't use types, functions or their derivatives that are sensitive to the
_UNICODE constant, such as
_UNICODE always defined, to avoid passing
char* strings to WinAPI getting silently compiled
char* anywhere in program are considered UTF-8 (if not said otherwise)
- All my strings are
std::string, though you can pass char* or string literal to
convert(const std::string &).
only use Win32 functions that accept widechars (
LPWSTR). Never those which accept
LPSTR. Pass parameters this way:
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
(The policy uses conversion functions below.)
With MFC strings:
CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
Working with files, filenames and fstream on Windows:
- Never pass
const char* filename arguments to
fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
std::string arguments to
We'll have to manually remove the convert, when MSVC's attitude to
- This code is not multi-platform and may have to be changed manually in the future
fstream unicode research/discussion case 4215 for more info.
- Never produce text output files with non-UTF8 content
- Avoid using
fopen() for RAII/OOD reasons. If necessary, use
_wfopen() and WinAPI conventions above.
// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
// Ask me for implementation..
std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
// Ask me for implementation..
// Interface to MFC
std::string convert(const CString &mfcString)
return mfcString.GetString(); // This branch is deprecated.
CString convert(const std::string &s)
Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode