VC++
MFC Tutorial:
Unicode, MBCS and Generic text mappings
|
|
Chris
Maunder.
Environment: Unicode,
MBCS
In order to allow your
programs to be used in international markets it is worth
making your application Unicode or MBCS
aware. The Unicode
character set is a "wide character" (2 bytes per
character) set that contains every character available in
every language, including all technical symbols and
special publishing characters. Multibyte character set (MBCS)
uses either 1 or 2 bytes per character
and is used for character
sets that contain large numbers of different characters (eg
Asian language character sets).
Which character set you
use depends on the language and the operating system. Unicode
requires more space than MBCS
since each character is 2 bytes. It is also faster than MBCS
and is used by Windows NT as standard, so non-Unicode
strings passed to and from the operating system must be
translated, incurring overhead. However, Unicode
is not supported on Win95 and so MBCS
may be a better choice in this situation. Note that if you
wish to develop applications in the Windows CE environment
then all applications must be compiled in Unicode.
Using MBCS
or Unicode
The best way to use
Unicode or MBCS - or
indeed even ASCII - in your programs is to use the generic
text mapping macros provided by Visual
C++. That way you can simply use a single
define to swap between Unicode,
MBCS and ASCII without
having to do any recoding.
To use MBCS
or Unicode you need
only define either _MBCS or _UNICODE
in your project. For Unicode you will also need to specify
the entry point symbol in your Project settings as wWinMainCRTStartup.
Please note that if both _MBCS and _UNICODE
are defined then the result will be unpredictable.
Generic Text mappings
and portable functions
The generic text mappings
replace the standard char or LPSTR types with generic TCHAR
or LPTSTR macros. These macros will map to different types
and functions depending on whether you have compiled with
UNICODE or MBCS (or neither) defined. The simplest way to
use the TCHAR type is
to use the CString class - it is
extremely flexible and does most of the work for you.
In conjunction with the
generic character type, there is a set of generic string
manipulation functions prefixed by _tcs.
For instance, instead of using the strrev
function in your code, you should use the _tcsrev
function which will map to the correct function depending
on which character set you have compiled for. The table
below demonstrates:
#define |
Compiled
Version |
Example |
_UNICODE |
Unicode
(wide-character) |
_tcsrev
maps to _wcsrev |
_MBCS |
Multibyte-character |
_tcsrev
maps to _mbsrev |
None
(the default: neither _UNICODE nor _MBCS
defined) |
SBCS
(ASCII) |
_tcsrev
maps to strrev |
Each str*
function has a corresponding tcs*
function that should be used instead. See the TCHAR.H file
for all the mapping and macros that are available. Just
look up the online help for the string function in
question in order to find the equivalent portable
function.
Note: Do
not use the str* family of functions with
Unicode strings, since Unicode strings are likely to
contain embedded null bytes.
The next important point
is that each literal string should be enclosed by the TEXT()
(or _T()) macro. This macro prepends a
"L" in front of literal strings if the project
is being compiled in Unicode, or does nothing if MBCS or
ASCII is being used. For instance, the string _T("Hello")
will be interpreted as "Hello" in MBCS or ASCII,
and L"Hello" in Unicode.If you are working in
Unicode and do not use the _T() macro,
you may get compiler warnings.
Note that you can use
ASCII and Unicode within the same program, but not within
the same string.
All MFC functions except
for database class member functions are Unicode aware.
Converting between
Generic types and ASCII
Visual C++ provides a
bunch of very useful macros for converting between
different character format. The basic form of these macros
is X2Y(), where X is the source format.
Possible conversion formats are shown in the following
table.
String
Type |
Abbreviation |
ASCII (LPSTR) |
A |
WIDE
(LPWSTR) |
W |
OLE
(LPOLESTR) |
OLE |
Generic
(LPTSTR) |
T |
Const |
C |
Thus, A2W converts
an LPSTR to an LPWSTR,
OLE2T converts an LPOLESTR to an LPTSTR,
and so on.
There are also const
forms (denoted by a C) that convert to a const
string. For instance, A2CT converts from LPSTR
to LPCTSTR.
When using the string
conversion macros you need to include the USES_CONVERSION
macro at the beginning of your function:
void foo(LPSTR lpsz)
{
USES_CONVERSION;
...
LPTSTR szGeneric = A2T(lpsz)
// Do something with szGeneric
...
}
Two caveats on using the
conversion macros:
- Never use the
conversion macros inside a tight loop. This will cause
a lot of memory to be allocated each time the
conversion is performed, and will result in slow code.
Better to perform the conversion outside the loop and
pass the converted value into the loop.
- Never return the
result of the macros directly from a function, unless
the return value implies making a copy of the data
before returning. For instance, if you have a function
that returns an LPOLESTR, then do not do the
following:
LPTSTR BadReturn(LPSTR lpsz)
{
USES_CONVERSION;
// do something
return A2T(lpsz);
}
Instead, you should
return the value as a CString, which would imply a
copy of the string would be made before the function
returns:
CString GoodReturn(LPSTR lpsz)
{
USES_CONVERSION;
// do something
return A2T(lpsz);
}
Tips and Traps
- The TRACE
statement
The TRACE
macros have a few cousins - namely the TRACE0,
TRACE1, TRACE2
and TRACE3 macros. These macros allow you
to specify a format string (as in the normal TRACE
macro), and either 0,1,2 or 3 parameters, without the need
to enclose your literal format string in the _T()
macro. For instance,
TRACE(_T("This is trace statement number %d\n"), 1);
can be written
TRACE1("This is trace statement number %d\n", 1);
- Viewing
Unicode strings in the debugger
If you are using Unicode
in your applciation and wish to view Unicode strings in
the debugger, then you will need to go to Tools | Options
| Debug and click on "Display Unicode Strings".
- The Length
of strings
Be careful when
performing operations that depend on the size or length of
a string. For instance, CString::GetLength
returns the number of characters in a string, NOT the size
in bytes. If you were to write the string to a CArchive
object, then you would need to multiply the length of the
string by the size of each character in the string to get
the number of bytes to write:
CString str = _T("Hello, World");
archive.Write( str, str.GetLength( ) * sizeof( TCHAR ) );
- Reading and
Writing ASCII text files
If you are using Unicode
or MBCS then you need to be careful when writing ASCII
files. The safest and easiest way to write text files is
to use the CStdioFile class provided with
MFC. Just use the CString class and the ReadString
and WriteString member functions and nothing
should go wrong. However, if you need to use the CFile
class and it's associated Read and Write
functions, then if you use the following code:
CFile file(...);
CString str = _T("This is some text");
file.Write( str, (str.GetLength()+1) * sizeof( TCHAR ) );
instead of
CStdioFile file(...);
CString str = _T("This is some text");
file.WriteString(str);
then the results will be
Significantly different. The two lines of text below are
from a file created using the first and second code
snippets respectively:
(This text was viewed
using WordPad)
- Not all
structures use the generic text mappings
For instance, the
CHARFORMAT structure, if the RichEditControl version is
less than 2.0, uses a char[] for the szFaceName field,
instead of a TCHAR as would be expected. You must be
careful not to blindly change "..." to
_T("...") without first checking. In this case,
you would probably need to convert from TCHAR to char
before copying any data to the szFaceName field.
- Copying
text to the Clipboard
This is one area where
you may need to use ASCII and Unicode in the same program,
since the CF_TEXT format for the clipboard uses ASCII
only. NT systems have the option of the CF_UNICODETEXT if
you wish to use Unicode on the clipboard.
- Installing
the Unicode MFC libraries
The
Unicode versions of the MFC
libraries are not copied to your hard drive unless you
select them during a Custom installation. They are not
copied during other types of installation. If you attempt
to build or run an MFC
Unicode application
without the MFC Unicode
files, you may get errors.
(From the
online docs) To copy the files to your hard drive, rerun
Setup, choose Custom installation, clear all other
components except "Microsoft Foundation Class
Libraries," click the Details button, and
select both "Static Library for Unicode"
and "Shared Library for Unicode."
|