Common Coding Mistakes – Wide Character Arrays

This post contains a few of my thoughts on common coding mistakes we see during code reviews when developers deal with wide character arrays. Manipulating wide character strings is reasonably easy to get right, but there are plenty of “gotchas” still popping up. Coders should make sure they take care because a few things can slip your mind when dealing with these strings and result in mistakes.

A little bit of background:

The term wide character generally refers to character data types with a width larger than a byte (the width of a normal char). The actual size of a wide character varies between implementations, but the most common sizes are 2 bytes (i.e. Windows) and 4 bytes (i.e. Unix-like OSes). Wide characters usually represent a particular character using one of the Unicode character sets: in Windows this will be UTF-16 and for Unix-like systems, whose wide characters are twice the size, this will usually be UTF-32.

Windows seems to love wide character strings and has made them standard. As a result, many Windows APIs have two versions: functionNameA and functionNameW, an ANSI version and a wide char string version, respectively. If you’ve done any development on Windows systems, you’ll definitely be no stranger to wide character strings.

There are definite advantages to representing strings as wide char arrays, but there are a lot of mistakes to make, especially if you’re used to developing on Unix-like systems or you forget to consider the fact that one character does not equal one byte.

For example, consider the following scenario, where a Windows developer begins to unsuspectingly parse a packet that follows their proprietary network protocol. The code shown takes a UTF-16 string length (unsigned int) from the packet and performs a bounds check. If the check passes, a string of the specified length (assumed to be a UTF-16 string) is copied from the packet buffer to a fresh wide char array on heap.

[ … ]

if(packet->dataLen > 34 || packet->dataLen < sizeof(wchar_t)) bailout_and_exit();

size_t bufLen = packet->dataLen / sizeof(wchar_t);

wchar_t *appData = new wchar_t[bufLen];

memcpy(appData, packet->payload, packet->dataLen);

[ … ]

This might look okay at first glance; after all, we’re just copying a chunk of data to a new wide char array. But consider what would happen if packet->dataLen was an odd number. For example, if packet->dataLen = 11, we end up with size_t bufLen = 11 / 2 = 5 since the remainder of the division will be discarded.

So, a five-element–long wide character buffer is allocated into which the memcpy() copies 11 bytes. Since five wide chars on Windows is 10 bytes (and 11 bytes are copied), we have an off-by-one overflow. To avoid this, the modulo operator should be used to check that packet->dataLen is even to begin with; that is:

if(packet->dataLen % 2) bailout()

Another common occurrence is to forget that the NULL terminator on the end of a wide character buffer is not a single NULL byte: it’s two NULL bytes (or 4, on a UNIX-like box). This can lead to problems when the usual len + 1 is used instead of the len + 2 that is required to account for the extra NULL byte(s) needed to terminate wide char arrays, for example:

int alloc_len = len + 1;

wchar_t *buf = (wchar_t *)malloc(alloc_len);

memset(buf, 0x00, len);

wcsncpy(buf, srcBuf, len);

If srcBuf had len wide chars in it, all of these would be copied into buf, but wcsncpy() would not NULL terminatebuf. With normal character arrays, the added byte (which will be a NULL because of the memset) would be the NULL terminator and everything would be fine. But since wide char strings need either a two- or four-byte NULL terminator (Windows and UNIX, respectively), we now have a non-terminated string that could cause problems later on.

Some developers also slip up when they wrongly interchange the number of bytes and the number of characters. That is, they use the number of bytes as a copy length when what the function was asking for was the number of characters to copy; for example, something like the following is pretty common:

int destLen = (stringLen * (sizeof(wchar_t)) + sizeof(wchar_t);

wchar_t *destBuf = (wchar_t *)malloc(destLen);

MultiByteToWideChar(CP_UTF8, 0, srcBuf, stringLen, destBuf, destLen);

[ do something ]

The MultiByteToWide API is documented at <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>.

The problem with the sample shown above is that the sixth parameter to MultiByteToWideChar is the length of the destination buffer in wide characters, not in bytes, as the call above was done. Our destination length is out by a factor of two here (or four on UNIX-like systems, generally) and ultimately we can end up overrunning the buffer. These sorts of mistakes result in overflows and they’re surprisingly common.

The same sort of mistake can also be made when using “safe” wide char string functions, like wcsncpy(), for example:

unsigned int destLen = (stringLen * sizeof(wchar_t)) + sizeof(wchar_t);

wchar_t destBuf[destLen];

memset(destBuf, 0x00, destLen);

wcsncpy(destBuf, srcBuf, sizeof(destBuf));

Although using sizeof(destuff) for maximum destination size would be fine if we were dealing with normal characters, this doesn’t work for wide character buffers. Instead, sizeof(destBuf) will return the number of bytes indestBuf, which means the wcsncpy() call above it can end up copying twice as many bytes to destBuf as intended—again, an overflow.

The other wide char equivalent string manipulation functions are also prone to misuse in the same ways as their normal char counterparts—look for all the wide char equivalents when auditing such functions as swprintf,wcscpy, wcsncpy, etc. There also are a few wide char-specific APIs that are easily misused; take, for example,wcstombs(), which converts a wide char string to a multi-byte string. The prototype looks like this:

size_t wcstombs(char *restrict s, const wchar_t *restrict pwcs, size_t n);

It does bounds checking, so the conversion stops when n bytes have been written to s or when a NULL terminator is encountered in pwcs (the source buffer). If an error occurs, i.e. a wide char in pwcs can’t be converted, the conversion stops and the function returns (size_t)-1, else the number of bytes written is returned. The MSDN considerswcstombs() to be deprecated, but there are still a few common ways to mess when using it, and they all revolve around not checking return values.

If a bad wide character is encountered in the conversion and you’re not expecting a negative number to be returned, you could end up under-indexing your array; for example:

int i;

i = wcstombs( … )  // wcstombs() can return -1

buf[i] = L'';

If a bad wide character is found during conversion, the destination buffer will not be NULL terminated and may contain uninitialized data if you didn’t zero it or otherwise initialize it beforehand.

Additionally, if the return value is n, the destination buffer won’t be NULL terminated, so any string operations later carried out on or using the destination buffer could run past the end of the buffer. Two possible consequences are a potential page fault if an operation runs off the end of a page or potential memory corruption bugs, depending on howdestbuf is usedlater . Developers should avoid wcstombs() and use wcstombs_s() or another, safer alternative. Bottom line: always read the docs before using a new function since APIs don’t always do what you’d expect (or want) them to do.

Another thing to watch out for is accidentally interchanging wide char and normal char functions. A good example would be incorrectly using strlen() on a wide character string instead of wcslen()—since wchar strings are chock full of NULL bytes, strlen() isn’t going to return the length you were after. It’s easy to see how this can end up causing security problems if a memory allocation is done based on a strlen() that was incorrectly performed on a wide char array.

Mistakes can also be made when trying to develop cross-platform or portable code—don’t hardcode the presumed length of wchars. In the examples above, I have assumed sizeof(wchar_t) = 2; however, as I’ve said a few times, this is NOT necessarily the case at all, since many UNIX-like systems have sizeof(wchar_t) = 4.

Making these assumptions about width could easily result in overflows when they are violated. Let’s say someone runs your code on a platform where wide characters aren’t two bytes in length, but are four; consider what would happen here:

wchar_t *destBuf = (wchar_t *)malloc(32 * 2 + 2);

wcsncpy(destBuf, srcBuf, 32);

On Windows, this would be fine since there’s enough room in destBuff for 32 wide chars + NULL terminator (66 bytes). But as soon as you run this on a Linux box—where wide chars are four bytes—you’re going to get wcsncpy()writing 4 * 32 + 2 = 130 bytes and resulting in a pretty obvious overflow.

So don’t make assumptions about how large wide characters are supposed to be since it can and does vary. Always usesizeof(wchar_t) to find out.

When you’re reviewing code, keep your eye out for the use of unsafe wide char functions, and ensure the math is right when allocating and copying memory. Make sure you check your return values properly and, most obviously, read the docs to make absolutely sure you’re not making any mistakes or missing something important.