Wide character conversions
[C library]

These functions convert between wide chars (16 or 32 bit) and their UTF-8 representation. More...

Defines

#define MB_CUR_MAX   _mb_cur_max()
 Maximum UTF-8 sequence length.
#define WEOF   ((wchar_t) -1)
 End-of-file marker for wchar_t streams.

Functions

int mbtowc (wchar_t *wc, const char *mb, size_t ml)
 Convert a UTF-8 sequence to a wide character.
int mbtowci (wchar_t **pp, const char *mb, size_t ml)
 Convert a UTF-8 sequence to a wide character.
int wctomb (char *mb, wchar_t wc)
 Converts a wide character to UTF-8 encoding.
int wcitomb (char *mb, wchar_t **pp)
 Converts a wide character stored in memory to UTF-8 encoding.

Detailed Description

These functions convert between wide chars (16 or 32 bit) and their UTF-8 representation.

Whether the wide character is 16 or 32 bit depends on you specifying the -fshort-wchar switch to gcc or not.

UTF-8 can encode any value between 0x00000000 and 0x3FFFFFFF. The RFC3629 specifies that the valid range for a Unicode character is between 0x00000000 and 0x0010FFFF and that the range 0x0000D800 - 0x0000DFFF must not be used. In addition, the RFC also specifies that UTF-8 sequences that are longer than necessary should be treated as erroneous.
You can configure the library to enforce RFC3629 conformance by defining the preprocessor symbol BLC_RFC3629_CONFORMANCE before you include the wchar.h header file.

The functions declared by the header are actually macros, which select the underlying function depending on the size of the wide character type and whether you specified RFC3629 conformance or not.

To use these functions you have to include wchar.h.


Define Documentation

#define MB_CUR_MAX   _mb_cur_max()

Maximum UTF-8 sequence length.

The maximum number of bytes resulting in the UTF-8 expansion of a wchar_t object. If wchar_t is 16 bits wide, then it is 3. If wchat_t is 32 bits and strict RFC conformance is not specified, then it is 6. For 32-bit wchar_t and strict RFC conformance, it is 4.


Function Documentation

int mbtowc ( wchar_t *  wc,
const char *  mb,
size_t  ml 
)

Convert a UTF-8 sequence to a wide character.

The function, if the character string pointer is not NULL, examines the multibyte string mb. If it is an empty string (i.e. its first character is 0), then the function returns 0. Otherwise, it examines not more than the ml bytes of it and checks if they form a proper UTF-8 encoded wide character. If yes, then the sequence is decoded and the wide character is stored at the location pointed by wc. The return value is the length of the UTF-8 sequence that was decoded. If the sequence is in error or longer that the specified length, -1 is returned and no conversion takes place.

Parameters:
wc Pointer to the location where the wide character result should be stored, or NULL, if the result shall not be stored.
mb Pointer to a character string that represents the multibyte sequence that should be decoded.
ml The maximum number of bytes to examine from the sequence.
Returns:
The number of bytes used from the string if the conversion was successful, -1 if a conversion error was encountered or 0 if the string pointer was NULL or if the first character of the string was 0. If the pointer was not NULL but the first byte of the string was 0, then if the wc pointer was not null, a 0 multibyte character will be stored at the pointed location.
Note:
RFC3629 declares that the U+D800 - U+DFFF codepoint range of the Basic Multilingual Plane is reserved and should not be used. In addition, the RFC also defines that the largest valid codepoint is U+10FFFF. Furthermore, the RFC explicitely specifies that the UTF-8 sequence that encodes a Unicode character must be the shortest possible sequence.
If you define BLC_RFC3629_CONFORMANCE before you include the <wchar.h> header, then this function will check for RFC conformance and will report an error if the multibyte sequence breaks it. If you specified 16-bit wchar_t by using the -fshort-wchar compiler switch, thus limiting yourself to the use of the Basic Multilingual Plane, then only UTF-8 sequences that encode a codepoint no larger than U+FFFF will be accepted; all other sequences will be flagged as error, even if otherwise they are valid and RFC compliant. If you use 32-bit wchar_t (the compiler default) and you do not define BLC_RFC3629_CONFORMANCE then any sequence up to 6 bytes long will be accepted as long as the marker bits are correct in the bytes.
int mbtowci ( wchar_t **  pp,
const char *  mb,
size_t  ml 
)

Convert a UTF-8 sequence to a wide character.

This function has the same semantics and behaviour as the mbtowc() function, except that instead of a pointer to the location where the wide char should be stored, a pointer to the pointer is passed to the function. The pointer in memory (i.e. the one that points to the location of the result) will be incremented so repeated calls to this function can iterate through a wchar_t array.

Parameters:
pp Pointer to a pointer to the location where the wide character result should be stored. If this argument is NULL, then the resulting wide character will be discarded and naturally no pointer in memory is incremented. If this pointer is valid, but the pointer it points to is NULL, then the resulting wide charater will be discarded but the pointer in memory will be incremented (and thus will no longer be NULL). If both the argument and the pointer identified by it are valid, then the result will be stored and the pointer in memory will be incremented. Note that the increment is performed even if the conversion will otherwise fail.
mb Pointer to a character string that represents the multibyte sequence that should be decoded.
ml The maximum number of bytes to examine from the sequence.
Returns:
The number of bytes used from the string if the conversion was successful, -1 if a conversion error was encountered or 0 if the string pointer was NULL or if the first character of the string was 0.
Attention:
This is not a standard library function.
int wcitomb ( char *  mb,
wchar_t **  pp 
)

Converts a wide character stored in memory to UTF-8 encoding.

The function has the same functionality and behaviour as the wctomb() function, except that its second parameter is not a wide character but a pointer to a pointer to wide chars. The function fetches the pointer, fetches the wchar pointed by it, converts it and then increments the pointer by 1 and stores it back. Thus, using this function you can iterate through a wchar_t array.

Parameters:
mb The pointer to the result buffer (or NULL). If the pointer is not NULL, then it must point to a buffer at least MB_CUR_MAX characters long.
pp Pointer to a pointer to wchar_t. The pointer in memory will be used to fetch the wide character to convert. The pointer in memory will be incremented to point to the next wide character. The increment is done even if the conversion of the character fails.
Return values:
-1 The conversion failed, because the **pp argument can not be transformed to UTF-8 or RFC3629 compliance was configured, and the **pp character is not in the valid character range as per the RFC.
0 The **pp wide character was 0. The buffer was not modified.
>0 The value is the length of the resulting UTF-8 sequence. The sequence was written into the buffer (unless the buffer pointer was NULL, in which case the resulting multibyte sequence is discarded). The length of the sequence is never longer than MB_CUR_MAX.
Attention:
This is not a standard library function.
int wctomb ( char *  mb,
wchar_t  wc 
)

Converts a wide character to UTF-8 encoding.

The function converts a wide character to UTF-8 representation. If the conversion is valid, then the resulting character sequence is stored at the given location. The function returns the length of the UTF-8 sequence that corresponds to the given wide character. If the character to convert is (wchar_t) 0, then the function returns 0. If the character can not be converted, the return value is -1. If the result pointer is null, then the return value is still valid, but the resulting UTF-8 sequence is discarded.

Parameters:
mb The pointer to the result buffer (or NULL). If the pointer is not NULL, then it must point to a buffer at least MB_CUR_MAX characters long.
wc The wide character to convert
Return values:
-1 The conversion failed, because the wc argument can not be transformed to UTF-8 or RFC3629 compliance was configured, and the wc character is not in the valid character range as per the RFC.
0 The wc character was 0. The buffer was not modified.
>0 The value is the length of the resulting UTF-8 sequence. The sequence was written into the buffer (unless the buffer pointer was NULL, in which case the resulting multibyte sequence is discarded). The length of the sequence is never longer than MB_CUR_MAX.
Note:
RFC3629 declares that the U+D800 - U+DFFF codepoint range of the Basic Multilingual Plane is reserved and should not be used. Furthermore, if you configured 32-bit wide chars, any code value above U+10FFFF is invalid.
If you define BLC_RFC3629_CONFORMANCE before you include the <wchar.h> header, then these restrictions are enforced. Otherwise, any character between U+0000 and U+FFFF for 16 bit wchar_t and U+0000 and U+7FFFFFFF for 32-bit wchar_t are accepted.
Attention:
If the s argument is NULL, then the standard requires this function to return non-0 or 0 depending on whether the mutibyte encoding does or does not have a state dependent encoding. Since we do not have it, the function should return 0 if it receives a NULL pointer. This implementation breaks the standard in that even if it receives a NULL pointer it will return the length of the resulting multibyte sequence, but the resulting UTF-8 sequence will be discarded.
Generated on Tue Jul 13 16:51:45 2010 by  doxygen 1.6.3