These functions convert between wide chars (16 or 32 bit) and their UTF-8 representation. More...
Defines | |
| #define | MB_CUR_MAX _mb_cur_max() |
| Maximum UTF-8 sequence length. | |
| #define | WEOF ((wchar_t) -1) |
| End-of-file marker for wchar_t streams. | |
Functions | |
| int | mbtowc (wchar_t *wc, const char *mb, size_t ml) |
| Convert a UTF-8 sequence to a wide character. | |
| int | mbtowci (wchar_t **pp, const char *mb, size_t ml) |
| Convert a UTF-8 sequence to a wide character. | |
| int | wctomb (char *mb, wchar_t wc) |
| Converts a wide character to UTF-8 encoding. | |
| int | wcitomb (char *mb, wchar_t **pp) |
| Converts a wide character stored in memory to UTF-8 encoding. | |
These functions convert between wide chars (16 or 32 bit) and their UTF-8 representation.
Whether the wide character is 16 or 32 bit depends on you specifying the -fshort-wchar switch to gcc or not.
UTF-8 can encode any value between 0x00000000 and 0x3FFFFFFF. The RFC3629 specifies that the valid range for a Unicode character is between 0x00000000 and 0x0010FFFF and that the range 0x0000D800 - 0x0000DFFF must not be used. In addition, the RFC also specifies that UTF-8 sequences that are longer than necessary should be treated as erroneous.
You can configure the library to enforce RFC3629 conformance by defining the preprocessor symbol BLC_RFC3629_CONFORMANCE before you include the wchar.h header file.
The functions declared by the header are actually macros, which select the underlying function depending on the size of the wide character type and whether you specified RFC3629 conformance or not.
To use these functions you have to include wchar.h.
| #define MB_CUR_MAX _mb_cur_max() |
Maximum UTF-8 sequence length.
The maximum number of bytes resulting in the UTF-8 expansion of a wchar_t object. If wchar_t is 16 bits wide, then it is 3. If wchat_t is 32 bits and strict RFC conformance is not specified, then it is 6. For 32-bit wchar_t and strict RFC conformance, it is 4.
| int mbtowc | ( | wchar_t * | wc, | |
| const char * | mb, | |||
| size_t | ml | |||
| ) |
Convert a UTF-8 sequence to a wide character.
The function, if the character string pointer is not NULL, examines the multibyte string mb. If it is an empty string (i.e. its first character is 0), then the function returns 0. Otherwise, it examines not more than the ml bytes of it and checks if they form a proper UTF-8 encoded wide character. If yes, then the sequence is decoded and the wide character is stored at the location pointed by wc. The return value is the length of the UTF-8 sequence that was decoded. If the sequence is in error or longer that the specified length, -1 is returned and no conversion takes place.
| wc | Pointer to the location where the wide character result should be stored, or NULL, if the result shall not be stored. | |
| mb | Pointer to a character string that represents the multibyte sequence that should be decoded. | |
| ml | The maximum number of bytes to examine from the sequence. |
NULL or if the first character of the string was 0. If the pointer was not NULL but the first byte of the string was 0, then if the wc pointer was not null, a 0 multibyte character will be stored at the pointed location.wchar_t by using the -fshort-wchar compiler switch, thus limiting yourself to the use of the Basic Multilingual Plane, then only UTF-8 sequences that encode a codepoint no larger than U+FFFF will be accepted; all other sequences will be flagged as error, even if otherwise they are valid and RFC compliant. If you use 32-bit wchar_t (the compiler default) and you do not define BLC_RFC3629_CONFORMANCE then any sequence up to 6 bytes long will be accepted as long as the marker bits are correct in the bytes. | int mbtowci | ( | wchar_t ** | pp, | |
| const char * | mb, | |||
| size_t | ml | |||
| ) |
Convert a UTF-8 sequence to a wide character.
This function has the same semantics and behaviour as the mbtowc() function, except that instead of a pointer to the location where the wide char should be stored, a pointer to the pointer is passed to the function. The pointer in memory (i.e. the one that points to the location of the result) will be incremented so repeated calls to this function can iterate through a wchar_t array.
| pp | Pointer to a pointer to the location where the wide character result should be stored. If this argument is NULL, then the resulting wide character will be discarded and naturally no pointer in memory is incremented. If this pointer is valid, but the pointer it points to is NULL, then the resulting wide charater will be discarded but the pointer in memory will be incremented (and thus will no longer be NULL). If both the argument and the pointer identified by it are valid, then the result will be stored and the pointer in memory will be incremented. Note that the increment is performed even if the conversion will otherwise fail. | |
| mb | Pointer to a character string that represents the multibyte sequence that should be decoded. | |
| ml | The maximum number of bytes to examine from the sequence. |
| int wcitomb | ( | char * | mb, | |
| wchar_t ** | pp | |||
| ) |
Converts a wide character stored in memory to UTF-8 encoding.
The function has the same functionality and behaviour as the wctomb() function, except that its second parameter is not a wide character but a pointer to a pointer to wide chars. The function fetches the pointer, fetches the wchar pointed by it, converts it and then increments the pointer by 1 and stores it back. Thus, using this function you can iterate through a wchar_t array.
| mb | The pointer to the result buffer (or NULL). If the pointer is not NULL, then it must point to a buffer at least MB_CUR_MAX characters long. | |
| pp | Pointer to a pointer to wchar_t. The pointer in memory will be used to fetch the wide character to convert. The pointer in memory will be incremented to point to the next wide character. The increment is done even if the conversion of the character fails. |
| -1 | The conversion failed, because the **pp argument can not be transformed to UTF-8 or RFC3629 compliance was configured, and the **pp character is not in the valid character range as per the RFC. | |
| 0 | The **pp wide character was 0. The buffer was not modified. | |
| >0 | The value is the length of the resulting UTF-8 sequence. The sequence was written into the buffer (unless the buffer pointer was NULL, in which case the resulting multibyte sequence is discarded). The length of the sequence is never longer than MB_CUR_MAX. |
| int wctomb | ( | char * | mb, | |
| wchar_t | wc | |||
| ) |
Converts a wide character to UTF-8 encoding.
The function converts a wide character to UTF-8 representation. If the conversion is valid, then the resulting character sequence is stored at the given location. The function returns the length of the UTF-8 sequence that corresponds to the given wide character. If the character to convert is (wchar_t) 0, then the function returns 0. If the character can not be converted, the return value is -1. If the result pointer is null, then the return value is still valid, but the resulting UTF-8 sequence is discarded.
| mb | The pointer to the result buffer (or NULL). If the pointer is not NULL, then it must point to a buffer at least MB_CUR_MAX characters long. | |
| wc | The wide character to convert |
| -1 | The conversion failed, because the wc argument can not be transformed to UTF-8 or RFC3629 compliance was configured, and the wc character is not in the valid character range as per the RFC. | |
| 0 | The wc character was 0. The buffer was not modified. | |
| >0 | The value is the length of the resulting UTF-8 sequence. The sequence was written into the buffer (unless the buffer pointer was NULL, in which case the resulting multibyte sequence is discarded). The length of the sequence is never longer than MB_CUR_MAX. |
s argument is NULL, then the standard requires this function to return non-0 or 0 depending on whether the mutibyte encoding does or does not have a state dependent encoding. Since we do not have it, the function should return 0 if it receives a NULL pointer. This implementation breaks the standard in that even if it receives a NULL pointer it will return the length of the resulting multibyte sequence, but the resulting UTF-8 sequence will be discarded.
1.7.1