MBRTOC16(3C) | Standard C Library Functions | MBRTOC16(3C) |
mbrtoc16
, mbrtoc32
,
mbrtowc
, mbrtowc_l
—
#include <wchar.h>
size_t
mbrtowc
(wchar_t *restrict pwc,
const char *restrict str, size_t
len, mstate_t *restrict ps);
#include <wchar.h>
#include <xlocale.h>
size_t
mbrtowc
(wchar_t *restrict pwc,
const char *restrict str, size_t
len, mstate_t *restrict ps,
locale_t loc);
#include
<uchar.h>
size_t
mbrtoc16
(char16_t *restrict
p16c, const char *restrict str,
size_t len, mbstate_t *restrict
ps);
size_t
mbrtoc32
(char32_t *restrict
p32c, const char *restrict str,
size_t len, mbstate_t *restrict
ps);
mbrtoc16
(), mbrtoc32
(),
mbrtowc
(), and mbrtowc_l
()
functions convert character sequences, which may contain multi-byte
characters, into different character formats. The functions work in the
following formats:
mbrtoc16
()mbrtoc32
()mbrtowc
(),
mbrtowc_l
()The functions consume up to len characters
from the string str and accumulate them in
ps until a valid character is found, which is
influenced by the LC_CTYPE
category of the current
locale. For example, in the C locale, only ASCII
characters are recognized, while in a UTF-8 based locale
like en_US.UTF-8, UTF-8 multi-byte character sequences
that represent Unicode code points are recognized. The
mbrtowc_l
() function uses the locale passed in
loc rather than the locale of the current thread.
When a valid character sequence has been found, it is converted to
either a 16-bit character sequence for mbrtoc16
() or
a 32-bit character sequence for mbrtoc32
() and will
be stored in p16c and p32c
respectively.
The ps argument represents a multi-byte
conversion state which can be used across multiple calls to a given function
(but not mixed between functions). These allow for characters to be consumed
from subsequent buffers, e.g. different values of str.
The functions may be called from multiple threads as long as they use unique
values for ps. If ps is
NULL
, then a function-specific buffer will be used
for the conversion state; however, this is stored between all threads and
its use is not recommended.
When using these functions, more than one character may be output for a given set of consumed input characters. An example of this is when a given code point is represented as a set of surrogate pairs in UTF-16, which require two 16-bit characters to represent a code point. When this occurs, the functions return the special return value -3.
The functions all have a special behavior when
NULL
is passed for str. They
instead will treat it as though pwc,
p16c, or p32c were
NULL
, str had been passed as
the empty string, "" and the length, len,
would appear as the value 1. In other words, the functions would be called
as:
mbrtowc(NULL, "", 1, ps) mbrtowc_l(NULL, "", 1, ps) mbrtoc16(NULL, "", 1, ps) mbrtoc32(NULL, "", 1, ps)
Regardless of the locale, the characters returned will be encoded as though the code point were the corresponding value in Unicode. This means that if a locale returns a value that would be a surrogate pair in the UTF-16 encoding, it will still be encoded as a UTF-16 character.
This behavior of the mbrtoc16
() and
mbrtoc32
() functions should not be relied upon, is
not portable, and subject to change for non-Unicode locales.
mbrtoc16
(), mbrtoc32
(),
mbrtowc
(), and mbrtowc_l
()
functions return the following values:
EILSEQ
. No data was written into the wide
character buffer (pwc, p16c,
p32c).mbrtoc16
() and
mbrtoc32
() functions.mbrtoc32
()
function to convert a multibyte string.
#include <locale.h> #include <stdlib.h> #include <string.h> #include <err.h> #include <stdio.h> #include <uchar.h> int main(void) { mbstate_t mbs; char32_t out; size_t ret; const char *uchar_str = "\xe5\x85\x89"; (void) memset(&mbs, 0, sizeof (mbs)); (void) setlocale(LC_CTYPE, "en_US.UTF-8"); ret = mbrtoc32(&out, uchar_str, strlen(uchar_str), &mbs); if (ret != strlen(uchar_str)) { errx(EXIT_FAILURE, "failed to convert string, got %zd", ret); } (void) printf("Converted %zu bytes into UTF-32 character " "0x%x0, ret, out); return (0); }
When compiled and run, this produces:
$ ./a.out Converted 3 bytes into UTF-32 character 0x5149
Example 2 Handling surrogate pairs from the
mbrtoc16
() function.
#include <locale.h> #include <stdlib.h> #include <string.h> #include <err.h> #include <stdio.h> #include <uchar.h> int main(void) { mbstate_t mbs; char16_t first, second; size_t ret; const char *uchar_str = "\xf0\x9f\x92\xa9"; (void) memset(&mbs, 0, sizeof (mbs)); (void) setlocale(LC_CTYPE, "en_US.UTF-8"); ret = mbrtoc16(&first, uchar_str, strlen(uchar_str), &mbs); if (ret != strlen(uchar_str)) { errx(EXIT_FAILURE, "failed to convert string, got %zd", ret); } ret = mbrtoc16(&second, "", 0, &mbs); if (ret != (size_t)-3) { errx(EXIT_FAILURE, "didn't get second surrogate pair, " "got %zd", ret); } (void) printf("UTF-16 surrogates: 0x%x 0x%x0, first, second); return (0); }
When compiled and run, this produces:
$ ./a.out UTF-16 surrogates: 0xd83d 0xdca9
mbrtoc16
(), mbrtoc32
(),
mbrtowc
(), and mbrtowc_l
()
functions will fail if:
mbrtoc16
(), mbrtoc32
(),
mbrtowc
(), and mbrtowc_l
()
functions are MT-Safe as long as different
mbstate_t structures are passed in
ps. If ps is
NULL
or different threads use the same value for
ps, then the functions are Unsafe.
September 20, 2021 | OmniOS |