mblen
mbstowcs
mbtowc
wcstombs
wctomb
There are two kinds of DBCS sequences, mixed and pure (in
Standard terminology, multibyte and wide; this discussion
uses DBCS terms). Mixed sequences may contain both single- and
double-byte characters, while pure sequences contain only double-byte
characters.
Mixed DBCS Sequences
Several methods exist for handling mixed DBCS sequences. For example,
an encoding scheme may set aside a subrange of values to signal
multibyte sequences. Another popular encoding scheme sets aside a
single byte value to indicate a shift out from a normal
interpretation of character codes to an alternate interpretation, where
groups of bytes represent certain characters. This method is referred
to as shift-out/shift-in encoding and is the method the SAS/C
Compiler uses to handle multibyte sequences. This encoding scheme uses
shift states, which indicate how a byte value or set of byte
values will be interpreted. The SAS/C Compiler uses shift-out/shift-in
encoding because it is the DBCS encoding defined for the EBCDIC character set.
A mixed DBCS sequence must follow these rules:
\x0E
and the value for SI is
\x0F
. For example, the following is a mixed DBCS string in hex:
\x81\x82\x83\x0E\x41\x52\x0F\x81The
\x41\x52
between the \x0E
and \x0F
is a double-byte character. The other
characters are single-byte.
'a'
is
\x81
in hex.
In the double-byte state, each character is represented by 2 bytes. Double-byte characters must conform to the following constraints:
\x41
and \xFE
, except for
the encoding of the blank
space.
\x41
and \xFE
, except for
the encoding of the blank
space.
\x40\x40
.
\x0E\x0F
). For example, the following sequence
(in hex), which might be construed as a single multibyte character, is not valid:
\x0E\x0F\x0E\x0F\x0E\x0F\x0E\x41\x81\x0FThis restriction is imposed because the number of bytes used to represent a multibyte character would, in theory, be unbounded; but the Standard requires an implementation to define a maximum byte-length for a multibyte character.
On the other hand, consecutive SI/SO pairs (\x0F\x0E
)
are permitted because they may result from string concatenation.
For example, the following sequence (in hex) is valid:
\x0E\x41\x81\x0F\x0E\x41\x83\x0F
wchar_t
, is implementation-defined as an integer type capable of
representing all the codes for the largest character set in locales
supported by the implementation. wchar_t
is implemented by the
SAS/C Library in <stddef.h>
as follows:
typedef unsigned short wchar_t;
wchar_t
elements. When a mixed character sequence
contains characters that require only a single byte, these characters
are converted to wchar_t
, but their values are unchanged. For
example, the mixed string ("abc"
) is represented as follows:
\x81\x82\x83\x00When converted to a pure DBCS sequence, the string will become the following:
\x00\x81\x00\x82\x00\x83\x00\x00Use the
mbtowc
function to convert 1 multibyte character to a
double-byte character. Use the mbstowcs
function to convert a
sequence of multibyte characters to a double-byte sequence. Note that
this function assumes the sequence is terminated by the null character,
\x00
. You also can use regular string-handling functions
with mixed DBCS sequences. For example, you can use strlen
to
determine the byte-length of a sequence, as long as the sequence is
null-terminated.
When converting from pure to mixed, SO/SI pairs are added to the
sequence as necessary. Use the wctomb
function to convert 1
double-byte character to a multibyte character. Use the wcstombs
function to convert a sequence of double-byte characters to a multibyte
sequence. Note that this function assumes the sequence is terminated
by the null wide character, \x00\x00
.
DBCS Support with SPE
The multibyte character functions can be used with the SPE
framework. Normally this framework does not support
locales, and by default DBCS support is not enabled. To enable DBCS
support with SPE, turn on the CRABDBCS
bit in
CRABFLGM
in your start-up
routine or in L$UMAIN.
Formatted I/O Functions and Multibyte Character Sequences
Mixed DBCS sequences are supported in the format string for the formatted I/O
functions such as printf
, sprintf
, scanf
, sscanf
,
and strftime
as required by the Standard. Recognition of a mixed sequence
within a format requires that a double-byte locale such as
"DBCS"
be in effect.
Mixed sequences are treated like any other character sequence in the format string
with one exception; they are copied unchanged to output or matched on scanf
input, but invalid sequences may cause premature termination of the
function. The conversion specifier %
and specifications
associated with it, which are imbedded within the format string, are
recognized only while in single-byte mode, which is the initial shift
state at the beginning of the format string.
Locales and Multibyte Character Sequences
The processing of multibyte character sequences is dependent on the
current locale. (See Localization for a full
discussion of locales.) For example, some locales support DBCS
sequences and some do not.
The standard locales "S370"
and "POSIX"
do not support
DBCS sequences.
The default locale, ""
, may or may not support DBCS
sequences,
depending on the values of locale-related environment variables.
Of the three locales supplied by the SAS/C Library, "DBCS"
and "DBEX"
support DBCS sequences, while "SAMP"
does not.
The macro MB_CUR_MAX
, defined in <stdlib.h>
, defines the longest
sequence of bytes needed to represent a single multibyte character in
the current locale. The macro MB_LEN_MAX
, on the other hand, is not
locale-dependent and defines the longest multibyte character permitted
across all locales.
Function Descriptions
Descriptions of each multibyte character function follow. Each
description includes a synopsis, a description, discussions of return
values and portability issues, and an example. Also, errors, cautions,
diagnostics, implementation details, and usage notes are included if
appropriate. None of the multibyte character functions are supported
by traditional UNIX C Compilers.
#include <stdlib.h> int mblen(const char *s, size_t n);
mblen
determines how many bytes are needed to represent the multibyte
character pointed to by s
.
n
specifies the maximum number of bytes of the multibyte character
sequence to examine.
s
is not NULL
, the return value is as follows:
0
s
points to the null character.
n
or fewer bytes constitute a valid
multibyte character.
-1
n
or fewer bytes do not constitute a valid
multibyte character.
s
is NULL
, the return value is as follows:
0
mblen
encounters invalid data; a
return value of -1
is the only indication of an error.
/* This example counts multibyte characters (not including */ /* terminating null) in a DBCS mixed string using mblen(). */ #include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> /* "strptr" points to the beginning of a DBCS MIXED string. */ /* RETURNS: number of multibyte characters */ int count1(char *strptr) { int i = 0; /* number of multibyte characters found */ int charlen; /* byte length of current characte */ /* Inform library that we will be accepting a DBCS string. */ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); /* Reset to initial shift state. (A valid mixed string */ /* must begin in initial shift state). */ mblen(NULL, 0); /* One loop iteration per character. Advance "strptr" by */ /* number of bytes consumed. */ while (charlen = mblen(strptr, MB_LEN_MAX)) { if (charlen < 0) { fputs("Invalid MIXED DBCS string", stderr); abort(); fclose(stderr); } strptr += charlen; i++; } return i; }
#include <stdlib.h> size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);
mbstowcs
converts a sequence of multibyte characters (mixed DBCS
sequence) pointed to by s
into a sequence of corresponding wide characters (pure DBCS
sequence) and stores the output sequence in the array pointed to by
pwcs
.
The multibyte character sequence is assumed to begin in the initial shift state.
n
specifies the maximum number of wide characters to be stored.
mbstowcs
returns the
number of elements of pwcs
that were modified, excluding the
terminating 0
code, if any. If the sequence of multibyte characters is
invalid, mbstowcs
returns -1
.
mbtowc
.
If copying takes place between objects that overlap, the behavior of
mbstowcs
is undefined.
A diagnostic is not issued if mbstowcs
encounters invalid data; a
return value of -1
is the only indication of an error.
mbstowcs
and wcstomb
.
#include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> #define MAX_CHARACTERS 81 /* "old_string" is the input MIXED DBCS string. "new_string" */ /* is the output MIXED DBCS string. "old_wchar" is the */ /* multibyte character to be replaced. "new_wchar" is the */ /* multibyte character to replace with. */ void mbsrepl(char *old_string, char *new_string, wchar_t old_wchar, wchar_t new_wchar) { wchar_t work[MAX_CHARACTERS]; int nchars; int i; /* Inform library that we will be accepting a DBCS string.*/ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); nchars = mbstowcs(work, old_string, MAX_CHARACTERS); if (nchars < 0) { fputs("Invalid DBCS string.\n", stderr); fclose(stderr); abort(); } /* Perform the actual substitution. */ for (i = 0; i < nchars; i++) if (work[i] == old_wchar) work[i] = new_wchar; /* Convert back to MIXED format. */ nchars = wcstombs(new_string, work, MAX_CHARACTERS); /* See if the replacement caused the string to overflow. */ if (nchars == MAX_CHARACTERS) { fputs("Replacement string too large.\n", stderr); abort(); fclose(stderr); } }
#include <stdlib.h> int mbtowc(wchar_t *pwc, const char *s, size_t n);
mbtowc
determines how many bytes are needed to represent the
multibyte character pointed to by s
. If s
is not NULL
,
mbtowc
then stores the corresponding wide character in the array
pointed to by pwc
.
n
specifies the maximum number of bytes to examine in the array
pointed to by pwc
.
s
is not NULL
, the return value is as follows:
0
s
points to the null character.
n
or fewer bytes constitute a valid
multibyte character.
-1
n
or fewer bytes do not constitute a valid
multibyte character.
s
is NULL
, the return value is as follows:
0
mbtowc
encounters invalid data; a
return value of -1
is the only indication of an error.
mbtowc
.
#include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> /* "begstr" points to the beginning of a DBCS MIXED string. */ /* "mbc_sought" is the character value we're looking for. */ int mbfind(char *begstr, wchar_t int mbc_sought) { int mbclen; /* length (in bytes) of current character */ wchar_t mbc; /* value of current character */ char *strptr; /* pointer to current location in string */ strptr = begstr; /* Inform library that we will be accepting a DBCS string.*/ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); /* Reset to initial shift state. (A valid mixed string */ /* must begin in initial shift state). */ mbtowc((wchar_t *)NULL, NULL, 0); /* One loop iteration per character. Advance "strptr" by */ /* number of bytes consumed. */ while (mbclen = mbtowc(&mbc, strptr, MB_LEN_MAX)) { if (mbclen < 0) { fputs("Invalid pure DBCS string\n", stderr); abort(); } if (mbc == mbc_sought) break; strptr += mbclen; } /* Last character was not '\0' -- must have found it */ if (mbclen) { printf("MBFIND: found at byte offset %d\n", strptr - begstr); return 1; } else { puts("MBFIND: character not found\n"); return 0; } }
#include <stdlib.h> size_t wcstombs(char *s, const wchar_t *pwcs, size_t n);
wcstombs
converts a sequence of wide characters (pure DBCS
sequence) to a sequence of multibyte characters (mixed DBCS sequence).
The wide characters are in the array pointed to by pwcs
, and the
resulting multibyte characters are stored in the array pointed to by
s
. The resulting multibyte character sequence begins in the
initial shift state.
n
specifies the maximum number of bytes to be filled with
multibyte characters. The conversion stops if a multibyte character
would exceed the limit of n
bytes or if a null character is
stored.
wcstombs
returns the
number of bytes of s
that were modified, excluding the terminating
0
byte, if any. If the sequence of multibyte characters is invalid,
wcstombs
returns -1
.
wcstombs
is undefined.
A diagnostic is not issued if wcstombs
encounters invalid data; a
return value of -1
is the only indication of an error.
mbstowcs
.
#include <stdlib.h> int wctomb(char *s, wchar_t wchar);
wctomb
determines how many bytes are needed to represent the
multibyte character corresponding to the wide (pure DBCS) character
whose value is wchar
, including any change in shift state. It
stores the multibyte character representation in the array pointed to
by s
, assuming s
is not NULL
. If the value of
wchar
is 0
, wctomb
is left in the initial shift state.
s
is not NULL
, the return value is the number of bytes
that make up the multibyte character corresponding to the value of wchar
.
If s
is NULL
, the return value is as follows:
0
MB_CUR_MAX
macro.
wctomb
encounters invalid data; a
return value of -1
is the only indication of an error.
wctomb
.
#include <stdlib.h> #include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> #define MAX_CHARACTERS 81 /* "pure_string" is the input PURE DBCS string. */ /* "mixed_string" the output MIXED DBCS string. */ void mbline(wchar_t *pure_string, char *mixed_string) { int i; int mbclen; wchar_t wc; /* Inform library that we will be accepting a DBCS string. */ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); wctomb(NULL, 0); /* Reset to initial shift state. */ /* One loop iteration per character. Advance "mixed_string"*/ /* by number of bytes in character. */ i = 0; do { wc = pure_string[i++]; mbclen = wctomb(mixed_string, wc); if (mbclen < 0) { puts("Invalid PURE DBCS string.\n"); abort(); fclose(stdout); } mixed_string += mbclen; } while (wc != L'\n'); *mixed_string = '\0'; }