iconv_intro, iconv - Introduction to codeset conversion
Conversion of character encoding from one coded character
set (codeset) to another is an operation that often has to
be performed by the operating system and some applications.
For example, the man command supports codeset conversion
to allow one set of reference page files to meet
the needs of locales that support the same language and
territory but different codesets (see man(1)).
The following commands and library interfaces give users
and application developers direct access to codeset conversion
operations: The iconv command converts characters
in a data file from one codeset to another (see iconv(1)).
The iconv(), iconv_open(), and iconv_close() functions
convert a string of characters from one codeset to another
(see iconv(3), iconv_open(3), and iconv_close(3)). The
iconv command uses these interfaces to convert characters.
There are two types of codeset converters: algorithmic and
table. Algorithmic converters, which reside in the
/usr/lib/nls/loc/iconv directory, are shared libraries
with a predefined entry point for invocation by functions
in the libiconv.so library. Algorithmic converters are
needed for the conversion of multibyte codesets, in part
because table converters cannot handle the required number
of character values and also because some of these codesets
require complex handling (see NOTES). Algorithmic
converters are supplied as part of the operating system
product; the internal interfaces that they require are not
published for external use.
Table converters, which reside in the
/usr/lib/nls/loc/iconvTable directory, can be created by
using the genxlt command (see genxlt(1)). These converters
can support single-byte codesets and up to 256 encoded
character values.
Names of codeset converters are in the following form:
from-codeset_to-codeset
For example, the following converter converts values from
Super DEC Kanji to Japanese Extended UNIX Code:
sdeckanji_eucJP
The codeset converters produce an invalid character error
in response to characters that cannot be converted from
the source codeset to the destination codeset. This error
is always produced for character codes that are invalid in
the source codeset. However, if the error results from
characters that are valid in the source codeset but have
no counterparts in the destination codeset, you can eliminate
the error by defining the ICONV_DEFSTR environment
variable to specify a substitute output string. See the
ENVIRONMENT VARIABLES section for more information about
using the ICONV_DEFSTR variable.
It is possible to convert data directly between two
codesets or by way of an intermediate codeset, such as
UTF-16, UCS-4, or UTF-8. For conversion of Chinese characters,
be aware that the results of converting a Traditional
Chinese codeset directly to a Simplified Chinese
codeset may not be the same as the results of converting
Traditional Chinese first to UTF-16, UCS-4, or UTF-8 and
then to Simplified Chinese.
ENVIRONMENT VARIABLES [Toc] [Back] Some codeset converters require more complex algorithms
than can be provided through tables. The following environment
variables provide control over conversion behavior
for different kinds of codeset converters:
Controls the behavior for the many-to-one value conversions
for conversion of Traditional Chinese (except for
Traditional Chinese encoded in Telecode) to Simplified
Chinese. The valid settings for this environment variable
are as follows: Specifies that the preferred mapping value
(the first one in the one-to-many mapping list) is always
taken. The batch setting is the ICONV_ACTION default.
Specifies that all the possible values are printed to the
standard output, enclosed by braces ({ }), so that the
user can later manually edit the converted file and select
the one to use. Specifies that all the possible values
are printed to the standard output except for punctuation
symbols, for which only the preferred mapping value is
printed. As is true for conv-all, the conv_all_nosym setting
prints value choices enclosed by braces so that the
converted file can later be edited. Sets byte ordering
for UTF-16 or UCS-4 (UTF-32) converters only. Valid values
are little-endian or big-endian.
If ICONV_NOBOM is set to a non-null value, the
default byte ordering is big-endian. If ICONV_NOBOM
is not set, the default byte ordering is littleendian.
Setting the ICONV_BYTEORDER and
ICONV_NOBOM environment variables may be necessary
when producing UTF-16 or UCS-4 output that will be
processed by codeset converters on platforms other
than Tru64 UNIX. Defines the default string to be
substituted in output for valid input characters
that cannot be converted from the source codeset to
the destination codeset. The variable value can be
an arbitrary string or a code number. If the value
is a code number (for example, 10, 07, 0x10, or,
for Unicode converters, U+1234), the corresponding
character in the output codeset (to-codeset) is
printed.
For a given type of codeset conversion, a matching
ICONV_DEFSTR_from-codeset_to-codeset variable has
precedence over the ICONV_DEFSTR variable without
the from-codeset_to-codeset suffix. When defining
the variable with the suffix, replace from-codeset_to-codeset
with the name of the codeset converter
to which the variable applies. The
ICONV_DEFSTR variable (defined without the suffix)
is used by a converter when no ICONV_DEFSTR_fromcodeset_to-codeset
variable has been defined
specifically for the type of conversion being done.
If these variables are not defined or are set to
the null string, the characters that cannot be converted
are skipped and have no representation in
converted output.
The following converter-specific restrictions apply
to ICONV_DEFSTR* variables: ICONV_DEFSTR* environment
variables do not work for converters that convert
between Japanese codesets or between Korean
codesets. For converters that handle UTF-16, UCS-4
or UTF-8 format, the only valid variable value is a
code number (such as U+1234 or 0x10) or a string
whose value is a single ASCII character (such as
?). For these converters, any string value other
than a single ASCII character is ignored and any
characters that cannot be converted have no representation
in output. For converters that handle
output in UTF-16, UCS-4 or UTF-8 format, characters
that cannot be converted and for which no valid
ICONV_DEFSTR* value has been defined produce an
error condition that aborts the conversion process.
Disables generation of the byte-order mark at the
beginning of UTF-16 or UCS-4 (UTF-32) output. A
valid setting is any value other than a null
string. If ICONV_NOBOM is set, big-endian is established
as the default byte ordering and BOM generation
is disabled. If ICONV_NOBOM is not set, little-endian
is established as the default byte
ordering and BOM generation is enabled.
Codeset converters that process UTF-16 or UCS-4
data on platforms other than Tru64 UNIX usually
require the byte-order mark. The ICONV_NOBOM and
ICONV_BYTEORDER environment variables provide you
with the means to control the generation of a byteorder
mark and byte ordering. Thus, you can establish
codeset conversion that is appropriate to the
requirements of other platforms or is compatible
with output produced by codeset converters that
were included in versions of Tru64 UNIX prior to
Version 4.0D. Activates phrase conversion for converters
that convert from a Traditional Chinese
codeset (except for Traditional Chinese encoded in
Telecode) to a Simplified Chinese codeset or the
reverse. When phrase conversion is activated, a
whole phrase in Traditional Chinese is converted to
a different phrase in Simplified Chinese or the
reverse.
If ICONV_PHRCONV is set to mark, the converted
phrases are be bracketed by [ and ] to highlight
the conversion result for visual checking.
The phrase conversion databases in the
/usr/share/phrdb directory are normal text files
with the same file names as those of the algorithmic
converters in /usr/lib/nls/loc/iconv/*. These
phrase conversion databases contain entries for
phrase conversion pairs.
Algorithmic converters Table converters Phrase conversion
databases
Commands: genxlt(1), iconv(1), phrase(1)
Functions: iconv(3), iconv_close(3), iconv_open(3)
Others: i18n_intro(5), l10n_intro(5)
iconv_intro(5)
[ Back ] |