utf8 - FreeBSD

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->FreeBSD man pages -> utf8 (5)

UTF8(5)

NAME [Toc] [Back]

     utf8 -- UTF-8, a transformation format of ISO 10646

SYNOPSIS [Toc] [Back]

     ENCODING "UTF-8"

DESCRIPTION [Toc] [Back]

     The UTF-8 encoding represents UCS-4 characters as a sequence of octets,
     using between 1 and 6 for each character.	It is backwards compatible
     with ASCII, so 0x00-0x7f refer to the ASCII character set.  The multibyte
     encoding of non-ASCII characters consist entirely of bytes whose high
     order bit is set.	The actual encoding is represented by the following
     table:

     [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
     [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
     [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
	     1110bbbb, 10bbbbbb, 10bbbbbb
     [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
	     11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
     [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
	     111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
     [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
	     1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

     If more than a single representation of a value exists (for example,
     0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always
     used.  Longer ones are detected as an error as they pose a potential
     security risk, and destroy the 1:1 character:octet sequence mapping.

COMPATIBILITY [Toc] [Back]

     The utf8 encoding supersedes the utf2(5) encoding.  The only differences
     between the two are that utf8 handles the full 31-bit character set of
     ISO 10646 whereas utf2(5) is limited to a 16-bit character set, and that
     utf2(5) accepts redundant, non-``shortest form'' representations of characters.

STANDARDS [Toc] [Back]

     The utf8 encoding is compatible with RFC 2279 and Unicode 3.2.

BUGS [Toc] [Back]

     Byte order marker (BOM) characters are neither added nor removed from
     UTF-8-encoded wide character stdio(3) streams.


FreeBSD 5.2.1		       October 30, 2002 		 FreeBSD 5.2.1

[ Back ]

Similar pages

Name	OS	Title
utf2	FreeBSD	Universal character set Transformation Format encoding of wide characters
Unicode	Tru64	Support for the Unicode and ISO/IEC 10646 standards
UTF-32	Tru64	Support for the Unicode and ISO/IEC 10646 standards
UTF-16	Tru64	Support for the Unicode and ISO/IEC 10646 standards
UTF-8	Tru64	Support for the Unicode and ISO/IEC 10646 standards
iso10646	Tru64	Support for the Unicode and ISO/IEC 10646 standards
unicode	Tru64	Support for the Unicode and ISO/IEC 10646 standards
universal.utf8	Tru64	Support for the Unicode and ISO/IEC 10646 standards
UCS-4	Tru64	Support for the Unicode and ISO/IEC 10646 standards
UCS-2	Tru64	Support for the Unicode and ISO/IEC 10646 standards

newsletter delivery service

UTF8(5)

Contents

NAME [Toc] [Back]

SYNOPSIS [Toc] [Back]

DESCRIPTION [Toc] [Back]

COMPATIBILITY [Toc] [Back]

SEE ALSO [Toc] [Back]

STANDARDS [Toc] [Back]

BUGS [Toc] [Back]