Improve Unicode / UTF-8 documentation

This commit is contained in:
Albrecht Schlosser 2020-01-26 15:10:53 +01:00
parent f3724f7488
commit 30a868dc0f

View File

@ -2,12 +2,12 @@
\page unicode Unicode and UTF-8 Support \page unicode Unicode and UTF-8 Support
This chapter explains how FLTK handles international This chapter explains how FLTK handles international
text via Unicode and UTF-8. text via Unicode and UTF-8.
Unicode support was only recently added to FLTK and is Unicode support was added to FLTK starting with version 1.3.0 and is
still incomplete. This chapter is Work in Progress, reflecting still incomplete but mostly functional. This chapter is Work in Progress,
the current state of Unicode support. reflecting the current state of Unicode support.
\section unicode_about About Unicode, ISO 10646 and UTF-8 \section unicode_about About Unicode, ISO 10646 and UTF-8
@ -16,11 +16,11 @@ deliberately brief and provides just enough information for
the rest of this chapter. the rest of this chapter.
For further information, please see: For further information, please see:
- http://www.unicode.org - https://unicode.org
- http://www.iso.org - https://iso.org
- http://en.wikipedia.org/wiki/Unicode - https://en.wikipedia.org/wiki/Unicode
- http://www.cl.cam.ac.uk/~mgk25/unicode.html - https://www.cl.cam.ac.uk/~mgk25/unicode.html
- http://www.apps.ietf.org/rfc/rfc3629.html - https://tools.ietf.org/html/rfc3629
\par The Unicode Standard \par The Unicode Standard
@ -33,7 +33,7 @@ and is supported by most of the major computing companies in the world.
Before Unicode, many different systems, on different platforms, Before Unicode, many different systems, on different platforms,
had been developed for encoding characters for different languages, had been developed for encoding characters for different languages,
but no single encoding could satisfy all languages. but no single encoding could satisfy all languages.
Unicode provides access to over 100,000 characters Unicode provides access to over 130,000 characters
used in all the major languages written today, used in all the major languages written today,
and is independent of platform and language. and is independent of platform and language.
@ -78,7 +78,10 @@ U+10FFFF. The complete character set is sub-divided into \e planes.
used characters from previous encoding standards. Other planes used characters from previous encoding standards. Other planes
contain characters for specialist applications. contain characters for specialist applications.
\todo Do we need this info about planes? \todo FLTK 1.3 and later supports the full Unicode range (21 bits), but
there are a few exceptions, for instance binary shortcut values in menus
(\ref Fl_Shortcut) can only be used with characters from the BMP (16 bits).
This may be extended in a future FLTK version.
The UCS also defines various methods of encoding characters as The UCS also defines various methods of encoding characters as
a sequence of bytes. a sequence of bytes.
@ -95,8 +98,8 @@ UTF-16 and UTF-32 are based on units of two and four bytes.
UCS characters requiring more than 16 bits are encoded using UCS characters requiring more than 16 bits are encoded using
"surrogate pairs" in UTF-16. "surrogate pairs" in UTF-16.
UTF-8 encodes all Unicode characters into variable length UTF-8 encodes all Unicode characters into variable length
sequences of bytes. Unicode characters in the 7-bit ASCII sequences of bytes. Unicode characters in the 7-bit ASCII
range map to the same value and are represented as a single byte, range map to the same value and are represented as a single byte,
making the transformation to Unicode quick and easy. making the transformation to Unicode quick and easy.
@ -139,6 +142,11 @@ some level of synchronisation and error detection.
</tr> </tr>
</table> </table>
\note This table contains theoretical values outside the valid Unicode
range (<tt>U+000000 - U+10FFFF</tt>). Such values can only be returned by
conversion functions for illegal input values (see \ref unicode_illegals).
\par \par
Moving from ASCII encoding to Unicode will allow all new FLTK Moving from ASCII encoding to Unicode will allow all new FLTK
@ -175,7 +183,7 @@ the following limitations:
are LIMITED to 24 bit Unicode values, but also says that only 16 bits are LIMITED to 24 bit Unicode values, but also says that only 16 bits
are really used under linux and win32. are really used under linux and win32.
<b>[Can we verify this?]</b> <b>[Can we verify this?]</b>
- The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are - The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are
designed to handle Unicode characters in the range U+000000 to U+10FFFF designed to handle Unicode characters in the range U+000000 to U+10FFFF
inclusive, which covers all UTF-16 characters, as specified in RFC 3629. inclusive, which covers all UTF-16 characters, as specified in RFC 3629.
@ -189,7 +197,7 @@ the following limitations:
and not on a general Unicode character basis. and not on a general Unicode character basis.
- FLTK will not handle right-to-left or bi-directional text. - FLTK will not handle right-to-left or bi-directional text.
\todo \todo
Verify 16/24 bit Unicode limit for different character sets? Verify 16/24 bit Unicode limit for different character sets?
OksiD's code appears limited to 16-bit whereas the FLTK2 code OksiD's code appears limited to 16-bit whereas the FLTK2 code
@ -249,7 +257,7 @@ about error handling and return values.
\section unicode_fltk_calls FLTK Unicode and UTF-8 Functions \section unicode_fltk_calls FLTK Unicode and UTF-8 Functions
This section currently provides a brief overview of the functions. This section provides a brief overview of the functions.
For more details, consult the main text for each function via its link. For more details, consult the main text for each function via its link.
int fl_utf8locale() int fl_utf8locale()