diff --git a/doc/src/sgml/admin.sgml b/doc/src/sgml/admin.sgml index 0fce2cc9af..3fa9da921a 100644 --- a/doc/src/sgml/admin.sgml +++ b/doc/src/sgml/admin.sgml @@ -1,5 +1,5 @@ @@ -97,6 +98,7 @@ Derived from postgres.sgml. &intro-ag; &installation; &installw; + &charset; &runtime; &client-auth; &manage-ag; diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml new file mode 100644 index 0000000000..93b3d021e9 --- /dev/null +++ b/doc/src/sgml/charset.sgml @@ -0,0 +1,700 @@ + + Character Sets + + + + Describes the available language and character set support in + Postgres. + + + + + Postgres supports non-ASCII character + sets with two approaches: + + + + + Using locale features in underlying + system libraries. This allows single-byte character sets to be + configured with a locale-specific collation order, provided that + the underlying system supports the required locale. This + technique supports only one character set per server, and can + not support multi-byte character sets. + + + + + + Using explicit multiple-byte character sets defined in the + Postgres server. These character sets + are also known to some client libraries. The number of character + sets is fixed at the time the server is compiled, and internal + operations such as string comparisons require expansion of each + character into a 32-bit word. + + + + + + + Multi-byte Support + + + Author + + + Tatsuo Ishii, + last updated 2000-03-22. + Check Tatsuo's + web site for more information. + + + + + Multi-byte (MB) support is intended to allow + Postgres to handle + multiple-byte character sets such as EUC (Extended Unix Code), Unicode and + Mule internal code. With MB enabled you can use multi-byte + character sets in regular expressions (regexp), LIKE, and some + other functions. The default + encoding system is selected while initializing your + Postgres installation using + initdb. Note that this can be + overridden when you create a database using + createdb or by using the SQL command + CREATE DATABASE. So you can have multiple databases each with + a different encoding system. + + + + MB also fixes some problems concerning 8-bit single byte + character sets including ISO8859. (I would not say all of problems + have been fixed. I just confirmed that the regression test ran fine + and a few French characters could be used with the patch. Please let + me know if you find any problem while using 8-bit characters.) + + + + Enabling MB + + + Run configure with a multibyte option: + + +% ./configure --enable-multibyte[=encoding_system] + + + where encoding_system can be one of the + values in the following table: + + + <productname>Postgres</productname> Character Set Encodings + Encodings + + + + Encoding + Description + + + + + SQL_ASCII + ASCII + + + EUC_JP + Japanese EUC + + + EUC_CN + Chinese EUC + + + EUC_KR + Korean EUC + + + EUC_TW + Taiwan EUC + + + UNICODE + Unicode(UTF-8) + + + MULE_INTERNAL + Mule internal + + + LATIN1 + ISO 8859-1 English and some European languages + + + LATIN2 + ISO 8859-2 English and some European languages + + + LATIN3 + ISO 8859-3 English and some European languages + + + LATIN4 + ISO 8859-4 English and some European languages + + + LATIN5 + ISO 8859-5 English and some European languages + + + KOI8 + KOI8-R + + + WIN + Windows CP1251 + + + ALT + Windows CP866 + + + +
+
+ + + Here is an example of configuring + Postgres to use a Japanese encoding by + default: + + +% ./configure --enable-multibyte=EUC_JP + + + + + If the encoding system is omitted (./configure --enable-multibyte), + SQL_ASCII is assumed. + +
+ + + Setting the Encoding + + + initdb defines the default encoding + for a Postgres installation. For example: + + +% initdb -E EUC_JP + + + sets the default encoding to EUC_JP(Extended Unix Code for Japanese). + Note that you can use "--encoding" instead of "-E" if you prefer + to type longer option strings. + If no -E or --encoding option is given, the encoding + specified at the compile time is used. + + + + You can create a database with a different encoding: + + +% createdb -E EUC_KR korean + + + will create a database named "korean" with EUC_KR encoding. The + another way to accomplish this is to use a SQL command: + + +CREATE DATABASE korean WITH ENCODING = 'EUC_KR'; + + + The encoding for a database is represented as an + encoding column in the + pg_database system catalog. + You can see that by using -l or \l of psql + command. + + +$ psql -l + List of databases + Database | Owner | Encoding +---------------+---------+--------------- + euc_cn | t-ishii | EUC_CN + euc_jp | t-ishii | EUC_JP + euc_kr | t-ishii | EUC_KR + euc_tw | t-ishii | EUC_TW + mule_internal | t-ishii | MULE_INTERNAL + regression | t-ishii | SQL_ASCII + template1 | t-ishii | EUC_JP + test | t-ishii | EUC_JP + unicode | t-ishii | UNICODE +(9 rows) + + + + + + Automatic encoding translation between backend and + frontend + + + Postgres supports an automatic + encoding translation between backend + and frontend for some encodings. + + + <productname>Postgres</productname> Client/Server Character Set Encodings + Communication Encodings + + + + Server Encoding + Available Client Encodings + + + + + EUC_JP + EUC_JP, SJIS + + + EUC_TW + EUC_TW, BIG5 + + + LATIN2 + LATIN2, WIN1250 + + + LATIN5 + LATIN5, WIN, ALT + + + MULE_INTERNAL + EUC_JP, SJIS, EUC_KR, EUC_CN, + EUC_TW, BIG5, LATIN1 to LATIN5, + WIN, ALT, WIN1250 + + + +
+
+ + + To enable the automatic encoding translation, you have to tell + Postgres the encoding you would like + to use in frontend. There are + several ways to accomplish this. + + + + + Using the \encoding command in + psql. + \encoding allows you to change frontend + encoding on the fly. For + example, to change the encoding to SJIS, type: + + +\encoding SJIS + + + + + + + Using libpq functions. + \encoding actually calls + PQsetClientEncoding() for its purpose. + + +int PQsetClientEncoding(PGconn *conn, const char *encoding) + + + where conn is a connection to the backend, + and encoding is an encoding you + want to use. If it successfully sets the encoding, it returns 0, + otherwise -1. The current encoding for this connection can be shown by + using: + + +int PQclientEncoding(const PGconn *conn) + + + Note that it returns the "encoding id," not the encoding symbol string + such as "EUC_JP." To convert an encoding id to an encoding symbol, you + can use: + + +char *pg_encoding_to_char(int encoding_id) + + + + + + + Using PGCLIENTENCODING. + + If an environment variable PGCLIENTENCODING is defined in the + frontend, an automatic encoding translation is done by the backend. + + + + + + Using SET CLIENT_ENCODING TO. + + Setting the frontend side encoding can be done a SQL command: + + +SET CLIENT_ENCODING TO 'encoding'; + + + Also you can use SQL92 syntax "SET NAMES" for this purpose: + + +SET NAMES 'encoding'; + + + To query the current the frontend encoding: + + +SHOW CLIENT_ENCODING; + + + To return to the default encoding: + + +RESET CLIENT_ENCODING; + + + + + +
+ + + About Unicode + + + An automatic encoding translation between Unicode and other + encodings is not yet supported. + + + + + What happens if the translation is not possible? + + + Suppose you choose EUC_JP for the backend, LATIN1 for the frontend, + then some Japanese characters could not be translated into LATIN1. In + this case, a letter cannot be represented in the LATIN1 character set, + would be transformed as: + + +(HEXA DECIMAL) + + + + + + References + + + These are good sources to start learning various kind of encoding + systems. + + + + + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf + Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW + appear in section 3.2. + + + + + + Unicode: http://www.unicode.org/ + The homepage of UNICODE. + + + + + + RFC 2044 + UTF-8 is defined here. + + + + + + + + History + + + +May 20, 2000 + * SJIS UDC (NEC selection IBM kanji) support contributed + by Eiji Tokuya + * Changes above will appear in 7.0.1 + +Mar 22, 2000 + * Add new libpq functions PQsetClientEncoding, PQclientEncoding + * ./configure --with-mb=EUC_JP + now deprecated. use + ./configure --enable-multibyte=EUC_JP + instead + * Add SQL_ASCII regression test case + * Add SJIS User Defined Character (UDC) support + * All of above will appear in 7.0 + +July 11, 1999 + * Add support for WIN1250 (Windows Czech) as a client encoding + (contributed by Pavel Behal) + * fix some compiler warnings (contributed by Tomoaki Nishiyama) + +Mar 23, 1999 + * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866) + (thanks Oleg Broytmann for testing) + * Fix problem with MB and locale + +Jan 26, 1999 + * Add support for Big5 for fronend encoding + (you need to create a database with EUC_TW to use Big5) + * Add regression test case for EUC_TW + (contributed by Jonah Kuo) + +Dec 15, 1998 + * Bugs related to SQL_ASCII support fixed + +Nov 5, 1998 + * 6.4 release. In this version, pg_database has "encoding" + column that represents the database encoding + +Jul 22, 1998 + * determine encoding at initdb/createdb rather than compile time + * support for PGCLIENTENCODING when issuing COPY command + * support for SQL92 syntax "SET NAMES" + * support for LATIN2-5 + * add UNICODE regression test case + * new test suite for MB + * clean up source files + +Jun 5, 1998 + * add support for the encoding translation between the backend + and the frontend + * new command SET CLIENT_ENCODING etc. added + * add support for LATIN1 character set + * enhance 8 bit cleaness + +April 21, 1998 some enhancements/fixes + * character_length(), position(), substring() are now aware of + multi-byte characters + * add octet_length() + * add --with-mb option to configure + * new regression tests for EUC_KR + (contributed by Soonmyung. Hong) + * add some test cases to the EUC_JP regression test + * fix problem in regress/regress.sh in case of System V + * fix toupper(), tolower() to handle 8bit chars + +Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1 + +Mar 10, 1998 PL2 released + * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL + * add an English document (this file) + * fix problems concerning 8-bit single byte characters + +Mar 1, 1998 PL1 released + + + + + + WIN1250 on Windows/ODBC + + + + + The WIN1250 character set on Windows client platforms can be used + with Postgres with locale support + enabled. + + + + The following should be kept in mind: + + + + + Success depends on proper system locales. This has been tested + with RH6.0 and Slackware 3.6, with cs_CZ.iso8859-2 locale. + + + + + + Never try to set the server multibyte database encoding to WIN1250. + Always use LATIN2 instead since there is not a WIN1250 locale + in Unix. + + + + + + WIN1250 encoding is useable only for M$W ODBC clients. The + characters are recoded on the fly, to be displayed and stored + back properly. + + + + + + + When running, it is important to remember the following: + + + + + This configuration reorders your sort order depending on your + LC_x settings. Don't be + confused with the regression test results since they don't use + locale. + + + + + + A locale such as "ch" is correctly sorted + only if your system + supports that locale; older systems may not do so but new ones + (e.g. RH6.0) do. + + + + + + You have to insert money as '162,50' (note + comma within the single-quotes). + + + + + + At the time of writing (early 1999), this configuration has + not received extensive testing. Please let us know of any + changes you had to make! + + + + + + + WIN1250 on Windows/ODBC + + + Change the three relevant files in the source directories. + + + + + + Compile Postgres with local enabled + and the multibyte encoding set to LATIN2. + + + + + + Set up your instalation. Do not forget to create locale + variables in your profile (environment). For example (this may + not be correct for your environment): + + +LC_ALL=cs_CZ.ISO8859-2 +LC_COLLATE=cs_CZ.ISO8859-2 +LC_CTYPE=cs_CZ.ISO8859-2 +LC_MONETARY=cs_CZ.ISO8859-2 +LC_NUMERIC=cs_CZ.ISO8859-2 +LC_TIME=cs_CZ.ISO8859-2 + + + + + + + You have to start the postmaster with locales set! + + + + + + Try it with Czech language, and have it sort on a query. + + + + + + Install ODBC driver for PgSQL on your M$ Windows machine. + + + + + + Setup properly your data source. Include this line in your ODBC + configuration dialog in the field Connect Settings: + + +SET CLIENT_ENCODING = 'WIN1250'; + + + + + + + Now try it again, but in Windows with ODBC. + + + + +
+
+ + diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml index 0dcdecb95c..19f93c5aae 100644 --- a/doc/src/sgml/postgres.sgml +++ b/doc/src/sgml/postgres.sgml @@ -1,5 +1,5 @@ - - - - - - - - - - - + + + + + + + + + + + + @@ -170,16 +171,17 @@ $Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.40 2000/08/25 15:17:37 th included twice. &intro-ag; --> - &installation; - &installw; - &runtime; - &client-auth; - &manage-ag; - &user-manag; - &backup; - &recovery; - ®ress; - &release; + &installation; + &installw; + &charset; + &runtime; + &client-auth; + &manage-ag; + &user-manag; + &backup; + &recovery; + ®ress; + &release; @@ -199,20 +201,20 @@ $Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.40 2000/08/25 15:17:37 th included twice. &intro-pg; --> - &arch-pg; - &extend; - &xfunc; - &xtypes; - &xoper; - &xaggr; - &rules; - &xindex; - &indexcost; - &gist; - &dfunc; - &trigger; - &spi; - &xplang; + &arch-pg; + &extend; + &xfunc; + &xtypes; + &xoper; + &xaggr; + &rules; + &xindex; + &indexcost; + &gist; + &dfunc; + &trigger; + &spi; + &xplang;