diff --git a/doc/src/sgml/admin.sgml b/doc/src/sgml/admin.sgml
index 0fce2cc9af..3fa9da921a 100644
--- a/doc/src/sgml/admin.sgml
+++ b/doc/src/sgml/admin.sgml
@@ -1,5 +1,5 @@
@@ -97,6 +98,7 @@ Derived from postgres.sgml.
&intro-ag;
&installation;
&installw;
+ &charset;
&runtime;
&client-auth;
&manage-ag;
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
new file mode 100644
index 0000000000..93b3d021e9
--- /dev/null
+++ b/doc/src/sgml/charset.sgml
@@ -0,0 +1,700 @@
+
+ Character Sets
+
+
+
+ Describes the available language and character set support in
+ Postgres.
+
+
+
+
+ Postgres supports non-ASCII character
+ sets with two approaches:
+
+
+
+
+ Using locale features in underlying
+ system libraries. This allows single-byte character sets to be
+ configured with a locale-specific collation order, provided that
+ the underlying system supports the required locale. This
+ technique supports only one character set per server, and can
+ not support multi-byte character sets.
+
+
+
+
+
+ Using explicit multiple-byte character sets defined in the
+ Postgres server. These character sets
+ are also known to some client libraries. The number of character
+ sets is fixed at the time the server is compiled, and internal
+ operations such as string comparisons require expansion of each
+ character into a 32-bit word.
+
+
+
+
+
+
+ Multi-byte Support
+
+
+ Author
+
+
+ Tatsuo Ishii,
+ last updated 2000-03-22.
+ Check Tatsuo's
+ web site for more information.
+
+
+
+
+ Multi-byte (MB) support is intended to allow
+ Postgres to handle
+ multiple-byte character sets such as EUC (Extended Unix Code), Unicode and
+ Mule internal code. With MB enabled you can use multi-byte
+ character sets in regular expressions (regexp), LIKE, and some
+ other functions. The default
+ encoding system is selected while initializing your
+ Postgres installation using
+ initdb. Note that this can be
+ overridden when you create a database using
+ createdb or by using the SQL command
+ CREATE DATABASE. So you can have multiple databases each with
+ a different encoding system.
+
+
+
+ MB also fixes some problems concerning 8-bit single byte
+ character sets including ISO8859. (I would not say all of problems
+ have been fixed. I just confirmed that the regression test ran fine
+ and a few French characters could be used with the patch. Please let
+ me know if you find any problem while using 8-bit characters.)
+
+
+
+ Enabling MB
+
+
+ Run configure with a multibyte option:
+
+
+% ./configure --enable-multibyte[=encoding_system]
+
+
+ where encoding_system can be one of the
+ values in the following table:
+
+
+ Postgres Character Set Encodings
+ Encodings
+
+
+
+ Encoding
+ Description
+
+
+
+
+ SQL_ASCII
+ ASCII
+
+
+ EUC_JP
+ Japanese EUC
+
+
+ EUC_CN
+ Chinese EUC
+
+
+ EUC_KR
+ Korean EUC
+
+
+ EUC_TW
+ Taiwan EUC
+
+
+ UNICODE
+ Unicode(UTF-8)
+
+
+ MULE_INTERNAL
+ Mule internal
+
+
+ LATIN1
+ ISO 8859-1 English and some European languages
+
+
+ LATIN2
+ ISO 8859-2 English and some European languages
+
+
+ LATIN3
+ ISO 8859-3 English and some European languages
+
+
+ LATIN4
+ ISO 8859-4 English and some European languages
+
+
+ LATIN5
+ ISO 8859-5 English and some European languages
+
+
+ KOI8
+ KOI8-R
+
+
+ WIN
+ Windows CP1251
+
+
+ ALT
+ Windows CP866
+
+
+
+
+
+
+
+ Here is an example of configuring
+ Postgres to use a Japanese encoding by
+ default:
+
+
+% ./configure --enable-multibyte=EUC_JP
+
+
+
+
+ If the encoding system is omitted (./configure --enable-multibyte),
+ SQL_ASCII is assumed.
+
+
+
+
+ Setting the Encoding
+
+
+ initdb defines the default encoding
+ for a Postgres installation. For example:
+
+
+% initdb -E EUC_JP
+
+
+ sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
+ Note that you can use "--encoding" instead of "-E" if you prefer
+ to type longer option strings.
+ If no -E or --encoding option is given, the encoding
+ specified at the compile time is used.
+
+
+
+ You can create a database with a different encoding:
+
+
+% createdb -E EUC_KR korean
+
+
+ will create a database named "korean" with EUC_KR encoding. The
+ another way to accomplish this is to use a SQL command:
+
+
+CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
+
+
+ The encoding for a database is represented as an
+ encoding column in the
+ pg_database system catalog.
+ You can see that by using -l or \l of psql
+ command.
+
+
+$ psql -l
+ List of databases
+ Database | Owner | Encoding
+---------------+---------+---------------
+ euc_cn | t-ishii | EUC_CN
+ euc_jp | t-ishii | EUC_JP
+ euc_kr | t-ishii | EUC_KR
+ euc_tw | t-ishii | EUC_TW
+ mule_internal | t-ishii | MULE_INTERNAL
+ regression | t-ishii | SQL_ASCII
+ template1 | t-ishii | EUC_JP
+ test | t-ishii | EUC_JP
+ unicode | t-ishii | UNICODE
+(9 rows)
+
+
+
+
+
+ Automatic encoding translation between backend and
+ frontend
+
+
+ Postgres supports an automatic
+ encoding translation between backend
+ and frontend for some encodings.
+
+
+ Postgres Client/Server Character Set Encodings
+ Communication Encodings
+
+
+
+ Server Encoding
+ Available Client Encodings
+
+
+
+
+ EUC_JP
+ EUC_JP, SJIS
+
+
+ EUC_TW
+ EUC_TW, BIG5
+
+
+ LATIN2
+ LATIN2, WIN1250
+
+
+ LATIN5
+ LATIN5, WIN, ALT
+
+
+ MULE_INTERNAL
+ EUC_JP, SJIS, EUC_KR, EUC_CN,
+ EUC_TW, BIG5, LATIN1 to LATIN5,
+ WIN, ALT, WIN1250
+
+
+
+
+
+
+
+ To enable the automatic encoding translation, you have to tell
+ Postgres the encoding you would like
+ to use in frontend. There are
+ several ways to accomplish this.
+
+
+
+
+ Using the \encoding command in
+ psql.
+ \encoding allows you to change frontend
+ encoding on the fly. For
+ example, to change the encoding to SJIS, type:
+
+
+\encoding SJIS
+
+
+
+
+
+
+ Using libpq functions.
+ \encoding actually calls
+ PQsetClientEncoding() for its purpose.
+
+
+int PQsetClientEncoding(PGconn *conn, const char *encoding)
+
+
+ where conn is a connection to the backend,
+ and encoding is an encoding you
+ want to use. If it successfully sets the encoding, it returns 0,
+ otherwise -1. The current encoding for this connection can be shown by
+ using:
+
+
+int PQclientEncoding(const PGconn *conn)
+
+
+ Note that it returns the "encoding id," not the encoding symbol string
+ such as "EUC_JP." To convert an encoding id to an encoding symbol, you
+ can use:
+
+
+char *pg_encoding_to_char(int encoding_id)
+
+
+
+
+
+
+ Using PGCLIENTENCODING.
+
+ If an environment variable PGCLIENTENCODING is defined in the
+ frontend, an automatic encoding translation is done by the backend.
+
+
+
+
+
+ Using SET CLIENT_ENCODING TO.
+
+ Setting the frontend side encoding can be done a SQL command:
+
+
+SET CLIENT_ENCODING TO 'encoding';
+
+
+ Also you can use SQL92 syntax "SET NAMES" for this purpose:
+
+
+SET NAMES 'encoding';
+
+
+ To query the current the frontend encoding:
+
+
+SHOW CLIENT_ENCODING;
+
+
+ To return to the default encoding:
+
+
+RESET CLIENT_ENCODING;
+
+
+
+
+
+
+
+
+ About Unicode
+
+
+ An automatic encoding translation between Unicode and other
+ encodings is not yet supported.
+
+
+
+
+ What happens if the translation is not possible?
+
+
+ Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
+ then some Japanese characters could not be translated into LATIN1. In
+ this case, a letter cannot be represented in the LATIN1 character set,
+ would be transformed as:
+
+
+(HEXA DECIMAL)
+
+
+
+
+
+ References
+
+
+ These are good sources to start learning various kind of encoding
+ systems.
+
+
+
+
+
+ ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
+ Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
+ appear in section 3.2.
+
+
+
+
+
+ Unicode: http://www.unicode.org/
+ The homepage of UNICODE.
+
+
+
+
+
+ RFC 2044
+ UTF-8 is defined here.
+
+
+
+
+
+
+
+ History
+
+
+
+May 20, 2000
+ * SJIS UDC (NEC selection IBM kanji) support contributed
+ by Eiji Tokuya
+ * Changes above will appear in 7.0.1
+
+Mar 22, 2000
+ * Add new libpq functions PQsetClientEncoding, PQclientEncoding
+ * ./configure --with-mb=EUC_JP
+ now deprecated. use
+ ./configure --enable-multibyte=EUC_JP
+ instead
+ * Add SQL_ASCII regression test case
+ * Add SJIS User Defined Character (UDC) support
+ * All of above will appear in 7.0
+
+July 11, 1999
+ * Add support for WIN1250 (Windows Czech) as a client encoding
+ (contributed by Pavel Behal)
+ * fix some compiler warnings (contributed by Tomoaki Nishiyama)
+
+Mar 23, 1999
+ * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
+ (thanks Oleg Broytmann for testing)
+ * Fix problem with MB and locale
+
+Jan 26, 1999
+ * Add support for Big5 for fronend encoding
+ (you need to create a database with EUC_TW to use Big5)
+ * Add regression test case for EUC_TW
+ (contributed by Jonah Kuo)
+
+Dec 15, 1998
+ * Bugs related to SQL_ASCII support fixed
+
+Nov 5, 1998
+ * 6.4 release. In this version, pg_database has "encoding"
+ column that represents the database encoding
+
+Jul 22, 1998
+ * determine encoding at initdb/createdb rather than compile time
+ * support for PGCLIENTENCODING when issuing COPY command
+ * support for SQL92 syntax "SET NAMES"
+ * support for LATIN2-5
+ * add UNICODE regression test case
+ * new test suite for MB
+ * clean up source files
+
+Jun 5, 1998
+ * add support for the encoding translation between the backend
+ and the frontend
+ * new command SET CLIENT_ENCODING etc. added
+ * add support for LATIN1 character set
+ * enhance 8 bit cleaness
+
+April 21, 1998 some enhancements/fixes
+ * character_length(), position(), substring() are now aware of
+ multi-byte characters
+ * add octet_length()
+ * add --with-mb option to configure
+ * new regression tests for EUC_KR
+ (contributed by Soonmyung. Hong)
+ * add some test cases to the EUC_JP regression test
+ * fix problem in regress/regress.sh in case of System V
+ * fix toupper(), tolower() to handle 8bit chars
+
+Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
+
+Mar 10, 1998 PL2 released
+ * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
+ * add an English document (this file)
+ * fix problems concerning 8-bit single byte characters
+
+Mar 1, 1998 PL1 released
+
+
+
+
+
+ WIN1250 on Windows/ODBC
+
+
+
+
+ The WIN1250 character set on Windows client platforms can be used
+ with Postgres with locale support
+ enabled.
+
+
+
+ The following should be kept in mind:
+
+
+
+
+ Success depends on proper system locales. This has been tested
+ with RH6.0 and Slackware 3.6, with cs_CZ.iso8859-2 locale.
+
+
+
+
+
+ Never try to set the server multibyte database encoding to WIN1250.
+ Always use LATIN2 instead since there is not a WIN1250 locale
+ in Unix.
+
+
+
+
+
+ WIN1250 encoding is useable only for M$W ODBC clients. The
+ characters are recoded on the fly, to be displayed and stored
+ back properly.
+
+
+
+
+
+
+ When running, it is important to remember the following:
+
+
+
+
+ This configuration reorders your sort order depending on your
+ LC_x settings. Don't be
+ confused with the regression test results since they don't use
+ locale.
+
+
+
+
+
+ A locale such as "ch" is correctly sorted
+ only if your system
+ supports that locale; older systems may not do so but new ones
+ (e.g. RH6.0) do.
+
+
+
+
+
+ You have to insert money as '162,50' (note
+ comma within the single-quotes).
+
+
+
+
+
+ At the time of writing (early 1999), this configuration has
+ not received extensive testing. Please let us know of any
+ changes you had to make!
+
+
+
+
+
+
+ WIN1250 on Windows/ODBC
+
+
+ Change the three relevant files in the source directories.
+
+
+
+
+
+ Compile Postgres with local enabled
+ and the multibyte encoding set to LATIN2.
+
+
+
+
+
+ Set up your instalation. Do not forget to create locale
+ variables in your profile (environment). For example (this may
+ not be correct for your environment):
+
+
+LC_ALL=cs_CZ.ISO8859-2
+LC_COLLATE=cs_CZ.ISO8859-2
+LC_CTYPE=cs_CZ.ISO8859-2
+LC_MONETARY=cs_CZ.ISO8859-2
+LC_NUMERIC=cs_CZ.ISO8859-2
+LC_TIME=cs_CZ.ISO8859-2
+
+
+
+
+
+
+ You have to start the postmaster with locales set!
+
+
+
+
+
+ Try it with Czech language, and have it sort on a query.
+
+
+
+
+
+ Install ODBC driver for PgSQL on your M$ Windows machine.
+
+
+
+
+
+ Setup properly your data source. Include this line in your ODBC
+ configuration dialog in the field Connect Settings:
+
+
+SET CLIENT_ENCODING = 'WIN1250';
+
+
+
+
+
+
+ Now try it again, but in Windows with ODBC.
+
+
+
+
+
+
+
+
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index 0dcdecb95c..19f93c5aae 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -1,5 +1,5 @@
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
@@ -170,16 +171,17 @@ $Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.40 2000/08/25 15:17:37 th
included twice.
&intro-ag;
-->
- &installation;
- &installw;
- &runtime;
- &client-auth;
- &manage-ag;
- &user-manag;
- &backup;
- &recovery;
- ®ress;
- &release;
+ &installation;
+ &installw;
+ &charset;
+ &runtime;
+ &client-auth;
+ &manage-ag;
+ &user-manag;
+ &backup;
+ &recovery;
+ ®ress;
+ &release;
@@ -199,20 +201,20 @@ $Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.40 2000/08/25 15:17:37 th
included twice.
&intro-pg;
-->
- &arch-pg;
- &extend;
- &xfunc;
- &xtypes;
- &xoper;
- &xaggr;
- &rules;
- &xindex;
- &indexcost;
- &gist;
- &dfunc;
- &trigger;
- &spi;
- &xplang;
+ &arch-pg;
+ &extend;
+ &xfunc;
+ &xtypes;
+ &xoper;
+ &xaggr;
+ &rules;
+ &xindex;
+ &indexcost;
+ &gist;
+ &dfunc;
+ &trigger;
+ &spi;
+ &xplang;