Update multi-byte support README
This commit is contained in:
parent
853cf66176
commit
5b1f92eaa7
178
doc/README.mb
178
doc/README.mb
@ -1,7 +1,7 @@
|
|||||||
postgresql 6.5.1 multi-byte (MB) support README July 11 1999
|
PostgreSQL 7.0 multi-byte (MB) support README Mar 22 2000
|
||||||
|
|
||||||
Tatsuo Ishii
|
Tatsuo Ishii
|
||||||
t-ishii@sra.co.jp
|
ishii@postgresql.org
|
||||||
http://www.sra.co.jp/people/t-ishii/PostgreSQL/
|
http://www.sra.co.jp/people/t-ishii/PostgreSQL/
|
||||||
|
|
||||||
0. Introduction
|
0. Introduction
|
||||||
@ -9,12 +9,12 @@ postgresql 6.5.1 multi-byte (MB) support README July 11 1999
|
|||||||
The MB support is intended for allowing PostgreSQL to handle
|
The MB support is intended for allowing PostgreSQL to handle
|
||||||
multi-byte character sets such as EUC(Extended Unix Code), Unicode and
|
multi-byte character sets such as EUC(Extended Unix Code), Unicode and
|
||||||
Mule internal code. With the MB enabled you can use multi-byte
|
Mule internal code. With the MB enabled you can use multi-byte
|
||||||
character sets in regexp ,LIKE and some functions. The default
|
character sets in regexp ,LIKE and some other functions. The default
|
||||||
encoding system chosen is determined while initializing your
|
encoding system chosen is determined while initializing your
|
||||||
PostgreSQL installation using initdb(1). Note that this can be
|
PostgreSQL installation using initdb(1). Note that this can be
|
||||||
overridden when you create a database using createdb(1) or create
|
overridden when you create a database using createdb(1) or by using a
|
||||||
database SQL command. So you could have multiple databases with
|
create database SQL command. So you could have multiple databases with
|
||||||
different encoding systems.
|
each different encoding system.
|
||||||
|
|
||||||
MB also fixes some problems concerning with 8-bit single byte
|
MB also fixes some problems concerning with 8-bit single byte
|
||||||
character sets including ISO8859. (I would not say all of problems
|
character sets including ISO8859. (I would not say all of problems
|
||||||
@ -24,11 +24,11 @@ me know if you find any problem while using 8-bit characters)
|
|||||||
|
|
||||||
1. How to use
|
1. How to use
|
||||||
|
|
||||||
run configure with the mb option:
|
run configure with a multibyte option:
|
||||||
|
|
||||||
% configure --with-mb=encoding_system
|
% ./configure --enable-multibyte[=encoding_system]
|
||||||
|
|
||||||
where encoding_system is one of:
|
where the encoding_system is one of:
|
||||||
|
|
||||||
SQL_ASCII ASCII
|
SQL_ASCII ASCII
|
||||||
EUC_JP Japanese EUC
|
EUC_JP Japanese EUC
|
||||||
@ -48,21 +48,21 @@ where encoding_system is one of:
|
|||||||
|
|
||||||
Example:
|
Example:
|
||||||
|
|
||||||
% configure --with-mb=EUC_JP
|
% ./configure --enable-multibyte=EUC_JP
|
||||||
|
|
||||||
If MB is disabled, nothing is changed except better supporting for
|
If the encoding system is omitted (./configure --enable-multibyte),
|
||||||
8-bit single byte character sets.
|
SQL_ASCII is assumed.
|
||||||
|
|
||||||
2. How to set encoding
|
2. How to set the encoding
|
||||||
|
|
||||||
initdb command defines the default encoding for a PostgreSQL
|
initdb command defines the default encoding for a PostgreSQL
|
||||||
installation. For example:
|
installation. For example:
|
||||||
|
|
||||||
% initdb -e EUC_JP
|
% initdb -E EUC_JP
|
||||||
|
|
||||||
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
|
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
|
||||||
Note that you can use "-pgencoding" instead of "-e" if you like longer
|
Note that you can use "--encoding" instead of "-E" if you like longer
|
||||||
option string:-) If no -e or -pgencoding option is given, the encoding
|
option string:-) If no -E or --encoding option is given, the encoding
|
||||||
specified at the compile time is used.
|
specified at the compile time is used.
|
||||||
|
|
||||||
You can create a database with a different encoding.
|
You can create a database with a different encoding.
|
||||||
@ -75,78 +75,85 @@ another way to accomplish this is to use a SQL command:
|
|||||||
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
|
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
|
||||||
|
|
||||||
The encoding for a database is represented as "encoding" column in the
|
The encoding for a database is represented as "encoding" column in the
|
||||||
pg_database system catalog.
|
pg_database system catalog. You can see that by using -l or \l of psql
|
||||||
|
command.
|
||||||
|
|
||||||
datname |datdba|encoding|datpath
|
$ psql -l
|
||||||
-------------+------+--------+-------------
|
List of databases
|
||||||
template1 | 1739| 1|template1
|
Database | Owner | Encoding
|
||||||
postgres | 1739| 0|postgres
|
---------------+---------+---------------
|
||||||
euc_jp | 1739| 1|euc_jp
|
euc_cn | t-ishii | EUC_CN
|
||||||
euc_kr | 1739| 3|euc_kr
|
euc_jp | t-ishii | EUC_JP
|
||||||
euc_cn | 1739| 2|euc_cn
|
euc_kr | t-ishii | EUC_KR
|
||||||
unicode | 1739| 5|unicode
|
euc_tw | t-ishii | EUC_TW
|
||||||
mule_internal| 1739| 6|mule_internal
|
mule_internal | t-ishii | MULE_INTERNAL
|
||||||
|
regression | t-ishii | SQL_ASCII
|
||||||
|
template1 | t-ishii | EUC_JP
|
||||||
|
test | t-ishii | EUC_JP
|
||||||
|
unicode | t-ishii | UNICODE
|
||||||
|
(9 rows)
|
||||||
|
|
||||||
A number in the encoding column is "encoding id" and can be translated
|
3. Automatic encoding translation between backend and frontend
|
||||||
to the encoding name using pg_encoding command.
|
|
||||||
|
|
||||||
$ pg_encoding 1
|
PostgreSQL supports an automatic encoding translation between backend
|
||||||
EUC_JP
|
and frontend for some encodings.
|
||||||
|
|
||||||
If an argument to pg_encoding is not a number, then it is regarded as
|
encoding of backend available encoding of frontend
|
||||||
an encoding name and pg_encoding will return the encoding id.
|
--------------------------------------------------------------------
|
||||||
|
EUC_JP EUC_JP, SJIS
|
||||||
|
|
||||||
|
EUC_TW EUC_TW, BIG5
|
||||||
|
|
||||||
|
LATIN2 LATIN2, WIN1250
|
||||||
|
|
||||||
|
LATIN5 LATIN5, WIN, ALT
|
||||||
|
|
||||||
|
MULE_INTERNAL EUC_JP, SJIS, EUC_KR, EUC_CN,
|
||||||
|
EUC_TW, BIG5, LATIN1 to LATIN5,
|
||||||
|
WIN, ALT, WIN1250
|
||||||
|
|
||||||
$ pg_encoding EUC_JP
|
To enable the automatic encoding translation, you have to tell
|
||||||
1
|
PostgreSQL the encoding you would like to use in frontend. There are
|
||||||
|
several ways to accomplish this.
|
||||||
|
|
||||||
3. PGCLIENTENCODING
|
o using \encoding command in psql
|
||||||
|
|
||||||
If an environment variable PGCLIENTENCODING is defined on the
|
\encoding allows you to change frontend encoding on the fly. For
|
||||||
frontend, automatic encoding translation is done by the backend. For
|
example, to change the encoding to SJIS, type:
|
||||||
example, if the backend has been compiled with MB=EUC_JP and
|
|
||||||
PGCLIENTENCODING=SJIS(Shift JIS: yet another Japanese encoding
|
|
||||||
system), then any SJIS strings coming from the frontend would be
|
|
||||||
translated to EUC_JP before going into the parser. Outputs from the
|
|
||||||
backend would be translated to SJIS of course.
|
|
||||||
|
|
||||||
Supported encodings for PGCLIENTENCODING are:
|
\encoding SJIS
|
||||||
|
|
||||||
SQL_ASCII ASCII
|
o using libpq functions
|
||||||
EUC_JP Japanese EUC
|
|
||||||
SJIS Yet another Japanese encoding
|
|
||||||
EUC_CN Chinese EUC
|
|
||||||
EUC_KR Korean EUC
|
|
||||||
EUC_TW Taiwan EUC
|
|
||||||
BIG5 Traditional Chinese
|
|
||||||
MULE_INTERNAL Mule internal
|
|
||||||
LATIN1 ISO 8859-1 English and some European languages
|
|
||||||
LATIN2 ISO 8859-2 English and some European languages
|
|
||||||
LATIN3 ISO 8859-3 English and some European languages
|
|
||||||
LATIN4 ISO 8859-4 English and some European languages
|
|
||||||
LATIN5 ISO 8859-5 English and some European languages
|
|
||||||
KOI8 KOI8-R
|
|
||||||
WIN Windows CP1251
|
|
||||||
ALT Windows CP866
|
|
||||||
WIN1250 Windows CP1250 (Czech)
|
|
||||||
|
|
||||||
Note that UNICODE is not supported(yet). Also note that the
|
\encoding actually calls PQsetClientEncoding() for its purpose.
|
||||||
translation is not always possible. Suppose you choose EUC_JP for the
|
|
||||||
backend, LATIN1 for the frontend, then some Japanese characters cannot
|
|
||||||
be translated into latin. In this case, a letter cannot be represented
|
|
||||||
in the Latin character set, would be transformed as:
|
|
||||||
|
|
||||||
(HEXA DECIMAL)
|
int PQsetClientEncoding(PGconn *conn, const char *encoding)
|
||||||
|
|
||||||
3. SET CLIENT_ENCODING TO command
|
conn is a connection to the backend, and encoding is an encoding you
|
||||||
|
want to use. If it successfully sets the encoding, it returns 0,
|
||||||
|
otherwise -1. The current encoding for this connection can be shown by
|
||||||
|
using:
|
||||||
|
|
||||||
Actually setting the frontend side encoding information is done by a
|
int PQclientEncoding(const PGconn *conn)
|
||||||
new command:
|
|
||||||
|
Note that it returns the "encoding id," not the encoding symbol string
|
||||||
|
such as "EUC_JP." To convert an encoding id to an encoding symbol, you
|
||||||
|
can use:
|
||||||
|
|
||||||
|
char *pg_encoding_to_char(int encoding_id)
|
||||||
|
|
||||||
|
o using PGCLIENTENCODING
|
||||||
|
|
||||||
|
If an environment variable PGCLIENTENCODING is defined in the
|
||||||
|
frontend, an automatic encoding translation is done by the backend.
|
||||||
|
|
||||||
|
o using SET CLIENT_ENCODING TO command
|
||||||
|
|
||||||
|
Setting the frontend side encoding can be done a SQL command:
|
||||||
|
|
||||||
SET CLIENT_ENCODING TO 'encoding';
|
SET CLIENT_ENCODING TO 'encoding';
|
||||||
|
|
||||||
where encoding is one of the encodings those can be set to
|
Also you can use SQL92 syntax "SET NAMES" for this purpose:
|
||||||
PGCLIENTENCODING. Also you can use SQL92 syntax "SET NAMES" for this
|
|
||||||
purpose:
|
|
||||||
|
|
||||||
SET NAMES 'encoding';
|
SET NAMES 'encoding';
|
||||||
|
|
||||||
@ -158,10 +165,21 @@ To return to the default encoding:
|
|||||||
|
|
||||||
RESET CLIENT_ENCODING;
|
RESET CLIENT_ENCODING;
|
||||||
|
|
||||||
This would reset the frontend encoding to same as the backend
|
4. About Unicode
|
||||||
encoding, thus no encoding translation would be performed.
|
|
||||||
|
|
||||||
4. References
|
An automatic encoding translation between Unicode and any other
|
||||||
|
encodings is not supported (yet).
|
||||||
|
|
||||||
|
5. What happens if the translation is not possible?
|
||||||
|
|
||||||
|
Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
|
||||||
|
then some Japanese characters could not be translated into LATIN1. In
|
||||||
|
this case, a letter cannot be represented in the LATIN1 character set,
|
||||||
|
would be transformed as:
|
||||||
|
|
||||||
|
(HEXA DECIMAL)
|
||||||
|
|
||||||
|
6. References
|
||||||
|
|
||||||
These are good sources to start learning various kind of encoding
|
These are good sources to start learning various kind of encoding
|
||||||
systems.
|
systems.
|
||||||
@ -178,6 +196,16 @@ Unicode: http://www.unicode.org/
|
|||||||
|
|
||||||
5. History
|
5. History
|
||||||
|
|
||||||
|
Mar 22, 2000
|
||||||
|
* Add new libpq functions PQsetClientEncoding, PQclientEncoding
|
||||||
|
* ./configure --with-mb=EUC_JP
|
||||||
|
now deprecated. use
|
||||||
|
./configure --enable-multibyte=EUC_JP
|
||||||
|
instead
|
||||||
|
* Add SQL_ASCII regression test case
|
||||||
|
* Add SJIS User Defined Character (UDC) support
|
||||||
|
* All of above will appear in 7.0
|
||||||
|
|
||||||
July 11, 1999
|
July 11, 1999
|
||||||
* Add support for WIN1250 (Windows Czech) as a client encoding
|
* Add support for WIN1250 (Windows Czech) as a client encoding
|
||||||
(contributed by Pavel Behal)
|
(contributed by Pavel Behal)
|
||||||
|
Loading…
x
Reference in New Issue
Block a user