README.mb has been unified into SGML documents.
This commit is contained in:
parent
07c741e61c
commit
31a81ea8ec
325
doc/README.mb
325
doc/README.mb
@ -1,325 +0,0 @@
|
|||||||
PostgreSQL 7.0 multi-byte (MB) support README Mar 22 2000
|
|
||||||
|
|
||||||
Tatsuo Ishii
|
|
||||||
ishii@postgresql.org
|
|
||||||
http://www.sra.co.jp/people/t-ishii/PostgreSQL/
|
|
||||||
|
|
||||||
0. Introduction
|
|
||||||
|
|
||||||
The MB support is intended for allowing PostgreSQL to handle
|
|
||||||
multi-byte character sets such as EUC(Extended Unix Code), Unicode and
|
|
||||||
Mule internal code. With the MB enabled you can use multi-byte
|
|
||||||
character sets in regexp ,LIKE and some other functions. The default
|
|
||||||
encoding system chosen is determined while initializing your
|
|
||||||
PostgreSQL installation using initdb(1). Note that this can be
|
|
||||||
overridden when you create a database using createdb(1) or by using a
|
|
||||||
create database SQL command. So you could have multiple databases with
|
|
||||||
each different encoding system.
|
|
||||||
|
|
||||||
MB also fixes some problems concerning with 8-bit single byte
|
|
||||||
character sets including ISO8859. (I would not say all of problems
|
|
||||||
have been fixed. I just confirmed that the regression test ran fine
|
|
||||||
and a few French characters could be used with the patch. Please let
|
|
||||||
me know if you find any problem while using 8-bit characters)
|
|
||||||
|
|
||||||
1. How to use
|
|
||||||
|
|
||||||
run configure with a multibyte option:
|
|
||||||
|
|
||||||
% ./configure --enable-multibyte[=encoding_system]
|
|
||||||
|
|
||||||
where the encoding_system is one of:
|
|
||||||
|
|
||||||
SQL_ASCII ASCII
|
|
||||||
EUC_JP Japanese EUC
|
|
||||||
EUC_CN Chinese EUC
|
|
||||||
EUC_KR Korean EUC
|
|
||||||
EUC_TW Taiwan EUC
|
|
||||||
UNICODE Unicode(UTF-8)
|
|
||||||
MULE_INTERNAL Mule internal
|
|
||||||
LATIN1 ISO 8859-1 English and some European languages
|
|
||||||
LATIN2 ISO 8859-2 English and some European languages
|
|
||||||
LATIN3 ISO 8859-3 English and some European languages
|
|
||||||
LATIN4 ISO 8859-4 English and some European languages
|
|
||||||
LATIN5 ISO 8859-5 English and some European languages
|
|
||||||
KOI8 KOI8-R
|
|
||||||
WIN Windows CP1251
|
|
||||||
ALT Windows CP866
|
|
||||||
|
|
||||||
Example:
|
|
||||||
|
|
||||||
% ./configure --enable-multibyte=EUC_JP
|
|
||||||
|
|
||||||
If the encoding system is omitted (./configure --enable-multibyte),
|
|
||||||
SQL_ASCII is assumed.
|
|
||||||
|
|
||||||
2. How to set the encoding
|
|
||||||
|
|
||||||
initdb command defines the default encoding for a PostgreSQL
|
|
||||||
installation. For example:
|
|
||||||
|
|
||||||
% initdb -E EUC_JP
|
|
||||||
|
|
||||||
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
|
|
||||||
Note that you can use "--encoding" instead of "-E" if you like longer
|
|
||||||
option string:-) If no -E or --encoding option is given, the encoding
|
|
||||||
specified at the compile time is used.
|
|
||||||
|
|
||||||
You can create a database with a different encoding.
|
|
||||||
|
|
||||||
% createdb -E EUC_KR korean
|
|
||||||
|
|
||||||
will create a database named "korean" with EUC_KR encoding. The
|
|
||||||
another way to accomplish this is to use a SQL command:
|
|
||||||
|
|
||||||
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
|
|
||||||
|
|
||||||
The encoding for a database is represented as "encoding" column in the
|
|
||||||
pg_database system catalog. You can see that by using -l or \l of psql
|
|
||||||
command.
|
|
||||||
|
|
||||||
$ psql -l
|
|
||||||
List of databases
|
|
||||||
Database | Owner | Encoding
|
|
||||||
---------------+---------+---------------
|
|
||||||
euc_cn | t-ishii | EUC_CN
|
|
||||||
euc_jp | t-ishii | EUC_JP
|
|
||||||
euc_kr | t-ishii | EUC_KR
|
|
||||||
euc_tw | t-ishii | EUC_TW
|
|
||||||
mule_internal | t-ishii | MULE_INTERNAL
|
|
||||||
regression | t-ishii | SQL_ASCII
|
|
||||||
template1 | t-ishii | EUC_JP
|
|
||||||
test | t-ishii | EUC_JP
|
|
||||||
unicode | t-ishii | UNICODE
|
|
||||||
(9 rows)
|
|
||||||
|
|
||||||
3. Automatic encoding translation between backend and frontend
|
|
||||||
|
|
||||||
PostgreSQL supports an automatic encoding translation between backend
|
|
||||||
and frontend for some encodings.
|
|
||||||
|
|
||||||
encoding of backend available encoding of frontend
|
|
||||||
--------------------------------------------------------------------
|
|
||||||
EUC_JP EUC_JP, SJIS
|
|
||||||
|
|
||||||
EUC_TW EUC_TW, BIG5
|
|
||||||
|
|
||||||
LATIN2 LATIN2, WIN1250
|
|
||||||
|
|
||||||
LATIN5 LATIN5, WIN, ALT
|
|
||||||
|
|
||||||
MULE_INTERNAL EUC_JP, SJIS, EUC_KR, EUC_CN,
|
|
||||||
EUC_TW, BIG5, LATIN1 to LATIN5,
|
|
||||||
WIN, ALT, WIN1250
|
|
||||||
|
|
||||||
To enable the automatic encoding translation, you have to tell
|
|
||||||
PostgreSQL the encoding you would like to use in frontend. There are
|
|
||||||
several ways to accomplish this.
|
|
||||||
|
|
||||||
o using \encoding command in psql
|
|
||||||
|
|
||||||
\encoding allows you to change frontend encoding on the fly. For
|
|
||||||
example, to change the encoding to SJIS, type:
|
|
||||||
|
|
||||||
\encoding SJIS
|
|
||||||
|
|
||||||
o using libpq functions
|
|
||||||
|
|
||||||
\encoding actually calls PQsetClientEncoding() for its purpose.
|
|
||||||
|
|
||||||
int PQsetClientEncoding(PGconn *conn, const char *encoding)
|
|
||||||
|
|
||||||
conn is a connection to the backend, and encoding is an encoding you
|
|
||||||
want to use. If it successfully sets the encoding, it returns 0,
|
|
||||||
otherwise -1. The current encoding for this connection can be shown by
|
|
||||||
using:
|
|
||||||
|
|
||||||
int PQclientEncoding(const PGconn *conn)
|
|
||||||
|
|
||||||
Note that it returns the "encoding id," not the encoding symbol string
|
|
||||||
such as "EUC_JP." To convert an encoding id to an encoding symbol, you
|
|
||||||
can use:
|
|
||||||
|
|
||||||
char *pg_encoding_to_char(int encoding_id)
|
|
||||||
|
|
||||||
o using PGCLIENTENCODING
|
|
||||||
|
|
||||||
If an environment variable PGCLIENTENCODING is defined in the
|
|
||||||
frontend, an automatic encoding translation is done by the backend.
|
|
||||||
|
|
||||||
o using SET CLIENT_ENCODING TO command
|
|
||||||
|
|
||||||
Setting the frontend side encoding can be done a SQL command:
|
|
||||||
|
|
||||||
SET CLIENT_ENCODING TO 'encoding';
|
|
||||||
|
|
||||||
Also you can use SQL92 syntax "SET NAMES" for this purpose:
|
|
||||||
|
|
||||||
SET NAMES 'encoding';
|
|
||||||
|
|
||||||
To query the current the frontend encoding:
|
|
||||||
|
|
||||||
SHOW CLIENT_ENCODING;
|
|
||||||
|
|
||||||
To return to the default encoding:
|
|
||||||
|
|
||||||
RESET CLIENT_ENCODING;
|
|
||||||
|
|
||||||
4. About Unicode
|
|
||||||
|
|
||||||
An automatic encoding translation between Unicode and any other
|
|
||||||
encodings is not supported (yet).
|
|
||||||
|
|
||||||
5. What happens if the translation is not possible?
|
|
||||||
|
|
||||||
Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
|
|
||||||
then some Japanese characters could not be translated into LATIN1. In
|
|
||||||
this case, a letter cannot be represented in the LATIN1 character set,
|
|
||||||
would be transformed as:
|
|
||||||
|
|
||||||
(HEXA DECIMAL)
|
|
||||||
|
|
||||||
6. References
|
|
||||||
|
|
||||||
These are good sources to start learning various kind of encoding
|
|
||||||
systems.
|
|
||||||
|
|
||||||
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
|
|
||||||
Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
|
|
||||||
appear in section 3.2.
|
|
||||||
|
|
||||||
Unicode: http://www.unicode.org/
|
|
||||||
The homepage of UNICODE.
|
|
||||||
|
|
||||||
RFC 2044
|
|
||||||
UTF-8 is defined here.
|
|
||||||
|
|
||||||
5. History
|
|
||||||
|
|
||||||
May 20, 2000
|
|
||||||
* SJIS UDC (NEC selection IBM kanji) support contributed
|
|
||||||
by Eiji Tokuya
|
|
||||||
* Changes above will appear in 7.0.1
|
|
||||||
|
|
||||||
Mar 22, 2000
|
|
||||||
* Add new libpq functions PQsetClientEncoding, PQclientEncoding
|
|
||||||
* ./configure --with-mb=EUC_JP
|
|
||||||
now deprecated. use
|
|
||||||
./configure --enable-multibyte=EUC_JP
|
|
||||||
instead
|
|
||||||
* Add SQL_ASCII regression test case
|
|
||||||
* Add SJIS User Defined Character (UDC) support
|
|
||||||
* All of above will appear in 7.0
|
|
||||||
|
|
||||||
July 11, 1999
|
|
||||||
* Add support for WIN1250 (Windows Czech) as a client encoding
|
|
||||||
(contributed by Pavel Behal)
|
|
||||||
* fix some compiler warnings (contributed by Tomoaki Nishiyama)
|
|
||||||
|
|
||||||
Mar 23, 1999
|
|
||||||
* Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
|
|
||||||
(thanks Oleg Broytmann for testing)
|
|
||||||
* Fix problem with MB and locale
|
|
||||||
|
|
||||||
Jan 26, 1999
|
|
||||||
* Add support for Big5 for fronend encoding
|
|
||||||
(you need to create a database with EUC_TW to use Big5)
|
|
||||||
* Add regression test case for EUC_TW
|
|
||||||
(contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
|
|
||||||
|
|
||||||
Dec 15, 1998
|
|
||||||
* Bugs related to SQL_ASCII support fixed
|
|
||||||
|
|
||||||
Nov 5, 1998
|
|
||||||
* 6.4 release. In this version, pg_database has "encoding"
|
|
||||||
column that represents the database encoding
|
|
||||||
|
|
||||||
Jul 22, 1998
|
|
||||||
* determine encoding at initdb/createdb rather than compile time
|
|
||||||
* support for PGCLIENTENCODING when issuing COPY command
|
|
||||||
* support for SQL92 syntax "SET NAMES"
|
|
||||||
* support for LATIN2-5
|
|
||||||
* add UNICODE regression test case
|
|
||||||
* new test suite for MB
|
|
||||||
* clean up source files
|
|
||||||
|
|
||||||
Jun 5, 1998
|
|
||||||
* add support for the encoding translation between the backend
|
|
||||||
and the frontend
|
|
||||||
* new command SET CLIENT_ENCODING etc. added
|
|
||||||
* add support for LATIN1 character set
|
|
||||||
* enhance 8 bit cleaness
|
|
||||||
|
|
||||||
April 21, 1998 some enhancements/fixes
|
|
||||||
* character_length(), position(), substring() are now aware of
|
|
||||||
multi-byte characters
|
|
||||||
* add octet_length()
|
|
||||||
* add --with-mb option to configure
|
|
||||||
* new regression tests for EUC_KR
|
|
||||||
(contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
|
|
||||||
* add some test cases to the EUC_JP regression test
|
|
||||||
* fix problem in regress/regress.sh in case of System V
|
|
||||||
* fix toupper(), tolower() to handle 8bit chars
|
|
||||||
|
|
||||||
Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
|
|
||||||
|
|
||||||
Mar 10, 1998 PL2 released
|
|
||||||
* add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
|
|
||||||
* add an English document (this file)
|
|
||||||
* fix problems concerning 8-bit single byte characters
|
|
||||||
|
|
||||||
Mar 1, 1998 PL1 released
|
|
||||||
|
|
||||||
Appendix:
|
|
||||||
|
|
||||||
[Here is a good documentation explaining how to use WIN1250 on
|
|
||||||
Windows/ODBC from Pavel Behal. Please note that Installation step 1)
|
|
||||||
is not necceary in 6.5.1 -- Tatsuo]
|
|
||||||
|
|
||||||
Version: 0.91 for PgSQL 6.5
|
|
||||||
Author: Pavel Behal
|
|
||||||
Revised by: Tatsuo Ishii
|
|
||||||
Email: behal@opf.slu.cz
|
|
||||||
Licence: The Same as PostgreSQL
|
|
||||||
|
|
||||||
Sorry for my Eglish and C code, I'm not native :-)
|
|
||||||
|
|
||||||
!!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
|
||||||
|
|
||||||
Instalation:
|
|
||||||
------------
|
|
||||||
1) Change three affected files in source directories
|
|
||||||
(I don't have time to create proper patch diffs, I don't know how)
|
|
||||||
2) Compile with enabled locale and multibyte set to LATIN2
|
|
||||||
3) Setup properly your instalation, do not forget to create locale
|
|
||||||
variables in your profile (environment). Ex. (may not be exactly true):
|
|
||||||
LC_ALL=cs_CZ.ISO8859-2
|
|
||||||
LC_COLLATE=cs_CZ.ISO8859-2
|
|
||||||
LC_CTYPE=cs_CZ.ISO8859-2
|
|
||||||
LC_MONETARY=cs_CZ.ISO8859-2
|
|
||||||
LC_NUMERIC=cs_CZ.ISO8859-2
|
|
||||||
LC_TIME=cs_CZ.ISO8859-2
|
|
||||||
4) You have to start the postmaster with locales set!
|
|
||||||
5) Try it with Czech language, it have to sort
|
|
||||||
5) Install ODBC driver for PgSQL into your M$ Windows
|
|
||||||
6) Setup properly your data source. Include this line in your ODBC
|
|
||||||
configuration dialog in field "Connect Settings:" :
|
|
||||||
SET CLIENT_ENCODING = 'WIN1250';
|
|
||||||
7) Now try it again, but in Windows with ODBC.
|
|
||||||
|
|
||||||
Description:
|
|
||||||
------------
|
|
||||||
- Depends on proper system locales, tested with RH6.0 and Slackware 3.6,
|
|
||||||
with cs_CZ.iso8859-2 loacle
|
|
||||||
- Never try to set-up server multibyte database encoding to WIN1250,
|
|
||||||
always use LATIN2 instead. There is not WIN1250 locale in Unix
|
|
||||||
- WIN1250 encoding is useable only for M$W ODBC clients. The characters are
|
|
||||||
on thy fly re-coded, to be displayed and stored back properly
|
|
||||||
|
|
||||||
Important:
|
|
||||||
----------
|
|
||||||
- it reorders your sort order depending on your LC_... setting, so don't be
|
|
||||||
confused with regression tests, they don't use locale
|
|
||||||
- "ch" is corectly sorted only in some newer locales (Ex. RH6.0)
|
|
||||||
- you have to insert money as '162,50' (with comma in aphostrophes!)
|
|
||||||
- not tested properly
|
|
Loading…
x
Reference in New Issue
Block a user