postgres

Author	SHA1	Message	Date
Michael Paquier	59f47fb98d	unaccent: Add support for quoted translated characters As reported in bug #18057, the extension unaccent removes in its rule file whitespace characters that are intentionally specified when building unaccent.rules from UnicodeData.txt, causing an incorrect translation for some characters like numeric symbols. This is caused by the fact that all whitespaces before and after the origin and target characters are all discarded (this limitation is documented). This commit makes possible the use of quotes around target characters, so as whitespaces can be considered part of target characters. Some target characters use a double quote, these require an extra double quote. The documentation is updated to show how to use quoted areas, generate_unaccent_rules.py is updated to generate unaccent.rules and a couple of tests are added for numeric symbols. While working on this patch, I have implemented a fake rule file to test the parsing logic implemented, which is not included here as it would just consume extra cycles in the tests, and it requires the manipulation of an installation tree to be able to work correctly. As this requires a change of format in unaccent.rules, this cannot be backpatched, unfortunately. The idea to use double quotes as escaped characters comes from Tom Lane. Reported-by: Martin Schlossarek Author: Michael Paquier Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org	2023-09-20 12:29:36 +09:00
Michael Paquier	44e73a498c	Fix regression tests of unaccent to work without UTF8 support The tests of unaccent rely on UTF8 characters, and unlike any other test suite in the tree (fuzzystrmatch, citext, hstore, etc.), they would fail if run on a database that does not support UTF8 encoding. This commit fixes the tests of unaccent so as these are skipped when run on a database without UTF8 support, using the same method as the other test suits based on \if, getdatabaseencoding() and an alternate output file. This has been broken for a long time, but nobody has complained about that either, so no backpatch is done. This can be reproduced with something like REGRESS_OPTS="--no-locale --encoding=sql_ascii", for instance. To defend against that, this module's Makefile and meson.build enforced a UTF8 encoding without locales, but it did not offer protection for options given by REGRESS_OPTS. This switch makes this regression test suite more consistent with all the others, as well. Reviewed-by: Peter Eisentraut Discussion: https://postgr.es/m/ZIq1HUnIV2ksW85x@paquier.xyz	2023-07-04 08:05:00 +09:00
Jeff Davis	f413941f41	Fix t_isspace(), etc., when datlocprovider=i and datctype=C. Check whether the datctype is C to determine whether t_isspace() and related functions use isspace() or iswspace(). Previously, t_isspace() checked whether the database default collation was C; which is incorrect when the default collation uses the ICU provider. Discussion: https://postgr.es/m/79e4354d9eccfdb00483146a6b9f6295202e7890.camel@j-davis.com Reviewed-by: Peter Eisentraut Backpatch-through: 15	2023-03-17 12:08:46 -07:00
Jeff Davis	27b62377b4	Use ICU by default at initdb time. If the ICU locale is not specified, initialize the default collator and retrieve the locale name from that. Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com Reviewed-by: Peter Eisentraut	2023-03-09 10:52:41 -08:00
Michael Paquier	e3dd7c06e6	Simplify a bit the special rules generating unaccent.rules As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT (U+2117), and we use CLDR 41, so this can be removed from the set of special cases. The set of regression tests is expanded for degree signs, which are two of the special cases, and a fancy case with U+210C in Latin-ASCII.xml that we have discovered about when diving into what could be done for Cyrillic characters (this last part is material for a future patch, not tackled yet). While on it, some of the assertions of generate_unaccent_rules.py are expanded to report the codepoint on which a failure is found, something useful for debugging. Extracted from a larger patch by the same author. Author: Przemysław Sztoch Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl	2022-07-05 16:17:51 +09:00
Thomas Munro	456e3718e7	Add combining characters to unaccent.rules. Strip certain classes of combining characters, so that accents encoded this way are removed. Author: Hugh Ranalli Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org	2019-02-01 15:23:01 +01:00
Michael Paquier	e1c1d5444e	Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml This has required an update of the python script generating the rules, as its format has changed in release 29. This release has also added new punctuation and symbols, and a new set of rules has been generated to include them. The way to find newest versions of Latin-ASCII gets also more clearly documented. Author: Hugh Ranalli, Michael Paquier Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org	2019-01-10 14:10:21 +09:00
Peter Eisentraut	b6f3649bba	Convert unaccent tests to UTF-8 This makes it easier to add new tests that are specific to Unicode features. The files were previously in KOI8-R. Discussion: https://www.postgresql.org/message-id/8506.1545111362@sss.pgh.pa.us	2019-01-02 18:36:05 +01:00
Tom Lane	629b3af27d	Convert contrib modules to use the extension facility. This isn't fully tested as yet, in particular I'm not sure that the "foo--unpackaged--1.0.sql" scripts are OK. But it's time to get some buildfarm cycles on it. sepgsql is not converted to an extension, mainly because it seems to require a very nonstandard installation process. Dimitri Fontaine and Tom Lane	2011-02-13 22:54:49 -05:00
Tom Lane	4b98b613f6	Print the actual DB encoding in the unaccent regression test. This is to help make it more obvious what the problem is, if the encoding isn't what the test expects.	2009-08-18 16:00:50 +00:00
Teodor Sigaev	92e05bc6a5	Unaccent dictionary.	2009-08-18 10:34:39 +00:00

11 Commits