Previously we searched for code points where the Unicode data file
listed an equivalent combining character sequence that added accents.
Some codepoints redirect to a single other codepoint, instead of doing
any combining. We can follow those references recursively to get the
answer.
Per bug report #18362, which reported missing Ancient Greek characters.
Specifically, precomposed characters with oxia (from the polytonic
accent system used for old Greek) just point to precomposed characters
with tonos (from the monotonic accent system for modern Greek), and we
have to follow the extra hop to find out that they are composed with
an acute accent.
Besides those, the new rule also:
* pulls in a lot of 'Mathematical Alphanumeric Symbols', which are
copies of the Latin and Greek alphabets and numbers rendered
in different typefaces, and
* corrects a single mathematical letter that previously came from the
CLDR transliteration file, but the new rule extracts from the main
Unicode database file, where clearly the latter is right and the
former is a wrong (reported to CLDR).
Reported-by: Cees van Zeeland <cees.van.zeeland@freedom.nl>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18362-be6d0cfe122b6354%40postgresql.org
As coded, the module's Makefile would fail to set a value for PYTHON as
it checked if the variable is defined. When compiling without
--with-python, PYTHON is defined and set to an empty value, so the
existing check is not able to do its work.
This commit switches the rule to check if the value is empty rather than
defined, allowing the generation of unaccent.rules even if --with-python
is not used as long as "python" exists. BISON and FLEX do the same in
pgxs.mk, for instance.
Thinko in f85a485f89.
Author: Japin Li
Discussion: https://postgr.es/m/MEYP282MB1669F86C0DC7B4DC48489CB0B6C3A@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
Backpatch-through: 13
This led to an overestimation of the size allocated for both the quoted
and non-quoted cases, while using an inconsistent style. Thinkos in
59f47fb98d.
Per report from Coverity, with extra input from Tom Lane.
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols. This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).
This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters. Some
target characters use a double quote, these require an extra double
quote.
The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols. While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.
As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately. The idea to use double quotes as escaped
characters comes from Tom Lane.
Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
The tests of unaccent rely on UTF8 characters, and unlike any other test
suite in the tree (fuzzystrmatch, citext, hstore, etc.), they would fail
if run on a database that does not support UTF8 encoding.
This commit fixes the tests of unaccent so as these are skipped when run
on a database without UTF8 support, using the same method as the other
test suits based on \if, getdatabaseencoding() and an alternate output
file.
This has been broken for a long time, but nobody has complained about
that either, so no backpatch is done. This can be reproduced with
something like REGRESS_OPTS="--no-locale --encoding=sql_ascii", for
instance. To defend against that, this module's Makefile and
meson.build enforced a UTF8 encoding without locales, but it did not
offer protection for options given by REGRESS_OPTS. This switch makes
this regression test suite more consistent with all the others, as
well.
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZIq1HUnIV2ksW85x@paquier.xyz
Specify --no-locale and --encoding=UTF8 to be consistent with the
Makefile, which specifies NO_LOCALE=1. Fixes test for some locales
when meson is used and ICU is disabled. May have been an oversight in
e6927270cd.
Also switch argument order in unaccent/meson.build to make it
consistent in style.
Discussion: https://postgr.es/m/CABwTF4Wz41pNMJ9q3tpH=6mnvg6aopDU5Lzvers5=6=WJVekww@mail.gmail.com
Author: Gurjeet Singh
Author: Jeff Davis
Check whether the datctype is C to determine whether t_isspace() and
related functions use isspace() or iswspace().
Previously, t_isspace() checked whether the database default collation
was C; which is incorrect when the default collation uses the ICU
provider.
Discussion: https://postgr.es/m/79e4354d9eccfdb00483146a6b9f6295202e7890.camel@j-davis.com
Reviewed-by: Peter Eisentraut
Backpatch-through: 15
Previously, the default encoding was derived from the locale when
using libc; while the default was always UTF-8 when using ICU. That
would throw an error when the locale was not compatible with UTF-8.
This commit causes initdb to derive the default encoding from the
locale for both providers. If --no-locale is specified (or if the
locale is C or POSIX), the default encoding will be UTF-8 for ICU
(because ICU does not support SQL_ASCII) and SQL_ASCII for libc.
Per buildfarm failure on system "hoverfly" related to commit
27b62377b4.
Discussion: https://postgr.es/m/d191d5841347301a8f1238f609471ddd957fc47e.camel%40j-davis.com
The generated resource files aren't exactly the same ones as the old
buildsystems generate. Previously "InternalName" and "OriginalFileName" were
mostly wrong / not set (despite being required), but that was hard to fix in
at least the make build. Additionally, the meson build falls back to a
"auto-generated" description when not set, and doesn't set it in a few cases -
unlikely that anybody looks at these descriptions in detail.
Author: Andres Freund <andres@anarazel.de>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC build system is hard to maintain for
developers not using Windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern build system.
After evaluating different build system choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.
We decided that it's more realistic to commit a relatively early version of
the new build system and mature it in tree.
This commit adds an initial version of a meson based build system. It supports
building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD,
Solaris and Windows (however only gcc is supported on aix, solaris). For
Windows/MSVC postgres can now be built with ninja (faster, particularly for
incremental builds) and msbuild (supporting the visual studio GUI, but
building slower).
Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM
bitcode generation, documentation adjustments) are done in subsequent commits
requiring further review. Other aspects (e.g. not installing test-only
extensions) are not yet addressed.
When building on Windows with msbuild, builds are slower when using a visual
studio version older than 2019, because those versions do not support
MultiToolTask, required by meson for intra-target parallelism.
The plan is to remove the MSVC specific build system in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make build system in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.
Some initial help for postgres developers is at
https://wiki.postgresql.org/wiki/Meson
With contributions from Thomas Munro, John Naylor, Stone Tickle and others.
Author: Andres Freund <andres@anarazel.de>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT
(U+2117), and we use CLDR 41, so this can be removed from the set of
special cases.
The set of regression tests is expanded for degree signs, which are two
of the special cases, and a fancy case with U+210C in Latin-ASCII.xml
that we have discovered about when diving into what could be done for
Cyrillic characters (this last part is material for a future patch, not
tackled yet).
While on it, some of the assertions of generate_unaccent_rules.py are
expanded to report the codepoint on which a failure is found, something
useful for debugging.
Extracted from a larger patch by the same author.
Author: Przemysław Sztoch
Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl
Apparently, the previous update
(2e0e066679) must have used a stale
input file and missed a few additions that were added shortly before
the CLDR release. Update this now so that the next update really only
changes things new in that version.
Andres Freund pointed out that allowing non-superusers to run
"CREATE EXTENSION ... FROM unpackaged" has security risks, since
the unpackaged-to-1.0 scripts don't try to verify that the existing
objects they're modifying are what they expect. Just attaching such
objects to an extension doesn't seem too dangerous, but some of them
do more than that.
We could have resolved this, perhaps, by still requiring superuser
privilege to use the FROM option. However, it's fair to ask just what
we're accomplishing by continuing to lug the unpackaged-to-1.0 scripts
forward. None of them have received any real testing since 9.1 days,
so they may not even work anymore (even assuming that one could still
load the previous "loose" object definitions into a v13 database).
And an installation that's trying to go from pre-9.1 to v13 or later
in one jump is going to have worse compatibility problems than whether
there's a trivial way to convert their contrib modules into extension
style.
Hence, let's just drop both those scripts and the core-code support
for "CREATE EXTENSION ... FROM".
Discussion: https://postgr.es/m/20200213233015.r6rnubcvl4egdh5r@alap3.anarazel.de
This allows these modules to be installed into a database without
superuser privileges (assuming that the DBA or sysadmin has installed
the module's files in the expected place). You only need CREATE
privilege on the current database, which by default would be
available to the database owner.
The following modules are marked trusted:
btree_gin
btree_gist
citext
cube
dict_int
earthdistance
fuzzystrmatch
hstore
hstore_plperl
intarray
isn
jsonb_plperl
lo
ltree
pg_trgm
pgcrypto
seg
tablefunc
tcn
tsm_system_rows
tsm_system_time
unaccent
uuid-ossp
In the future we might mark some more modules trusted, but there
seems to be no debate about these, and on the whole it seems wise
to be conservative with use of this feature to start out with.
Discussion: https://postgr.es/m/32315.1580326876@sss.pgh.pa.us
We currently have several sets of files generated from data provided
by Unicode. These all have ad hoc rules and instructions for updating
when new Unicode versions appear, and it's not done consistently.
This patch centralizes and automates the process and makes it part of
the release checklist. The Unicode and CLDR versions are specified in
Makefile.global.in. There is a new make target "update-unicode" that
downloads all the relevant files and runs the generation script.
There is also a new script for generating the table of combining
characters for ucs_wcwidth(). That table is now in a separate include
file rather than hardcoded into the middle of other code. This is
based on the script that was used for generating
d8594d123c, but the script itself wasn't
committed at that time.
Reviewed-by: John Naylor <john.naylor@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c8d05f42-443e-6c23-819b-05b31759a37c@2ndquadrant.com
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.
By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.
Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
As originally coded, the script would fail on Windows 10 and Python 3
because stdout would not be switched to UTF-8 only for Python 2. This
patch makes that apply to both versions.
Also add python 2 compatibility markers so that we know what to remove
once we drop support for that. Also use a "with" clause to ensure file
descriptor is closed promptly.
Author: Hugh Ranalli, Ramanarayana
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAKm4Xs7_61XMyOWmHs3n0mmkS0O4S0pvfWk=7cQ5P0gs177f7A@mail.gmail.com
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org
This has required an update of the python script generating the rules,
as its format has changed in release 29. This release has also added
new punctuation and symbols, and a new set of rules has been generated
to include them. The way to find newest versions of Latin-ASCII gets
also more clearly documented.
Author: Hugh Ranalli, Michael Paquier
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
Since the fixes for CVE-2018-1058, we've advised people to schema-qualify
function references in order to fix failures in code that executes under
a minimal search_path setting. However, that's insufficient to make the
single-argument form of unaccent() work, because it looks up the "unaccent"
text search dictionary using the search path.
The most expedient answer seems to be to remove the search_path dependency
by making it look in the same schema that the unaccent() function itself
is declared in. This will definitely work for the normal usage of this
function with the unaccent dictionary provided by the extension.
It's barely possible that there are people who were relying on the
search-path-dependent behavior to select other dictionaries with the same
name; but if there are any such people at all, they can still get that
behavior by writing unaccent('unaccent', ...), or possibly
unaccent('unaccent'::text::regdictionary, ...) if the lookup has to be
postponed to runtime.
Per complaint from Gunnlaugur Thor Briem. Back-patch to all supported
branches.
Discussion: https://postgr.es/m/CAPs+M8LCex6d=DeneofdsoJVijaG59m9V0ggbb3pOH7hZO4+cQ@mail.gmail.com
We have a lot of code in which option names, which from the user's
viewpoint are logically keywords, are passed through the grammar as plain
identifiers, and then matched to string literals during command execution.
This approach avoids making words into lexer keywords unnecessarily. Some
places matched these strings using plain strcmp, some using pg_strcasecmp.
But the latter should be unnecessary since identifiers would have been
downcased on their way through the parser. Aside from any efficiency
concerns (probably not a big factor), the lack of consistency in this area
creates a hazard of subtle bugs due to different places coming to different
conclusions about whether two option names are the same or different.
Hence, standardize on using strcmp() to match any option names that are
expected to have been fed through the parser.
This does create a user-visible behavioral change, which is that while
formerly all of these would work:
alter table foo set (fillfactor = 50);
alter table foo set (FillFactor = 50);
alter table foo set ("fillfactor" = 50);
alter table foo set ("FillFactor" = 50);
now the last case will fail because that double-quoted identifier is
different from the others. However, none of our documentation says that
you can use a quoted identifier in such contexts at all, and we should
discourage doing so since it would break if we ever decide to parse such
constructs as true lexer keywords rather than poor man's substitutes.
So this shouldn't create a significant compatibility issue for users.
Daniel Gustafsson, reviewed by Michael Paquier, small changes by me
Discussion: https://postgr.es/m/29405B24-564E-476B-98C0-677A29805B84@yesql.se
Improve generate_unaccent_rules.py to handle composed characters whose base
is another composed character rather than a plain letter. The net effect
of this is to add a bunch of multi-accented Vietnamese characters to
unaccent.rules.
Original complaint from Kha Nguyen, diagnosis of the script's shortcoming
by Thomas Munro.
Dang Minh Huong and Michael Paquier
Discussion: https://postgr.es/m/CALo3sF6EC8cy1F2JUz=GRf5h4LMUJTaG3qpdoiLrNbWEXL-tRg@mail.gmail.com
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
This makes almost all core code follow the policy introduced in the
previous commit. Specific decisions:
- Text search support functions with char* and length arguments, such as
prsstart and lexize, may receive unaligned strings. I doubt
maintainers of non-core text search code will notice.
- Use plain VARDATA() on values detoasted or synthesized earlier in the
same function. Use VARDATA_ANY() on varlenas sourced outside the
function, even if they happen to always have four-byte headers. As an
exception, retain the universal practice of using VARDATA() on return
values of SendFunctionCall().
- Retain PG_GETARG_BYTEA_P() in pageinspect. (Page images are too large
for a one-byte header, so this misses no optimization.) Sites that do
not call get_page_from_raw() typically need the four-byte alignment.
- For now, do not change btree_gist. Its use of four-byte headers in
memory is partly entangled with storage of 4-byte headers inside
GBT_VARKEY, on disk.
- For now, do not change gtrgm_consistent() or gtrgm_distance(). They
incorporate the varlena header into a cache, and there are multiple
credible implementation strategies to consider.