Add simple codepoint redirections to unaccent.rules.

Previously we searched for code points where the Unicode data file
listed an equivalent combining character sequence that added accents.
Some codepoints redirect to a single other codepoint, instead of doing
any combining.  We can follow those references recursively to get the
answer.

Per bug report #18362, which reported missing Ancient Greek characters.
Specifically, precomposed characters with oxia (from the polytonic
accent system used for old Greek) just point to precomposed characters
with tonos (from the monotonic accent system for modern Greek), and we
have to follow the extra hop to find out that they are composed with
an acute accent.

Besides those, the new rule also:

* pulls in a lot of 'Mathematical Alphanumeric Symbols', which are
  copies of the Latin and Greek alphabets and numbers rendered
  in different typefaces, and

* corrects a single mathematical letter that previously came from the
  CLDR transliteration file, but the new rule extracts from the main
  Unicode database file, where clearly the latter is right and the
  former is a wrong (reported to CLDR).

Reported-by: Cees van Zeeland <cees.van.zeeland@freedom.nl>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18362-be6d0cfe122b6354%40postgresql.org
This commit is contained in:
Thomas Munro 2024-07-05 15:25:31 +12:00
parent 1eff8279d4
commit 18501841bc
3 changed files with 1025 additions and 9 deletions

View File

@ -176,6 +176,6 @@ SELECT ts_lexize('unaccent', '〝');
SELECT unaccent('');
unaccent
----------
x
H
(1 row)

View File

@ -104,10 +104,11 @@ def is_letter_with_marks(codepoint, table):
"""Returns true for letters combined with one or more marks."""
# See https://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
# Letter may have no combining characters, in which case it has
# no marks.
if len(codepoint.combining_ids) == 1:
return False
# Some codepoints redirect directly to another, instead of doing any
# "combining"... but sometimes they redirect to a codepoint that doesn't
# exist, so ignore those.
if len(codepoint.combining_ids) == 1 and codepoint.combining_ids[0] in table:
return is_letter_with_marks(table[codepoint.combining_ids[0]], table)
# A letter without diacritical marks has none of them.
if any(is_mark(table[i]) for i in codepoint.combining_ids[1:]) is False:
@ -148,8 +149,7 @@ def get_plain_letter(codepoint, table):
def is_ligature(codepoint, table):
"""Return true for letters combined with letters."""
return all(is_letter(table[i], table) for i in codepoint.combining_ids)
return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids)
def get_plain_letters(codepoint, table):
"""Return a list of plain letters from a ligature."""
@ -200,6 +200,11 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
# the parser of unaccent only accepts non-whitespace characters
# for "src" and "trg" (see unaccent.c)
if not src.isspace() and not trg.isspace():
if src == "\u210c":
# This mapping seems to be in error, and causes a collision
# by disagreeing with the main Unicode database file:
# https://unicode-org.atlassian.net/browse/CLDR-17656
continue
charactersSet.add((ord(src), trg))
return charactersSet
@ -251,7 +256,7 @@ def main(args):
# walk through all the codepoints looking for interesting mappings
for codepoint in all:
if codepoint.general_category.startswith('L') and \
len(codepoint.combining_ids) > 1:
len(codepoint.combining_ids) > 0:
if is_letter_with_marks(codepoint, table):
charactersSet.add((codepoint.id,
chr(get_plain_letter(codepoint, table).id)))

File diff suppressed because it is too large Load Diff