The previous coding of this function could get into situations where it
would never terminate, because successive passes would re-add EMPTY arcs
that had been removed by the previous pass. Rewrite the function
completely using a new algorithm that is guaranteed to terminate, and
also seems to be usually faster than the old one. Per Tcl bugs 3604074
and 3606683.
Tom Lane and Don Porter
parseqatom() failed to check for an error return (NULL result) from its
recursive call to parsebranch(), and in consequence could crash with a
null-pointer dereference after an error return. This bug has been there
since day one, but wasn't noticed before, probably because most error cases
in parsebranch() didn't actually lead to returning NULL. Add the missing
error check, and also tweak parsebranch() to exit in a less indirect
fashion after a call to parseqatom() fails.
Report by Tomasz Karlik, fix by me.
rfree() failed to cope with the case that pg_regcomp() had initialized the
regex_t struct but then failed to allocate any memory for re->re_guts (ie,
the first malloc call in pg_regcomp() failed). It would try to touch the
guts struct anyway, and thus dump core. This is a sufficiently narrow
corner case that it's not surprising it's never been seen in the field;
but still a bug is a bug, so patch all active branches.
Noted while investigating whether we need to call pg_regfree after a
failure return from pg_regcomp. Other than this bug, it turns out we
don't, so adjust comments appropriately.
The previous coding abused the first element of a cNFA state's arcs list
to hold a per-state flag bit, which was confusing, undocumented, and not
even particularly efficient. Get rid of that in favor of a separate
"stflags" vector. Since there's only one bit in use, I chose to allocate a
char per state; we could possibly replace this with a bitmap at some point,
but that would make accesses a little slower. It's already about 8X
smaller than before, so let's not get overly tense.
Also document the representation better than it was before, which is to say
not at all.
This patch is a byproduct of investigations towards extracting a "fixed
prefix" string from the compact-NFA representation of regex patterns.
Might need to back-patch it if we decide to back-patch that fix, but for
now it's just code cleanup so I'll just put it in HEAD.
Apparently some primordial version of Spencer's engine needed cdissect()
and child functions to be able to continue matching from a previous
position when re-called. That is dead code, though, since trivial
inspection shows that cdissect can never be entered without having
previously done zapmem which resets the relevant retry counter. I have
also verified experimentally that no case in the Tcl regression tests
reaches cdissect with a nonzero retry value. Accordingly, remove that
logic. This doesn't really save any noticeable number of cycles in itself,
but it is one step towards making dissect() and cdissect() equivalent,
which will allow removing hundreds of lines of near-duplicated code.
Since struct subre's "retry" field is no longer particularly related to
any kind of retry, rename it to "id". As of this commit it's only used
for identifying a subre node in debug printouts, so you might think we
should get rid of the field entirely; but I have a plan for another use.
Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)
To make this work properly, we have to add an "iteration" node type
to the regex engine's vocabulary of sub-regex nodes. Since this is a
moderately large change with a fair risk of introducing new bugs of its
own, apply to HEAD only, even though it's a fix for a longstanding bug.
The syntax "\n*", that is a backref with a * quantifier directly applied
to it, has never worked correctly in Spencer's library. This has been an
open bug in the Tcl bug tracker since 2005:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
The core of the problem is in parseqatom(), which first changes "\n*" to
"\n+|" and then applies repeat() to the NFA representing the backref atom.
repeat() thinks that any arc leading into its "rp" argument is part of the
sub-NFA to be repeated. Unfortunately, since parseqatom() already created
the arc that was intended to represent the empty bypass around "\n+", this
arc gets moved too, so that it now leads into the state loop created by
repeat(). Thus, what was supposed to be an "empty" bypass gets turned into
something that represents zero or more repetitions of the NFA representing
the backref atom. In the original example, in place of
^([bc])\1*$
we now have something that acts like
^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.
We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already, and what's more the
whole business of converting "x*" to "x+|" probably needs to go away to fix
another problem I'll mention in a moment. Instead, this patch suppresses
the *-conversion when the target is a simple backref atom, leaving the case
of m == 0 to be handled at runtime. This makes the patch in regcomp.c a
one-liner, at the cost of having to tweak cbrdissect() a little. In the
event I went a bit further than that and rewrote cbrdissect() to check all
the string-length-related conditions before it starts comparing characters.
It seems a bit stupid to possibly iterate through many copies of an
n-character backreference, only to fail at the end because the target
string's length isn't a multiple of n --- we could have found that out
before starting. The existing coding could only be a win if integer
division is hugely expensive compared to character comparison, but I don't
know of any modern machine where that might be true.
This does not fix all the problems with quantified back-references. In
particular, the code is still broken for back-references that appear within
a larger expression that is quantified (so that direct insertion of the
quantification limits into the BACKREF node doesn't apply). I think fixing
that will take some major surgery on the NFA code, specifically introducing
an explicit iteration node type instead of trying to transform iteration
into concatenation of modified regexps.
Back-patch to all supported branches. In HEAD, also add a regression test
case for this. (It may seem a bit silly to create a regression test file
for just one test case; but I'm expecting that we will soon import a whole
bunch of regex regression tests from Tcl, so might as well create the
infrastructure now.)
Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management. Much more to
do there of course.
Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.
This involves getting the character classification and case-folding
functions in the regex library to use the collations infrastructure.
Most of this work had been done already in connection with the upper/lower
and LIKE logic, so it was a simple matter of transposition.
While at it, split out these functions into a separate source file
regc_pg_locale.c, so that they can be correctly labeled with the Postgres
project's license rather than the Scriptics license. These functions are
100% Postgres-written code whereas what remains in regc_locale.c is still
mostly not ours, so lumping them both under the same copyright notice was
getting more and more misleading.
from "clang". The VERR changes make an assignment unconditional, which is
probably easier to read/understand anyway, and one can hardly argue that
it's worth shaving cycles off the case of reporting another error when
one has already been detected. The INSIST change limits where that macro
can be used, but not in a way that creates a problem for any existing call.
was Tcl 8.4.8. The main changes are to remove the never-fully-implemented
code for multi-character collating elements, and to const-ify some stuff a
bit more fully. In combination with the recent security patch, this commit
brings us into line with Tcl 8.5.0.
Note that I didn't make any effort to duplicate a lot of cosmetic changes
that they made to bring their copy into line with their own style
guidelines, such as adding braces around single-line IF bodies. Most of
those we either had done already (such as ANSI-fication of function headers)
or there is no point because pgindent would undo the change anyway.
conversion of basic ASCII letters. Remove all uses of strcasecmp and
strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp;
remove most but not all direct uses of toupper and tolower in favor of
pg_toupper and pg_tolower. These functions use the same notions of
case folding already developed for identifier case conversion. I left
the straight locale-based folding in place for situations where we are
just manipulating user data and not trying to match it to built-in
strings --- for example, the SQL upper() function is still locale
dependent. Perhaps this will prove not to be what's wanted, but at
the moment we can initdb and pass regression tests in Turkish locale.
(extracted from Tcl 8.4.1 release, as Henry still hasn't got round to
making it a separate library). This solves a performance problem for
multibyte, as well as upgrading our regexp support to match recent Tcl
and nearly match recent Perl.
postgres.h or c.h includes a system header (such as stdio.h or
stdlib.h), there's no need to specifically include it in any of the .c
files in the backend.
Neil Conway
Implement SQL99 SIMILAR TO as a synonym for our existing operator "~".
Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape).
Extend the definition to make the FOR clause optional.
Define textregexsubstr() to actually implement this feature.
Update the regression test to include these new string features.
All tests pass.
Rename the regular expression support routines from "pg95_xxx" to "pg_xxx".
Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
Attached is a pacth against 7.2 which adds locale awareness to the
character classes of the regular expression engine.
...
> > I still think the xdigit class could be handled the same way the digit
> > class is (by enumeration rather than using the isxdigit function). That
> > saves you a cicle, and I don't think there's any loss.
>
> In fact, I will email you when I apply the original patch.
I miss that case :-(. Here is the pached patch.
...
Here is a patch which addresses Tatsuo's concerns (it does return an
static struct instead of constructing it).
definitions from K&R to ANSI C style, and fix broken assumption that
int and long are the same datatype. This repairs problems observed
on Alpha with regexps having between 32 and 63 states.
I have implemented a framework of encoding translation between the
backend and the frontend. Also I have added a new variable setting
command:
SET CLIENT_ENCODING TO 'encoding';
Other features include:
Latin1 support more 8 bit cleaness
See doc/README.mb for more details. Note that the pacthes are
against May 30 snapshot.
Tatsuo Ishii
Included are patches intended for allowing PostgreSQL to handle
multi-byte charachter sets such as EUC(Extende Unix Code), Unicode and
Mule internal code. With the MB patch you can use multi-byte character
sets in regexp and LIKE. The encoding system chosen is determined at
the compile time.
To enable the MB extension, you need to define a variable "MB" in
Makefile.global or in Makefile.custom. For further information please
take a look at README.mb under doc directory.
(Note that unlike "jp patch" I do not use modified GNU regexp any
more. I changed Henry Spencer's regexp coming with PostgreSQL.)