style by default. Per discussion, there seems to be hardly anything that
really relies on being able to change the regex flavor, so the ability to
select it via embedded options ought to be enough for any stragglers.
Also, if we didn't remove the GUC, we'd really be morally obligated to
mark the regex functions non-immutable, which'd possibly create performance
issues.
* Stop escaping ? and {. As of SQL:2008, SIMILAR TO is defined to have
POSIX-compatible interpretation of ? as well as {m,n} and related constructs,
so we should allow these things through to our regex engine.
* Escape ^ and $. It appears that our regex engine will treat ^^ at the
beginning of the string the same as ^, and similarly for $$ at the end of
the string, which meant that SIMILAR TO was effectively ignoring ^ at the
start of the pattern and $ at the end. Since these are not supposed to be
metacharacters, this is a bug.
The second part of this is arguably a back-patchable bug fix, but I'm
hesitant to do that because it might break applications that are expecting
something like "col SIMILAR TO '^foo$'" to work like a POSIX pattern.
Seems safer to only change it at a major version boundary.
Per discussion of an example from Doug Gorley.
case where there is a match to the pattern overall but the user has specified
a parenthesized subexpression and that subexpression hasn't got a match.
An example is substring('foo' from 'foo(bar)?'). This should return NULL,
since (bar) isn't matched, but it was mistakenly returning the whole-pattern
match instead (ie, 'foo'). Per bug #4044 from Rui Martins.
This has been broken since the beginning; patch in all supported versions.
The old behavior was sufficiently inconsistent that it's impossible to believe
anyone is depending on it.
to not cause needless copying of text datums that have 1-byte headers.
Greg Stark, in response to performance gripe from Guillaume Smet and
ITAGAKI Takahiro.
regexp_split_to_table() within a single query. This is only a partial
solution, as it turns out that with enough matches per string these
functions can also tickle a repalloc() misbehavior. But fixing that
is a topic for a separate patch.
that cached compiled patterns will still be there when the function is next
called. Clean up looping logic, thereby fixing bug identified by Pavel
Stehule. Share setup code between the two functions, add some comments, and
avoid risky mixing of int and size_t variables. Clean up the documentation a
tad, and accept all the flag characters mentioned in table 9-19 rather than
just a subset.
and regexp_split_to_table. These functions provide access to the
capture groups resulting from a POSIX regular expression match,
and provide the ability to split a string on a POSIX regular
expression, respectively. Patch from Jeremy Drake; code review by
Neil Conway, additional comments and suggestions from Tom and
Peter E.
This patch bumps the catversion, adds some regression tests,
and updates the docs.
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
form '^(foo)$'. Before, these could never be optimized into indexscans.
The recent changes to make psql and pg_dump generate such patterns (for \d
commands and -t and related switches, respectively) therefore represented
a big performance hit for people with large pg_class catalogs, as seen in
recent gripe from Erik Jones. While at it, be more paranoid about
case-sensitivity checking in multibyte encodings, and fix some other
corner cases in which a regex might be interpreted too liberally.
alternatives ("|" symbol). The original coding allowed the added ^ and $
constraints to be absorbed into the first and last alternatives, producing
a pattern that would match more than it should. Per report from Eric Noriega.
I also changed the pattern to add an ARE director ("***:"), ensuring that
SIMILAR TO patterns do not change behavior if regex_flavor is changed. This
is necessary to make the non-capturing parentheses work, and seems like a
good idea on general principles.
Back-patched as far as 7.4. 7.3 also has the bug, but a fix seems impractical
because that version's regex engine doesn't have non-capturing parens.
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
fix problems with replacement-string backslashes that aren't followed by
one of the expected characters, avoid giving the impression that
replace_text_regexp() is meant to be called directly as a SQL function,
etc.
The specification of this function is as follows.
regexp_replace(source text, pattern text, replacement text, [flags
text])
returns text
Replace string that matches to regular expression in source text to
replacement text.
- pattern is regular expression pattern.
- replacement is replace string that can use '\1'-'\9', and '\&'.
'\1'-'\9': back reference to the n'th subexpression.
'\&' : entire matched string.
- flags can use the following values:
g: global (replace all)
i: ignore case
When the flags is not specified, case sensitive, replace the first
instance only.
Atsushi Ogawa
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
error conditions during regexp compile, but not during regexp execution;
any sort of "can't happen" errors would be treated as no-match instead
of being reported as they should be. Noticed while trying to duplicate
a reported Tcl bug.
conversion of basic ASCII letters. Remove all uses of strcasecmp and
strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp;
remove most but not all direct uses of toupper and tolower in favor of
pg_toupper and pg_tolower. These functions use the same notions of
case folding already developed for identifier case conversion. I left
the straight locale-based folding in place for situations where we are
just manipulating user data and not trying to match it to built-in
strings --- for example, the SQL upper() function is still locale
dependent. Perhaps this will prove not to be what's wanted, but at
the moment we can initdb and pass regression tests in Turkish locale.
should not be too eager to reject paths involving unknown schemas, since
it can't really tell whether the schemas exist in the target database.
(Also, when reading pg_dumpall output, it could be that the schemas
don't exist yet, but eventually will.) ALTER USER SET has a similar issue.
So, reduce the normal ERROR to a NOTICE when checking search_path values
for these commands. Supporting this requires changing the API for GUC
assign_hook functions, which causes the patch to touch a lot of places,
but the changes are conceptually trivial.
expression accepted by the regex operators, per discussion yesterday.
Along the way, reduce deadlock_timeout from PGC_POSTMASTER to PGC_SIGHUP
category. It is probably best to insist that all backends share the same
setting, but that doesn't mean it has to be frozen at startup.
(extracted from Tcl 8.4.1 release, as Henry still hasn't got round to
making it a separate library). This solves a performance problem for
multibyte, as well as upgrading our regexp support to match recent Tcl
and nearly match recent Perl.
the SQL99 standard. (I'm not sure that the character-class features are
quite right, but that can be fixed later.) Document SQL99 and POSIX
regexps as being different features; provide variants of SUBSTRING for
each.
Will optimize the case for repeated calls for the same expression,
which seems to be the most common case. Formerly, always searched
from the first entry.
May want to look at the least-recently-used algorithm to make sure it
is identifying the right slots to reclaim. Seems silly to do math when
it seems that we could simply use an incrementing counter...
Implement SQL99 SIMILAR TO as a synonym for our existing operator "~".
Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape).
Extend the definition to make the FOR clause optional.
Define textregexsubstr() to actually implement this feature.
Update the regression test to include these new string features.
All tests pass.
Rename the regular expression support routines from "pg95_xxx" to "pg_xxx".
Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.