This commit adds XML_PARSE_HUGE to the libxml2 functions used in core
for the parsing of XML objects, raising up the original limit of 10MB
supported by libxml2.
In most code paths of upstream, XML_MAX_TEXT_LENGTH (10^7) is the
historical limit that gets upgraded to XML_MAX_HUGE_LENGTH (10^9) once
XML_PARSE_HUGE is given to the parser calls. These are still limited by
any palloc() calls for text, up to 1GB.
This offers the possibility to handle within the backend XML objects
larger than 10MB in general, with also a higher depth limit. This
change affects the contrib module xml2, the xml data type and SQL/XML.
Author: Dmitry Koval
Reviewed-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/18274-98d16bc03520665f@postgresql.org
As in c27f8621eed et al.
Not as critical as other cases we've handled, but I figure if we're
going to add JsonbTableRoutine using TableFuncRoutine, this makes it
easier to jump around the code.
This function implements the standard XMLTest function, which
converts text into xml text nodes. It uses the libxml2 function
xmlEncodeSpecialChars to escape predefined entities (&"<>), so
that those do not cause any conflict when concatenating the text
node output with existing xml documents.
This also adds a note in features.sgml about not supporting
XML(SEQUENCE). The SQL specification defines a RETURNING clause
to a set of XML functions, where RETURNING CONTENT or RETURNING
SEQUENCE can be defined. Since PostgreSQL doesn't support
XML(SEQUENCE) all of these functions operate with an
implicit RETURNING CONTENT.
Author: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Discussion: https://postgr.es/m/86617a66-ec95-581f-8d54-08059cca8885@uni-muenster.de
cursor_to_xmlschema() assumed that any Portal must have a tupDesc,
which is not so. Add a defensive check.
It's plausible that this mistake occurred because of the rather
poorly chosen name of the lookup function SPI_cursor_find(),
which in such cases is returning something that isn't very much
like a cursor. Add some documentation to try to forestall future
errors of the same ilk.
Report and patch by Boyu Yang (docs changes by me). Back-patch
to all supported branches.
Discussion: https://postgr.es/m/dd343010-c637-434c-a8cb-418f53bda3b8.yangboyu.yby@alibaba-inc.com
Run pgindent, pgperltidy, and reformat-dat-files.
This set of diffs is a bit larger than typical. We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop). We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up. Going
forward, that should make for fewer random-seeming changes to existing
code.
Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
This adds the ability to pretty-print XML documents ... according to
libxml's somewhat idiosyncratic notions of what's pretty, anyway.
One notable divergence from a strict reading of the spec is that
libxml is willing to collapse empty nodes "<node></node>" to just
"<node/>", whereas SQL and the underlying XML spec say that this
option should only result in whitespace tweaks. Nonetheless,
it seems close enough to justify using the SQL-standard syntax.
Jim Jones, reviewed by Peter Smith and myself
Discussion: https://postgr.es/m/2f5df461-dad8-6d7d-4568-08e10608a69b@uni-muenster.de
This ancient bit of code was summarily trapping any ereport longjmp
whatsoever and assuming that it must represent an invalid-XML report.
It's not really appropriate to handle OOM-like situations that way:
maybe the input is valid or maybe not, but we couldn't find out.
And it'd be a seriously bad idea to ignore, say, a query cancel
error that way. (Perhaps that can't happen because there is no
CHECK_FOR_INTERRUPTS anywhere within xml_parse, but even if that's
true today it's obviously a very fragile assumption.)
But in the wake of the previous commit, we can drop the PG_TRY
here altogether, and use the soft error mechanism to catch only
the kinds of errors that are legitimate to treat as invalid-XML.
(This is our first use of the soft error mechanism for something
not directly related to a datatype input function. It won't be
the last.)
xml_is_document can be converted in the same way. That one is
not actively broken, because it was checking specifically for
ERRCODE_INVALID_XML_DOCUMENT rather than trapping everything;
but the code is still shorter and probably faster this way.
Discussion: https://postgr.es/m/3564577.1671142683@sss.pgh.pa.us
The key idea here is that xml_parse must distinguish hard errors
from soft errors. We want to throw a hard error for libxml
initialization failures: those might be out-of-memory, or something
else, but in any case they are not the fault of the input string.
If we get to the point of parsing the input, and something goes
wrong, we can fairly consider that to mean bad input.
One thing that arguably does mean bad input, but I didn't trouble
to handle softly, is encoding conversion failure while converting
the server encoding to UTF8. This might be something to improve
later, but it seems like a pretty low-probability scenario.
Discussion: https://postgr.es/m/3564577.1671142683@sss.pgh.pa.us
This is similar to 7d25958, and this commit takes care of all the
remaining inconsistencies between the initial value used in the C
variable associated to a GUC and its default value stored in the GUC
tables (as of pg_settings.boot_val).
Some of the initial values of the GUCs updated rely on a compile-time
default. These are refactored so as the GUC table and its C declaration
use the same values. This makes everything consistent with other
places, backend_flush_after, bgwriter_flush_after, port,
checkpoint_flush_after doing so already, for example.
Extracted from a larger patch by Peter Smith. The spots updated in the
modules are from me.
Author: Peter Smith, Michael Paquier
Reviewed-by: Nathan Bossart, Tom Lane, Justin Pryzby
Discussion: https://postgr.es/m/CAHut+PtHE0XSfjjRQ6D4v7+dqzCw=d+1a64ujra4EX8aoc_Z+w@mail.gmail.com
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in optimizer, parser,
utility, libpq, and "commands" code, as well as in remaining library
code. Do the same for all code related to frontend programs (with the
exception of pg_dump/pg_dumpall related code).
Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy. Later commits will handle
ecpg and pg_dump/pg_dumpall.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
Since these macros just cast whatever you give them to the designated
output type, and many normal uses also cast the output type further, a
number of incorrect uses go undiscovered. The fixes in this patch
have been discovered by changing these macros to inline functions,
which is the subject of a future patch.
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
There were many calls to construct_array() and deconstruct_array() for
built-in types, for example, when dealing with system catalog columns.
These all hardcoded the type attributes necessary to pass to these
functions.
To simplify this a bit, add construct_array_builtin(),
deconstruct_array_builtin() as wrappers that centralize this hardcoded
knowledge. This simplifies many call sites and reduces the amount of
hardcoded stuff that is spread around.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/2914356f-9e5f-8c59-2995-5997fc48bcba%40enterprisedb.com
These are usually not useful since users will use packaged
distributions and won't be interested in rebuilding their installation
from source. Also, we have only used these kinds of hints for some
features and in some places, not consistently throughout.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/2552aed7-d0e9-280a-54aa-2dc7073f371d%40enterprisedb.com
The output of table_to_xmlschema() and allied functions includes
a regex describing valid values for these types ... but the regex
was itself invalid, as it failed to escape a literal "+" sign.
Report and fix by Renan Soares Lopes. Back-patch to all
supported branches.
Discussion: https://postgr.es/m/7f6fabaa-3f8f-49ab-89ca-59fbfe633105@me.com
SQL includes provisions for numeric Unicode escapes in string
literals and identifiers. Previously we only accepted those
if they represented ASCII characters or the server encoding
was UTF-8, making the conversion to internal form trivial.
This patch adjusts things so that we'll call the appropriate
encoding conversion function in less-trivial cases, allowing
the escape sequence to be accepted so long as it corresponds
to some character available in the server encoding.
This also applies to processing of Unicode escapes in JSONB.
However, the old restriction still applies to client-side
JSON processing, since that hasn't got access to the server's
encoding conversion infrastructure.
This patch includes some lexer infrastructure that simplifies
throwing errors with error cursors pointing into the middle of
a string (or other complex token). For the moment I only used
it for errors relating to Unicode escapes, but we might later
expand the usage to some other cases.
Patch by me, reviewed by John Naylor.
Discussion: https://postgr.es/m/2393.1578958316@sss.pgh.pa.us
Our usual practice for "poor man's enum" catalog columns is to define
macros for the possible values and use those, not literal constants,
in C code. But for some reason lost in the mists of time, this was
never done for typalign/attalign or typstorage/attstorage. It's never
too late to make it better though, so let's do that.
The reason I got interested in this right now is the need to duplicate
some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch.
But in general, this sort of change aids greppability and readability,
so it's a good idea even without any specific motivation.
I may have missed a few places that could be converted, and it's even
more likely that pending patches will re-introduce some hard-coded
references. But that's not fatal --- there's no expectation that
we'd actually change any of these values. We can clean up stragglers
over time.
Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
SPI gets used to build a list of relation OIDs for XML object
generation, and one code path building a list uses SPI_execute() without
looking at errors it produces. So fix that.
Author: Mark Dilger
Reviewed-by: Michael Paquier, Pavel Stehule
Discussion: https://postgr.es/m/17d30445-4862-7917-170f-84328dcd292d@gmail.com
This gives an alternative way of catching exceptions, for the common
case where the cleanup code is the same in the error and non-error
cases. So instead of
PG_TRY();
{
... code that might throw ereport(ERROR) ...
}
PG_CATCH();
{
cleanup();
PG_RE_THROW();
}
PG_END_TRY();
cleanup();
one can write
PG_TRY();
{
... code that might throw ereport(ERROR) ...
}
PG_FINALLY();
{
cleanup();
}
PG_END_TRY();
Discussion: https://www.postgresql.org/message-id/flat/95a822c3-728b-af0e-d7e5-71890507ae0c%402ndquadrant.com
When we already know the length that we're going to append, then it
makes sense to use appendBinaryStringInfo instead of
appendStringInfoString so that the append can be performed with a simple
memcpy() using a known length rather than having to first perform a
strlen() call to obtain the length.
Discussion: https://postgr.es/m/CAKJS1f8+FRAM1s5+mAa3isajeEoAaicJ=4e0WzrH3tAusbbiMQ@mail.gmail.com
xml.c passed format = 1 to xmlNodeDump(), resulting in sometimes getting
extra whitespace (newlines + spaces) in the output. We don't really want
that, first because whitespace might be semantically significant in some
XML uses, and second because it happens only very inconsistently. Only
one case in our regression tests is affected.
This potentially affects the results of xpath() and the XMLTABLE construct,
when emitting nodeset values.
Note that the older code in contrib/xml2 doesn't do this; it seems
to have been an aboriginal bad decision in commit ea3b212fe.
While this definitely seems like a bug to me, the small number of
complaints to date argues against back-patching a behavioral change.
Hence, fix in HEAD only, at least for now.
Per report from Jean-Marc Voillequin.
Discussion: https://postgr.es/m/1EC8157EB499BF459A516ADCF135ADCE3A23A9CA@LON-WGMSX712.ad.moodys.net
Previously we were using the SQL:2003 definition, which doesn't allow
this, but that creates a serious dump/restore gotcha: there is no
setting of xmloption that will allow all valid XML data. Hence,
switch to the 2006 definition.
Since libxml doesn't accept <!DOCTYPE> directives in the mode we
use for CONTENT parsing, the implementation is to detect <!DOCTYPE>
in the input and switch to DOCUMENT parsing mode. This should not
cost much, because <!DOCTYPE> should be close to the front of the
input if it's there at all. It's possible that this causes the
error messages for malformed input to be slightly different than
they were before, if said input includes <!DOCTYPE>; but that does
not seem like a big problem.
In passing, buy back a few cycles in parsing of large XML documents
by not doing strlen() of the whole input in parse_xml_decl().
Back-patch because dump/restore failures are not nice. This change
shouldn't break any cases that worked before, so it seems safe to
back-patch.
Chapman Flack (revised a bit by me)
Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
Certain libxml2 versions (such as the 2.7.6 commonly seen in older
distributions, but apparently only on x86_64) contain a bug that causes
xmlCopyNode, when called on a XML_DOCUMENT_NODE, to return a node that
xmlFreeNode crashes on. Arrange to call xmlFreeDoc instead of
xmlFreeNode for those nodes.
Per buildfarm members lapwing and grison.
Author: Pavel Stehule, light editing by Álvaro.
Discussion: https://postgr.es/m/20190308024436.GA2374@alvherre.pgsql
Correctly process nodes of more types than previously. In some cases,
nodes were being ignored (nothing was output); in other cases, trying to
return them resulted in errors about unrecognized nodes. In yet other
cases, necessary escaping (of XML special characters) was not being
done. Fix all those (as far as the authors could find) and add
regression tests cases verifying the new behavior.
I (Álvaro) was of two minds about backpatching these changes. They do
seem bugfixes that would benefit most users of the affected functions;
but on the other hand it would change established behavior in minor
releases, so it seems prudent not to.
Authors: Pavel Stehule, Markus Winand, Chapman Flack
Discussion:
https://postgr.es/m/CAFj8pRA6J25CtAZ2TuRvxK3gat7-bBUYh0rfE2yM7Hj9GD14Dg@mail.gmail.comhttps://postgr.es/m/8BDB0627-2105-4564-AA76-7849F028B96E@winand.at
The elephant in the room as pointed out by Chapman Flack, not fixed in
this commit, is that we still have XMLTABLE operating on XPath 1.0
instead of the standard-mandated XQuery (or even its subset XPath 2.0).
Fixing that is a major undertaking, however.
While this isn't really supposed to happen, it can occur in OOM
situations and perhaps others. Instead of crashing, substitute
"(no message provided)".
I didn't worry about localizing this text, since we aren't
localizing anything else here; besides, if we're on the edge of
OOM, it's unlikely gettext() would work.
Report and fix by Sergio Conde Gómez in bug #15624.
Discussion: https://postgr.es/m/15624-4dea54091a2864e6@postgresql.org
heapam.h previously was included in a number of widely used
headers (e.g. execnodes.h, indirectly in executor.h, ...). That's
problematic on its own, as heapam.h contains a lot of low-level
details that don't need to be exposed that widely, but becomes more
problematic with the upcoming introduction of pluggable table storage
- it seems inappropriate for heapam.h to be included that widely
afterwards.
heapam.h was largely only included in other headers to get the
HeapScanDesc typedef (which was defined in heapam.h, even though
HeapScanDescData is defined in relscan.h). The better solution here
seems to be to just use the underlying struct (forward declared where
necessary). Similar for BulkInsertState.
Another problem was that LockTupleMode was used in executor.h - parts
of the file tried to cope without heapam.h, but due to the fact that
it indirectly included it, several subsequent violations of that goal
were not not noticed. We could just reuse the approach of declaring
parameters as int, but it seems nicer to move LockTupleMode to
lockoptions.h - that's not a perfect location, but also doesn't seem
bad.
As a number of files relied on implicitly included heapam.h, a
significant number of files grew an explicit include. It's quite
probably that a few external projects will need to do the same.
Author: Andres Freund
Reviewed-By: Alvaro Herrera
Discussion: https://postgr.es/m/20190114000701.y4ttcb74jpskkcfb@alap3.anarazel.de
According to the SQL standard, the context of XMLTABLE's XPath
row_expression is the document node of the XML input document, not the
root node. This becomes visible when a relative path rather than
absolute is used as row expression. Absolute paths is what was used in
original tests and docs (and the most common form used in examples
throughout the interwebs), which explains why this wasn't noticed
before.
Other functions such as xpath() and xpath_exists() also have this
problem. While not specified by the SQL standard, it would be pretty
odd to leave those functions to behave differently than XMLTABLE, so
change them too. However, this is a backwards-incompatible change.
No backpatch, out of fear of breaking code depending on the original
broken behavior.
Author: Markus Winand
Reported-By: Markus Winand
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/0684A598-002C-42A2-AE12-F024A324EAE4@winand.at
Column expressions that match TEXT or CDATA nodes must return the
contents of the nodes themselves, not the content of non-existing
children (i.e. the empty string).
Author: Markus Winand
Reported-by: Markus Winand
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/0684A598-002C-42A2-AE12-F024A324EAE4@winand.at
When a value contained an XML declaration naming some other encoding,
this function interpreted UTF8 bytes as the named encoding, yielding
mojibake. xml_parse() already has similar logic. This would be
necessary but not sufficient for non-UTF8 databases, so preserve
behavior there until the xpath facility can support such databases
comprehensively. Back-patch to 9.3 (all supported versions).
Pavel Stehule and Noah Misch
Discussion: https://postgr.es/m/CAFj8pRC-dM=tT=QkGi+Achkm+gwPmjyOayGuUfXVumCxkDgYWg@mail.gmail.com
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>