This patch introduces the ability for complex datatypes to have an
in-memory representation that is different from their on-disk format.
On-disk formats are typically optimized for minimal size, and in any case
they can't contain pointers, so they are often not well-suited for
computation. Now a datatype can invent an "expanded" in-memory format
that is better suited for its operations, and then pass that around among
the C functions that operate on the datatype. There are also provisions
(rudimentary as yet) to allow an expanded object to be modified in-place
under suitable conditions, so that operations like assignment to an element
of an array need not involve copying the entire array.
The initial application for this feature is arrays, but it is not hard
to foresee using it for other container types like JSON, XML and hstore.
I have hopes that it will be useful to PostGIS as well.
In this initial implementation, a few heuristics have been hard-wired
into plpgsql to improve performance for arrays that are stored in
plpgsql variables. We would like to generalize those hacks so that
other datatypes can obtain similar improvements, but figuring out some
appropriate APIs is left as a task for future work. (The heuristics
themselves are probably not optimal yet, either, as they sometimes
force expansion of arrays that would be better left alone.)
Preliminary performance testing shows impressive speed gains for plpgsql
functions that do element-by-element access or update of large arrays.
There are other cases that get a little slower, as a result of added array
format conversions; but we can hope to improve anything that's annoyingly
bad. In any case most applications should see a net win.
Tom Lane, reviewed by Andres Freund
... and rename it and its sibling array_offsets to array_position and
array_positions, to account for the changed behavior.
Having the functions return subscripts better matches existing practice,
and is better suited to using the result value as a subscript into the
array directly. For one-based arrays, the new definition is identical
to what was originally committed.
(We use the term "subscript" in the documentation, which is what we use
whenever we talk about arrays; but the functions themselves are named
using the word "position" to match the standard-defined POSITION()
functions.)
Author: Pavel Stěhule
Behavioral problem noted by Dean Rasheed.
When I rewrote this in commit 56a79a869bedc4bf6c35853642694cc0b0594dd2,
I forgot that it's possible for the input array type to change from one
call to the next (this can happen when applying the function to
pg_statistic columns, for instance). Fix that.
Previously, each new array created a new memory context that started
out at 8kB. This is incredibly wasteful when there are lots of small
groups of just a few elements each.
Change initArrayResult() and friends to accept a "subcontext" argument
to indicate whether the caller wants the ArrayBuildState allocated in
a new subcontext or not. If not, it can no longer be released
separately from the rest of the memory context.
Fixes bug report by Frank van Vugt on 2013-10-19.
Tomas Vondra. Reviewed by Ali Akbar, Tom Lane, and me.
There wasn't any good reason for a single C function to implement both
these SQL functions: it saved very little code overall, and it required
significant pushups to re-determine at runtime which case applied. Redoing
it as two functions ends up with just slightly more lines of code, but it's
simpler to understand, and faster too because we need not repeat syscache
lookups on every call.
An important side benefit is that this eliminates the only case in which
different aliases of the same C function had both anyarray and anyelement
arguments at the same position, which would almost always be a mistake.
The opr_sanity regression test will now notice such mistakes since there's
no longer a valid case where it happens.
These cases formerly failed with errors about "could not find array type
for data type". Now they yield arrays of the same element type and one
higher dimension.
The implementation involves creating functions with API similar to the
existing accumArrayResult() family. I (tgl) also extended the base family
by adding an initArrayResult() function, which allows callers to avoid
special-casing the zero-inputs case if they just want an empty array as
result. (Not all do, so the previous calling convention remains valid.)
This allowed simplifying some existing code in xml.c and plperl.c.
Ali Akbar, reviewed by Pavel Stehule, significantly modified by me
Per recent discussion, it's important for all computed datums (not only the
results of input functions) to not contain any ill-defined (uninitialized)
bits. Failing to ensure that can result in equal() reporting that
semantically indistinguishable Consts are not equal, which in turn leads to
bizarre and undesirable planner behavior, such as in a recent example from
David Johnston. We might eventually try to fix this in a general manner by
allowing datatypes to define identity-testing functions, but for now the
path of least resistance is to expect datatypes to force all unused bits
into consistent states.
Per some testing by Noah Misch, array and path functions seem to be the
only ones presenting risks at the moment, so I looked through all the
functions in adt/array*.c and geo_ops.c and fixed them as necessary. In
the array functions, the easiest/safest fix is to allocate result arrays
with palloc0 instead of palloc. Possibly in future someone will want to
look into whether we can just zero the padding bytes, but that looks too
complex for a back-patchable fix. In the path functions, we already had a
precedent in path_in for just zeroing the one known pad field, so duplicate
that code as needed.
Back-patch to all supported branches.
better handling of NULL elements within the arrays. The third parameter
is a string that should be used to represent a NULL element, or should
be translated into a NULL element, respectively. If the third parameter
is NULL it behaves the same as the two-parameter form.
There are two incompatible changes in the behavior of the two-parameter form
of string_to_array. First, it will return an empty (zero-element) array
rather than NULL when the input string is of zero length. Second, if the
field separator is NULL, the function splits the string into individual
characters, rather than returning NULL as before. These two changes make
this form fully compatible with the behavior of the new three-parameter form.
Pavel Stehule, reviewed by Brendan Jurd
being called as aggregates, and to get the aggregate transition state memory
context if needed. Use it instead of poking directly into AggState and
WindowAggState in places that shouldn't know so much.
We should have done this in 8.4, probably, but better late than never.
Revised version of a patch by Hitoshi Harada.
of individually pfree'ing pass-by-reference transition values. This should
be at least as fast as the prior coding, and it has the major advantage of
clearing out any working data an aggregate function may have stored in or
underneath the aggcontext. This avoids memory leakage when an aggregate
such as array_agg() is used in GROUP BY mode. Per report from Chris Spotts.
Back-patch to 8.4. In principle the problem could arise in prior versions,
but since they didn't have array_agg the issue seems not critical.
ArrayBuildState, per trouble report from Merlin Moncure. By adopting
this fix, we are essentially deciding that aggregate final-functions
should not modify their inputs ever. Adjust documentation and comments
to match that conclusion.
"array_agg_finalfn(null)". We should modify pg_proc entries to prevent this
query from being accepted, but let's just make the function itself secure too.
Per my note of today.
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
the array (for array_push) or higher-dimensional array (for array_cat)
rather than decrementing it as before. This avoids generating lower
bounds other than one for any array operation within the SQL spec. Per
recent discussion.
Interestingly, this seems to have been the original behavior, because
while updating the docs I noticed that a large fraction of relevant
examples were *wrong* for the old behavior and are now right. Is it
worth correcting this in the back-branch docs?
functionality, but I still need to make another pass looking at places
that incidentally use arrays (such as ACL manipulation) to make sure they
are null-safe. Contrib needs work too.
I have not changed the behaviors that are still under discussion about
array comparison and what to do with lower bounds.
be anything yielding an array of the proper kind, not only sub-ARRAY[]
constructs; do subscript checking at runtime not parse time. Also,
adjust array_cat to make array || array comply with the SQL99 spec.
Joe Conway
ANYELEMENT. The effect is to postpone typechecking of the function
body until runtime. Documentation is still lacking.
Original patch by Joe Conway, modified to postpone type checking
by Tom Lane.
comparison functions), replacing the highly bogus bitwise array_eq. Create
a btree index opclass for ANYARRAY --- it is now possible to create indexes
on array columns.
Arrange to cache the results of catalog lookups across multiple array
operations, instead of repeating the lookups on every call.
Add string_to_array and array_to_string functions.
Remove singleton_array, array_accum, array_assign, and array_subscript
functions, since these were for proof-of-concept and not intended to become
supported functions.
Minor adjustments to behavior in some corner cases with empty or
zero-dimensional arrays.
Joe Conway (with some editorializing by Tom Lane).