diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml
index 2a1957476b..4ee4ac99e2 100644
--- a/doc/src/sgml/xoper.sgml
+++ b/doc/src/sgml/xoper.sgml
@@ -4,17 +4,32 @@
Postgres supports left unary,
right unary and binary
- operators. Operators can be overloaded, or re-used
- with different numbers and types of arguments. If
+ operators. Operators can be overloaded; that is,
+ the same operator name can be used for different operators
+ that have different numbers and types of arguments. If
there is an ambiguous situation and the system cannot
determine the correct operator to use, it will return
- an error and you may have to typecast the left and/or
+ an error. You may have to typecast the left and/or
right operands to help it understand which operator you
meant to use.
- To create an operator for adding two complex numbers
- can be done as follows. First we need to create a
- function to add the new types. Then, we can create the
- operator with the function.
+
+
+
+ Every operator is "syntactic sugar" for a call to an
+ underlying function that does the real work; so you must
+ first create the underlying function before you can create
+ the operator. However, an operator is not
+ merely syntactic sugar, because it carries additional information
+ that helps the query planner optimize queries that use the
+ operator. Much of this chapter will be devoted to explaining
+ that additional information.
+
+
+
+ Here is an example of creating an operator for adding two
+ complex numbers. We assume we've already created the definition
+ of type complex. First we need a function that does the work;
+ then we can define the operator:
CREATE FUNCTION complex_add(complex, complex)
@@ -32,11 +47,7 @@ CREATE OPERATOR + (
- We've shown how to create a binary operator here. To
- create unary operators, just omit one of leftarg (for
- left unary) or rightarg (for right unary).
- If we give the system enough type information, it can
- automatically figure out which operators to use.
+ Now we can do:
SELECT (a + b) AS c FROM test_complex;
@@ -51,8 +62,18 @@ SELECT (a + b) AS c FROM test_complex;
+
+ We've shown how to create a binary operator here. To
+ create unary operators, just omit one of leftarg (for
+ left unary) or rightarg (for right unary). The procedure
+ clause and the argument clauses are the only required items
+ in CREATE OPERATOR. The COMMUTATOR clause shown in the example
+ is an optional hint to the query optimizer. Further details about
+ COMMUTATOR and other optimizer hints appear below.
+
+
- Hash Join Operators
+ Operator Optimization Information
Author
@@ -62,39 +83,335 @@ SELECT (a + b) AS c FROM test_complex;
- The assumption underlying hash join is that two values that will be
- considered equal by the comparison operator will always have the same
- hash value. If two values get put in different hash buckets, the join
- will never compare them at all, so they are necessarily treated as
- unequal.
+ A Postgres operator definition can include
+ several optional clauses that tell the system useful things about how
+ the operator behaves. These clauses should be provided whenever
+ appropriate, because they can make for considerable speedups in execution
+ of queries that use the operator. But if you provide them, you must be
+ sure that they are right! Incorrect use of an optimization clause can
+ result in backend crashes, subtly wrong output, or other Bad Things.
+ You can always leave out an optimization clause if you are not sure
+ about it --- the only consequence is that queries might run slower than
+ they need to.
- But we have a number of datatypes for which the "=" operator is not
- a straight bitwise comparison. For example, intervaleq is not bitwise
- at all; it considers two time intervals equal if they have the same
+ Additional optimization clauses might be added in future versions of
+ Postgres. The ones described here are all
+ the ones that release 6.5 understands.
+
+
+
+ COMMUTATOR
+
+
+ The COMMUTATOR clause, if provided, names an operator that is the
+ commutator of the operator being defined. We say that operator A is the
+ commutator of operator B if (x A y) equals (y B x) for all possible input
+ values x,y. Notice that B is also the commutator of A. For example,
+ operators '<' and '>' for a particular datatype are usually each others'
+ commutators, and operator '+' is usually commutative with itself.
+ But operator '-' is usually not commutative with anything.
+
+
+
+ The left argument type of a commuted operator is the same as the
+ right argument type of its commutator, and vice versa. So the name of
+ the commutator operator is all that Postgres
+ needs to be given to look up the commutator, and that's all that need
+ be provided in the COMMUTATOR clause.
+
+
+
+ When you are defining a self-commutative operator, you just do it.
+ When you are defining a pair of commutative operators, things are
+ a little trickier: how can the first one to be defined refer to the
+ other one, which you haven't defined yet? There are two solutions
+ to this problem:
+
+
+
+ One way is to omit the COMMUTATOR clause in the first operator that
+ you define, and then provide one in the second operator's definition.
+ Since Postgres knows that commutative
+ operators come in pairs, when it sees the second definition it will
+ automatically go back and fill in the missing COMMUTATOR clause in
+ the first definition.
+
+
+
+ The other, more straightforward way is just to include COMMUTATOR clauses
+ in both definitions. When Postgres processes
+ the first definition and realizes that COMMUTATOR refers to a non-existent
+ operator, the system will make a dummy entry for that operator in the
+ system's pg_operator table. This dummy entry will have valid data only
+ for the operator name, left and right argument types, and result type,
+ since that's all that Postgres can deduce
+ at this point. The first operator's catalog entry will link to this
+ dummy entry. Later, when you define the second operator, the system
+ updates the dummy entry with the additional information from the second
+ definition. If you try to use the dummy operator before it's been filled
+ in, you'll just get an error message. (Note: this procedure did not work
+ reliably in Postgres versions before 6.5,
+ but it is now the recommended way to do things.)
+
+
+
+
+
+ NEGATOR
+
+
+ The NEGATOR clause, if provided, names an operator that is the
+ negator of the operator being defined. We say that operator A
+ is the negator of operator B if both return boolean results and
+ (x A y) equals NOT (x B y) for all possible inputs x,y.
+ Notice that B is also the negator of A.
+ For example, '<' and '>=' are a negator pair for most datatypes.
+ An operator can never be validly be its own negator.
+
+
+
+ Unlike COMMUTATOR, a pair of unary operators could validly be marked
+ as each others' negators; that would mean (A x) equals NOT (B x)
+ for all x, or the equivalent for right-unary operators.
+
+
+
+ An operator's negator must have the same left and/or right argument types
+ as the operator itself, so just as with COMMUTATOR, only the operator
+ name need be given in the NEGATOR clause.
+
+
+
+ Providing NEGATOR is very helpful to the query optimizer since
+ it allows expressions like NOT (x = y) to be simplified into
+ x <> y. This comes up more often than you might think, because
+ NOTs can be inserted as a consequence of other rearrangements.
+
+
+
+ Pairs of negator operators can be defined using the same methods
+ explained above for commutator pairs.
+
+
+
+
+
+ RESTRICT
+
+
+ The RESTRICT clause, if provided, names a restriction selectivity
+ estimation function for the operator (note that this is a function
+ name, not an operator name). RESTRICT clauses only make sense for
+ binary operators that return boolean. The idea behind a restriction
+ selectivity estimator is to guess what fraction of the rows in a
+ table will satisfy a WHERE-clause condition of the form
+
+ field OP constant
+
+ for the current operator and a particular constant value.
+ This assists the optimizer by
+ giving it some idea of how many rows will be eliminated by WHERE
+ clauses that have this form. (What happens if the constant is on
+ the left, you may be wondering? Well, that's one of the things that
+ COMMUTATOR is for...)
+
+
+
+ Writing new restriction selectivity estimation functions is far beyond
+ the scope of this chapter, but fortunately you can usually just use
+ one of the system's standard estimators for many of your own operators.
+ These are the standard restriction estimators:
+
+ eqsel for =
+ neqsel for <>
+ intltsel for < or <=
+ intgtsel for > or >=
+
+ It might seem a little odd that these are the categories, but they
+ make sense if you think about it. '=' will typically accept only
+ a small fraction of the rows in a table; '<>' will typically reject
+ only a small fraction. '<' will accept a fraction that depends on
+ where the given constant falls in the range of values for that table
+ column (which, it just so happens, is information collected by
+ VACUUM ANALYZE and made available to the selectivity estimator).
+ '<=' will accept a slightly larger fraction than '<' for the same
+ comparison constant, but they're close enough to not be worth
+ distinguishing, especially since we're not likely to do better than a
+ rough guess anyhow. Similar remarks apply to '>' and '>='.
+
+
+
+ You can frequently get away with using either eqsel or neqsel for
+ operators that have very high or very low selectivity, even if they
+ aren't really equality or inequality. For example, the regular expression
+ matching operators (~, ~*, etc) use eqsel on the assumption that they'll
+ usually only match a small fraction of the entries in a table.
+
+
+
+ JOIN
+
+
+ The JOIN clause, if provided, names a join selectivity
+ estimation function for the operator (note that this is a function
+ name, not an operator name). JOIN clauses only make sense for
+ binary operators that return boolean. The idea behind a join
+ selectivity estimator is to guess what fraction of the rows in a
+ pair of tables will satisfy a WHERE-clause condition of the form
+
+ table1.field1 OP table2.field2
+
+ for the current operator. As with the RESTRICT clause, this helps
+ the optimizer very substantially by letting it figure out which
+ of several possible join sequences is likely to take the least work.
+
+
+
+ As before, this chapter will make no attempt to explain how to write
+ a join selectivity estimator function, but will just suggest that
+ you use one of the standard estimators if one is applicable:
+
+ eqjoinsel for =
+ neqjoinsel for <>
+ intltjoinsel for < or <=
+ intgtjoinsel for > or >=
+
+
+
+
+
+
+ HASHES
+
+
+ The HASHES clause, if present, tells the system that it is OK to
+ use the hash join method for a join based on this operator. HASHES
+ only makes sense for binary operators that return boolean --- and
+ in practice, the operator had better be equality for some data type.
+
+
+
+ The assumption underlying hash join is that the join operator can
+ only return TRUE for pairs of left and right values that hash to the
+ same hash code. If two values get put in different hash buckets, the
+ join will never compare them at all, implicitly assuming that the
+ result of the join operator must be FALSE. So it never makes sense
+ to specify HASHES for operators that do not represent equality.
+
+
+
+ In fact, logical equality is not good enough either --- the operator
+ had better represent pure bitwise equality, because the hash function
+ will be computed on the memory representation of the values regardless
+ of what the bits mean. For example, equality of
+ time intervals is not bitwise equality; the interval equality operator
+ considers two time intervals equal if they have the same
duration, whether or not their endpoints are identical. What this means
- is that a join using "=" between interval fields will yield different
+ is that a join using "=" between interval fields would yield different
results if implemented as a hash join than if implemented another way,
because a large fraction of the pairs that should match will hash to
- different values and will never be compared.
+ different values and will never be compared by the hash join. But
+ if the optimizer chose to use a different kind of join, all the pairs
+ that the equality operator says are equal will be found.
+ We don't want that kind of inconsistency, so we don't mark interval
+ equality as hashable.
- I believe the same problem exists for float data; for example, on
- IEEE-compliant machines, minus zero and plus zero have different bit
- patterns (hence different hash values) but should be considered equal.
- A hashjoin will get it wrong.
+ There are also machine-dependent ways in which a hash join might fail
+ to do the right thing. For example, if your datatype
+ is a structure in which there may be uninteresting pad bits, it's unsafe
+ to mark the equality operator HASHES. (Unless, perhaps, you write
+ your other operators to ensure that the unused bits are always zero.)
+ Another example is that the FLOAT datatypes are unsafe for hash
+ joins. On machines that meet the IEEE floating point standard, minus
+ zero and plus zero are different values (different bit patterns) but
+ they are defined to compare equal. So, if float equality were marked
+ HASHES, a minus zero and a plus zero would probably not be matched up
+ by a hash join, but they would be matched up by any other join process.
- I will go through pg_operator and remove the hashable flag from
- operators that are not safely hashable, but I see no way to
- automatically check for this sort of mistake. The only long-term
- answer is to raise the consciousness of datatype creators about what
- it means to set the oprcanhash flag. Don't do it unless your equality
- operator can be implemented as memcmp()!
+ The bottom line is that you should probably only use HASHES for
+ equality operators that are (or could be) implemented by memcmp().
+
+
+
+
+ SORT1 and SORT2
+
+
+ The SORT clauses, if present, tell the system that it is OK to use
+ the merge join method for a join based on the current operator.
+ Both must be specified if either is. The current operator must be
+ equality for some pair of data types, and the SORT1 and SORT2 clauses
+ name the ordering operator ('<' operator) for the left and right-side
+ data types respectively.
+
+
+
+ Merge join is based on the idea of sorting the left and righthand tables
+ into order and then scanning them in parallel. So, both data types must
+ be capable of being fully ordered, and the join operator must be one
+ that can only succeed for pairs of values that fall at the "same place"
+ in the sort order. In practice this means that the join operator must
+ behave like equality. But unlike hashjoin, where the left and right
+ data types had better be the same (or at least bitwise equivalent),
+ it is possible to mergejoin two
+ distinct data types so long as they are logically compatible. For
+ example, the int2-versus-int4 equality operator is mergejoinable.
+ We only need sorting operators that will bring both datatypes into a
+ logically compatible sequence.
+
+
+
+ When specifying merge sort operators, the current operator and both
+ referenced operators must return boolean; the SORT1 operator must have
+ both input datatypes equal to the current operator's left argument type,
+ and the SORT2 operator must have
+ both input datatypes equal to the current operator's right argument type.
+ (As with COMMUTATOR and NEGATOR, this means that the operator name is
+ sufficient to specify the operator, and the system is able to make dummy
+ operator entries if you happen to define the equality operator before
+ the other ones.)
+
+
+
+ In practice you should only write SORT clauses for an '=' operator,
+ and the two referenced operators should always be named '<'. Trying
+ to use merge join with operators named anything else will result in
+ hopeless confusion, for reasons we'll see in a moment.
+
+
+
+ There are additional restrictions on operators that you mark
+ mergejoinable. These restrictions are not currently checked by
+ CREATE OPERATOR, but a merge join may fail at runtime if any are
+ not true:
+
+
+
+ The mergejoinable equality operator must have a commutator
+ (itself if the two data types are the same, or a related equality operator
+ if they are different).
+
+
+
+ There must be '<' and '>' ordering operators having the same left and
+ right input datatypes as the mergejoinable operator itself. These
+ operators must be named '<' and '>' --- you do
+ not have any choice in the matter, since there is no provision for
+ specifying them explicitly. Note that if the left and right data types
+ are different, neither of these operators is the same as either
+ SORT operator. But they had better order the data values compatibly
+ with the SORT operators, or mergejoin will fail to work.
+
+
+
+