Improve multivariate statistics documentation

Extended statistics commit 7b504eb282c did not include appropriate documentation next to where we document regular planner statistics (I ripped what was submitted before commit and then forgot to put it back), and while later commit 2686ee1b7ccf added some material, it structurally depended on what I had ripped out, so the end result wasn't proper. Fix those problems by shuffling what was added by 2686ee1b7ccf and including some additional material, so that now chapter 14 "Performance Tips" now describes the types of multivariate statistics we currently have, and chapter 68 "How the Planner Uses Statistics" shows some examples. The new text should be more in line with previous material, in (hopefully) the appropriate depth. While at it, fix a small bug in pg_statistic_ext docs: one column was listed in the wrong spot.
2017-04-20 15:43:33 -03:00 · 2017-04-20 15:43:33 -03:00 · 919f6d746e
commit 919f6d746e
parent 8bcb31ad5a
3 changed files with 323 additions and 146 deletions
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@ -4320,6 +4320,17 @@
      <entry>Owner of the statistic</entry>
     </row>

+     <row>
+      <entry><structfield>stxkeys</structfield></entry>
+      <entry><type>int2vector</type></entry>
+      <entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry>
+      <entry>
+       This is an array of values that indicate which table columns this
+       statistic covers. For example a value of <literal>1 3</literal> would
+       mean that the first and the third table columns make up the statistic key.
+      </entry>
+     </row>
+
     <row>
      <entry><structfield>stxkind</structfield></entry>
      <entry><type>char[]</type></entry>
@ -4332,17 +4343,6 @@
      </entry>
     </row>

-     <row>
-      <entry><structfield>stxkeys</structfield></entry>
-      <entry><type>int2vector</type></entry>
-      <entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry>
-      <entry>
-       This is an array of values that indicate which table columns this
-       statistic covers. For example a value of <literal>1 3</literal> would
-       mean that the first and the third table columns make up the statistic key.
-      </entry>
-     </row>
-
     <row>
      <entry><structfield>stxndistinct</structfield></entry>
      <entry><type>pg_ndistinct</type></entry>
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000
   <secondary>of the planner</secondary>
  </indexterm>

+  <sect2>
+   <title>Single-Column Statistics</title>
  <para>
   As we saw in the previous section, the query planner needs to estimate
   the number of rows retrieved by a query in order to make good choices
@ -1043,7 +1045,200 @@ WHERE tablename = 'road';
   Further details about the planner's use of statistics can be found in
   <xref linkend="planner-stats-details">.
  </para>
+  </sect2>

+  <sect2 id="planner-stats-extended">
+   <title>Extended Statistics</title>
+
+   <indexterm zone="planner-stats-extended">
+    <primary>statistics</primary>
+    <secondary>of the planner</secondary>
+   </indexterm>
+
+   <indexterm>
+    <primary>correlation</primary>
+    <secondary>in the query planner</secondary>
+   </indexterm>
+
+   <indexterm>
+    <primary>pg_statistic_ext</primary>
+   </indexterm>
+
+   <para>
+    It is common to see slow queries running bad execution plans because
+    multiple columns used in the query clauses are correlated.
+    The planner normally assumes that multiple conditions
+    are independent of each other,
+    an assumption that does not hold when column values are correlated.
+    Regular statistics, because of their per-individual-column nature,
+    do not capture the knowledge of cross-column correlation;
+    <firstterm>multivariate statistics</firstterm> can be used to instruct
+    the server to obtain statistics across such a set of columns,
+    which are later used by the query optimizer
+    to determine cardinality and selectivity
+    of clauses involving those columns.
+    Multivariate statistics are currently the only use of
+    <firstterm>extended statistics</firstterm>.
+   </para>
+
+   <para>
+    Extended statistics are created using
+    <xref linkend="sql-createstatistics">, which see for more details.
+    Data collection is deferred until the next <command>ANALYZE</command>
+    on the table, after which the stored values can be examined in the
+    <link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
+    catalog.
+   </para>
+
+   <para>
+    The following subsections describe the types of extended statistics
+    that are currently supported.
+   </para>
+
+   <sect3>
+    <title>Functional Dependencies</title>
+
+    <para>
+     The simplest type of extended statistics are functional dependencies,
+     a concept used in definitions of database normal forms.
+     Put simply, it is said that column <literal>b</> is functionally
+     dependent on column <literal>a</> if knowledge of the value of
+     <literal>a</> is sufficient to determine the value of <literal>b</>.
+     In normalized databases, functional dependencies are allowed only on
+     primary keys and superkeys. However, many data sets are in practice not
+     fully normalized for various reasons; intentional denormalization for
+     performance reasons is a common example.
+    </para>
+
+    <para>
+     The existance of functional dependencies directly affects the accuracy
+     of estimates in certain queries.
+     The reason is that conditions on the dependent columns do not
+     restrict the result set, but the query planner (lacking functional
+     dependency knowledge) considers them independent, resulting in
+     underestimates.
+     To inform the planner about the functional dependencies, we collect
+     measurements of dependency during <command>ANALYZE</>. Assessing
+     the degree of dependency between all sets of columns would be
+     prohibitively expensive, so the search is limited to potential
+     dependencies defined using the <literal>dependencies</> option of
+     extended statistics.  It is advisable to create
+     <literal>dependencies</> statistics if and only if functional
+     dependencies actually exist, to avoid unnecessary overhead on both
+     <command>ANALYZE</> and query planning.
+    </para>
+
+    <para>
+     To inspect functional dependencies on a statistics
+     <literal>stts</literal>, you may do this:
+<programlisting>
+CREATE STATISTICS stts WITH (dependencies)
+               ON (zip, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxname, stxkeys, stxdependencies
+  FROM pg_statistic_ext
+ WHERE stxname = 'stts';
+ stxname | stxkeys |              stxdependencies               
+---------+---------+--------------------------------------------
+ stts    | 1 5     | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
+(1 row)
+</programlisting>
+     where it can be seen that column 1 (a zip code) fully determines column
+     5 (city) so the coefficient is 1.0, while city only determines zip code
+     about 42% of the time, meaning that there are many cities (58%) that are
+     represented by more than a single ZIP code.
+    </para>
+
+    <para>
+     When computing the selectivity, the planner inspects all conditions and
+     attempts to identify which conditions are already implied by other
+     conditions.  The selectivity estimates from any redundant conditions are
+     ignored from a selectivity point of view. In the example query above,
+     the selectivity estimates for either of the conditions may be eliminated,
+     thus improving the overall estimate.
+    </para>
+
+    <sect4>
+     <title>Limitations of Functional Dependencies</title>
+
+     <para>
+      Functional dependencies are a very simple type of statistics, and
+      as such have several limitations. The first limitation is that they
+      only work with simple equality conditions, comparing columns and constant
+      values. It's not possible to use them to eliminate equality conditions
+      comparing two columns or a column to an expression, range clauses,
+      <literal>LIKE</> or any other type of conditions.
+     </para>
+
+     <para>
+      When eliminating the implied conditions, the planner assumes that the
+      conditions are compatible. Consider the following example, where
+      this assumption does not hold:
+
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
+   Filter: ((a = 1) AND (b = 10))
+   Rows Removed by Filter: 10000
+</programlisting>
+
+      While there are no rows with such combination of values, the planner
+      is unable to verify whether the values match &mdash; it only knows that
+      the columns are functionally dependent.
+     </para>
+
+     <para>
+      This assumption is related to queries executed on the database; in many
+      cases, it's actually satisfied (e.g. when the GUI only allows selecting
+      compatible values). But if that's not the case, functional dependencies
+      may not be a viable option.
+     </para>
+    </sect4>
+   </sect3>
+
+   <sect3>
+    <title>Multivariate N-Distinct Coefficients</title>
+
+    <para>
+     Single-column statistics store the number of distinct values in each
+     column.  Estimates of the number of distinct values on more than one
+     column (for example, for <literal>GROUP BY a, b</literal>) are
+     frequently wrong when the planner only has single-column statistical
+     data, however, causing it to select bad plans.
+     In order to improve n-distinct estimation when multiple columns are
+     grouped together, the <literal>ndistinct</> option of extended statistics
+     can be used, which instructs <command>ANALYZE</> to collect n-distinct
+     estimates for all possible combinations of two or more columns of the set
+     of columns in the statistics object (the per-column estimates are already
+     available in <structname>pg_statistic</>).
+    </para>
+
+    <para>
+     Continuing the above example, the n-distinct coefficients in a ZIP
+     code table may look like the following:
+<programlisting>
+CREATE STATISTICS stts2 WITH (ndistinct)
+               ON (zip, state, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxkeys AS k, stxndistinct AS nd
+  FROM pg_statistic_ext
+ WHERE stxname = 'stts2';
+-[ RECORD 1 ]---------------------------------------------
+k  | 1 2 5
+nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
+      {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
+(1 row)
+</programlisting>
+     which indicates that there are three combinations of columns that
+     have 33178 distinct values: ZIP code and state; ZIP code and city;
+     and ZIP code, city and state (the fact that they are all equal is
+     expected given the nature of ZIP-code data).  On the other hand,
+     the combination of city and state only has 27435 distinct values.
+    </para>
+   </sect3>
+  </sect2>
 </sect1>

 <sect1 id="explicit-joins">
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@ -445,159 +445,141 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
   operator-specific selectivity functions are mostly found
   in <filename>src/backend/utils/adt/selfuncs.c</filename>.
  </para>
+ </sect1>

-  <sect2 id="functional-dependencies">
-   <title>Functional Dependencies</title>
+ <sect1 id="multivariate-statistics-examples">
+  <title>Multivariate Statistics Examples</title>

+  <indexterm>
+   <primary>row estimation</primary>
+   <secondary>multivariate</secondary>
+  </indexterm>
+
+  <sect2>
+   <title>Functional dependencies</title>
   <para>
-    The simplest type of extended statistics are functional dependencies,
-    used in definitions of database normal forms. When simplified, saying that
-    <literal>b</> is functionally dependent on <literal>a</> means that
-    knowledge of value of <literal>a</> is sufficient to determine value of
-    <literal>b</>.
-   </para>
-
-   <para>
-    In normalized databases, only functional dependencies on primary keys
-    and superkeys are allowed. However, in practice, many data sets are not
-    fully normalized, for example, due to intentional denormalization for
-    performance reasons.
-   </para>
-
-   <para>
-    Functional dependencies directly affect accuracy of the estimates, as
-    conditions on the dependent column(s) do not restrict the result set,
-    resulting in underestimates.
-   </para>
-
-   <para>
-    To inform the planner about the functional dependencies, we collect
-    measurements of dependency during <command>ANALYZE</>. Assessing
-    dependency between all sets of columns would be prohibitively
-    expensive, so we limit our search to potential dependencies defined
-    using the <command>CREATE STATISTICS</> command.
+    Multivariate correlation can be seen with a very simple data set &mdash; a
+    table with two columns, both containing the same values:

 <programlisting>
 CREATE TABLE t (a INT, b INT);
-INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
-CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
+INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
 ANALYZE t;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
-                                          QUERY PLAN                                             
-------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
+</programlisting>
+
+    As explained in <xref linkend="planner-stats">, the planner can determine
+    cardinality of <structname>t</structname> using the number of pages and
+    rows obtained from <structname>pg_class</structname>:
+
+<programlisting>
+SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
+
+ relpages | reltuples
+----------+-----------
+       45 |     10000
+</programlisting>
+
+    The data distribution is very simple; there are only 100 distinct values
+    in each column, uniformly distributed.
+   </para>
+
+   <para>
+    The following example shows the result of estimating a <literal>WHERE</>
+    condition on the <structfield>a</> column:
+
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
+                                 QUERY PLAN                                  
+-------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
+   Filter: (a = 1)
+   Rows Removed by Filter: 9900
+</programlisting>
+
+    The planner examines the condition and determines the selectivity
+    of this clause to be 1%.  By comparing this estimate and the actual
+    number of rows, we see that the estimate is very accurate
+    (in fact exact, as the table is very small).  Changing the
+    <literal>WHERE</> to use the <structfield>b</> column, an identical
+    plan is generated.  Observe what happens if we apply the same
+    condition on both columns combining them with <literal>AND</>:
+
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
- Planning time: 0.367 ms
- Execution time: 3.380 ms
-(5 rows)
 </programlisting>

-    The planner is now aware of the functional dependencies and considers
-    them when computing the selectivity of the second condition.  Running
-    the query without the statistics would lead to quite different estimates.
+    The planner estimates the selectivity for each condition individually,
+    arriving to the 1% estimates as above, and then multiplies them, getting
+    the final 0.01% estimate. The <quote>actual</quote> figures, however,
+    show that this results in a significant underestimate, as the actual
+    number of rows matching the conditions (100) is two orders of magnitude
+    higher than the estimated value.
+   </para>
+
+   <para>
+    This problem can be fixed by applying functional-dependency
+    multivariate statistics on the two columns:

 <programlisting>
-DROP STATISTICS s1;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
-                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
+CREATE STATISTICS stts WITH (dependencies) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
- Planning time: 0.000 ms
- Execution time: 6.379 ms
-(5 rows)
 </programlisting>
   </para>
-
-   <para>
-    If no dependency exists, the collected statistics do not influence the
-    query plan. The only effect is to slow down <command>ANALYZE</>. Should
-    partial dependencies exist these will also be stored and applied
-    during planning.
-   </para>
-
-   <para>
-    Similarly to per-column statistics, extended statistics are stored in
-    a system catalog called <structname>pg_statistic_ext</structname>.
-    To inspect the statistics <literal>s1</literal> defined above,
-    you may do this:
-
-<programlisting>
-SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1';
- stxname |              stxdependencies
---------+--------------------------------------------
- s1      | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
-(1 row)
-</programlisting>
-
-     This shows that the statistics are defined on table <structname>t</>,
-     <structfield>attnums</structfield> lists attribute numbers of columns
-     (references <structname>pg_attribute</structname>). It also shows
-     the length in bytes of the functional dependencies, as found by
-     <command>ANALYZE</> when serialized into a <literal>bytea</> column.
-   </para>
-
-   <para>
-    When computing the selectivity, the planner inspects all conditions and
-    attempts to identify which conditions are already implied by other
-    conditions.  The selectivity estimates from any redundant conditions are
-    ignored from a selectivity point of view. In the example query above,
-    the selectivity estimates for either of the conditions may be eliminated,
-    thus improving the overall estimate.
-   </para>
-
-    <sect3 id="functional-dependencies-limitations">
-     <title>Limitations of functional dependencies</title>
-
-     <para>
-      Functional dependencies are a very simple type of statistics, and
-      as such have several limitations. The first limitation is that they
-      only work with simple equality conditions, comparing columns and constant
-      values. It's not possible to use them to eliminate equality conditions
-      comparing two columns or a column to an expression, range clauses,
-      <literal>LIKE</> or any other type of conditions.
-     </para>
-
-     <para>
-      When eliminating the implied conditions, the planner assumes that the
-      conditions are compatible. Consider the following example, violating
-      this assumption:
-
-<programlisting>
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
-                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
-   Filter: ((a = 1) AND (b = 10))
-   Rows Removed by Filter: 10000
- Planning time: 0.232 ms
- Execution time: 3.033 ms
-(5 rows)
-</programlisting>
-
-      While there are no rows with such combination of values, the planner
-      is unable to verify whether the values match - it only knows that
-      the columns are functionally dependent.
-     </para>
-
-     <para>
-      This assumption is more about queries executed on the database - in many
-      cases, it's actually satisfied (e.g. when the GUI only allows selecting
-      compatible values). But if that's not the case, functional dependencies
-      may not be a viable option.
-     </para>
-
-     <para>
-      For additional information about functional dependencies, see
-      <filename>src/backend/statistics/README.dependencies</>.
-     </para>
-
-    </sect3>
-
  </sect2>

- </sect1>
+  <sect2>
+   <title>Multivariate N-Distinct coefficients</title>
+   <para>
+    A similar problem occurs with estimation of the cardinality of distinct
+    elements, used to determine the number of groups that would be generated
+    by a <command>GROUP BY</command> clause.  When <command>GROUP BY</command>
+    lists a single column, the n-distinct estimate (which can be seen as the
+    number of rows returned by the aggregate execution node) is very accurate:
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
+                                       QUERY PLAN                                        
+-----------------------------------------------------------------------------------------
+ HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
+   Group Key: a
+   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)
+</programlisting>
+    But without multivariate statistics, the estimate for the number of
+    groups in a query with two columns in <command>GROUP BY</command>, as
+    in the following example, is off by an order of magnitude:
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+                                       QUERY PLAN                                        
+--------------------------------------------------------------------------------------------
+ HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
+   Group Key: a, b
+   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+</programlisting>
+    By dropping the existing statistics and re-creating it to include n-distinct
+    calculation, the estimate is much improved:
+<programlisting>
+DROP STATISTICS stts;
+CREATE STATISTICS stts WITH (dependencies, ndistinct) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+                                       QUERY PLAN                                        
+--------------------------------------------------------------------------------------------
+ HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
+   Group Key: a, b
+   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+</programlisting>
+   </para>

+  </sect2>
+ </sect1>
 </chapter>