Provide a bit more high-level documentation for the GEQO planner.

Per request from Luca Ferrari.
2007-07-21 04:02:41 +00:00 · 2007-07-21 04:02:41 +00:00 · ddb93cac24
commit ddb93cac24
parent 7abe764f17
2 changed files with 85 additions and 21 deletions
--- a/doc/src/sgml/arch-dev.sgml
+++ b/doc/src/sgml/arch-dev.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.29 2007/01/31 20:56:16 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.30 2007/07/21 04:02:41 tgl Exp $ -->
 <chapter id="overview">
  <title>Overview of PostgreSQL Internals</title>
@ -345,9 +345,10 @@
     can be executed would take an excessive amount of time and memory
     space. In particular, this occurs when executing queries
     involving large numbers of join operations. In order to determine
-     a reasonable (not optimal) query plan in a reasonable amount of
+     a reasonable (not necessarily optimal) query plan in a reasonable amount
-     time, <productname>PostgreSQL</productname> uses a <xref
+     of time, <productname>PostgreSQL</productname> uses a <xref
-     linkend="geqo" endterm="geqo-title">.
+     linkend="geqo" endterm="geqo-title"> when the number of joins
     exceeds a threshold (see <xref linkend="guc-geqo-threshold">).
    </para>
   </note>
@ -380,20 +381,17 @@
     the index's <firstterm>operator class</>, another plan is created using
     the B-tree index to scan the relation. If there are further indexes
     present and the restrictions in the query happen to match a key of an
-     index further plans will be considered.
+     index, further plans will be considered.  Index scan plans are also
     generated for indexes that have a sort ordering that can match the
     query's <literal>ORDER BY</> clause (if any), or a sort ordering that
     might be useful for merge joining (see below).
    </para>
    <para>
-     After all feasible plans have been found for scanning single relations,
+     If the query requires joining two or more relations,
-     plans for joining relations are created. The planner/optimizer
+     plans for joining relations are considered
-     preferentially considers joins between any two relations for which there
+     after all feasible plans have been found for scanning single relations.
-     exist a corresponding join clause in the <literal>WHERE</literal> qualification (i.e. for
+     The three available join strategies are:
     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
     exists). Join pairs with no join clause are considered only when there
     is no other choice, that is, a particular relation has no available
     join clauses to any other relation. All possible plans are generated for
     every join pair considered
     by the planner/optimizer. The three possible join strategies are:
     <itemizedlist>
      <listitem>
@ -439,6 +437,26 @@
     cheapest one.
    </para>
    <para>
     If the query uses fewer than <xref linkend="guc-geqo-threshold">
     relations, a near-exhaustive search is conducted to find the best
     join sequence.  The planner preferentially considers joins between any
     two relations for which there exist a corresponding join clause in the
     <literal>WHERE</literal> qualification (i.e. for
     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
     exists). Join pairs with no join clause are considered only when there
     is no other choice, that is, a particular relation has no available
     join clauses to any other relation. All possible plans are generated for
     every join pair considered by the planner, and the one that is
     (estimated to be) the cheapest is chosen.
    </para>
    <para>
     When <varname>geqo_threshold</varname> is exceeded, the join
     sequences considered are determined by heuristics, as described
     in <xref linkend="geqo">.  Otherwise the process is the same.
    </para>
    <para>
     The finished plan tree consists of sequential or index scans of
     the base relations, plus nested-loop, merge, or hash join nodes as
--- a/doc/src/sgml/geqo.sgml
+++ b/doc/src/sgml/geqo.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.39 2007/02/16 03:50:29 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.40 2007/07/21 04:02:41 tgl Exp $ -->
 <chapter id="geqo">
  <chapterinfo>
@ -186,11 +186,6 @@
    <productname>PostgreSQL</productname> optimizer.
   </para>
   <para>
    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor
    algorithm.
   </para>
   <para>
    Specific characteristics of the <acronym>GEQO</acronym>
    implementation in <productname>PostgreSQL</productname>
@ -224,6 +219,11 @@
    </itemizedlist>
   </para>
   <para>
    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
    Genitor algorithm.
   </para>
   <para>
    The <acronym>GEQO</acronym> module allows
    the <productname>PostgreSQL</productname> query optimizer to
@ -231,6 +231,42 @@
    non-exhaustive search.
   </para>
  <sect2>
   <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
   <para>
    The <acronym>GEQO</acronym> planning process uses the standard planner
    code to generate plans for scans of individual relations.  Then join
    plans are developed using the genetic approach.  As shown above, each
    candidate join plan is represented by a sequence in which to join
    the base relations.  In the initial stage, the <acronym>GEQO</acronym>
    code simply generates some possible join sequences at random.  For each
    join sequence considered, the standard planner code is invoked to
    estimate the cost of performing the query using that join sequence.
    (For each step of the join sequence, all three possible join strategies
    are considered; and all the initially-determined relation scan plans
    are available.  The estimated cost is the cheapest of these
    possibilities.)  Join sequences with lower estimated cost are considered
    <quote>more fit</> than those with higher cost.  The genetic algorithm
    discards the least fit candidates.  Then new candidates are generated
    by combining genes of more-fit candidates &mdash; that is, by using
    randomly-chosen portions of known low-cost join sequences to create
    new sequences for consideration.  This process is repeated until a
    preset number of join sequences have been considered; then the best
    one found at any time during the search is used to generate the finished
    plan.
   </para>
   <para>
    This process is inherently nondeterministic, because of the randomized
    choices made during both the initial population selection and subsequent
    <quote>mutation</> of the best candidates.  Hence different plans may
    be selected from one run to the next, resulting in varying run time
    and varying output row order.
   </para>
  </sect2>
  <sect2 id="geqo-future">
   <title>Future Implementation Tasks for
    <productname>PostgreSQL</> <acronym>GEQO</acronym></title>
@ -257,6 +293,16 @@
      </itemizedlist>
     </para>
     <para>
      In the current implementation, the fitness of each candidate join
      sequence is estimated by running the standard planner's join selection
      and cost estimation code from scratch.  To the extent that different
      candidates use similar sub-sequences of joins, a great deal of work
      will be repeated.  This could be made significantly faster by retaining
      cost estimates for sub-joins.  The problem is to avoid expending
      unreasonable amounts of memory on retaining that state.
     </para>
     <para>
      At a more basic level, it is not clear that solving query optimization
      with a GA algorithm designed for TSP is appropriate.  In the TSP case,