
Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
PostgreSQL Database Management System ===================================== This directory contains the source code distribution of the PostgreSQL database management system. PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. This distribution also contains C language bindings. PostgreSQL has many language interfaces, many of which are listed here: https://www.postgresql.org/download See the file INSTALL for instructions on how to build and install PostgreSQL. That file also lists supported operating systems and hardware platforms and contains information regarding any other software packages that are required to build or run the PostgreSQL system. Copyright and license information can be found in the file COPYRIGHT. A comprehensive documentation set is included in this distribution; it can be read as described in the installation instructions. The latest version of this software may be obtained at https://www.postgresql.org/download/. For more information look at our web site located at https://www.postgresql.org/.
Description
Languages
C
85.7%
PLpgSQL
5.8%
Perl
4.1%
Yacc
1.3%
Makefile
0.7%
Other
2.3%