Reverse PG_BINARY defines

2000-06-02 16:33:17 +00:00 · 2000-06-02 16:33:17 +00:00 · a305c7d675
commit a305c7d675
parent cc2b5e5815
4 changed files with 609 additions and 4 deletions
--- a/doc/FAQ_BSDI
+++ b/doc/FAQ_BSDI
@ -0,0 +1,34 @@
 This outlines how to increase the number of shared memory buffers
 supported by BSD/OS.  By default, only 4MB of shared memory is supported
 by BSDI.
 Keep in mind that shared memory is not pageable.  It is locked in RAM.
 Bruce Momjian (pgman@candle.pha.pa.us)
 ---------------------------------------------------------------------------
 Increase SHMMAXPGS by 1024 for every additional 4MB of shared
 memory:
 /sys/sys/shm.h:69:#define       SHMMAXPGS       1024    /* max hardware pages...
 The default setting of 1024 is for a maximum of 4MB of shared memory.
 For those running 4.1 or later, just recompile the kernel and reboot. 
 For those running earlier releases, there are more steps outlined below.
 ---------------------------------------------------------------------------
 Use bpatch to find the sysptsize value for the current kernel. 
 This is computed dynamically at bootup.
 	$ bpatch -r sysptsize
 	0x9 = 9
 Next, change SYSPTSIZE to a hard-coded value.  Use the bpatch value,
 plus add 1 for every additional 4MB of shared memory you desire.
 /sys/i386/i386/i386_param.c:28:#define  SYSPTSIZE 0        /* dynamically...
 sysptsize can not be changed by sysctl on the fly.
--- a/doc/TODO.detail/optimizer
+++ b/doc/TODO.detail/optimizer
@ -1055,3 +1055,534 @@ Hiroshi Inoue
 Inoue@tpf.co.jp
 From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000
 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
 Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
 	Thu, 20 Jan 2000 19:35:19 -0500 (EST)
 	(envelope-from owner-pgsql-hackers)
 Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 19:33:34 -0500
 Received: (from majordom@localhost)
 	by hub.org (8.9.3/8.9.3) id TAA00581
 	for pgsql-hackers-outgoing; Thu, 20 Jan 2000 19:32:37 -0500 (EST)
 	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 	by hub.org (8.9.3/8.9.3) with ESMTP id TAA98940
 	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 19:31:49 -0500 (EST)
 	(envelope-from tgl@sss.pgh.pa.us)
 Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id TAA25390
 	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 19:31:32 -0500 (EST)
 To: pgsql-hackers@postgreSQL.org
 Subject: [HACKERS] Some notes on optimizer cost estimates
 Date: Thu, 20 Jan 2000 19:31:32 -0500
 Message-ID: <25387.948414692@sss.pgh.pa.us>
 From: Tom Lane <tgl@sss.pgh.pa.us>
 Sender: owner-pgsql-hackers@postgreSQL.org
 Status: OR
 I have been spending some time measuring actual runtimes for various
 sequential-scan and index-scan query plans, and have learned that the
 current Postgres optimizer's cost estimation equations are not very
 close to reality at all.
 Presently we estimate the cost of a sequential scan as
 	Nblocks + CPU_PAGE_WEIGHT * Ntuples
 --- that is, the unit of cost is the time to read one disk page,
 and we have a "fudge factor" that relates CPU time per tuple to
 disk time per page.  (The default CPU_PAGE_WEIGHT is 0.033, which
 is probably too high for modern hardware --- 0.01 seems like it
 might be a better default, at least for simple queries.)  OK,
 it's a simplistic model, but not too unreasonable so far.
 The cost of an index scan is measured in these same terms as
 	Nblocks + CPU_PAGE_WEIGHT * Ntuples +
 	  CPU_INDEX_PAGE_WEIGHT * Nindextuples
 Here Ntuples is the number of tuples selected by the index qual
 condition (typically, it's less than the total table size used in
 sequential-scan estimation).  CPU_INDEX_PAGE_WEIGHT essentially
 estimates the cost of scanning an index tuple; by default it's 0.017 or
 half CPU_PAGE_WEIGHT.  Nblocks is estimated as the index size plus an
 appropriate fraction of the main table size.
 There are two big problems with this:
 1. Since main-table tuples are visited in index order, we'll be hopping
 around from page to page in the table.  The current cost estimation
 method essentially assumes that the buffer cache plus OS disk cache will
 be 100% efficient --- we will never have to read the same page of the
 main table twice in a scan, due to having discarded it between
 references.  This of course is unreasonably optimistic.  Worst case
 is that we'd fetch a main-table page for each selected tuple, but in
 most cases that'd be unreasonably pessimistic.
 2. The cost of a disk page fetch is estimated at 1.0 unit for both
 sequential and index scans.  In reality, sequential access is *much*
 cheaper than the quasi-random accesses performed by an index scan.
 This is partly a matter of physical disk seeks, and partly a matter
 of benefitting (or not) from any read-ahead logic the OS may employ.
 As best I can measure on my hardware, the cost of a nonsequential
 disk read should be estimated at 4 to 5 times the cost of a sequential
 one --- I'm getting numbers like 2.2 msec per disk page for sequential
 scans, and as much as 11 msec per page for index scans.  I don't
 know, however, if this ratio is similar enough on other platforms
 to be useful for cost estimating.  We could make it a parameter like
 we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
 ever bothers to adjust those numbers in the field ...
 The other effect that needs to be modeled, and currently is not, is the
 "hit rate" of buffer cache.  Presumably, this is 100% for tables smaller
 than the cache and drops off as the table size increases --- but I have
 no particular thoughts on the form of the dependency.  Does anyone have
 ideas here?  The problem is complicated by the fact that we don't really
 know how big the cache is; we know the number of buffers Postgres has,
 but we have no idea how big a disk cache the kernel is keeping.  As near
 as I can tell, finding a hit in the kernel disk cache is not a lot more
 expensive than having the page sitting in Postgres' own buffers ---
 certainly it's much much cheaper than a disk read.
 BTW, if you want to do some measurements of your own, try turning on
 PGOPTIONS="-d 2 -te".  This will dump a lot of interesting numbers
 into the postmaster log, if your platform supports getrusage().
 			regards, tom lane
 ************
 From owner-pgsql-hackers@hub.org Thu Jan 20 20:26:33 2000
 Received: from hub.org (hub.org [216.126.84.1])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA06630
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 21:26:32 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id VAA35022;
 	Thu, 20 Jan 2000 21:22:08 -0500 (EST)
 	(envelope-from owner-pgsql-hackers)
 Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:20:35 -0500
 Received: (from majordom@localhost)
 	by hub.org (8.9.3/8.9.3) id VAA34569
 	for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:19:38 -0500 (EST)
 	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 Received: from hercules.cs.ucsb.edu (hercules.cs.ucsb.edu [128.111.41.30])
 	by hub.org (8.9.3/8.9.3) with ESMTP id VAA34534
 	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 21:19:26 -0500 (EST)
 	(envelope-from xun@cs.ucsb.edu)
 Received: from xp10-06.dialup.commserv.ucsb.edu (root@xp10-06.dialup.commserv.ucsb.edu [128.111.253.249])
 	by hercules.cs.ucsb.edu (8.8.6/8.8.6) with ESMTP id SAA04655
 	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 18:19:22 -0800 (PST)
 Received: from xp10-06.dialup.commserv.ucsb.edu (xun@localhost)
 	by xp10-06.dialup.commserv.ucsb.edu (8.9.3/8.9.3) with ESMTP id SAA22377
 	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 18:19:40 -0800
 Message-Id: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu>
 To: pgsql-hackers@postgreSQL.org
 Reply-to: xun@cs.ucsb.edu
 Subject: Re. [HACKERS] Some notes on optimizer cost estimates
 Date: Thu, 20 Jan 2000 18:19:40 -0800
 From: Xun Cheng <xun@cs.ucsb.edu>
 Sender: owner-pgsql-hackers@postgreSQL.org
 Status: OR
 I'm very glad you bring up this cost estimate issue.
 Recent work in database research have argued a more
 detailed disk access cost model should be used for
 large queries especially joins.
 Traditional cost estimate only considers the number of
 disk pages accessed. However a more detailed model
 would consider three parameters: avg. seek, avg. latency
 and avg. page transfer. For old disk, typical values are
 SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
 A sequential continuous reading of a table (assuming
 1000 continuous pages) would cost
 (SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
 reading 200 times with 2 continuous pages/time would
 cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
 Someone from IBM lab re-studied the traditional
 ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost model
 and found some interesting results.
 >I have been spending some time measuring actual runtimes for various
 >sequential-scan and index-scan query plans, and have learned that the
 >current Postgres optimizer's cost estimation equations are not very
 >close to reality at all.
 One interesting question I'd like to ask is if this non-closeness
 really affects the optimal choice of postgresql's query optimizer.
 And to what degree the effects might be? My point is that
 if the optimizer estimated the cost for sequential-scan is 10 and
 the cost for index-scan is 20 while the actual costs are 10 vs. 40,
 it should be ok because the optimizer would still choose sequential-scan
 as it should.
 >1. Since main-table tuples are visited in index order, we'll be hopping
 >around from page to page in the table.
 I'm not sure about the implementation in postgresql. One thing you might
 be able to do is to first collect all must-read page addresses from 
 the index scan and then order them before the actual ordered page fetching.
 It would at least avoid the same page being read twice (not entirely
 true depending on the context (like in join) and algo.)
 >The current cost estimation
 >method essentially assumes that the buffer cache plus OS disk cache will
 >be 100% efficient --- we will never have to read the same page of the
 >main table twice in a scan, due to having discarded it between
 >references.  This of course is unreasonably optimistic.  Worst case
 >is that we'd fetch a main-table page for each selected tuple, but in
 >most cases that'd be unreasonably pessimistic.
 This is actually the motivation that I asked before if postgresql
 has a raw disk facility. That way we have much control on this cache
 issue. Of course only if we can provide some algo. better than OS
 cache algo. (depending on the context, like large joins), a raw disk
 facility will be worthwhile (besides the recoverability).
 Actually I have another question for you guys which is somehow related
 to this cost estimation issue. You know the difference between OLTP
 and OLAP. My question is how you target postgresql on both kinds
 of applications or just OLTP. From what I know OLTP and OLAP would
 have a big difference in query characteristics and thus 
 optimization difference. If postgresql is only targeted on
 OLTP, the above cost estimation issue might not be that
 important. However for OLAP, large tables and large queries are
 common and optimization would be difficult.
 xun
 ************
 From owner-pgsql-hackers@hub.org Thu Jan 20 20:41:44 2000
 Received: from hub.org (hub.org [216.126.84.1])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA07020
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 21:41:43 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id VAA40222;
 	Thu, 20 Jan 2000 21:34:08 -0500 (EST)
 	(envelope-from owner-pgsql-hackers)
 Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:32:35 -0500
 Received: (from majordom@localhost)
 	by hub.org (8.9.3/8.9.3) id VAA38388
 	for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:31:38 -0500 (EST)
 	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 	by hub.org (8.9.3/8.9.3) with ESMTP id VAA37422
 	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 21:31:02 -0500 (EST)
 	(envelope-from tgl@sss.pgh.pa.us)
 Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id VAA26761;
 	Thu, 20 Jan 2000 21:30:41 -0500 (EST)
 To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 cc: pgsql-hackers@postgreSQL.org
 Subject: Re: [HACKERS] Some notes on optimizer cost estimates 
 In-reply-to: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp> 
 References: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp>
 Comments: In-reply-to "Hiroshi Inoue" <Inoue@tpf.co.jp>
 	message dated "Fri, 21 Jan 2000 10:44:20 +0900"
 Date: Thu, 20 Jan 2000 21:30:41 -0500
 Message-ID: <26758.948421841@sss.pgh.pa.us>
 From: Tom Lane <tgl@sss.pgh.pa.us>
 Sender: owner-pgsql-hackers@postgreSQL.org
 Status: ORr
 "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 > I've wondered why we cound't analyze database without vacuum.
 > We couldn't run vacuum light-heartedly because it acquires an
 > exclusive lock for the target table. 
 There is probably no real good reason, except backwards compatibility,
 why the ANALYZE function (obtaining pg_statistic data) is part of
 VACUUM at all --- it could just as easily be a separate command that
 would only use read access on the database.  Bruce is thinking about
 restructuring VACUUM, so maybe now is a good time to think about
 splitting out the ANALYZE code too.
 > In addition,vacuum error occurs with analyze option in most
 > cases AFAIK. 
 Still, with current sources?  What's the error message?  I fixed
 a problem with pg_statistic tuples getting too big...
 			regards, tom lane
 ************
 From tgl@sss.pgh.pa.us Thu Jan 20 21:10:28 2000
 Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08412
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 22:10:26 -0500 (EST)
 Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id WAA27080;
 	Thu, 20 Jan 2000 22:10:28 -0500 (EST)
 To: Bruce Momjian <pgman@candle.pha.pa.us>
 cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Some notes on optimizer cost estimates 
 In-reply-to: <200001210248.VAA07186@candle.pha.pa.us> 
 References: <200001210248.VAA07186@candle.pha.pa.us>
 Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 	message dated "Thu, 20 Jan 2000 21:48:57 -0500"
 Date: Thu, 20 Jan 2000 22:10:28 -0500
 Message-ID: <27077.948424228@sss.pgh.pa.us>
 From: Tom Lane <tgl@sss.pgh.pa.us>
 Status: OR
 Bruce Momjian <pgman@candle.pha.pa.us> writes:
 > It is nice that ANALYZE is done during vacuum.  I can't imagine why you
 > would want to do an analyze without adding a vacuum to it.  I guess
 > that's why I made them the same command.
 Well, the main bad thing about ANALYZE being part of VACUUM is that
 it adds to the length of time that VACUUM is holding an exclusive
 lock on the table.  I think it'd make more sense for it to be a
 separate command.
 I have also been thinking about how to make ANALYZE produce a more
 reliable estimate of the most common value.  The three-element list
 that it keeps now is a good low-cost hack, but it really doesn't
 produce a trustworthy answer unless the MCV is pretty darn C (since
 it will never pick up on the MCV at all until there are at least
 two occurrences in three adjacent tuples).  The only idea I've come
 up with is to use a larger list, which would be slower and take
 more memory.  I think that'd be OK in a separate command, but I
 hesitate to do it inside VACUUM --- VACUUM has its own considerable
 memory requirements, and there's still the issue of not holding down
 an exclusive lock longer than you have to.
 			regards, tom lane
 From Inoue@tpf.co.jp Thu Jan 20 21:08:32 2000
 Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08225
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 22:08:29 -0500 (EST)
 Received: from cadzone ([126.0.1.40] (may be forged))
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id MAA04148; Fri, 21 Jan 2000 12:08:30 +0900
 From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 To: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane" <tgl@sss.pgh.pa.us>
 Cc: <pgsql-hackers@postgreSQL.org>
 Subject: RE: [HACKERS] Some notes on optimizer cost estimates
 Date: Fri, 21 Jan 2000 12:14:10 +0900
 Message-ID: <001301bf63bd$95cbe680$2801007e@tpf.co.jp>
 MIME-Version: 1.0
 Content-Type: text/plain;
 	charset="iso-8859-1"
 Content-Transfer-Encoding: 7bit
 X-Priority: 3 (Normal)
 X-MSMail-Priority: Normal
 X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
 In-Reply-To: <200001210248.VAA07186@candle.pha.pa.us>
 Importance: Normal
 Status: OR
 > -----Original Message-----
 > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
 > 
 > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 > > > I've wondered why we cound't analyze database without vacuum.
 > > > We couldn't run vacuum light-heartedly because it acquires an
 > > > exclusive lock for the target table. 
 > > 
 > > There is probably no real good reason, except backwards compatibility,
 > > why the ANALYZE function (obtaining pg_statistic data) is part of
 > > VACUUM at all --- it could just as easily be a separate command that
 > > would only use read access on the database.  Bruce is thinking about
 > > restructuring VACUUM, so maybe now is a good time to think about
 > > splitting out the ANALYZE code too.
 > 
 > I put it in vacuum because at the time I didn't know how to do such
 > things and vacuum already scanned the table.  I just linked on the the
 > scan.  Seemed like a good idea at the time.
 > 
 > It is nice that ANALYZE is done during vacuum.  I can't imagine why you
 > would want to do an analyze without adding a vacuum to it.  I guess
 > that's why I made them the same command.
 > 
 > If I made them separate commands, both would have to scan the table,
 > though the analyze could do it without the exclusive lock, which would
 > be good.
 >
 The functionality of VACUUM and ANALYZE is quite different.
 I don't prefer to charge VACUUM more than now about analyzing
 database.  Probably looong lock,more aborts .... 
 Various kind of analysis would be possible by splitting out ANALYZE.
 Regards.
 Hiroshi Inoue
 Inoue@tpf.co.jp
 From owner-pgsql-hackers@hub.org Fri Jan 21 11:01:59 2000
 Received: from hub.org (hub.org [216.126.84.1])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA07821
 	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 12:01:57 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id LAA77357;
 	Fri, 21 Jan 2000 11:52:25 -0500 (EST)
 	(envelope-from owner-pgsql-hackers)
 Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 11:50:46 -0500
 Received: (from majordom@localhost)
 	by hub.org (8.9.3/8.9.3) id LAA76756
 	for pgsql-hackers-outgoing; Fri, 21 Jan 2000 11:49:50 -0500 (EST)
 	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 Received: from eclipse.pacifier.com (eclipse.pacifier.com [199.2.117.78])
 	by hub.org (8.9.3/8.9.3) with ESMTP id LAA76594
 	for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 11:49:01 -0500 (EST)
 	(envelope-from dhogaza@pacifier.com)
 Received: from desktop (dsl-dhogaza.pacifier.net [216.65.147.68])
 	by eclipse.pacifier.com (8.9.3/8.9.3pop) with SMTP id IAA00225;
 	Fri, 21 Jan 2000 08:47:26 -0800 (PST)
 Message-Id: <3.0.1.32.20000121081044.01036290@mail.pacifier.com>
 X-Sender: dhogaza@mail.pacifier.com
 X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
 Date: Fri, 21 Jan 2000 08:10:44 -0800
 To: xun@cs.ucsb.edu, pgsql-hackers@postgreSQL.org
 From: Don Baccus <dhogaza@pacifier.com>
 Subject: Re: Re. [HACKERS] Some notes on optimizer cost estimates
 In-Reply-To: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu>
 Mime-Version: 1.0
 Content-Type: text/plain; charset="us-ascii"
 Sender: owner-pgsql-hackers@postgreSQL.org
 Status: OR
 At 06:19 PM 1/20/00 -0800, Xun Cheng wrote:
 >I'm very glad you bring up this cost estimate issue.
 >Recent work in database research have argued a more
 >detailed disk access cost model should be used for
 >large queries especially joins.
 >Traditional cost estimate only considers the number of
 >disk pages accessed. However a more detailed model
 >would consider three parameters: avg. seek, avg. latency
 >and avg. page transfer. For old disk, typical values are
 >SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
 >A sequential continuous reading of a table (assuming
 >1000 continuous pages) would cost
 >(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
 >reading 200 times with 2 continuous pages/time would
 >cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
 >Someone from IBM lab re-studied the traditional
 >ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost
 model
 >and found some interesting results.
 One complication when doing an index scan is that you are
 accessing two separate files (table and index), which can frequently
 be expected to cause an considerable increase in average seek time.
 Oracle and other commercial databases recommend spreading indices and
 tables over several spindles if at all possible in order to minimize
 this effect.
 I suspect it also helps their optimizer make decisions that are
 more consistently good for customers with the largest and most
 complex databases and queries, by making cost estimates more predictably
 reasonable.
 Still...this doesn't help with the question about the effect of the
 filesystem system cache.  I wandered around the web for a little bit
 last night, and found one summary of a paper by Osterhout on the
 effect of the Solaris cache on a fileserver serving diskless workstations.
 There was reference to the hierarchy involved (i.e. the local workstation
 cache is faster than the fileserver's cache which has to be read via
 the network which in turn is faster than reading from the fileserver's
 disk).  It appears the rule-of-thumb for the cache-hit ratio on reads,
 presumably based on measuring some internal Sun systems, used in their
 calculations was 80%.
 Just a datapoint to think about.
 There's also considerable operating system theory on paging systems
 that might be useful for thinking about trying to estimate the
 Postgres cache/hit ratio.  Then again, maybe Postgres could just
 keep count of how many pages of a given table are in the cache at
 any given time?  Or simply keep track of the current ratio of hits
 and misses?
 >>I have been spending some time measuring actual runtimes for various
 >>sequential-scan and index-scan query plans, and have learned that the
 >>current Postgres optimizer's cost estimation equations are not very
 >>close to reality at all.
 >One interesting question I'd like to ask is if this non-closeness
 >really affects the optimal choice of postgresql's query optimizer.
 >And to what degree the effects might be? My point is that
 >if the optimizer estimated the cost for sequential-scan is 10 and
 >the cost for index-scan is 20 while the actual costs are 10 vs. 40,
 >it should be ok because the optimizer would still choose sequential-scan
 >as it should.
 This is crucial, of course - if there are only two types of scans 
 available, what ever heuristic is used only has to be accurate enough
 to pick the right one.  Once the choice is made, it doesn't really
 matter (from the optimizer's POV) just how long it will actually take,
 the time will be spent and presumably it will be shorter than the
 alternative.
 How frequently will the optimizer choose wrongly if:
 1. All of the tables and indices were in PG buffer cache or filesystem
   cache? (i.e. fixed access times for both types of scans)
 or
 2. The table's so big that only a small fraction can reside in RAM
   during the scan and join, which means that the non-sequential
   disk access pattern of the indexed scan is much more expensive.
 Also, if you pick sequential scans more frequently based on a presumption
 that index scans are expensive due to increased average seek time, how
 often will this penalize the heavy-duty user that invests in extra
 drives and lots of RAM?
 ...
 >>The current cost estimation
 >>method essentially assumes that the buffer cache plus OS disk cache will
 >>be 100% efficient --- we will never have to read the same page of the
 >>main table twice in a scan, due to having discarded it between
 >>references.  This of course is unreasonably optimistic.  Worst case
 >>is that we'd fetch a main-table page for each selected tuple, but in
 >>most cases that'd be unreasonably pessimistic.
 >
 >This is actually the motivation that I asked before if postgresql
 >has a raw disk facility. That way we have much control on this cache
 >issue. Of course only if we can provide some algo. better than OS
 >cache algo. (depending on the context, like large joins), a raw disk
 >facility will be worthwhile (besides the recoverability).
 Postgres does have control over its buffer cache.  The one thing that
 raw disk I/O would give you is control over where blocks are placed,
 meaning you could more accurately model the cost of retrieving them.
 So presumably the cache could be tuned to the allocation algorithm
 used to place various structures on the disk.
 I still wonder just how much gain you get by this approach.  Compared,
 to, say simply spending $2,000 on a gigabyte of RAM.  Heck, PCs even
 support a couple gigs of RAM now.
 >Actually I have another question for you guys which is somehow related
 >to this cost estimation issue. You know the difference between OLTP
 >and OLAP. My question is how you target postgresql on both kinds
 >of applications or just OLTP. From what I know OLTP and OLAP would
 >have a big difference in query characteristics and thus 
 >optimization difference. If postgresql is only targeted on
 >OLTP, the above cost estimation issue might not be that
 >important. However for OLAP, large tables and large queries are
 >common and optimization would be difficult.
 - Don Baccus, Portland OR <dhogaza@pacifier.com>
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.
 ************
--- a/doc/TODO.detail/vacuum
+++ b/doc/TODO.detail/vacuum
@ -1403,7 +1403,7 @@ From owner-pgsql-hackers@hub.org Sat Jan 22 02:31:03 2000
 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA06743
 	for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:31:02 -0500 (EST)
-Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.2 $) with ESMTP id DAA07529 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:25:13 -0500 (EST)
+Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.3 $) with ESMTP id DAA07529 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:25:13 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id DAA31900;
 	Sat, 22 Jan 2000 03:19:53 -0500 (EST)
@ -1475,7 +1475,7 @@ From tgl@sss.pgh.pa.us Sat Jan 22 10:31:02 2000
 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA20882
 	for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:31:00 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) by renoir.op.net (o1/$Revision: 1.2 $) with ESMTP id LAA26612 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:12:44 -0500 (EST)
+Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) by renoir.op.net (o1/$Revision: 1.3 $) with ESMTP id LAA26612 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:12:44 -0500 (EST)
 Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id LAA20569;
 	Sat, 22 Jan 2000 11:11:26 -0500 (EST)
@ -1499,3 +1499,43 @@ Or equivalently, vacuum after updating all the rows.
 			regards, tom lane
 From tgl@sss.pgh.pa.us Thu Jan 20 23:51:49 2000
 Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13919
 	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 00:51:47 -0500 (EST)
 Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id AAA03644;
 	Fri, 21 Jan 2000 00:51:51 -0500 (EST)
 To: Bruce Momjian <pgman@candle.pha.pa.us>
 cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
 Subject: Re: vacuum timings 
 In-reply-to: <200001210543.AAA13592@candle.pha.pa.us> 
 References: <200001210543.AAA13592@candle.pha.pa.us>
 Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 	message dated "Fri, 21 Jan 2000 00:43:49 -0500"
 Date: Fri, 21 Jan 2000 00:51:51 -0500
 Message-ID: <3641.948433911@sss.pgh.pa.us>
 From: Tom Lane <tgl@sss.pgh.pa.us>
 Status: ORr
 Bruce Momjian <pgman@candle.pha.pa.us> writes:
 > I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
 > 400MB and index is 160MB.
 > With index on the single in4 column, I got:
 > 	 78 seconds for a vacuum
 > 	121 seconds for vacuum after deleting a single row
 > 	662 seconds for vacuum after deleting the entire table
 > With no index, I got:
 > 	 43 seconds for a vacuum
 > 	 43 seconds for vacuum after deleting a single row
 > 	 43 seconds for vacuum after deleting the entire table
 > I find this quite interesting.
 How long does it take to create the index on your setup --- ie,
 if vacuum did a drop/create index, would it be competitive?
 			regards, tom lane
--- a/src/include/c.h
+++ b/src/include/c.h
@ -8,7 +8,7 @@
 * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $Id: c.h,v 1.71 2000/06/02 15:57:40 momjian Exp $
+ * $Id: c.h,v 1.72 2000/06/02 16:33:17 momjian Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -896,7 +896,7 @@ extern char *vararg_format(const char *fmt,...);
 * ----------------------------------------------------------------
 */
-#ifndef __CYGWIN32__
+#ifdef __CYGWIN32__
 #define	PG_BINARY	0
 #define	PG_BINARY_R	"rb"
 #define	PG_BINARY_W	"wb"