3267 lines
134 KiB
Plaintext
3267 lines
134 KiB
Plaintext
From owner-pgsql-hackers@hub.org Sun Jun 14 18:45:04 1998
|
|
Received: from hub.org (hub.org [209.47.148.200])
|
|
by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id SAA03690
|
|
for <maillist@candle.pha.pa.us>; Sun, 14 Jun 1998 18:45:00 -0400 (EDT)
|
|
Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id SAA28049; Sun, 14 Jun 1998 18:39:42 -0400 (EDT)
|
|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 14 Jun 1998 18:36:06 +0000 (EDT)
|
|
Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id SAA27943 for pgsql-hackers-outgoing; Sun, 14 Jun 1998 18:36:04 -0400 (EDT)
|
|
Received: from angular.illustra.com (ifmxoak.illustra.com [206.175.10.34]) by hub.org (8.8.8/8.7.5) with ESMTP id SAA27925 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 18:35:47 -0400 (EDT)
|
|
Received: from hawk.illustra.com (hawk.illustra.com [158.58.61.70]) by angular.illustra.com (8.7.4/8.7.3) with SMTP id PAA21293 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 15:35:12 -0700 (PDT)
|
|
Received: by hawk.illustra.com (5.x/smail2.5/06-10-94/S)
|
|
id AA07922; Sun, 14 Jun 1998 15:35:13 -0700
|
|
From: dg@illustra.com (David Gould)
|
|
Message-Id: <9806142235.AA07922@hawk.illustra.com>
|
|
Subject: [HACKERS] performance tests, initial results
|
|
To: pgsql-hackers@postgreSQL.org
|
|
Date: Sun, 14 Jun 1998 15:35:13 -0700 (PDT)
|
|
Mime-Version: 1.0
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Content-Transfer-Encoding: 7bit
|
|
Sender: owner-pgsql-hackers@hub.org
|
|
Precedence: bulk
|
|
Status: RO
|
|
|
|
|
|
I have been playing a little with the performance tests found in
|
|
pgsql/src/tests/performance and have a few observations that might be of
|
|
minor interest.
|
|
|
|
The tests themselves are simple enough although the result parsing in the
|
|
driver did not work on Linux. I am enclosing a patch below to fix this. I
|
|
think it will also work better on the other systems.
|
|
|
|
A summary of results from my testing are below. Details are at the bottom
|
|
of this message.
|
|
|
|
My test system is 'leslie':
|
|
|
|
linux 2.0.32, gcc version 2.7.2.3
|
|
P133, HX chipset, 512K L2, 32MB mem
|
|
NCR810 fast scsi, Quantum Atlas 2GB drive (7200 rpm).
|
|
|
|
|
|
Results Summary (times in seconds)
|
|
|
|
Single txn 8K txn Create 8K idx 8K random Simple
|
|
Case Description 8K insert 8K insert Index Insert Scans Orderby
|
|
=================== ========== ========= ====== ====== ========= =======
|
|
1 From Distribution
|
|
P90 FreeBsd -B256 39.56 1190.98 3.69 46.65 65.49 2.27
|
|
IDE
|
|
|
|
2 Running on leslie
|
|
P133 Linux 2.0.32 15.48 326.75 2.99 20.69 35.81 1.68
|
|
SCSI 32M
|
|
|
|
3 leslie, -o -F
|
|
no forced writes 15.90 24.98 2.63 20.46 36.43 1.69
|
|
|
|
4 leslie, -o -F
|
|
no ASSERTS 14.92 23.23 1.38 18.67 33.79 1.58
|
|
|
|
5 leslie, -o -F -B2048
|
|
more buffers 21.31 42.28 2.65 25.74 42.26 1.72
|
|
|
|
6 leslie, -o -F -B2048
|
|
more bufs, no ASSERT 20.52 39.79 1.40 24.77 39.51 1.55
|
|
|
|
|
|
|
|
|
|
Case to Case Difference Factors (+ is faster)
|
|
|
|
Single txn 8K txn Create 8K idx 8K random Simple
|
|
Case Description 8K insert 8K insert Index Insert Scans Orderby
|
|
=================== ========== ========= ====== ====== ========= =======
|
|
|
|
leslie vs BSD P90. 2.56 3.65 1.23 2.25 1.83 1.35
|
|
|
|
(noflush -F) vs no -F -1.03 13.08 1.14 1.01 -1.02 1.00
|
|
|
|
No Assert vs Assert 1.05 1.07 1.90 1.06 1.07 1.09
|
|
|
|
-B256 vs -B2048 1.34 1.69 1.01 1.26 1.16 1.02
|
|
|
|
|
|
Observations:
|
|
|
|
- leslie (P133 linux) appears to be about 1.8 times faster than the
|
|
P90 BSD system used for the test result distributed with the source, not
|
|
counting the 8K txn insert case which was completely disk bound.
|
|
|
|
- SCSI disks make a big (factor of 3.6) difference. During this test the
|
|
disk was hammering and cpu utilization was < 10%.
|
|
|
|
- Assertion checking seems to cost about 7% except for create index where
|
|
it costs 90%
|
|
|
|
- the -F option to avoid flushing buffers has tremendous effect if there are
|
|
many very small transactions. Or, another way, flushing at the end of the
|
|
transaction is a major disaster for performance.
|
|
|
|
- Something is very wrong with our buffer cache implementation. Going from
|
|
256 buffers to 2048 buffers costs an average of 25%. In the 8K txn case
|
|
it costs about 70%. I see looking at the code and profiling that in the 8K
|
|
txn case this is in BufferSync() which examines all the buffers at commit
|
|
time. I don't quite understand why it is so costly for the single 8K row
|
|
txn (35%) though.
|
|
|
|
It would be nice to have some more tests. Maybe the Wisconsin stuff will
|
|
be useful.
|
|
|
|
|
|
|
|
----------------- patch to test harness. apply from pgsql ------------
|
|
*** src/test/performance/runtests.pl.orig Sun Jun 14 11:34:04 1998
|
|
|
|
Differences %
|
|
|
|
|
|
----------------- patch to test harness. apply from pgsql ------------
|
|
*** src/test/performance/runtests.pl.orig Sun Jun 14 11:34:04 1998
|
|
--- src/test/performance/runtests.pl Sun Jun 14 12:07:30 1998
|
|
***************
|
|
*** 84,123 ****
|
|
open (STDERR, ">$TmpFile") or die;
|
|
select (STDERR); $| = 1;
|
|
|
|
! for ($i = 0; $i <= $#perftests; $i++)
|
|
! {
|
|
$test = $perftests[$i];
|
|
($test, $XACTBLOCK) = split (/ /, $test);
|
|
$runtest = $test;
|
|
! if ( $test =~ /\.ntm/ )
|
|
! {
|
|
! #
|
|
# No timing for this queries
|
|
- #
|
|
close (STDERR); # close $TmpFile
|
|
open (STDERR, ">/dev/null") or die;
|
|
$runtest =~ s/\.ntm//;
|
|
}
|
|
! else
|
|
! {
|
|
close (STDOUT);
|
|
open(STDOUT, ">&SAVEOUT");
|
|
print STDOUT "\nRunning: $perftests[$i+1] ...";
|
|
close (STDOUT);
|
|
open (STDOUT, ">/dev/null") or die;
|
|
select (STDERR); $| = 1;
|
|
! printf "$perftests[$i+1]: ";
|
|
}
|
|
|
|
do "sqls/$runtest";
|
|
|
|
# Restore STDERR to $TmpFile
|
|
! if ( $test =~ /\.ntm/ )
|
|
! {
|
|
close (STDERR);
|
|
open (STDERR, ">>$TmpFile") or die;
|
|
}
|
|
-
|
|
select (STDERR); $| = 1;
|
|
$i++;
|
|
}
|
|
--- 84,116 ----
|
|
open (STDERR, ">$TmpFile") or die;
|
|
select (STDERR); $| = 1;
|
|
|
|
! for ($i = 0; $i <= $#perftests; $i++) {
|
|
$test = $perftests[$i];
|
|
($test, $XACTBLOCK) = split (/ /, $test);
|
|
$runtest = $test;
|
|
! if ( $test =~ /\.ntm/ ) {
|
|
# No timing for this queries
|
|
close (STDERR); # close $TmpFile
|
|
open (STDERR, ">/dev/null") or die;
|
|
$runtest =~ s/\.ntm//;
|
|
}
|
|
! else {
|
|
close (STDOUT);
|
|
open(STDOUT, ">&SAVEOUT");
|
|
print STDOUT "\nRunning: $perftests[$i+1] ...";
|
|
close (STDOUT);
|
|
open (STDOUT, ">/dev/null") or die;
|
|
select (STDERR); $| = 1;
|
|
! print "$perftests[$i+1]: ";
|
|
}
|
|
|
|
do "sqls/$runtest";
|
|
|
|
# Restore STDERR to $TmpFile
|
|
! if ( $test =~ /\.ntm/ ) {
|
|
close (STDERR);
|
|
open (STDERR, ">>$TmpFile") or die;
|
|
}
|
|
select (STDERR); $| = 1;
|
|
$i++;
|
|
}
|
|
***************
|
|
*** 128,138 ****
|
|
open (TMPF, "<$TmpFile") or die;
|
|
open (RESF, ">$ResFile") or die;
|
|
|
|
! while (<TMPF>)
|
|
! {
|
|
! $str = $_;
|
|
! ($test, $rtime) = split (/:/, $str);
|
|
! ($tmp, $rtime, $rest) = split (/[ ]+/, $rtime);
|
|
! print RESF "$test: $rtime\n";
|
|
}
|
|
|
|
--- 121,130 ----
|
|
open (TMPF, "<$TmpFile") or die;
|
|
open (RESF, ">$ResFile") or die;
|
|
|
|
! while (<TMPF>) {
|
|
! if (m/^(.*: ).* ([0-9:.]+) *elapsed/) {
|
|
! ($test, $rtime) = ($1, $2);
|
|
! print RESF $test, $rtime, "\n";
|
|
! }
|
|
}
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
|
|
------------------------- testcase detail --------------------------
|
|
|
|
1. from distribution
|
|
DBMS: PostgreSQL 6.2b10
|
|
OS: FreeBSD 2.1.5-RELEASE
|
|
HardWare: i586/90, 24M RAM, IDE
|
|
StartUp: postmaster -B 256 '-o -S 2048' -S
|
|
Compiler: gcc 2.6.3
|
|
Compiled: -O, without CASSERT checking, with
|
|
-DTBL_FREE_CMD_MEMORY (to free memory
|
|
if BEGIN/END after each query execution)
|
|
DB connection startup: 0.20
|
|
8192 INSERTs INTO SIMPLE (1 xact): 39.58
|
|
8192 INSERTs INTO SIMPLE (8192 xacts): 1190.98
|
|
Create INDEX on SIMPLE: 3.69
|
|
8192 INSERTs INTO SIMPLE with INDEX (1 xact): 46.65
|
|
8192 random INDEX scans on SIMPLE (1 xact): 65.49
|
|
ORDER BY SIMPLE: 2.27
|
|
|
|
|
|
2. run on leslie with asserts
|
|
DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01)
|
|
OS: Linux 2.0.32 leslie
|
|
HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
|
|
StartUp: postmaster -B 256 '-o -S 2048' -S
|
|
Compiler: gcc 2.7.2.3
|
|
Compiled: -O, WITH CASSERT checking, with
|
|
-DTBL_FREE_CMD_MEMORY (to free memory
|
|
if BEGIN/END after each query execution)
|
|
DB connection startup: 0.10
|
|
8192 INSERTs INTO SIMPLE (1 xact): 15.48
|
|
8192 INSERTs INTO SIMPLE (8192 xacts): 326.75
|
|
Create INDEX on SIMPLE: 2.99
|
|
8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.69
|
|
8192 random INDEX scans on SIMPLE (1 xact): 35.81
|
|
ORDER BY SIMPLE: 1.68
|
|
|
|
|
|
3. with -F to avoid forced i/o
|
|
DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01)
|
|
OS: Linux 2.0.32 leslie
|
|
HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
|
|
StartUp: postmaster -B 256 '-o -S 2048 -F' -S
|
|
Compiler: gcc 2.7.2.3
|
|
Compiled: -O, WITH CASSERT checking, with
|
|
-DTBL_FREE_CMD_MEMORY (to free memory
|
|
if BEGIN/END after each query execution)
|
|
DB connection startup: 0.10
|
|
8192 INSERTs INTO SIMPLE (1 xact): 15.90
|
|
8192 INSERTs INTO SIMPLE (8192 xacts): 24.98
|
|
Create INDEX on SIMPLE: 2.63
|
|
8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.46
|
|
8192 random INDEX scans on SIMPLE (1 xact): 36.43
|
|
ORDER BY SIMPLE: 1.69
|
|
|
|
|
|
4. no asserts, -F to avoid forced I/O
|
|
DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01)
|
|
OS: Linux 2.0.32 leslie
|
|
HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
|
|
StartUp: postmaster -B 256 '-o -S 2048' -S
|
|
Compiler: gcc 2.7.2.3
|
|
Compiled: -O, No CASSERT checking, with
|
|
-DTBL_FREE_CMD_MEMORY (to free memory
|
|
if BEGIN/END after each query execution)
|
|
DB connection startup: 0.10
|
|
8192 INSERTs INTO SIMPLE (1 xact): 14.92
|
|
8192 INSERTs INTO SIMPLE (8192 xacts): 23.23
|
|
Create INDEX on SIMPLE: 1.38
|
|
8192 INSERTs INTO SIMPLE with INDEX (1 xact): 18.67
|
|
8192 random INDEX scans on SIMPLE (1 xact): 33.79
|
|
ORDER BY SIMPLE: 1.58
|
|
|
|
|
|
5. with more buffers (2048 vs 256) and -F to avoid forced i/o
|
|
DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01)
|
|
OS: Linux 2.0.32 leslie
|
|
HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
|
|
StartUp: postmaster -B 2048 '-o -S 2048 -F' -S
|
|
Compiler: gcc 2.7.2.3
|
|
Compiled: -O, WITH CASSERT checking, with
|
|
-DTBL_FREE_CMD_MEMORY (to free memory
|
|
if BEGIN/END after each query execution)
|
|
DB connection startup: 0.11
|
|
8192 INSERTs INTO SIMPLE (1 xact): 21.31
|
|
8192 INSERTs INTO SIMPLE (8192 xacts): 42.28
|
|
Create INDEX on SIMPLE: 2.65
|
|
8192 INSERTs INTO SIMPLE with INDEX (1 xact): 25.74
|
|
8192 random INDEX scans on SIMPLE (1 xact): 42.26
|
|
ORDER BY SIMPLE: 1.72
|
|
|
|
|
|
6. No Asserts, more buffers (2048 vs 256) and -F to avoid forced i/o
|
|
DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01)
|
|
OS: Linux 2.0.32 leslie
|
|
HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
|
|
StartUp: postmaster -B 2048 '-o -S 2048 -F' -S
|
|
Compiler: gcc 2.7.2.3
|
|
Compiled: -O, No CASSERT checking, with
|
|
-DTBL_FREE_CMD_MEMORY (to free memory
|
|
if BEGIN/END after each query execution)
|
|
DB connection startup: 0.11
|
|
8192 INSERTs INTO SIMPLE (1 xact): 20.52
|
|
8192 INSERTs INTO SIMPLE (8192 xacts): 39.79
|
|
Create INDEX on SIMPLE: 1.40
|
|
8192 INSERTs INTO SIMPLE with INDEX (1 xact): 24.77
|
|
8192 random INDEX scans on SIMPLE (1 xact): 39.51
|
|
ORDER BY SIMPLE: 1.55
|
|
---------------------------------------------------------------------
|
|
|
|
-dg
|
|
|
|
David Gould dg@illustra.com 510.628.3783 or 510.305.9468
|
|
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
|
|
"Don't worry about people stealing your ideas. If your ideas are any
|
|
good, you'll have to ram them down people's throats." -- Howard Aiken
|
|
|
|
|
|
From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999
|
|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
|
|
for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
|
|
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.16 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
|
|
Received: from localhost (majordom@localhost)
|
|
by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
|
|
Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
|
|
(envelope-from owner-pgsql-hackers)
|
|
Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 10:11:55 -0400
|
|
Received: (from majordom@localhost)
|
|
by hub.org (8.9.3/8.9.3) id KAA30030
|
|
for pgsql-hackers-outgoing; Tue, 19 Oct 1999 10:11:00 -0400 (EDT)
|
|
(envelope-from owner-pgsql-hackers@postgreSQL.org)
|
|
Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
|
|
by hub.org (8.9.3/8.9.3) with ESMTP id KAA29914
|
|
for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 10:10:33 -0400 (EDT)
|
|
(envelope-from tgl@sss.pgh.pa.us)
|
|
Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
|
|
by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id KAA09038;
|
|
Tue, 19 Oct 1999 10:09:15 -0400 (EDT)
|
|
To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
|
|
cc: "Vadim Mikheev" <vadim@krs.ru>, pgsql-hackers@postgreSQL.org
|
|
Subject: Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
|
|
In-reply-to: Your message of Tue, 19 Oct 1999 19:03:22 +0900
|
|
<000801bf1a19$2d88ae20$2801007e@cadzone.tpf.co.jp>
|
|
Date: Tue, 19 Oct 1999 10:09:15 -0400
|
|
Message-ID: <9036.940342155@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Sender: owner-pgsql-hackers@postgreSQL.org
|
|
Status: RO
|
|
|
|
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
|
|
> 1. shared cache holds committed system tuples.
|
|
> 2. private cache holds uncommitted system tuples.
|
|
> 3. relpages of shared cache are updated immediately by
|
|
> phisical change and corresponding buffer pages are
|
|
> marked dirty.
|
|
> 4. on commit, the contents of uncommitted tuples except
|
|
> relpages,reltuples,... are copied to correponding tuples
|
|
> in shared cache and the combined contents are
|
|
> committed.
|
|
> If so,catalog cache invalidation would be no longer needed.
|
|
> But synchronization of the step 4. may be difficult.
|
|
|
|
I think the main problem is that relpages and reltuples shouldn't
|
|
be kept in pg_class columns at all, because they need to have
|
|
very different update behavior from the other pg_class columns.
|
|
|
|
The rest of pg_class is update-on-commit, and we can lock down any one
|
|
row in the normal MVCC way (if transaction A has modified a row and
|
|
transaction B also wants to modify it, B waits for A to commit or abort,
|
|
so it can know which version of the row to start from). Furthermore,
|
|
there can legitimately be several different values of a row in use in
|
|
different places: the latest committed, an uncommitted modification, and
|
|
one or more old values that are still being used by active transactions
|
|
because they were current when those transactions started. (BTW, the
|
|
present relcache is pretty bad about maintaining pure MVCC transaction
|
|
semantics like this, but it seems clear to me that that's the direction
|
|
we want to go in.)
|
|
|
|
relpages cannot operate this way. To be useful for avoiding lseeks,
|
|
relpages *must* change exactly when the physical file changes. It
|
|
matters not at all whether the particular transaction that extended the
|
|
file ultimately commits or not. Moreover there can be only one correct
|
|
value (per relation) across the whole system, because there is only one
|
|
length of the relation file.
|
|
|
|
If we want to take reltuples seriously and try to maintain it
|
|
on-the-fly, then I think it needs still a third behavior. Clearly
|
|
it cannot be updated using MVCC rules, or we lose all writer
|
|
concurrency (if A has added tuples to a rel, B would have to wait
|
|
for A to commit before it could update reltuples...). Furthermore
|
|
"updating" isn't a simple matter of storing what you think the new
|
|
value is; otherwise two transactions adding tuples in parallel would
|
|
leave the wrong answer after B commits and overwrites A's value.
|
|
I think it would work for each transaction to keep track of a net delta
|
|
in reltuples for each table it's changed (total tuples added less total
|
|
tuples deleted), and then atomically add that value to the table's
|
|
shared reltuples counter during commit. But that still leaves the
|
|
problem of how you use the counter during a transaction to get an
|
|
accurate answer to the question "If I scan this table now, how many tuples
|
|
will I see?" At the time the question is asked, the current shared
|
|
counter value might include the effects of transactions that have
|
|
committed since your transaction started, and therefore are not visible
|
|
under MVCC rules. I think getting the correct answer would involve
|
|
making an instantaneous copy of the current counter at the start of
|
|
your xact, and then adding your own private net-uncommitted-delta to
|
|
the saved shared counter value when asked the question. This doesn't
|
|
look real practical --- you'd have to save the reltuples counts of
|
|
*all* tables in the database at the start of each xact, on the off
|
|
chance that you might need them. Ugh. Perhaps someone has a better
|
|
idea. In any case, reltuples clearly needs different mechanisms than
|
|
the ordinary fields in pg_class do, because updating it will be a
|
|
performance bottleneck otherwise.
|
|
|
|
If we allow reltuples to be updated only by vacuum-like events, as
|
|
it is now, then I think keeping it in pg_class is still OK.
|
|
|
|
In short, it seems clear to me that relpages should be removed from
|
|
pg_class and kept somewhere else if we want to make it more reliable
|
|
than it is now, and the same for reltuples (but reltuples doesn't
|
|
behave the same as relpages, and probably ought to be handled
|
|
differently).
|
|
|
|
regards, tom lane
|
|
|
|
************
|
|
|
|
From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999
|
|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
|
|
for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
|
|
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.16 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
|
|
Received: from localhost (majordom@localhost)
|
|
by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
|
|
Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
|
|
(envelope-from owner-pgsql-hackers)
|
|
Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 21:07:01 -0400
|
|
Received: (from majordom@localhost)
|
|
by hub.org (8.9.3/8.9.3) id VAA50644
|
|
for pgsql-hackers-outgoing; Tue, 19 Oct 1999 21:06:06 -0400 (EDT)
|
|
(envelope-from owner-pgsql-hackers@postgreSQL.org)
|
|
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
|
|
by hub.org (8.9.3/8.9.3) with ESMTP id VAA50584
|
|
for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 21:05:26 -0400 (EDT)
|
|
(envelope-from Inoue@tpf.co.jp)
|
|
Received: from cadzone ([126.0.1.40] (may be forged))
|
|
by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
|
|
id KAA01715; Wed, 20 Oct 1999 10:05:14 +0900
|
|
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
|
|
To: "Tom Lane" <tgl@sss.pgh.pa.us>
|
|
Cc: <pgsql-hackers@postgreSQL.org>
|
|
Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge relations
|
|
Date: Wed, 20 Oct 1999 10:09:13 +0900
|
|
Message-ID: <000501bf1a97$b925a860$2801007e@cadzone.tpf.co.jp>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain;
|
|
charset="iso-8859-1"
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Priority: 3 (Normal)
|
|
X-MSMail-Priority: Normal
|
|
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
|
|
X-Mimeole: Produced By Microsoft MimeOLE V4.72.2106.4
|
|
Importance: Normal
|
|
Sender: owner-pgsql-hackers@postgreSQL.org
|
|
Status: RO
|
|
|
|
> -----Original Message-----
|
|
> From: Hiroshi Inoue [mailto:Inoue@tpf.co.jp]
|
|
> Sent: Tuesday, October 19, 1999 6:45 PM
|
|
> To: Tom Lane
|
|
> Cc: pgsql-hackers@postgreSQL.org
|
|
> Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge
|
|
> relations
|
|
>
|
|
>
|
|
> >
|
|
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
|
|
>
|
|
> [snip]
|
|
>
|
|
> >
|
|
> > > Deletion is necessary only not to consume disk space.
|
|
> > >
|
|
> > > For example vacuum could remove not deleted files.
|
|
> >
|
|
> > Hmm ... interesting idea ... but I can hear the complaints
|
|
> > from users already...
|
|
> >
|
|
>
|
|
> My idea is only an analogy of PostgreSQL's simple recovery
|
|
> mechanism of tuples.
|
|
>
|
|
> And my main point is
|
|
> "delete fails after commit" doesn't harm the database
|
|
> except that not deleted files consume disk space.
|
|
>
|
|
> Of cource,it's preferable to delete relation files immediately
|
|
> after(or just when) commit.
|
|
> Useless files are visible though useless tuples are invisible.
|
|
>
|
|
|
|
Anyway I don't need "DROP TABLE inside transactions" now
|
|
and my idea is originally for that issue.
|
|
|
|
After a thought,I propose the following solution.
|
|
|
|
1. mdcreate() couldn't create existent relation files.
|
|
If the existent file is of length zero,we would overwrite
|
|
the file.(seems the comment in md.c says so but the
|
|
code doesn't do so).
|
|
If the file is an Index relation file,we would overwrite
|
|
the file.
|
|
|
|
2. mdunlink() couldn't unlink non-existent relation files.
|
|
mdunlink() doesn't call elog(ERROR) even if the file
|
|
doesn't exist,though I couldn't find where to change
|
|
now.
|
|
mdopen() doesn't call elog(ERROR) even if the file
|
|
doesn't exist and leaves the relation as CLOSED.
|
|
|
|
Comments ?
|
|
|
|
Regards.
|
|
|
|
Hiroshi Inoue
|
|
Inoue@tpf.co.jp
|
|
|
|
************
|
|
|
|
From pgsql-hackers-owner+M6267@hub.org Sun Aug 27 21:46:37 2000
|
|
Received: from hub.org (root@hub.org [216.126.84.1])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA07972
|
|
for <pgman@candle.pha.pa.us>; Sun, 27 Aug 2000 20:46:36 -0400 (EDT)
|
|
Received: from hub.org (majordom@localhost [127.0.0.1])
|
|
by hub.org (8.10.1/8.10.1) with SMTP id e7S0kaL27996;
|
|
Sun, 27 Aug 2000 20:46:36 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
|
|
by hub.org (8.10.1/8.10.1) with ESMTP id e7S05aL24107
|
|
for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:36 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id UAA01604
|
|
for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:29 -0400 (EDT)
|
|
To: pgsql-hackers@postgreSQL.org
|
|
Subject: [HACKERS] Possible performance improvement: buffer replacement policy
|
|
Date: Sun, 27 Aug 2000 20:05:29 -0400
|
|
Message-ID: <1601.967421129@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
X-Mailing-List: pgsql-hackers@postgresql.org
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@hub.org
|
|
Status: RO
|
|
|
|
Those of you with long memories may recall a benchmark that Edmund Mergl
|
|
drew our attention to back in May '99. That test showed extremely slow
|
|
performance for updating a table with many indexes (about 20). At the
|
|
time, it seemed the problem was due to bad performance of btree with
|
|
many equal keys, so I thought I'd go back and retry the benchmark after
|
|
this latest round of btree hackery.
|
|
|
|
The good news is that btree itself seems to be pretty well fixed; the
|
|
bad news is that the benchmark is still slow for large numbers of rows.
|
|
The problem is I/O: the CPU mostly sits idle waiting for the disk.
|
|
As best I can tell, the difficulty is that the working set of pages
|
|
needed to update this many indexes is too large compared to the number
|
|
of disk buffers Postgres is using. (I was running with -B 1000 and
|
|
looking at behavior for a 100000-row test table. This gave me a table
|
|
size of 3876 pages, plus 11526 pages in 20 indexes.)
|
|
|
|
Of course, there's only so much we can do when the number of buffers
|
|
is too small, but I still started to wonder if we are using the buffers
|
|
as effectively as we can. Some tracing showed that most of the pages
|
|
of the indexes were being read and written multiple times within a
|
|
single UPDATE query, while most of the pages of the table proper were
|
|
fetched and written only once. That says we're not using the buffers
|
|
as well as we could; the index pages are not being kept in memory when
|
|
they should be. In a query like this, we should displace main-table
|
|
pages sooner to allow keeping more index pages in cache --- but with
|
|
the simple LRU replacement method we use, once a page has been loaded
|
|
it will stay in cache for at least the next NBuffers (-B) page
|
|
references, no matter what. With a large NBuffers that's a long time.
|
|
|
|
I've come across an interesting article:
|
|
The LRU-K Page Replacement Algorithm For Database Disk Buffering
|
|
Elizabeth J. O'Neil, Patrick E. O'Neil, Gerhard Weikum
|
|
Proceedings of the 1993 ACM SIGMOD international conference
|
|
on Management of Data, May 1993
|
|
(If you subscribe to the ACM digital library, you can get a PDF of this
|
|
from there.) This article argues that standard LRU buffer management is
|
|
inherently not great for database caches, and that it's much better to
|
|
replace pages on the basis of time since the K'th most recent reference,
|
|
not just time since the most recent one. K=2 is enough to get most of
|
|
the benefit. The big win is that you are measuring an actual page
|
|
interreference time (between the last two references) and not just
|
|
dealing with a lower-bound guess on the interreference time. Frequently
|
|
used pages are thus much more likely to stay in cache.
|
|
|
|
It looks like it wouldn't take too much work to replace shared buffers
|
|
on the basis of LRU-2 instead of LRU, so I'm thinking about trying it.
|
|
|
|
Has anyone looked into this area? Is there a better method to try?
|
|
|
|
regards, tom lane
|
|
|
|
From prlw1@newn.cam.ac.uk Fri Jan 19 12:54:45 2001
|
|
Received: from henry.newn.cam.ac.uk (henry.newn.cam.ac.uk [131.111.204.130])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA29822
|
|
for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 12:54:44 -0500 (EST)
|
|
Received: from [131.111.204.180] (helo=quartz.newn.cam.ac.uk)
|
|
by henry.newn.cam.ac.uk with esmtp (Exim 3.13 #1)
|
|
id 14JfkU-0001WA-00; Fri, 19 Jan 2001 17:54:54 +0000
|
|
Received: from prlw1 by quartz.newn.cam.ac.uk with local (Exim 3.13 #1)
|
|
id 14Jfj6-0001cq-00; Fri, 19 Jan 2001 17:53:28 +0000
|
|
Date: Fri, 19 Jan 2001 17:53:28 +0000
|
|
From: Patrick Welche <prlw1@newn.cam.ac.uk>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgreSQL.org
|
|
Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy
|
|
Message-ID: <20010119175328.A6223@quartz.newn.cam.ac.uk>
|
|
Reply-To: prlw1@cam.ac.uk
|
|
References: <1601.967421129@sss.pgh.pa.us> <200101191703.MAA25873@candle.pha.pa.us>
|
|
Mime-Version: 1.0
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Disposition: inline
|
|
User-Agent: Mutt/1.2i
|
|
In-Reply-To: <200101191703.MAA25873@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Fri, Jan 19, 2001 at 12:03:58PM -0500
|
|
Status: RO
|
|
|
|
On Fri, Jan 19, 2001 at 12:03:58PM -0500, Bruce Momjian wrote:
|
|
>
|
|
> Tom, did we ever test this? I think we did and found that it was the
|
|
> same or worse, right?
|
|
|
|
(Funnily enough, I just read that message:)
|
|
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: pgsql-hackers@postgreSQL.org
|
|
Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy
|
|
In-reply-to: <200010161541.LAA06653@candle.pha.pa.us>
|
|
References: <200010161541.LAA06653@candle.pha.pa.us>
|
|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
|
|
message dated "Mon, 16 Oct 2000 11:41:41 -0400"
|
|
Date: Mon, 16 Oct 2000 11:49:52 -0400
|
|
Message-ID: <26100.971711392@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
X-Mailing-List: pgsql-hackers@postgresql.org
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@hub.org
|
|
Status: RO
|
|
Content-Length: 947
|
|
Lines: 19
|
|
|
|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
>> It looks like it wouldn't take too much work to replace shared buffers
|
|
>> on the basis of LRU-2 instead of LRU, so I'm thinking about trying it.
|
|
>>
|
|
>> Has anyone looked into this area? Is there a better method to try?
|
|
|
|
> Sounds like a perfect idea. Good luck. :-)
|
|
|
|
Actually, the idea went down in flames :-(, but I neglected to report
|
|
back to pghackers about it. I did do some code to manage buffers as
|
|
LRU-2. I didn't have any good performance test cases to try it with,
|
|
but Richard Brosnahan was kind enough to re-run the TPC tests previously
|
|
published by Great Bridge with that code in place. Wasn't any faster,
|
|
in fact possibly a little slower, likely due to the extra CPU time spent
|
|
on buffer freelist management. It's possible that other scenarios might
|
|
show a better result, but right now I feel pretty discouraged about the
|
|
LRU-2 idea and am not pursuing it.
|
|
|
|
regards, tom lane
|
|
|
|
|
|
From pgsql-hackers-owner+M3455@postgresql.org Fri Jan 19 13:18:12 2001
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA02092
|
|
for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 13:18:12 -0500 (EST)
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0JIFJ037872;
|
|
Fri, 19 Jan 2001 13:15:19 -0500 (EST)
|
|
(envelope-from pgsql-hackers-owner+M3455@postgresql.org)
|
|
Received: from sectorbase2.sectorbase.com ([208.48.122.131])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0JI7V036780
|
|
for <pgsql-hackers@postgreSQL.org>; Fri, 19 Jan 2001 13:07:31 -0500 (EST)
|
|
(envelope-from vmikheev@SECTORBASE.COM)
|
|
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
|
|
id <DG1W4LRZ>; Fri, 19 Jan 2001 09:46:14 -0800
|
|
Message-ID: <8F4C99C66D04D4118F580090272A7A234D329F@sectorbase1.sectorbase.com>
|
|
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
|
|
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Cc: pgsql-hackers@postgresql.org
|
|
Subject: RE: [HACKERS] Possible performance improvement: buffer replacemen
|
|
t policy
|
|
Date: Fri, 19 Jan 2001 10:07:27 -0800
|
|
MIME-Version: 1.0
|
|
X-Mailer: Internet Mail Service (5.5.2653.19)
|
|
Content-Type: text/plain;
|
|
charset="iso-8859-1"
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: RO
|
|
|
|
> > Tom, did we ever test this? I think we did and found that
|
|
> > it was the same or worse, right?
|
|
>
|
|
> I tried it and didn't see any noticeable improvement on the particular
|
|
> test case I was using, so I got discouraged and didn't pursue the idea
|
|
> further. I'd like to come back to it someday, though.
|
|
|
|
I don't know how much useful could be LRU-2 but with WAL we should try
|
|
to reuse undirty free buffers first, not dirty ones, just to postpone
|
|
writes as long as we can. (BTW, this is what Oracle does.)
|
|
So, we probably should put new unfree dirty buffer just before first
|
|
dirty one in LRU.
|
|
|
|
Vadim
|
|
|
|
From markw@mohawksoft.com Thu Jun 7 14:40:02 2001
|
|
Return-path: <markw@mohawksoft.com>
|
|
Received: from gromit.dotclick.com (ipn9-f8366.net-resource.net [216.204.83.66])
|
|
by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Ie1c14004
|
|
for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 14:40:02 -0400 (EDT)
|
|
Received: from mohawksoft.com (IDENT:markw@localhost.localdomain [127.0.0.1])
|
|
by gromit.dotclick.com (8.9.3/8.9.3) with ESMTP id OAA04973;
|
|
Thu, 7 Jun 2001 14:37:00 -0400
|
|
Sender: markw@gromit.dotclick.com
|
|
Message-ID: <3B1FC9CB.57C72AD6@mohawksoft.com>
|
|
Date: Thu, 07 Jun 2001 14:36:59 -0400
|
|
From: mlw <markw@mohawksoft.com>
|
|
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.2 i686)
|
|
X-Accept-Language: en
|
|
MIME-Version: 1.0
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
|
|
Subject: Re: 7.2 items
|
|
References: <200106071503.f57F32n03924@candle.pha.pa.us>
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
Status: RO
|
|
|
|
Bruce Momjian wrote:
|
|
|
|
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
> >
|
|
> > > Here is a small list of big TODO items. I was wondering which ones
|
|
> > > people were thinking about for 7.2?
|
|
> >
|
|
> > A friend of mine wants to use PostgreSQL instead of Oracle for a large
|
|
> > application, but has run into a snag when speed comparisons looked
|
|
> > good until the Oracle folks added a couple of BITMAP indexes. I can't
|
|
> > recall seeing any discussion about that here -- are there any plans?
|
|
>
|
|
> It is not on our list and I am not sure what they do.
|
|
|
|
Do you have access to any Oracle Documentation? There is a good explanation
|
|
of them.
|
|
|
|
However, I will try to explain.
|
|
|
|
If you have a table, locations. It has 1,000,000 records.
|
|
|
|
In oracle you do this:
|
|
|
|
create bitmap index bitmap_foo on locations (state) ;
|
|
|
|
For each unique value of 'state' oracle will create a bitmap with 1,000,000
|
|
bits in it. With a one representing a match and a zero representing no
|
|
match. Record '0' in the table is represented by bit '0' in the bitmap,
|
|
record '1' is represented by bit '1', record two by bit '2' and so on.
|
|
|
|
In a table where comparatively few different values are to be indexed in a
|
|
large table, a bitmap index can be quite small and not suffer the N * log(N)
|
|
disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or
|
|
dense (or have periods of denseness and sparseness), it can be compressed
|
|
very efficiently as well.
|
|
|
|
When the statement:
|
|
|
|
select * from locations where state = 'MA';
|
|
|
|
Is executed, the bitmap is read into memory in very few disk operations.
|
|
(Perhaps even as few as one or two). It is a simple operation of rifling
|
|
through the bitmap for '1's that indicate the record has the property,
|
|
'state' = 'MA';
|
|
|
|
|
|
From mascarm@mascari.com Thu Jun 7 15:36:25 2001
|
|
Return-path: <mascarm@mascari.com>
|
|
Received: from corvette.mascari.com (dhcp065-024-161-045.columbus.rr.com [65.24.161.45])
|
|
by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57JaOc21943
|
|
for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:36:24 -0400 (EDT)
|
|
Received: from ferrari (ferrari.mascari.com [192.168.2.1])
|
|
by corvette.mascari.com (8.9.3/8.9.3) with SMTP id PAA25607;
|
|
Thu, 7 Jun 2001 15:29:31 -0400
|
|
Received: by localhost with Microsoft MAPI; Thu, 7 Jun 2001 15:34:18 -0400
|
|
Message-ID: <01C0EF67.5105D2E0.mascarm@mascari.com>
|
|
From: Mike Mascari <mascarm@mascari.com>
|
|
Reply-To: "mascarm@mascari.com" <mascarm@mascari.com>
|
|
To: "'mlw'" <markw@mohawksoft.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
|
|
Subject: RE: [HACKERS] Re: 7.2 items
|
|
Date: Thu, 7 Jun 2001 15:34:17 -0400
|
|
Organization: Mascari Development Inc.
|
|
X-Mailer: Microsoft Internet E-mail/MAPI - 8.0.0.4211
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset="us-ascii"
|
|
Content-Transfer-Encoding: 7bit
|
|
Status: RO
|
|
|
|
And in addition,
|
|
|
|
If you submitted the query:
|
|
|
|
SELECT * FROM addresses WHERE state = 'OH'
|
|
AND areacode = '614'
|
|
|
|
Then, with bitmap indexes, the bitmaps are just logically ANDed
|
|
together, and the final bitmap determines the matching rows.
|
|
|
|
Mike Mascari
|
|
mascarm@mascari.com
|
|
|
|
-----Original Message-----
|
|
From: mlw [SMTP:markw@mohawksoft.com]
|
|
|
|
Bruce Momjian wrote:
|
|
|
|
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
> >
|
|
> > > Here is a small list of big TODO items. I was wondering which
|
|
ones
|
|
> > > people were thinking about for 7.2?
|
|
> >
|
|
> > A friend of mine wants to use PostgreSQL instead of Oracle for a
|
|
large
|
|
> > application, but has run into a snag when speed comparisons
|
|
looked
|
|
> > good until the Oracle folks added a couple of BITMAP indexes. I
|
|
can't
|
|
> > recall seeing any discussion about that here -- are there any
|
|
plans?
|
|
>
|
|
> It is not on our list and I am not sure what they do.
|
|
|
|
Do you have access to any Oracle Documentation? There is a good
|
|
explanation
|
|
of them.
|
|
|
|
However, I will try to explain.
|
|
|
|
If you have a table, locations. It has 1,000,000 records.
|
|
|
|
In oracle you do this:
|
|
|
|
create bitmap index bitmap_foo on locations (state) ;
|
|
|
|
For each unique value of 'state' oracle will create a bitmap with
|
|
1,000,000
|
|
bits in it. With a one representing a match and a zero representing
|
|
no
|
|
match. Record '0' in the table is represented by bit '0' in the
|
|
bitmap,
|
|
record '1' is represented by bit '1', record two by bit '2' and so
|
|
on.
|
|
|
|
In a table where comparatively few different values are to be indexed
|
|
in a
|
|
large table, a bitmap index can be quite small and not suffer the N *
|
|
log(N)
|
|
disk I/O most tree based indexes suffer. If the bitmap is fairly
|
|
sparse or
|
|
dense (or have periods of denseness and sparseness), it can be
|
|
compressed
|
|
very efficiently as well.
|
|
|
|
When the statement:
|
|
|
|
select * from locations where state = 'MA';
|
|
|
|
Is executed, the bitmap is read into memory in very few disk
|
|
operations.
|
|
(Perhaps even as few as one or two). It is a simple operation of
|
|
rifling
|
|
through the bitmap for '1's that indicate the record has the
|
|
property,
|
|
'state' = 'MA';
|
|
|
|
|
|
|
|
From oleg@sai.msu.su Thu Jun 7 15:39:15 2001
|
|
Return-path: <oleg@sai.msu.su>
|
|
Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2])
|
|
by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Jd7c22010
|
|
for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:39:08 -0400 (EDT)
|
|
Received: from ra (ra [158.250.29.2])
|
|
by ra.sai.msu.su (8.9.3/8.9.3) with ESMTP id WAA07783;
|
|
Thu, 7 Jun 2001 22:38:20 +0300 (GMT)
|
|
Date: Thu, 7 Jun 2001 22:38:20 +0300 (GMT)
|
|
From: Oleg Bartunov <oleg@sai.msu.su>
|
|
X-X-Sender: <megera@ra.sai.msu.su>
|
|
To: mlw <markw@mohawksoft.com>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Re: 7.2 items
|
|
In-Reply-To: <3B1FC9CB.57C72AD6@mohawksoft.com>
|
|
Message-ID: <Pine.GSO.4.33.0106072234120.6015-100000@ra.sai.msu.su>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
I think it's possible to implement bitmap indexes with a little
|
|
effort using GiST. at least I know one implementation
|
|
http://www.it.iitb.ernet.in/~rvijay/dbms/proj/
|
|
if you have interests you could implement bitmap indexes yourself
|
|
unfortunately, we're very busy
|
|
|
|
Oleg
|
|
On Thu, 7 Jun 2001, mlw wrote:
|
|
|
|
> Bruce Momjian wrote:
|
|
>
|
|
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
> > >
|
|
> > > > Here is a small list of big TODO items. I was wondering which ones
|
|
> > > > people were thinking about for 7.2?
|
|
> > >
|
|
> > > A friend of mine wants to use PostgreSQL instead of Oracle for a large
|
|
> > > application, but has run into a snag when speed comparisons looked
|
|
> > > good until the Oracle folks added a couple of BITMAP indexes. I can't
|
|
> > > recall seeing any discussion about that here -- are there any plans?
|
|
> >
|
|
> > It is not on our list and I am not sure what they do.
|
|
>
|
|
> Do you have access to any Oracle Documentation? There is a good explanation
|
|
> of them.
|
|
>
|
|
> However, I will try to explain.
|
|
>
|
|
> If you have a table, locations. It has 1,000,000 records.
|
|
>
|
|
> In oracle you do this:
|
|
>
|
|
> create bitmap index bitmap_foo on locations (state) ;
|
|
>
|
|
> For each unique value of 'state' oracle will create a bitmap with 1,000,000
|
|
> bits in it. With a one representing a match and a zero representing no
|
|
> match. Record '0' in the table is represented by bit '0' in the bitmap,
|
|
> record '1' is represented by bit '1', record two by bit '2' and so on.
|
|
>
|
|
> In a table where comparatively few different values are to be indexed in a
|
|
> large table, a bitmap index can be quite small and not suffer the N * log(N)
|
|
> disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or
|
|
> dense (or have periods of denseness and sparseness), it can be compressed
|
|
> very efficiently as well.
|
|
>
|
|
> When the statement:
|
|
>
|
|
> select * from locations where state = 'MA';
|
|
>
|
|
> Is executed, the bitmap is read into memory in very few disk operations.
|
|
> (Perhaps even as few as one or two). It is a simple operation of rifling
|
|
> through the bitmap for '1's that indicate the record has the property,
|
|
> 'state' = 'MA';
|
|
>
|
|
>
|
|
> ---------------------------(end of broadcast)---------------------------
|
|
> TIP 6: Have you searched our list archives?
|
|
>
|
|
> http://www.postgresql.org/search.mpl
|
|
>
|
|
|
|
Regards,
|
|
Oleg
|
|
_____________________________________________________________
|
|
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
|
|
Sternberg Astronomical Institute, Moscow University (Russia)
|
|
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
|
|
phone: +007(095)939-16-83, +007(095)939-23-83
|
|
|
|
|
|
From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000
|
|
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165
|
|
for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT)
|
|
Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.16 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
|
|
Received: from hub.org (majordom@localhost [127.0.0.1])
|
|
by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477;
|
|
Fri, 16 Jun 2000 17:13:36 -0400 (EDT)
|
|
Received: from home.dialix.com ([203.15.150.26])
|
|
by hub.org (8.10.1/8.10.1) with ESMTP id e5GLCQM14064
|
|
for <pgsql-general@postgresql.org>; Fri, 16 Jun 2000 17:12:27 -0400 (EDT)
|
|
Received: from nemeton.com.au ([202.76.153.71])
|
|
by home.dialix.com (8.9.3/8.9.3/JustNet) with SMTP id HAA95516
|
|
for <pgsql-general@postgresql.org>; Sat, 17 Jun 2000 07:11:44 +1000 (EST)
|
|
(envelope-from giles@nemeton.com.au)
|
|
Received: (qmail 10213 invoked from network); 16 Jun 2000 09:52:29 -0000
|
|
Received: from nemeton.com.au (203.8.3.17)
|
|
by nemeton.com.au with SMTP; 16 Jun 2000 09:52:29 -0000
|
|
To: Jurgen Defurne <defurnj@glo.be>
|
|
cc: Mark Stier <kalium@gmx.de>,
|
|
postgreSQL general mailing list <pgsql-general@postgresql.org>
|
|
Subject: Re: [GENERAL] optimization by removing the file system layer?
|
|
In-Reply-To: Message from Jurgen Defurne <defurnj@glo.be>
|
|
of "Thu, 15 Jun 2000 20:26:57 +0200." <39491FF1.E1E583F8@glo.be>
|
|
Date: Fri, 16 Jun 2000 19:52:28 +1000
|
|
Message-ID: <10210.961149148@nemeton.com.au>
|
|
From: Giles Lean <giles@nemeton.com.au>
|
|
X-Mailing-List: pgsql-general@postgresql.org
|
|
Precedence: bulk
|
|
Sender: pgsql-general-owner@hub.org
|
|
Status: OR
|
|
|
|
|
|
|
|
> I think that the Un*x filesystem is one of the reasons that large
|
|
> database vendors rather use raw devices, than filesystem storage
|
|
> files.
|
|
|
|
This used to be the preference, back in the late 80s and possibly
|
|
early 90s. I'm seeing a preference toward using the filesystem now,
|
|
possibly with some sort of async I/O and co-operation from the OS
|
|
filesystem about interactions with the filesystem cache.
|
|
|
|
Performance preferences don't stand still. The hardware changes, the
|
|
software changes, the volume of data changes, and different solutions
|
|
become preferable.
|
|
|
|
> Using a raw device on the disk gives them the possibility to have
|
|
> complete control over their files, indices and objects without being
|
|
> bothered by the operating system.
|
|
>
|
|
> This speeds up things in several ways :
|
|
> - the least possible OS intervention
|
|
|
|
Not that this is especially useful, necessarily. If the "raw" device
|
|
is in fact managed by a logical volume manager doing mirroring onto
|
|
some sort of storage array there is still plenty of OS code involved.
|
|
|
|
The cost of using a filesystem in addition may not be much if anything
|
|
and of course a filesystem is considerably more flexible to
|
|
administer (backup, move, change size, check integrity, etc.)
|
|
|
|
> - choose block sizes according to applications
|
|
> - reducing fragmentation
|
|
> - packing data in nearby cilinders
|
|
|
|
... but when this storage area is spread over multiple mechanisms in a
|
|
smart storage array with write caching, you've no idea what is where
|
|
anyway. Better to let the hardware or at least the OS manage this;
|
|
there are so many levels of caching between a database and the
|
|
magnetic media that working hard to influence layout is almost
|
|
certainly a waste of time.
|
|
|
|
Kirk McKusick tells a lovely story that once upon a time it used to be
|
|
sensible to check some registers on a particular disk controller to
|
|
find out where the heads were when scheduling I/O. Needless to say,
|
|
that is history now!
|
|
|
|
There's a considerable cost in complexity and code in using "raw"
|
|
storage too, and it's not a one off cost: as the technologies change,
|
|
the "fast" way to do things will change and the code will have to be
|
|
updated to match. Better to leave this to the OS vendor where
|
|
possible, and take advantage of the tuning they do.
|
|
|
|
> - Anyone other ideas -> the sky is the limit here
|
|
|
|
> It also aids portability, at least on platforms that have an
|
|
> equivalent of a raw device.
|
|
|
|
I don't understand that claim. Not much is portable about raw
|
|
devices, and they're typically not nearlly as well documented as the
|
|
filesystem interfaces.
|
|
|
|
> It is also independent of the standard implemented Un*x filesystems,
|
|
> for which you will have to pay extra if you want to take extra
|
|
> measures against power loss.
|
|
|
|
Rather, it is worse. With a Unix filesystem you get quite defined
|
|
semantics about what is written when.
|
|
|
|
> The problem with e.g. e2fs, is that it is not robust enough if a CPU
|
|
> fails.
|
|
|
|
ext2fs doesn't even claim to have Unix filesystem semantics.
|
|
|
|
Regards,
|
|
|
|
Giles
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M1795@postgresql.org Thu Dec 7 18:47:52 2000
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA09172
|
|
for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 18:47:52 -0500 (EST)
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB7NjFP10612;
|
|
Thu, 7 Dec 2000 18:45:15 -0500 (EST)
|
|
(envelope-from pgsql-hackers-owner+M1795@postgresql.org)
|
|
Received: from thor.tht.net (thor.tht.net [209.47.145.4])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB7N6BP08233
|
|
for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:06:11 -0500 (EST)
|
|
(envelope-from bright@fw.wintelcom.net)
|
|
Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [209.1.153.20])
|
|
by thor.tht.net (8.9.3/8.9.3) with ESMTP id SAA97456
|
|
for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:57:32 GMT
|
|
(envelope-from bright@fw.wintelcom.net)
|
|
Received: (from bright@localhost)
|
|
by fw.wintelcom.net (8.10.0/8.10.0) id eB7MvWE21269
|
|
for pgsql-hackers@postgresql.org; Thu, 7 Dec 2000 14:57:32 -0800 (PST)
|
|
Date: Thu, 7 Dec 2000 14:57:32 -0800
|
|
From: Alfred Perlstein <bright@wintelcom.net>
|
|
To: pgsql-hackers@postgresql.org
|
|
Subject: [HACKERS] Patches with vacuum fixes available for 7.0.x
|
|
Message-ID: <20001207145732.X16205@fw.wintelcom.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Disposition: inline
|
|
User-Agent: Mutt/1.2.5i
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: ORr
|
|
|
|
We recently had a very satisfactory contract completed by
|
|
Vadim.
|
|
|
|
Basically Vadim has been able to reduce the amount of time
|
|
taken by a vacuum from 10-15 minutes down to under 10 seconds.
|
|
|
|
We've been running with these patches under heavy load for
|
|
about a week now without any problems except one:
|
|
don't 'lazy' (new option for vacuum) a table which has just
|
|
had an index created on it, or at least don't expect it to
|
|
take any less time than a normal vacuum would.
|
|
|
|
There's three patchsets and they are available at:
|
|
|
|
http://people.freebsd.org/~alfred/vacfix/
|
|
|
|
complete diff:
|
|
http://people.freebsd.org/~alfred/vacfix/v.diff
|
|
|
|
only lazy vacuum option to speed up index vacuums:
|
|
http://people.freebsd.org/~alfred/vacfix/vlazy.tgz
|
|
|
|
only lazy vacuum option to only scan from start of modified
|
|
data:
|
|
http://people.freebsd.org/~alfred/vacfix/mnmb.tgz
|
|
|
|
Although the patches are for 7.0.x I'm hoping that they
|
|
can be forward ported (if Vadim hasn't done it already)
|
|
to 7.1.
|
|
|
|
enjoy!
|
|
|
|
--
|
|
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
|
|
"I have the heart of a child; I keep it in a jar on my desk."
|
|
|
|
From pgsql-hackers-owner+M1809@postgresql.org Thu Dec 7 20:27:39 2000
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11827
|
|
for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 20:27:38 -0500 (EST)
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB81PsP22362;
|
|
Thu, 7 Dec 2000 20:25:54 -0500 (EST)
|
|
(envelope-from pgsql-hackers-owner+M1809@postgresql.org)
|
|
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB81JkP21783
|
|
for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 20:19:46 -0500 (EST)
|
|
(envelope-from bright@fw.wintelcom.net)
|
|
Received: (from bright@localhost)
|
|
by fw.wintelcom.net (8.10.0/8.10.0) id eB81JwU25447;
|
|
Thu, 7 Dec 2000 17:19:58 -0800 (PST)
|
|
Date: Thu, 7 Dec 2000 17:19:58 -0800
|
|
From: Alfred Perlstein <bright@wintelcom.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] Patches with vacuum fixes available for 7.0.x
|
|
Message-ID: <20001207171958.B16205@fw.wintelcom.net>
|
|
References: <20001207145732.X16205@fw.wintelcom.net> <28791.976236143@sss.pgh.pa.us>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Disposition: inline
|
|
User-Agent: Mutt/1.2.5i
|
|
In-Reply-To: <28791.976236143@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Thu, Dec 07, 2000 at 07:42:23PM -0500
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: OR
|
|
|
|
* Tom Lane <tgl@sss.pgh.pa.us> [001207 17:10] wrote:
|
|
> Alfred Perlstein <bright@wintelcom.net> writes:
|
|
> > Basically Vadim has been able to reduce the amount of time
|
|
> > taken by a vacuum from 10-15 minutes down to under 10 seconds.
|
|
>
|
|
> Cool. What's it do, exactly?
|
|
|
|
================================================================
|
|
|
|
The first is a bonus that Vadim gave us to speed up index
|
|
vacuums, I'm not sure I understand it completely, but it
|
|
work really well. :)
|
|
|
|
here's the README he gave us:
|
|
|
|
Vacuum LAZY index cleanup option
|
|
|
|
LAZY vacuum option introduces new way of indices cleanup.
|
|
Instead of reading entire index file to remove index tuples
|
|
pointing to deleted table records, with LAZY option vacuum
|
|
performes index scans using keys fetched from table record
|
|
to be deleted. Vacuum checks each result returned by index
|
|
scan if it points to target heap record and removes
|
|
corresponding index tuple.
|
|
This can greatly speed up indices cleaning if not so many
|
|
table records were deleted/modified between vacuum runs.
|
|
Vacuum uses new option on user' demand.
|
|
|
|
New vacuum syntax is:
|
|
|
|
vacuum [verbose] [analyze] [lazy] [table [(columns)]]
|
|
|
|
================================================================
|
|
|
|
The second is one of the suggestions I gave on the lists a while
|
|
back, keeping track of the "last dirtied" block in the data files
|
|
to only scan the tail end of the file for deleted rows, I think
|
|
what he instead did was keep a table that holds all the modified
|
|
blocks and vacuum only scans those:
|
|
|
|
Minimal Number Modified Block (MNMB)
|
|
|
|
This feature is to track MNMB of required tables with triggers
|
|
to avoid reading unmodified table pages by vacuum. Triggers
|
|
store MNMB in per-table files in specified directory
|
|
($LIBDIR/contrib/mnmb by default) and create these files if not
|
|
existed.
|
|
|
|
Vacuum first looks up functions
|
|
|
|
mnmb_getblock(Oid databaseId, Oid tableId)
|
|
mnmb_setblock(Oid databaseId, Oid tableId, Oid block)
|
|
|
|
in catalog. If *both* functions were found *and* there was no
|
|
ANALYZE option specified then vacuum calls mnmb_getblock to obtain
|
|
MNMB for table being vacuumed and starts reading this table from
|
|
block number returned. After table was processed vacuum calls
|
|
mnmb_setblock to update data in file to last table block number.
|
|
Neither mnmb_getblock nor mnmb_setblock try to create file.
|
|
If there was no file for table being vacuumed then mnmb_getblock
|
|
returns 0 and mnmb_setblock does nothing.
|
|
mnmb_setblock() may be used to set in file MNMB to 0 and force
|
|
vacuum to read entire table if required.
|
|
|
|
To compile MNMB you have to add -DMNMB to CUSTOM_COPT
|
|
in src/Makefile.custom.
|
|
|
|
--
|
|
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
|
|
"I have the heart of a child; I keep it in a jar on my desk."
|
|
|
|
From pgsql-general-owner+M4010@postgresql.org Mon Feb 5 18:50:47 2001
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02209
|
|
for <pgman@candle.pha.pa.us>; Mon, 5 Feb 2001 18:50:46 -0500 (EST)
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15Nn8x86486;
|
|
Mon, 5 Feb 2001 18:49:08 -0500 (EST)
|
|
(envelope-from pgsql-general-owner+M4010@postgresql.org)
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15N7Ux81124
|
|
for <pgsql-general@postgresql.org>; Mon, 5 Feb 2001 18:07:30 -0500 (EST)
|
|
(envelope-from pgsql-general-owner@postgresql.org)
|
|
Received: from news.tht.net (news.hub.org [216.126.91.242])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0V0Twq69854
|
|
for <pgsql-general@postgresql.org>; Tue, 30 Jan 2001 19:29:58 -0500 (EST)
|
|
(envelope-from news@news.tht.net)
|
|
Received: (from news@localhost)
|
|
by news.tht.net (8.11.1/8.11.1) id f0V0RAO01011
|
|
for pgsql-general@postgresql.org; Tue, 30 Jan 2001 19:27:10 -0500 (EST)
|
|
(envelope-from news)
|
|
From: Mike Hoskins <mikehoskins@yahoo.com>
|
|
X-Newsgroups: comp.databases.postgresql.general
|
|
Subject: Re: [GENERAL] MySQL file system
|
|
Date: Tue, 30 Jan 2001 18:30:36 -0600
|
|
Organization: Hub.Org Networking Services (http://www.hub.org)
|
|
Lines: 120
|
|
Message-ID: <3A775CAB.C416AA16@yahoo.com>
|
|
References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Complaints-To: scrappy@hub.org
|
|
X-Mailer: Mozilla 4.76 [en] (Windows NT 5.0; U)
|
|
X-Accept-Language: en
|
|
To: pgsql-general@postgresql.org
|
|
Precedence: bulk
|
|
Sender: pgsql-general-owner@postgresql.org
|
|
Status: OR
|
|
|
|
This idea is such a popular (even old) one that Oracle developed it for 8i --
|
|
IFS. Yep, AS/400 has had it forever, and BeOS is another example. Informix has
|
|
had its DataBlades for years, as well. In fact, Reiser-FS is an FS implemented
|
|
on a DB, albeit probably not a SQL DB. AIX's LVM and JFS is extent/DB-based, as
|
|
well. Let's see now, why would all those guys do that? (Now, some of those that
|
|
aren't SQL-based probably won't allow SQL queries on files, so just think about
|
|
those that do, for a minute)....
|
|
|
|
Rather than asking why, a far better question is why not? There is SO much
|
|
functionality to be gained here that it's silly to ask why. At a higher level,
|
|
treating BLOBs as files and as DB entries simultaneously has so many uses, that
|
|
one has trouble answering the question properly without the puzzled stare back
|
|
at the questioner. Again, look at the above list, particularly at AS/400 -- the
|
|
entire OS's FS sits on top of DB/2!
|
|
|
|
For example, think how easy dynamically generated web sites could access online
|
|
catalog information, with all those JPEG's, GIFs, PNGs, HTML files, Text files,
|
|
.PDF's, etc., both in the DB and in the FS. This would be so much easier to
|
|
maintain, when you have webmasters, web designers, artists, programmers,
|
|
sysadmins, dba's, etc., all trying to manage a big, dynamic, graphics-rich web
|
|
site. Who cares if the FS is a bit slow, as long as it's not too slow? That's
|
|
not the point, anyway.
|
|
|
|
The point is easy access to data: asset management, version control, the
|
|
ability to access the same data as a file and as a BLOB simultaneously, the
|
|
ability to replicate easier, the ability to use more tools on the same info,
|
|
etc. It's not for speed, per se; instead, it's for accessibility.
|
|
|
|
Think about this issue. You have some already compiled text-based program that
|
|
works on binary files, but not on databases -- it was simply never designed into
|
|
the program. How are you going to get your graphics BLOBs into that program?
|
|
Oh yeah, let's write another program to transform our data into files, first,
|
|
then after processing delete them in some cleanup routine.... Why? If you have
|
|
a DB'ed FS, then file data can simultaneously have two views -- one for the DB
|
|
and one as an FS. (You can easily reverse the scenario.) Not only does this
|
|
save time and disk space; it saves you from having to pay for the most expensive
|
|
element of all -- programmer time.
|
|
|
|
BTW, once this FS-on-a-DB concept really sinks in, imagine how tightly
|
|
integrated Linux/Unix apps could be written. Imagine if a bunch of GPL'ed
|
|
software started coding for this and used this as a means to exchange data, all
|
|
using a common set of libraries. You could get to the point of uniting files,
|
|
BLOBs, data of all sorts, IPC, version control, etc., all under one umbrella,
|
|
especially if XML was the means data was exchanged. Heck, distributed
|
|
authentication, file access, data access, etc., could be improved greatly.
|
|
Well, this paragraph sounds like flame bait, but really consider the
|
|
ramifications. Also, read the next paragraph....
|
|
|
|
Something like this *has* existed for Postgres for a long time -- PGFS, by Brian
|
|
Bartholomew. It's even supposedly matured with age. Unfortunately, I cannot
|
|
get to http://www.wv.com/ (Working Version's main site). Working Version is a
|
|
version control system that keeps old versions of files around in the FS. It
|
|
uses PG as the back-end DB and lets you mount it like another FS. It's
|
|
supposedly an awesome system, but where is it? It's not some clunky korbit
|
|
thingy, either. (If someone can find it, please let me know by email, if
|
|
possible.)
|
|
|
|
The only thing I can find on this is from a Google search, which caches
|
|
everything but the actual software:
|
|
|
|
http://www.google.com/search?q=pgfs+postgres&num=100&hl=en&lr=lang_en&newwindow=1&safe=active
|
|
|
|
Also, there is the Perl-FS that can be transformed into something like PGFS:
|
|
http://www.assurdo.com/perlfs/ It allows you to write Perl code that can mount
|
|
various protocols or data types as an FS, in user space. (One example is the
|
|
ability to mount FTP sites, BTW.)
|
|
|
|
Instead of ridiculing something you've never tried, consider that MySQL-FS,
|
|
Oracle (IFS), Informix (DataBlades), AS/400 (DB/2), BeOS, and Reiser-FS are
|
|
doing this today. Do you want to be left behind and let them tell us what it's
|
|
good for? Or, do we want this for PG? (Reiser-FS, BTW, is FASTER than ext2,
|
|
but has no SQL hooks).
|
|
|
|
There were many posts on this on slashdot:
|
|
http://slashdot.org/article.pl?sid=01/01/16/1855253&mode=thread
|
|
(I wrote some comments here, as well, just look for mikehoskins)
|
|
|
|
I, for one, want to see this succeed for MySQL, PostgreSQL, msql, etc. It's an
|
|
awesome feature that doesn't need to be speedy because it can save HUMANS time.
|
|
|
|
The question really is, "When do we want to catch up to everyone else?" We are
|
|
always moving to higher levels of abstraction, anyway, so it's just a matter of
|
|
time. PG should participate.
|
|
|
|
|
|
Adam Lang wrote:
|
|
|
|
> I wasn't following the thread too closely, but database for a filesystem has
|
|
> been done. BeOS uses a database for a filesystem as well as AS/400 and
|
|
> Mainframes.
|
|
>
|
|
> Adam Lang
|
|
> Systems Engineer
|
|
> Rutgers Casualty Insurance Company
|
|
> http://www.rutgersinsurance.com
|
|
> ----- Original Message -----
|
|
> From: "Alfred Perlstein" <bright@wintelcom.net>
|
|
> To: "Robert D. Nelson" <RDNELSON@co.centre.pa.us>
|
|
> Cc: "Joseph Shraibman" <jks@selectacast.net>; "Karl DeBisschop"
|
|
> <karl@debisschop.net>; "Ned Lilly" <ned@greatbridge.com>; "PostgreSQL
|
|
> General" <pgsql-general@postgresql.org>
|
|
> Sent: Wednesday, January 17, 2001 12:23 PM
|
|
> Subject: Re: [GENERAL] MySQL file system
|
|
>
|
|
> > * Robert D. Nelson <RDNELSON@co.centre.pa.us> [010117 05:17] wrote:
|
|
> > > >Raw disk access allows:
|
|
> > >
|
|
> > > If I'm correct, mysql is providing a filesystem, not a way to access raw
|
|
> > > disk, like Oracle does. Huge difference there - with a filesystem, you
|
|
> have
|
|
> > > overhead of FS *and* SQL at the same time.
|
|
> >
|
|
> > Oh, so it's sort of like /proc for mysql?
|
|
> >
|
|
> > What a terrible waste of time and resources. :(
|
|
> >
|
|
> > --
|
|
> > -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
|
|
> > "I have the heart of a child; I keep it in a jar on my desk."
|
|
|
|
|
|
From pgsql-general-owner+M4049@postgresql.org Tue Feb 6 01:26:19 2001
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA21425
|
|
for <pgman@candle.pha.pa.us>; Tue, 6 Feb 2001 01:26:18 -0500 (EST)
|
|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f166Nxx26400;
|
|
Tue, 6 Feb 2001 01:23:59 -0500 (EST)
|
|
(envelope-from pgsql-general-owner+M4049@postgresql.org)
|
|
Received: from simecity.com ([202.188.254.2])
|
|
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f166GUx25754
|
|
for <pgsql-general@postgresql.org>; Tue, 6 Feb 2001 01:16:30 -0500 (EST)
|
|
(envelope-from lyeoh@pop.jaring.my)
|
|
Received: (from mail@localhost)
|
|
by simecity.com (8.9.3/8.8.7) id OAA23910;
|
|
Tue, 6 Feb 2001 14:28:48 +0800
|
|
Received: from <lyeoh@pop.jaring.my> (ilab2.mecomb.po.my [192.168.3.22]) by cirrus.simecity.com via smap (V2.1)
|
|
id xma023908; Tue, 6 Feb 01 14:28:34 +0800
|
|
Message-ID: <3.0.5.32.20010206141555.00a3d100@192.228.128.13>
|
|
X-Sender: lyeoh@192.228.128.13
|
|
X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.5 (32)
|
|
Date: Tue, 06 Feb 2001 14:15:55 +0800
|
|
To: Mike Hoskins <mikehoskins@yahoo.com>, pgsql-general@postgresql.org
|
|
From: Lincoln Yeoh <lyeoh@pop.jaring.my>
|
|
Subject: [GENERAL] Re: MySQL file system
|
|
In-Reply-To: <3A775CF7.3C5F1909@yahoo.com>
|
|
References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset="us-ascii"
|
|
Precedence: bulk
|
|
Sender: pgsql-general-owner@postgresql.org
|
|
Status: OR
|
|
|
|
What you're saying seems to be to have a data structure where the same data
|
|
can be accessed in both the filesystem style and the RDBMs style. How does
|
|
that work? How is the mapping done between both structures? Slapping a
|
|
filesystem on top of a RDBMs doesn't do that does it?
|
|
|
|
Most filesystems are basically databases already, just differently
|
|
structured and featured databases. And so far most of them do their job
|
|
pretty well. You move a folder/directory somewhere, and everything inside
|
|
it moves. Tons of data are already arranged in that form. Though porting
|
|
over data from one filesystem to another is not always straightforward,
|
|
RDBMSes are far worse.
|
|
|
|
Maybe what would be nice is not a filesystem based on a database, rather
|
|
one influenced by databases. One with a decent fulltextindex for data and
|
|
filenames, where you have the option to ignore or not ignore
|
|
nonalphanumerics and still get an indexed search.
|
|
|
|
Then perhaps we could do something like the following:
|
|
|
|
select file.name from path "/var/logs/" where file.name like "%.log%' and
|
|
file.lastmodified > '2000/1/1' and file.contents =~ 'te_st[0-9]+\.gif$' use
|
|
index
|
|
|
|
Checkpoints would be nice too. Then I can rollback to a known point if I
|
|
screw up ;).
|
|
|
|
In fact the SQL style interface doesn't have to be built in at all. Neither
|
|
does the index have to be realtime. I suppose there could be an option to
|
|
make it realtime if performance is not an issue.
|
|
|
|
What could be done is to use some fast filesystem. Then we add tools to
|
|
maintain indexes, for SQL style interfaces and other style interfaces.
|
|
Checkpoints and rollbacks would be harder of course.
|
|
|
|
Cheerio,
|
|
Link.
|
|
|
|
|
|
From pgsql-hackers-owner+M20329@postgresql.org Tue Mar 19 18:00:15 2002
|
|
Return-path: <pgsql-hackers-owner+M20329@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g2K00EA02465
|
|
for <pgman@candle.pha.pa.us>; Tue, 19 Mar 2002 19:00:14 -0500 (EST)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 8C7164763EF; Tue, 19 Mar 2002 18:22:08 -0500 (EST)
|
|
Received: from CopelandConsulting.Net (dsl-24293-ld.customer.centurytel.net [209.142.135.135])
|
|
by postgresql.org (Postfix) with ESMTP id E4DAD475F1F
|
|
for <pgsql-hackers@postgresql.org>; Tue, 19 Mar 2002 18:02:17 -0500 (EST)
|
|
Received: from mouse.copelandconsulting.net (mouse.copelandconsulting.net [192.168.1.2])
|
|
by CopelandConsulting.Net (8.10.1/8.10.1) with ESMTP id g2JN0jh13185;
|
|
Tue, 19 Mar 2002 17:00:45 -0600 (CST)
|
|
X-Trade-Id: <CCC.Tue, 19 Mar 2002 17:00:45 -0600 (CST).Tue, 19 Mar 2002 17:00:45 -0600 (CST).200203192300.g2JN0jh13185.g2JN0jh13185@CopelandConsulting.Net.
|
|
Subject: Re: [HACKERS] Bitmap indexes?
|
|
From: Greg Copeland <greg@CopelandConsulting.Net>
|
|
To: Matthew Kirkwood <matthew@hairy.beasts.org>
|
|
cc: Oleg Bartunov <oleg@sai.msu.su>,
|
|
PostgresSQL Hackers Mailing List <pgsql-hackers@postgresql.org>
|
|
<Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com>
|
|
<Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com>
|
|
Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature";
|
|
boundary="=-Ivchb84S75fOMzJ9DxwK"
|
|
X-Mailer: Evolution/1.0.2
|
|
Date: 19 Mar 2002 17:00:53 -0600
|
|
Message-ID: <1016578854.14670.450.camel@mouse.copelandconsulting.net>
|
|
MIME-Version: 1.0
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: OR
|
|
|
|
--=-Ivchb84S75fOMzJ9DxwK
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: quoted-printable
|
|
|
|
On Tue, 2002-03-19 at 15:30, Matthew Kirkwood wrote:
|
|
> On Tue, 19 Mar 2002, Oleg Bartunov wrote:
|
|
>=20
|
|
> Sorry to reply over you, Oleg.
|
|
>=20
|
|
> > On 13 Mar 2002, Greg Copeland wrote:
|
|
> >
|
|
> > > One of the reasons why I originally stated following the hackers list=
|
|
is
|
|
> > > because I wanted to implement bitmap indexes. I found in the archive=
|
|
s,
|
|
> > > the follow link, http://www.it.iitb.ernet.in/~rvijay/dbms/proj/, which
|
|
> > > was extracted from this,
|
|
> > > http://groups.google.com/groups?hl=3Den&threadm=3D01C0EF67.5105D2E0.m=
|
|
ascarm%40mascari.com&rnum=3D1&prev=3D/groups%3Fq%3Dbitmap%2Bindex%2Bgroup:c=
|
|
omp.databases.postgresql.hackers%26hl%3Den%26selm%3D01C0EF67.5105D2E0.masca=
|
|
rm%2540mascari.com%26rnum%3D1, archive thread.
|
|
>=20
|
|
> For every case I have used a bitmap index on Oracle, a
|
|
> partial index[0] made more sense (especialy since it
|
|
> could usefully be compound).
|
|
|
|
That's very true, however, often bitmap indexes are used where partial
|
|
indexes may not work well. It maybe you were trying to apply the cure
|
|
for the wrong disease. ;)
|
|
|
|
>=20
|
|
> Our troublesome case (on Oracle) is a table of "events"
|
|
> where maybe fifty to a couple of hundred are "published"
|
|
> (ie. web-visible) at any time. The events are categorised
|
|
> by sport (about a dozen) and by "event type" (about five).
|
|
> We never really query events except by PK or by sport/type/
|
|
> published.
|
|
|
|
The reason why bitmap indexes are primarily used for DSS and data
|
|
wherehousing applications is because they are best used on extremely
|
|
large to very large tables which have low cardinality (e.g, 10,000,000
|
|
rows having 200 distinct values). On top of that, bitmap indexes also
|
|
tend to be much smaller than their "standard" cousins. On large and
|
|
very tables tables, this can sometimes save gigs in index space alone
|
|
(serious space benefit). Plus, their small index size tends to result
|
|
in much less I/O (serious speed benefit). This, of course, can result
|
|
in several orders of magnitude speed improvements when index scans are
|
|
required. As an added bonus, using AND, OR, XOR and NOT predicates are
|
|
exceptionally fast and if implemented properly, can even take advantage
|
|
of some 64-bit hardware for further speed improvements. This, of
|
|
course, further speeds look ups. The primary down side is that inserts
|
|
and updates to bitmap indexes are very costly (comparatively) which is,
|
|
yet again, why they excel in read-only environments (DSS & data
|
|
wherehousing).
|
|
|
|
It should also be noted that RDMS's, such as Oracle, often use multiple
|
|
types of bitmap indexes. This further impedes insert/update
|
|
performance, however, the additional bitmap index types usually allow
|
|
for range predicates while still making use of the bitmap index. If I'm
|
|
not mistaken, several other types of bitmaps are available as well as
|
|
many ways to encode and compress (rle, quad compression, etc) bitmap
|
|
indexes which further save on an already compact indexing scheme.
|
|
|
|
Given the proper problem domain, index bitmaps can be a big win.
|
|
|
|
>=20
|
|
> We make a bitmap index on "published", and trust Oracle to
|
|
> use it correctly, and hope that our other indexes are also
|
|
> useful.
|
|
>=20
|
|
> On Postgres[1] we would make a partial compound index:
|
|
>=20
|
|
> create index ... on events(sport_id,event_type_id)
|
|
> where published=3D'Y';
|
|
|
|
|
|
Generally speaking, bitmap indexes will not serve you very will on
|
|
tables having a low row counts, high cardinality or where they are
|
|
attached to tables which are primarily used in an OLTP capacity.=20
|
|
Situations where you have a low row count and low cardinality or high
|
|
row count and high cardinality tend to be better addressed by partial
|
|
indexes; which seem to make much more sense. In your example, it sounds
|
|
like you did "the right thing"(tm). ;)
|
|
|
|
|
|
Greg
|
|
|
|
|
|
--=-Ivchb84S75fOMzJ9DxwK
|
|
Content-Type: application/pgp-signature; name=signature.asc
|
|
Content-Description: This is a digitally signed message part
|
|
|
|
-----BEGIN PGP SIGNATURE-----
|
|
Version: GnuPG v1.0.6 (GNU/Linux)
|
|
Comment: For info see http://www.gnupg.org
|
|
|
|
iD8DBQA8l8Ml4lr1bpbcL6kRAhldAJ9Aoi9dwm1OteZjySfsd1o42trWLACfegQj
|
|
OEV6eO8MnBSlbJMHiQ08gNE=
|
|
=PQvW
|
|
-----END PGP SIGNATURE-----
|
|
|
|
--=-Ivchb84S75fOMzJ9DxwK--
|
|
|
|
|
|
From pgsql-hackers-owner+M26157@postgresql.org Tue Aug 6 23:06:34 2002
|
|
Date: Wed, 7 Aug 2002 13:07:38 +1000 (EST)
|
|
From: Gavin Sherry <swm@linuxworld.com.au>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
In-Reply-To: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net>
|
|
Message-ID: <Pine.LNX.4.21.0208071259210.13438-100000@linuxworld.com.au>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 1357
|
|
|
|
On Wed, 7 Aug 2002, Curt Sampson wrote:
|
|
|
|
> But after doing some benchmarking of various sorts of random reads
|
|
> and writes, it occurred to me that there might be optimizations
|
|
> that could help a lot with this sort of thing. What if, when we've
|
|
> got an index block with a bunch of entries, instead of doing the
|
|
> reads in the order of the entries, we do them in the order of the
|
|
> blocks the entries point to? That would introduce a certain amount
|
|
> of "sequentialness" to the reads that the OS is not capable of
|
|
> introducing (since it can't reschedule the reads you're doing, the
|
|
> way it could reschedule, say, random writes).
|
|
|
|
This sounds more or less like the method employed by Firebird as described
|
|
by Ann Douglas to Tom at OSCON (correct me if I get this wrong).
|
|
|
|
Basically, firebird populates a bitmap with entries the scan is interested
|
|
in. The bitmap is populated in page order so that all entries on the same
|
|
heap page can be fetched at once.
|
|
|
|
This is totally different to the way postgres does things and would
|
|
require significant modification to the index access methods.
|
|
|
|
Gavin
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 3: if posting/reading through Usenet, please send an appropriate
|
|
subscribe-nomail command to majordomo@postgresql.org so that your
|
|
message can get through to the mailing list cleanly
|
|
|
|
From pgsql-hackers-owner+M26162@postgresql.org Wed Aug 7 00:42:35 2002
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
In-Reply-To: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net>
|
|
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
|
|
message dated "Wed, 07 Aug 2002 11:31:32 +0900"
|
|
Date: Wed, 07 Aug 2002 00:41:47 -0400
|
|
Message-ID: <12593.1028695307@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 3063
|
|
|
|
Curt Sampson <cjs@cynic.net> writes:
|
|
> But after doing some benchmarking of various sorts of random reads
|
|
> and writes, it occurred to me that there might be optimizations
|
|
> that could help a lot with this sort of thing. What if, when we've
|
|
> got an index block with a bunch of entries, instead of doing the
|
|
> reads in the order of the entries, we do them in the order of the
|
|
> blocks the entries point to?
|
|
|
|
I thought to myself "didn't I just post something about that?"
|
|
and then realized it was on a different mailing list. Here ya go
|
|
(and no, this is not the first time around on this list either...)
|
|
|
|
|
|
I am currently thinking that bitmap indexes per se are not all that
|
|
interesting. What does interest me is bitmapped index lookup, which
|
|
came back into mind after hearing Ann Harrison describe how FireBird/
|
|
InterBase does it.
|
|
|
|
The idea is that you don't scan the index and base table concurrently
|
|
as we presently do it. Instead, you scan the index and make a list
|
|
of the TIDs of the table tuples you need to visit. This list can
|
|
be conveniently represented as a sparse bitmap. After you've finished
|
|
looking at the index, you visit all the required table tuples *in
|
|
physical order* using the bitmap. This eliminates multiple fetches
|
|
of the same heap page, and can possibly let you get some win from
|
|
sequential access.
|
|
|
|
Once you have built this mechanism, you can then move on to using
|
|
multiple indexes in interesting ways: you can do several indexscans
|
|
in one query and then AND or OR their bitmaps before doing the heap
|
|
scan. This would allow, for example, "WHERE a = foo and b = bar"
|
|
to be handled by ANDing results from separate indexes on the a and b
|
|
columns, rather than having to choose only one index to use as we do
|
|
now.
|
|
|
|
Some thoughts about implementation: FireBird's implementation seems
|
|
to depend on an assumption about a fixed number of tuple pointers
|
|
per page. We don't have that, but we could probably get away with
|
|
just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
|
|
Also, the main downside of this approach is that the bitmap could
|
|
get large --- but you could have some logic that causes you to fall
|
|
back to plain sequential scan if you get too many index hits. (It's
|
|
interesting to think of this as lossy compression of the bitmap...
|
|
which leads to the idea of only being fuzzy in limited areas of the
|
|
bitmap, rather than losing all the information you have.)
|
|
|
|
A possibly nasty issue is that lazy VACUUM has some assumptions in it
|
|
about indexscans holding pins on index pages --- that's what prevents
|
|
it from removing heap tuples that a concurrent indexscan is just about
|
|
to visit. It might be that there is no problem: even if lazy VACUUM
|
|
removes a heap tuple and someone else then installs a new tuple in that
|
|
same TID slot, you should be okay because the new tuple is too new to
|
|
pass your visibility test. But I'm not convinced this is safe.
|
|
|
|
regards, tom lane
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 6: Have you searched our list archives?
|
|
|
|
http://archives.postgresql.org
|
|
|
|
From pgsql-hackers-owner+M26172@postgresql.org Wed Aug 7 02:49:56 2002
|
|
X-Authentication-Warning: rh72.home.ee: hannu set sender to hannu@tm.ee using -f
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
From: Hannu Krosing <hannu@tm.ee>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>,
|
|
Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
In-Reply-To: <12776.1028697148@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
<12776.1028697148@sss.pgh.pa.us>
|
|
X-Mailer: Ximian Evolution 1.0.7
|
|
Date: 07 Aug 2002 09:46:29 +0500
|
|
Message-ID: <1028695589.2133.11.camel@rh72.home.ee>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 1064
|
|
|
|
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > On Wed, 7 Aug 2002, Tom Lane wrote:
|
|
> >> Also, the main downside of this approach is that the bitmap could
|
|
> >> get large --- but you could have some logic that causes you to fall
|
|
> >> back to plain sequential scan if you get too many index hits.
|
|
>
|
|
> > Well, what I was thinking of, should the list of TIDs to fetch get too
|
|
> > long, was just to break it down in to chunks.
|
|
>
|
|
> But then you lose the possibility of combining multiple indexes through
|
|
> bitmap AND/OR steps, which seems quite interesting to me. If you've
|
|
> visited only a part of each index then you can't apply that concept.
|
|
|
|
When the tuples are small relative to pagesize, you may get some
|
|
"compression" by saving just pages and not the actual tids in the the
|
|
bitmap.
|
|
|
|
-------------
|
|
Hannu
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 2: you can get off all lists at once with the unregister command
|
|
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
|
|
|
|
From pgsql-hackers-owner+M26166@postgresql.org Wed Aug 7 00:55:52 2002
|
|
Date: Wed, 7 Aug 2002 13:55:41 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
In-Reply-To: <12593.1028695307@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 1840
|
|
|
|
On Wed, 7 Aug 2002, Tom Lane wrote:
|
|
|
|
> I thought to myself "didn't I just post something about that?"
|
|
> and then realized it was on a different mailing list. Here ya go
|
|
> (and no, this is not the first time around on this list either...)
|
|
|
|
Wow. I'm glad to see you looking at this, because this feature would so
|
|
*so* much for the performance of some of my queries, and really, really
|
|
impress my "billion-row-database" client.
|
|
|
|
> The idea is that you don't scan the index and base table concurrently
|
|
> as we presently do it. Instead, you scan the index and make a list
|
|
> of the TIDs of the table tuples you need to visit.
|
|
|
|
Right.
|
|
|
|
> Also, the main downside of this approach is that the bitmap could
|
|
> get large --- but you could have some logic that causes you to fall
|
|
> back to plain sequential scan if you get too many index hits.
|
|
|
|
Well, what I was thinking of, should the list of TIDs to fetch get too
|
|
long, was just to break it down in to chunks. If you want to limit to,
|
|
say, 1000 TIDs, and your index has 3000, just do the first 1000, then
|
|
the next 1000, then the last 1000. This would still result in much less
|
|
disk head movement and speed the query immensely.
|
|
|
|
(BTW, I have verified this emperically during testing of random read vs.
|
|
random write on a RAID controller. The writes were 5-10 times faster
|
|
than the reads because the controller was caching a number of writes and
|
|
then doing them in the best possible order, whereas the reads had to be
|
|
satisfied in the order they were submitted to the controller.)
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 5: Have you checked our extensive FAQ?
|
|
|
|
http://www.postgresql.org/users-lounge/docs/faq.html
|
|
|
|
From pgsql-hackers-owner+M26167@postgresql.org Wed Aug 7 01:12:54 2002
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
In-Reply-To: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
|
|
message dated "Wed, 07 Aug 2002 13:55:41 +0900"
|
|
Date: Wed, 07 Aug 2002 01:12:28 -0400
|
|
Message-ID: <12776.1028697148@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 1428
|
|
|
|
Curt Sampson <cjs@cynic.net> writes:
|
|
> On Wed, 7 Aug 2002, Tom Lane wrote:
|
|
>> Also, the main downside of this approach is that the bitmap could
|
|
>> get large --- but you could have some logic that causes you to fall
|
|
>> back to plain sequential scan if you get too many index hits.
|
|
|
|
> Well, what I was thinking of, should the list of TIDs to fetch get too
|
|
> long, was just to break it down in to chunks.
|
|
|
|
But then you lose the possibility of combining multiple indexes through
|
|
bitmap AND/OR steps, which seems quite interesting to me. If you've
|
|
visited only a part of each index then you can't apply that concept.
|
|
|
|
Another point to keep in mind is that the bigger the bitmap gets, the
|
|
less useful an indexscan is, by definition --- sooner or later you might
|
|
as well fall back to a seqscan. So the idea of lossy compression of a
|
|
large bitmap seems really ideal to me. In principle you could seqscan
|
|
the parts of the table where matching tuples are thick on the ground,
|
|
and indexscan the parts where they ain't. Maybe this seems natural
|
|
to me as an old JPEG campaigner, but if you don't see the logic I
|
|
recommend thinking about it a little ...
|
|
|
|
regards, tom lane
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 3: if posting/reading through Usenet, please send an appropriate
|
|
subscribe-nomail command to majordomo@postgresql.org so that your
|
|
message can get through to the mailing list cleanly
|
|
|
|
From tgl@sss.pgh.pa.us Wed Aug 7 09:27:05 2002
|
|
To: Hannu Krosing <hannu@tm.ee>
|
|
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>,
|
|
Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
In-Reply-To: <1028726966.13418.12.camel@taru.tm.ee>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> <1028726966.13418.12.camel@taru.tm.ee>
|
|
Comments: In-reply-to Hannu Krosing <hannu@tm.ee>
|
|
message dated "07 Aug 2002 15:29:26 +0200"
|
|
Date: Wed, 07 Aug 2002 09:26:42 -0400
|
|
Message-ID: <15010.1028726802@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Content-Length: 1120
|
|
|
|
Hannu Krosing <hannu@tm.ee> writes:
|
|
> Now I remembered my original preference for page bitmaps (vs. tuple
|
|
> bitmaps): one can't actually make good use of a bitmap of tuples because
|
|
> there is no fixed tuples/page ratio and thus no way to quickly go from
|
|
> bit position to actual tuple. You mention the same problem but propose a
|
|
> different solution.
|
|
|
|
> Using page bitmap, we will at least avoid fetching any unneeded pages -
|
|
> essentially we will have a sequential scan over possibly interesting
|
|
> pages.
|
|
|
|
Right. One form of the "lossy compression" idea I suggested is to
|
|
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
|
|
too large to work with. Again, one could imagine doing that only in
|
|
denser areas of the bitmap.
|
|
|
|
> But I guess that CLUSTER support for INSERT will not be touched for 7.3
|
|
> as will real bitmap indexes ;)
|
|
|
|
All of this is far-future work I think. Adding a new scan type to the
|
|
executor would probably be pretty localized, but the ramifications in
|
|
the planner could be extensive --- especially if you want to do plans
|
|
involving ANDed or ORed bitmaps.
|
|
|
|
regards, tom lane
|
|
|
|
From pgsql-hackers-owner+M26178@postgresql.org Wed Aug 7 08:28:14 2002
|
|
X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
From: Hannu Krosing <hannu@tm.ee>
|
|
To: Hannu Krosing <hannu@tm.ee>
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>,
|
|
mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
In-Reply-To: <1028695589.2133.11.camel@rh72.home.ee>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
<12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee>
|
|
X-Mailer: Ximian Evolution 1.0.3.99
|
|
Date: 07 Aug 2002 15:29:26 +0200
|
|
Message-ID: <1028726966.13418.12.camel@taru.tm.ee>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 1837
|
|
|
|
On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote:
|
|
> On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
|
|
> > Curt Sampson <cjs@cynic.net> writes:
|
|
> > > On Wed, 7 Aug 2002, Tom Lane wrote:
|
|
> > >> Also, the main downside of this approach is that the bitmap could
|
|
> > >> get large --- but you could have some logic that causes you to fall
|
|
> > >> back to plain sequential scan if you get too many index hits.
|
|
> >
|
|
> > > Well, what I was thinking of, should the list of TIDs to fetch get too
|
|
> > > long, was just to break it down in to chunks.
|
|
> >
|
|
> > But then you lose the possibility of combining multiple indexes through
|
|
> > bitmap AND/OR steps, which seems quite interesting to me. If you've
|
|
> > visited only a part of each index then you can't apply that concept.
|
|
>
|
|
> When the tuples are small relative to pagesize, you may get some
|
|
> "compression" by saving just pages and not the actual tids in the the
|
|
> bitmap.
|
|
|
|
Now I remembered my original preference for page bitmaps (vs. tuple
|
|
bitmaps): one can't actually make good use of a bitmap of tuples because
|
|
there is no fixed tuples/page ratio and thus no way to quickly go from
|
|
bit position to actual tuple. You mention the same problem but propose a
|
|
different solution.
|
|
|
|
Using page bitmap, we will at least avoid fetching any unneeded pages -
|
|
essentially we will have a sequential scan over possibly interesting
|
|
pages.
|
|
|
|
If we were to use page-bitmap index for something with only a few values
|
|
like booleans, some insert-time local clustering should be useful, so
|
|
that TRUEs and FALSEs end up on different pages.
|
|
|
|
But I guess that CLUSTER support for INSERT will not be touched for 7.3
|
|
as will real bitmap indexes ;)
|
|
|
|
---------------
|
|
Hannu
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 6: Have you searched our list archives?
|
|
|
|
http://archives.postgresql.org
|
|
|
|
From pgsql-hackers-owner+M26192@postgresql.org Wed Aug 7 10:26:30 2002
|
|
To: Hannu Krosing <hannu@tm.ee>
|
|
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>,
|
|
Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
In-Reply-To: <1028733234.13418.113.camel@taru.tm.ee>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> <1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us> <1028733234.13418.113.camel@taru.tm.ee>
|
|
Comments: In-reply-to Hannu Krosing <hannu@tm.ee>
|
|
message dated "07 Aug 2002 17:13:54 +0200"
|
|
Date: Wed, 07 Aug 2002 10:26:13 -0400
|
|
Message-ID: <15622.1028730373@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 1224
|
|
|
|
Hannu Krosing <hannu@tm.ee> writes:
|
|
> On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
|
|
>> Right. One form of the "lossy compression" idea I suggested is to
|
|
>> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
|
|
>> too large to work with.
|
|
|
|
> If it is a real bitmap, should it not be easyeast to allocate at the
|
|
> start ?
|
|
|
|
But it isn't a "real bitmap". That would be a really poor
|
|
implementation, both for space and speed --- do you really want to scan
|
|
over a couple of megs of zeroes to find the few one-bits you care about,
|
|
in the typical case? "Bitmap" is a convenient term because it describes
|
|
the abstract behavior we want, but the actual data structure will
|
|
probably be nontrivial. If I recall Ann's description correctly,
|
|
Firebird's implementation uses run length coding of some kind (anyone
|
|
care to dig in their source and get all the details?). If we tried
|
|
anything in the way of lossy compression then there'd be even more stuff
|
|
lurking under the hood.
|
|
|
|
regards, tom lane
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 2: you can get off all lists at once with the unregister command
|
|
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
|
|
|
|
From pgsql-hackers-owner+M26188@postgresql.org Wed Aug 7 10:12:26 2002
|
|
X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
From: Hannu Krosing <hannu@tm.ee>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>,
|
|
Gavin Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
In-Reply-To: <15010.1028726802@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
<12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee>
|
|
<1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us>
|
|
X-Mailer: Ximian Evolution 1.0.3.99
|
|
Date: 07 Aug 2002 17:13:54 +0200
|
|
Message-ID: <1028733234.13418.113.camel@taru.tm.ee>
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Content-Length: 2812
|
|
|
|
On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
|
|
> Hannu Krosing <hannu@tm.ee> writes:
|
|
> > Now I remembered my original preference for page bitmaps (vs. tuple
|
|
> > bitmaps): one can't actually make good use of a bitmap of tuples because
|
|
> > there is no fixed tuples/page ratio and thus no way to quickly go from
|
|
> > bit position to actual tuple. You mention the same problem but propose a
|
|
> > different solution.
|
|
>
|
|
> > Using page bitmap, we will at least avoid fetching any unneeded pages -
|
|
> > essentially we will have a sequential scan over possibly interesting
|
|
> > pages.
|
|
>
|
|
> Right. One form of the "lossy compression" idea I suggested is to
|
|
> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
|
|
> too large to work with.
|
|
|
|
If it is a real bitmap, should it not be easyeast to allocate at the
|
|
start ?
|
|
|
|
a page bitmap for a 100 000 000 tuple table with 10 tuples/page will be
|
|
sized 10000000/8 = 1.25 MB, which does not look too big for me for that
|
|
amount of data (the data table itself would occupy 80 GB).
|
|
|
|
Even having the bitmap of 16 bits/page (with the bits 0-14 meaning
|
|
tuples 0-14 and bit 15 meaning "seq scan the rest of page") would
|
|
consume just 20 MB of _local_ memory, and would be quite justifyiable
|
|
for a query on a table that large.
|
|
|
|
For a real bitmap index the tuples-per-page should be a user-supplied
|
|
tuning parameter.
|
|
|
|
> Again, one could imagine doing that only in denser areas of the bitmap.
|
|
|
|
I would hardly call the resulting structure "a bitmap" ;)
|
|
|
|
And I'm not sure the overhead for a more complex structure would win us
|
|
any additional performance for most cases.
|
|
|
|
> > But I guess that CLUSTER support for INSERT will not be touched for 7.3
|
|
> > as will real bitmap indexes ;)
|
|
>
|
|
> All of this is far-future work I think.
|
|
|
|
After we do that we will probably be able claim support for
|
|
"datawarehousing" ;)
|
|
|
|
> Adding a new scan type to the
|
|
> executor would probably be pretty localized, but the ramifications in
|
|
> the planner could be extensive --- especially if you want to do plans
|
|
> involving ANDed or ORed bitmaps.
|
|
|
|
Also going to "smart inserter" which can do local clustering on sets of
|
|
real bitmap indexes for INSERTS (and INSERT side of UPDATE) would
|
|
probably be a major change from our current "stupid inserter" ;)
|
|
|
|
This will not be needed for bitmap resolution higher than 1bit/page but
|
|
default local clustering on bitmap indexes will probably buy us some
|
|
extra performance. by avoiding data page fetches when such indexes are
|
|
used.
|
|
|
|
AN anyway the support for INSERT being aware of clustering will probably
|
|
come up sometime.
|
|
|
|
------------
|
|
Hannu
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 2: you can get off all lists at once with the unregister command
|
|
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
|
|
|
|
From hannu@tm.ee Wed Aug 7 11:22:53 2002
|
|
X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f
|
|
Subject: Re: [HACKERS] CLUSTER and indisclustered
|
|
From: Hannu Krosing <hannu@tm.ee>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>,
|
|
Gavin
|
|
Sherry <swm@linuxworld.com.au>,
|
|
Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
|
|
In-Reply-To: <15622.1028730373@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
|
|
<12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee>
|
|
<1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us>
|
|
<1028733234.13418.113.camel@taru.tm.ee> <15622.1028730373@sss.pgh.pa.us>
|
|
X-Mailer: Ximian Evolution 1.0.3.99
|
|
Date: 07 Aug 2002 18:24:30 +0200
|
|
Message-ID: <1028737470.13419.182.camel@taru.tm.ee>
|
|
Content-Length: 2382
|
|
|
|
On Wed, 2002-08-07 at 16:26, Tom Lane wrote:
|
|
> Hannu Krosing <hannu@tm.ee> writes:
|
|
> > On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
|
|
> >> Right. One form of the "lossy compression" idea I suggested is to
|
|
> >> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
|
|
> >> too large to work with.
|
|
>
|
|
> > If it is a real bitmap, should it not be easyeast to allocate at the
|
|
> > start ?
|
|
>
|
|
> But it isn't a "real bitmap". That would be a really poor
|
|
> implementation, both for space and speed --- do you really want to scan
|
|
> over a couple of megs of zeroes to find the few one-bits you care about,
|
|
> in the typical case?
|
|
|
|
I guess that depends on data. The typical case should be somthing the
|
|
stats process will find out so the optimiser can use it
|
|
|
|
The bitmap must be less than 1/48 (size of TID) full for best
|
|
uncompressed "active-tid-list" to be smaller than plain bitmap. If there
|
|
were some structure above list then this ratio would be even higher.
|
|
|
|
I have had good experience using "compressed delta lists", which will
|
|
scale well ofer the whole "fullness" spectrum of bitmap, but this is for
|
|
storage, not for initial constructing of lists.
|
|
|
|
> "Bitmap" is a convenient term because it describes
|
|
> the abstract behavior we want, but the actual data structure will
|
|
> probably be nontrivial. If I recall Ann's description correctly,
|
|
> Firebird's implementation uses run length coding of some kind (anyone
|
|
> care to dig in their source and get all the details?).
|
|
|
|
Plain RLL is probably a good way to store it and for merging two or more
|
|
bitmaps, but not as good for constructing it bit-by-bit. I guess the
|
|
most effective structure for updating is often still a plain bitmap
|
|
(maybe not if it is very sparse and all of it does not fit in cache),
|
|
followed by some kind of balanced tree (maybe rb-tree).
|
|
|
|
If the bitmap is relatively full then the plain bitmap is almost always
|
|
the most effective to update.
|
|
|
|
> If we tried anything in the way of lossy compression then there'd
|
|
> be even more stuff lurking under the hood.
|
|
|
|
Having three-valued (0,1,maybe) RLL-encoded "tritmap" would be a good
|
|
way to represent lossy compression, and it would also be quite
|
|
straightforward to merge two of these using AND or OR. It may even be
|
|
possible to easily construct it using a fixed-length b-tree and going
|
|
from 1 to "maybe" for nodes that get too dense.
|
|
|
|
---------------
|
|
Hannu
|
|
|
|
|
|
From pgsql-hackers-owner+M21991@postgresql.org Wed Apr 24 23:37:37 2002
|
|
Return-path: <pgsql-hackers-owner+M21991@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3ba416337
|
|
for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:37:36 -0400 (EDT)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id CF13447622B; Wed, 24 Apr 2002 23:37:31 -0400 (EDT)
|
|
Received: from sraigw.sra.co.jp (sraigw.sra.co.jp [202.32.10.2])
|
|
by postgresql.org (Postfix) with ESMTP id 3EE92474E4B
|
|
for <pgsql-hackers@postgresql.org>; Wed, 24 Apr 2002 23:37:19 -0400 (EDT)
|
|
Received: from srascb.sra.co.jp (srascb [133.137.8.65])
|
|
by sraigw.sra.co.jp (8.9.3/3.7W-sraigw) with ESMTP id MAA76393;
|
|
Thu, 25 Apr 2002 12:35:44 +0900 (JST)
|
|
Received: (from root@localhost)
|
|
by srascb.sra.co.jp (8.11.6/8.11.6) id g3P3ZCK64299;
|
|
Thu, 25 Apr 2002 12:35:12 +0900 (JST)
|
|
(envelope-from t-ishii@sra.co.jp)
|
|
Received: from sranhm.sra.co.jp (sranhm [133.137.170.62])
|
|
by srascb.sra.co.jp (8.11.6/8.11.6av) with ESMTP id g3P3ZBV64291;
|
|
Thu, 25 Apr 2002 12:35:11 +0900 (JST)
|
|
(envelope-from t-ishii@sra.co.jp)
|
|
Received: from localhost (IDENT:t-ishii@srapc1474.sra.co.jp [133.137.170.59])
|
|
by sranhm.sra.co.jp (8.9.3+3.2W/3.7W-srambox) with ESMTP id MAA25562;
|
|
Thu, 25 Apr 2002 12:35:43 +0900
|
|
To: tgl@sss.pgh.pa.us
|
|
cc: cjs@cynic.net, pgman@candle.pha.pa.us, pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
|
|
<12342.1019705420@sss.pgh.pa.us>
|
|
X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1
|
|
=?iso-2022-jp?B?KBskQjAqGyhCKQ==?=
|
|
MIME-Version: 1.0
|
|
Content-Type: Text/Plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
Message-ID: <20020425123429E.t-ishii@sra.co.jp>
|
|
Date: Thu, 25 Apr 2002 12:34:29 +0900
|
|
From: Tatsuo Ishii <t-ishii@sra.co.jp>
|
|
X-Dispatcher: imput version 20000228(IM140)
|
|
Lines: 12
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: OR
|
|
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
|
|
> > *too* big and you use the data. A single 64K read takes very little
|
|
> > longer than a single 8K read.
|
|
>
|
|
> Proof?
|
|
|
|
Long time ago I tested with the 32k block size and got 1.5-2x speed up
|
|
comparing ordinary 8k block size in the sequential scan case.
|
|
FYI, if this is the case.
|
|
--
|
|
Tatsuo Ishii
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 5: Have you checked our extensive FAQ?
|
|
|
|
http://www.postgresql.org/users-lounge/docs/faq.html
|
|
|
|
From mloftis@wgops.com Thu Apr 25 01:43:14 2002
|
|
Return-path: <mloftis@wgops.com>
|
|
Received: from free.wgops.com (root@dsl092-002-178.sfo1.dsl.speakeasy.net [66.92.2.178])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P5hC426529
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 01:43:13 -0400 (EDT)
|
|
Received: from wgops.com ([10.1.2.207])
|
|
by free.wgops.com (8.11.3/8.11.3) with ESMTP id g3P5hBR43020;
|
|
Wed, 24 Apr 2002 22:43:11 -0700 (PDT)
|
|
(envelope-from mloftis@wgops.com)
|
|
Message-ID: <3CC7976F.7070407@wgops.com>
|
|
Date: Wed, 24 Apr 2002 22:43:11 -0700
|
|
From: Michael Loftis <mloftis@wgops.com>
|
|
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2
|
|
X-Accept-Language: en-us
|
|
MIME-Version: 1.0
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> <12342.1019705420@sss.pgh.pa.us>
|
|
Content-Type: text/plain; charset=us-ascii; format=flowed
|
|
Content-Transfer-Encoding: 7bit
|
|
Status: OR
|
|
|
|
|
|
|
|
Tom Lane wrote:
|
|
|
|
>Curt Sampson <cjs@cynic.net> writes:
|
|
>
|
|
>>Grabbing bigger chunks is always optimal, AFICT, if they're not
|
|
>>*too* big and you use the data. A single 64K read takes very little
|
|
>>longer than a single 8K read.
|
|
>>
|
|
>
|
|
>Proof?
|
|
>
|
|
I contend this statement.
|
|
|
|
It's optimal to a point. I know that my system settles into it's best
|
|
read-speeds @ 32K or 64K chunks. 8K chunks are far below optimal for my
|
|
system. Most systems I work on do far better at 16K than at 8K, and
|
|
most don't see any degradation when going to 32K chunks. (this is
|
|
across numerous OSes and configs -- results are interpretations from
|
|
bonnie disk i/o marks).
|
|
|
|
Depending on what you're doing it is more efficiend to read bigger
|
|
blocks up to a point. If you're multi-thread or reading in non-blocking
|
|
mode, take as big a chunk as you can handle or are ready to process in
|
|
quick order. If you're picking up a bunch of little chunks here and
|
|
there and know oyu're not using them again then choose a size that will
|
|
hopeuflly cause some of the reads to overlap, failing that, pick the
|
|
smallest usable read size.
|
|
|
|
The OS can never do that stuff for you.
|
|
|
|
|
|
|
|
From cjs@cynic.net Thu Apr 25 03:29:05 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from angelic.cynic.net ([202.232.117.21])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7T3404027
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:29:03 -0400 (EDT)
|
|
Received: from localhost (localhost [127.0.0.1])
|
|
by angelic.cynic.net (Postfix) with ESMTP
|
|
id 1C44E870E; Thu, 25 Apr 2002 16:28:51 +0900 (JST)
|
|
Date: Thu, 25 Apr 2002 16:28:51 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: OR
|
|
|
|
On Wed, 24 Apr 2002, Tom Lane wrote:
|
|
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
|
|
> > *too* big and you use the data. A single 64K read takes very little
|
|
> > longer than a single 8K read.
|
|
>
|
|
> Proof?
|
|
|
|
Well, there are various sorts of "proof" for this assertion. What
|
|
sort do you want?
|
|
|
|
Here's a few samples; if you're looking for something different to
|
|
satisfy you, let's discuss it.
|
|
|
|
1. Theoretical proof: two components of the delay in retrieving a
|
|
block from disk are the disk arm movement and the wait for the
|
|
right block to rotate under the head.
|
|
|
|
When retrieving, say, eight adjacent blocks, these will be spread
|
|
across no more than two cylinders (with luck, only one). The worst
|
|
case access time for a single block is the disk arm movement plus
|
|
the full rotational wait; this is the same as the worst case for
|
|
eight blocks if they're all on one cylinder. If they're not on one
|
|
cylinder, they're still on adjacent cylinders, requiring a very
|
|
short seek.
|
|
|
|
2. Proof by others using it: SQL server uses 64K reads when doing
|
|
table scans, as they say that their research indicates that the
|
|
major limitation is usually the number of I/O requests, not the
|
|
I/O capacity of the disk. BSD's explicitly separates the optimum
|
|
allocation size for storage (1K fragments) and optimum read size
|
|
(8K blocks) because they found performance to be much better when
|
|
a larger size block was read. Most file system vendors, too, do
|
|
read-ahead for this very reason.
|
|
|
|
3. Proof by testing. I wrote a little ruby program to seek to a
|
|
random point in the first 2 GB of my raw disk partition and read
|
|
1-8 8K blocks of data. (This was done as one I/O request.) (Using
|
|
the raw disk partition I avoid any filesystem buffering.) Here are
|
|
typical results:
|
|
|
|
125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
|
|
250 reads of 8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
|
|
500 reads of 4x8K blocks: 2.5 sec, 199 req/sec. 5.03 ms/req, 1.26 ms/block
|
|
1000 reads of 2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
|
|
2000 reads of 1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block
|
|
|
|
The ratios of data retrieval speed per read for groups of adjacent
|
|
8K blocks, assuming a single 8K block reads in 1 time unit, are:
|
|
|
|
1 block 1.00
|
|
2 blocks 1.18
|
|
4 blocks 1.56
|
|
8 blocks 2.34
|
|
16 blocks 4.68
|
|
|
|
At less than 20% more expensive, certainly two-block read requests
|
|
could be considered to cost "very little more" than one-block read
|
|
requests. Even four-block read requests are only half-again as
|
|
expensive. And if you know you're really going to be using the
|
|
data, read in 8 block chunks and your cost per block (in terms of
|
|
time) drops to less than a third of the cost of single-block reads.
|
|
|
|
Let me put paid to comments about multiple simultaneous readers
|
|
making this invalid. Here's a typical result I get with four
|
|
instances of the program running simultaneously:
|
|
|
|
125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
|
|
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
|
|
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
|
|
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
|
|
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block
|
|
|
|
Here's the ratio table again, with another column comparing the
|
|
aggregate number of requests per second for one process and four
|
|
processes:
|
|
|
|
1 block 1.00 310 : 440
|
|
2 blocks 1.10 262 : 401
|
|
4 blocks 1.28 199 : 346
|
|
8 blocks 1.69 132 : 260
|
|
16 blocks 3.89 66 : 113
|
|
|
|
Note that, here the relative increase in performance for increasing
|
|
sizes of reads is even *better* until we get past 64K chunks. The
|
|
overall throughput is better, of course, because with more requests
|
|
per second coming in, the disk seek ordering code has more to work
|
|
with and the average seek time spent seeking vs. reading will be
|
|
reduced.
|
|
|
|
You know, this is not rocket science; I'm sure there must be papers
|
|
all over the place about this. If anybody still disagrees that it's
|
|
a good thing to read chunks up to 64K or so when the blocks are
|
|
adjacent and you know you'll need the data, I'd like to see some
|
|
tangible evidence to support that.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From cjs@cynic.net Thu Apr 25 03:55:59 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from angelic.cynic.net ([202.232.117.21])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7tv405489
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:55:57 -0400 (EDT)
|
|
Received: from localhost (localhost [127.0.0.1])
|
|
by angelic.cynic.net (Postfix) with ESMTP
|
|
id 188EC870E; Thu, 25 Apr 2002 16:55:51 +0900 (JST)
|
|
Date: Thu, 25 Apr 2002 16:55:50 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <200204250404.g3P44OI19061@candle.pha.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0204251636550.3111-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: OR
|
|
|
|
On Thu, 25 Apr 2002, Bruce Momjian wrote:
|
|
|
|
> Well, we are guilty of trying to push as much as possible on to other
|
|
> software. We do this for portability reasons, and because we think our
|
|
> time is best spent dealing with db issues, not issues then can be deal
|
|
> with by other existing software, as long as the software is decent.
|
|
|
|
That's fine. I think that's a perfectly fair thing to do.
|
|
|
|
It was just the wording (i.e., "it's this other software's fault
|
|
that blah de blah") that got to me. To say, "We don't do readahead
|
|
becase most OSes supply it, and we feel that other things would
|
|
help more to improve performance," is fine by me. Or even, "Well,
|
|
nobody feels like doing it. You want it, do it yourself," I have
|
|
no problem with.
|
|
|
|
> Sure, that is certainly true. However, it is hard to know what the
|
|
> future will hold even if we had perfect knowledge of what was happening
|
|
> in the kernel. We don't know who else is going to start doing I/O once
|
|
> our I/O starts. We may have a better idea with kernel knowledge, but we
|
|
> still don't know 100% what will be cached.
|
|
|
|
Well, we do if we use raw devices and do our own caching, using
|
|
pages that are pinned in RAM. That was sort of what I was aiming
|
|
at for the long run.
|
|
|
|
> We have free-behind on our list.
|
|
|
|
Uh...can't do it, if you're relying on the OS to do the buffering.
|
|
How do you tell the OS that you're no longer going to use a page?
|
|
|
|
> I think LRU-K will do this quite well
|
|
> and be a nice general solution for more than just sequential scans.
|
|
|
|
LRU-K sounds like a great idea to me, as does putting pages read
|
|
for a table scan at the LRU end of the cache, rather than the MRU
|
|
(assuming we do something to ensure that they stay in cache until
|
|
read once, at any rate).
|
|
|
|
But again, great for your own cache, but doesn't work with the OS
|
|
cache. And I'm a bit scared to crank up too high the amount of
|
|
memory I give Postgres, lest the OS try to too aggressively buffer
|
|
all that I/O in what memory remains to it, and start blowing programs
|
|
(like maybe the backend binary itself) out of RAM. But maybe this
|
|
isn't typically a problem; I don't know.
|
|
|
|
> There may be validity in this. It is easy to do (I think) and could be
|
|
> a win.
|
|
|
|
It didn't look to difficult to me, when I looked at the code, and
|
|
you can see what kind of win it is from the response I just made
|
|
to Tom.
|
|
|
|
> > 1. It is *not* true that you have no idea where data is when
|
|
> > using a storage array or other similar system. While you
|
|
> > certainly ought not worry about things such as head positions
|
|
> > and so on, it's been a given for a long, long time that two
|
|
> > blocks that have close index numbers are going to be close
|
|
> > together in physical storage.
|
|
>
|
|
> SCSI drivers, for example, are pretty smart. Not sure we can take
|
|
> advantage of that from user-land I/O.
|
|
|
|
Looking at the NetBSD ones, I don't see what they're doing that's
|
|
so smart. (Aside from some awfully clever workarounds for stupid
|
|
hardware limitations that would otherwise kill performance.) What
|
|
sorts of "smart" are you referring to?
|
|
|
|
> Yes, but we are seeing some db's moving away from raw I/O.
|
|
|
|
Such as whom? And are you certain that they're moving to using the
|
|
OS buffer cache, too? MS SQL server, for example, uses the filesystem,
|
|
but turns off all buffering on those files.
|
|
|
|
> Our performance numbers beat most of the big db's already, so we must
|
|
> be doing something right.
|
|
|
|
Really? Do the performance numbers for simple, bulk operations
|
|
(imports, exports, table scans) beat the others handily? My intuition
|
|
says not, but I'll happily be convinced otherwise.
|
|
|
|
> Yes, but do we spend our time doing that. Is the payoff worth it, vs.
|
|
> working on other features. Sure it would be great to have all these
|
|
> fancy things, but is this where our time should be spent, considering
|
|
> other items on the TODO list?
|
|
|
|
I agree that these things need to be assesed.
|
|
|
|
> Jumping in and doing the I/O ourselves is a big undertaking, and looking
|
|
> at our TODO list, I am not sure if it is worth it right now.
|
|
|
|
Right. I'm not trying to say this is a critical priority, I'm just
|
|
trying to determine what we do right now, what we could do, and
|
|
the potential performance increase that would give us.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From cjs@cynic.net Thu Apr 25 05:19:11 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from angelic.cynic.net ([202.232.117.21])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P9J9412878
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 05:19:10 -0400 (EDT)
|
|
Received: from localhost (localhost [127.0.0.1])
|
|
by angelic.cynic.net (Postfix) with ESMTP
|
|
id 50386870E; Thu, 25 Apr 2002 18:19:03 +0900 (JST)
|
|
Date: Thu, 25 Apr 2002 18:19:02 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
|
|
Message-ID: <Pine.NEB.4.43.0204251805000.3111-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: OR
|
|
|
|
On Thu, 25 Apr 2002, Curt Sampson wrote:
|
|
|
|
> Here's the ratio table again, with another column comparing the
|
|
> aggregate number of requests per second for one process and four
|
|
> processes:
|
|
>
|
|
|
|
Just for interest, I ran this again with 20 processes working
|
|
simultaneously. I did six runs at each blockread size and summed
|
|
the tps for each process to find the aggregate number of reads per
|
|
second during the test. I dropped the higest and the lowest ones,
|
|
and averaged the rest. Here's the new table:
|
|
|
|
1 proc 4 procs 20 procs
|
|
|
|
1 block 310 440 260
|
|
2 blocks 262 401 481
|
|
4 blocks 199 346 354
|
|
8 blocks 132 260 250
|
|
16 blocks 66 113 116
|
|
|
|
I'm not sure at all why performance gets so much *worse* with a lot of
|
|
contention on the 1K reads. This could have something to with NetBSD, or
|
|
its buffer cache, or my laptop's crappy little disk drive....
|
|
|
|
Or maybe I'm just running out of CPU.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From tgl@sss.pgh.pa.us Thu Apr 25 09:54:35 2002
|
|
Return-path: <tgl@sss.pgh.pa.us>
|
|
Received: from sss.pgh.pa.us (root@[192.204.191.242])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3PDsY407038
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 09:54:34 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3PDsXF25059;
|
|
Thu, 25 Apr 2002 09:54:33 -0400 (EDT)
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
|
|
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
|
|
message dated "Thu, 25 Apr 2002 16:28:51 +0900"
|
|
Date: Thu, 25 Apr 2002 09:54:32 -0400
|
|
Message-ID: <25056.1019742872@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Status: OR
|
|
|
|
Curt Sampson <cjs@cynic.net> writes:
|
|
> 1. Theoretical proof: two components of the delay in retrieving a
|
|
> block from disk are the disk arm movement and the wait for the
|
|
> right block to rotate under the head.
|
|
|
|
> When retrieving, say, eight adjacent blocks, these will be spread
|
|
> across no more than two cylinders (with luck, only one).
|
|
|
|
Weren't you contending earlier that with modern disk mechs you really
|
|
have no idea where the data is? You're asserting as an article of
|
|
faith that the OS has been able to place the file's data blocks
|
|
optimally --- or at least well enough to avoid unnecessary seeks.
|
|
But just a few days ago I was getting told that random_page_cost
|
|
was BS because there could be no such placement.
|
|
|
|
I'm getting a tad tired of sweeping generalizations offered without
|
|
proof, especially when they conflict.
|
|
|
|
> 3. Proof by testing. I wrote a little ruby program to seek to a
|
|
> random point in the first 2 GB of my raw disk partition and read
|
|
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
|
|
> the raw disk partition I avoid any filesystem buffering.)
|
|
|
|
And also ensure that you aren't testing the point at issue.
|
|
The point at issue is that *in the presence of kernel read-ahead*
|
|
it's quite unclear that there's any benefit to a larger request size.
|
|
Ideally the kernel will have the next block ready for you when you
|
|
ask, no matter what the request is.
|
|
|
|
There's been some talk of using the AIO interface (where available)
|
|
to "encourage" the kernel to do read-ahead. I don't foresee us
|
|
writing our own substitute filesystem to make this happen, however.
|
|
Oracle may have the manpower for that sort of boondoggle, but we
|
|
don't...
|
|
|
|
regards, tom lane
|
|
|
|
From pgsql-hackers-owner+M22053@postgresql.org Thu Apr 25 20:45:42 2002
|
|
Return-path: <pgsql-hackers-owner+M22053@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q0jg405210
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 20:45:42 -0400 (EDT)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 17CE6476270; Thu, 25 Apr 2002 20:45:38 -0400 (EDT)
|
|
Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
|
|
by postgresql.org (Postfix) with ESMTP id 257DC47591C
|
|
for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 20:45:25 -0400 (EDT)
|
|
Received: (from kaf@localhost)
|
|
by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3Q0erX14397;
|
|
Thu, 25 Apr 2002 17:40:53 -0700
|
|
From: Kyle <kaf@nwlink.com>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
Message-ID: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
|
|
Date: Thu, 25 Apr 2002 17:40:53 -0700
|
|
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
|
|
<25056.1019742872@sss.pgh.pa.us>
|
|
X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: ORr
|
|
|
|
Tom Lane wrote:
|
|
> ...
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > 3. Proof by testing. I wrote a little ruby program to seek to a
|
|
> > random point in the first 2 GB of my raw disk partition and read
|
|
> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
|
|
> > the raw disk partition I avoid any filesystem buffering.)
|
|
>
|
|
> And also ensure that you aren't testing the point at issue.
|
|
> The point at issue is that *in the presence of kernel read-ahead*
|
|
> it's quite unclear that there's any benefit to a larger request size.
|
|
> Ideally the kernel will have the next block ready for you when you
|
|
> ask, no matter what the request is.
|
|
> ...
|
|
|
|
I have to agree with Tom. I think the numbers below show that with
|
|
kernel read-ahead, block size isn't an issue.
|
|
|
|
The big_file1 file used below is 2.0 gig of random data, and the
|
|
machine has 512 mb of main memory. This ensures that we're not
|
|
just getting cached data.
|
|
|
|
foreach i (4k 8k 16k 32k 64k 128k)
|
|
echo $i
|
|
time dd bs=$i if=big_file1 of=/dev/null
|
|
end
|
|
|
|
and the results:
|
|
|
|
bs user kernel elapsed
|
|
4k: 0.260 7.740 1:27.25
|
|
8k: 0.210 8.060 1:30.48
|
|
16k: 0.090 7.790 1:30.88
|
|
32k: 0.060 8.090 1:32.75
|
|
64k: 0.030 8.190 1:29.11
|
|
128k: 0.070 9.830 1:28.74
|
|
|
|
so with kernel read-ahead, we have basically the same elapsed (wall
|
|
time) regardless of block size. Sure, user time drops to a low at 64k
|
|
blocksize, but kernel time is increasing.
|
|
|
|
|
|
You could argue that this is a contrived example, no other I/O is
|
|
being done. Well I created a second 2.0g file (big_file2) and did two
|
|
simultaneous reads from the same disk. Sure performance went to hell
|
|
but it shows blocksize is still irrelevant in a multi I/O environment
|
|
with sequential read-ahead.
|
|
|
|
foreach i ( 4k 8k 16k 32k 64k 128k )
|
|
echo $i
|
|
time dd bs=$i if=big_file1 of=/dev/null &
|
|
time dd bs=$i if=big_file2 of=/dev/null &
|
|
wait
|
|
end
|
|
|
|
bs user kernel elapsed
|
|
4k: 0.480 8.290 6:34.13 bigfile1
|
|
0.320 8.730 6:34.33 bigfile2
|
|
8k: 0.250 7.580 6:31.75
|
|
0.180 8.450 6:31.88
|
|
16k: 0.150 8.390 6:32.47
|
|
0.100 7.900 6:32.55
|
|
32k: 0.190 8.460 6:24.72
|
|
0.060 8.410 6:24.73
|
|
64k: 0.060 9.350 6:25.05
|
|
0.150 9.240 6:25.13
|
|
128k: 0.090 10.610 6:33.14
|
|
0.110 11.320 6:33.31
|
|
|
|
|
|
the differences in read times are basically in the mud. Blocksize
|
|
just doesn't matter much with the kernel doing readahead.
|
|
|
|
-Kyle
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 6: Have you searched our list archives?
|
|
|
|
http://archives.postgresql.org
|
|
|
|
From pgsql-hackers-owner+M22055@postgresql.org Thu Apr 25 22:19:07 2002
|
|
Return-path: <pgsql-hackers-owner+M22055@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2J7411254
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:19:07 -0400 (EDT)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id F3924476208; Thu, 25 Apr 2002 22:19:02 -0400 (EDT)
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 6741D474E71
|
|
for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 22:18:50 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g3Q2Ili11246;
|
|
Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200204260218.g3Q2Ili11246@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
|
|
To: Kyle <kaf@nwlink.com>
|
|
Date: Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
|
|
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: OR
|
|
|
|
|
|
Nice test. Would you test simultaneous 'dd' on the same file, perhaps
|
|
with a slight delay between to the two so they don't read each other's
|
|
blocks?
|
|
|
|
seek() in the file will turn off read-ahead in most OS's. I am not
|
|
saying this is a major issue for PostgreSQL but the numbers would be
|
|
interesting.
|
|
|
|
|
|
---------------------------------------------------------------------------
|
|
|
|
Kyle wrote:
|
|
> Tom Lane wrote:
|
|
> > ...
|
|
> > Curt Sampson <cjs@cynic.net> writes:
|
|
> > > 3. Proof by testing. I wrote a little ruby program to seek to a
|
|
> > > random point in the first 2 GB of my raw disk partition and read
|
|
> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
|
|
> > > the raw disk partition I avoid any filesystem buffering.)
|
|
> >
|
|
> > And also ensure that you aren't testing the point at issue.
|
|
> > The point at issue is that *in the presence of kernel read-ahead*
|
|
> > it's quite unclear that there's any benefit to a larger request size.
|
|
> > Ideally the kernel will have the next block ready for you when you
|
|
> > ask, no matter what the request is.
|
|
> > ...
|
|
>
|
|
> I have to agree with Tom. I think the numbers below show that with
|
|
> kernel read-ahead, block size isn't an issue.
|
|
>
|
|
> The big_file1 file used below is 2.0 gig of random data, and the
|
|
> machine has 512 mb of main memory. This ensures that we're not
|
|
> just getting cached data.
|
|
>
|
|
> foreach i (4k 8k 16k 32k 64k 128k)
|
|
> echo $i
|
|
> time dd bs=$i if=big_file1 of=/dev/null
|
|
> end
|
|
>
|
|
> and the results:
|
|
>
|
|
> bs user kernel elapsed
|
|
> 4k: 0.260 7.740 1:27.25
|
|
> 8k: 0.210 8.060 1:30.48
|
|
> 16k: 0.090 7.790 1:30.88
|
|
> 32k: 0.060 8.090 1:32.75
|
|
> 64k: 0.030 8.190 1:29.11
|
|
> 128k: 0.070 9.830 1:28.74
|
|
>
|
|
> so with kernel read-ahead, we have basically the same elapsed (wall
|
|
> time) regardless of block size. Sure, user time drops to a low at 64k
|
|
> blocksize, but kernel time is increasing.
|
|
>
|
|
>
|
|
> You could argue that this is a contrived example, no other I/O is
|
|
> being done. Well I created a second 2.0g file (big_file2) and did two
|
|
> simultaneous reads from the same disk. Sure performance went to hell
|
|
> but it shows blocksize is still irrelevant in a multi I/O environment
|
|
> with sequential read-ahead.
|
|
>
|
|
> foreach i ( 4k 8k 16k 32k 64k 128k )
|
|
> echo $i
|
|
> time dd bs=$i if=big_file1 of=/dev/null &
|
|
> time dd bs=$i if=big_file2 of=/dev/null &
|
|
> wait
|
|
> end
|
|
>
|
|
> bs user kernel elapsed
|
|
> 4k: 0.480 8.290 6:34.13 bigfile1
|
|
> 0.320 8.730 6:34.33 bigfile2
|
|
> 8k: 0.250 7.580 6:31.75
|
|
> 0.180 8.450 6:31.88
|
|
> 16k: 0.150 8.390 6:32.47
|
|
> 0.100 7.900 6:32.55
|
|
> 32k: 0.190 8.460 6:24.72
|
|
> 0.060 8.410 6:24.73
|
|
> 64k: 0.060 9.350 6:25.05
|
|
> 0.150 9.240 6:25.13
|
|
> 128k: 0.090 10.610 6:33.14
|
|
> 0.110 11.320 6:33.31
|
|
>
|
|
>
|
|
> the differences in read times are basically in the mud. Blocksize
|
|
> just doesn't matter much with the kernel doing readahead.
|
|
>
|
|
> -Kyle
|
|
>
|
|
> ---------------------------(end of broadcast)---------------------------
|
|
> TIP 6: Have you searched our list archives?
|
|
>
|
|
> http://archives.postgresql.org
|
|
>
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 6: Have you searched our list archives?
|
|
|
|
http://archives.postgresql.org
|
|
|
|
From cjs@cynic.net Thu Apr 25 22:27:23 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from angelic.cynic.net ([202.232.117.21])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2RL411868
|
|
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:27:22 -0400 (EDT)
|
|
Received: from localhost (localhost [127.0.0.1])
|
|
by angelic.cynic.net (Postfix) with ESMTP
|
|
id AF60C870E; Fri, 26 Apr 2002 11:27:17 +0900 (JST)
|
|
Date: Fri, 26 Apr 2002 11:27:17 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0204261028110.449-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: OR
|
|
|
|
On Thu, 25 Apr 2002, Tom Lane wrote:
|
|
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > 1. Theoretical proof: two components of the delay in retrieving a
|
|
> > block from disk are the disk arm movement and the wait for the
|
|
> > right block to rotate under the head.
|
|
>
|
|
> > When retrieving, say, eight adjacent blocks, these will be spread
|
|
> > across no more than two cylinders (with luck, only one).
|
|
>
|
|
> Weren't you contending earlier that with modern disk mechs you really
|
|
> have no idea where the data is?
|
|
|
|
No, that was someone else. I contend that with pretty much any
|
|
large-scale storage mechanism (i.e., anything beyond ramdisks),
|
|
you will find that accessing two adjacent blocks is almost always
|
|
1) close to as fast as accessing just the one, and 2) much, much
|
|
faster than accessing two blocks that are relatively far apart.
|
|
|
|
There will be the odd case where the two adjacent blocks are
|
|
physically far apart, but this is rare.
|
|
|
|
If this idea doesn't hold true, the whole idea that sequential
|
|
reads are faster than random reads falls apart, and the optimizer
|
|
shouldn't even have the option to make random reads cost more, much
|
|
less have it set to four rather than one (or whatever it's set to).
|
|
|
|
> You're asserting as an article of
|
|
> faith that the OS has been able to place the file's data blocks
|
|
> optimally --- or at least well enough to avoid unnecessary seeks.
|
|
|
|
So are you, in the optimizer. But that's all right; the OS often
|
|
can and does do this placement; the FFS filesystem is explicitly
|
|
designed to do this sort of thing. If the filesystem isn't empty
|
|
and the files grow a lot they'll be split into large fragments,
|
|
but the fragments will be contiguous.
|
|
|
|
> But just a few days ago I was getting told that random_page_cost
|
|
> was BS because there could be no such placement.
|
|
|
|
I've been arguing against that point as well.
|
|
|
|
> And also ensure that you aren't testing the point at issue.
|
|
> The point at issue is that *in the presence of kernel read-ahead*
|
|
> it's quite unclear that there's any benefit to a larger request size.
|
|
|
|
I will test this.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From cjs@cynic.net Wed Apr 24 23:19:23 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from angelic.cynic.net ([202.232.117.21])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3JM414917
|
|
for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:19:22 -0400 (EDT)
|
|
Received: from localhost (localhost [127.0.0.1])
|
|
by angelic.cynic.net (Postfix) with ESMTP
|
|
id 1F36F870E; Thu, 25 Apr 2002 12:19:14 +0900 (JST)
|
|
Date: Thu, 25 Apr 2002 12:19:14 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: Sequential Scan Read-Ahead
|
|
In-Reply-To: <200204250156.g3P1ufh05751@candle.pha.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: OR
|
|
|
|
On Wed, 24 Apr 2002, Bruce Momjian wrote:
|
|
|
|
> > 1. Not all systems do readahead.
|
|
>
|
|
> If they don't, that isn't our problem. We expect it to be there, and if
|
|
> it isn't, the vendor/kernel is at fault.
|
|
|
|
It is your problem when another database kicks Postgres' ass
|
|
performance-wise.
|
|
|
|
And at that point, *you're* at fault. You're the one who's knowingly
|
|
decided to do things inefficiently.
|
|
|
|
Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
|
|
attitude gets me steamed. It's one thing to say, "We don't support
|
|
this." That's fine; there are often good reasons for that. It's a
|
|
completely different thing to say, "It's an unrelated entity's fault we
|
|
don't support this."
|
|
|
|
At any rate, relying on the kernel to guess how to optimise for
|
|
the workload will never work as well as well as the software that
|
|
knows the workload doing the optimization.
|
|
|
|
The lack of support thing is no joke. Sure, lots of systems nowadays
|
|
support unified buffer cache and read-ahead. But how many, besides
|
|
Solaris, support free-behind, which is also very important to avoid
|
|
blowing out your buffer cache when doing sequential reads? And who
|
|
at all supports read-ahead for reverse scans? (Or does Postgres
|
|
not do those, anyway? I can see the support is there.)
|
|
|
|
And even when the facilities are there, you create problems by
|
|
using them. Look at the OS buffer cache, for example. Not only do
|
|
we lose efficiency by using two layers of caching, but (as people
|
|
have pointed out recently on the lists), the optimizer can't even
|
|
know how much or what is being cached, and thus can't make decisions
|
|
based on that.
|
|
|
|
> Yes, seek() in file will turn off read-ahead. Grabbing bigger chunks
|
|
> would help here, but if you have two people already reading from the
|
|
> same file, grabbing bigger chunks of the file may not be optimal.
|
|
|
|
Grabbing bigger chunks is always optimal, AFICT, if they're not
|
|
*too* big and you use the data. A single 64K read takes very little
|
|
longer than a single 8K read.
|
|
|
|
> > 3. Even when the read-ahead does occur, you're still doing more
|
|
> > syscalls, and thus more expensive kernel/userland transitions, than
|
|
> > you have to.
|
|
>
|
|
> I would guess the performance impact is minimal.
|
|
|
|
If it were minimal, people wouldn't work so hard to build multi-level
|
|
thread systems, where multiple userland threads are scheduled on
|
|
top of kernel threads.
|
|
|
|
However, it does depend on how much CPU your particular application
|
|
is using. You may have it to spare.
|
|
|
|
> http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html
|
|
|
|
Well, this message has some points in it that I feel are just incorrect.
|
|
|
|
1. It is *not* true that you have no idea where data is when
|
|
using a storage array or other similar system. While you
|
|
certainly ought not worry about things such as head positions
|
|
and so on, it's been a given for a long, long time that two
|
|
blocks that have close index numbers are going to be close
|
|
together in physical storage.
|
|
|
|
2. Raw devices are quite standard across Unix systems (except
|
|
in the unfortunate case of Linux, which I think has been
|
|
remedied, hasn't it?). They're very portable, and have just as
|
|
well--if not better--defined write semantics as a filesystem.
|
|
|
|
3. My observations of OS performance tuning over the past six
|
|
or eight years contradict the statement, "There's a considerable
|
|
cost in complexity and code in using "raw" storage too, and
|
|
it's not a one off cost: as the technologies change, the "fast"
|
|
way to do things will change and the code will have to be
|
|
updated to match." While optimizations have been removed over
|
|
the years the basic optimizations (order reads by block number,
|
|
do larger reads rather than smaller, cache the data) have
|
|
remained unchanged for a long, long time.
|
|
|
|
4. "Better to leave this to the OS vendor where possible, and
|
|
take advantage of the tuning they do." Well, sorry guys, but
|
|
have a look at the tuning they do. It hasn't changed in years,
|
|
except to remove now-unnecessary complexity realated to really,
|
|
really old and slow disk devices, and to add a few thing that
|
|
guess workload but still do a worse job than if the workload
|
|
generator just did its own optimisations in the first place.
|
|
|
|
> http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html
|
|
|
|
Well, this one, with statements like "Postgres does have control
|
|
over its buffer cache," I don't know what to say. You can interpret
|
|
the statement however you like, but in the end Postgres very little
|
|
control at all over how data is moved between memory and disk.
|
|
|
|
BTW, please don't take me as saying that all control over physical
|
|
IO should be done by Postgres. I just think that Posgres could do
|
|
a better job of managing data transfer between disk and memory than
|
|
the OS can. The rest of the things (using raw paritions, read-ahead,
|
|
free-behind, etc.) just drop out of that one idea.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From kaf@nwlink.com Fri Apr 26 14:22:39 2002
|
|
Return-path: <kaf@nwlink.com>
|
|
Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3QIMc400783
|
|
for <pgman@candle.pha.pa.us>; Fri, 26 Apr 2002 14:22:38 -0400 (EDT)
|
|
Received: (from kaf@localhost)
|
|
by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3QII0l16824;
|
|
Fri, 26 Apr 2002 11:18:00 -0700
|
|
From: Kyle <kaf@nwlink.com>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
Message-ID: <15561.39384.296503.501888@doppelbock.patentinvestor.com>
|
|
Date: Fri, 26 Apr 2002 11:18:00 -0700
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
|
|
In-Reply-To: <200204261444.g3QEiFh11090@candle.pha.pa.us>
|
|
References: <15561.26116.817541.950416@doppelbock.patentinvestor.com>
|
|
<200204261444.g3QEiFh11090@candle.pha.pa.us>
|
|
X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
|
|
Status: ORr
|
|
|
|
Hey Bruce,
|
|
|
|
I'll forward this to the list if you think they'd benefit from it.
|
|
I'm not sure it says anything about read-ahead, I think this is more a
|
|
kernel caching issue. But I've been known to be wrong in the past.
|
|
Anyway...
|
|
|
|
|
|
the test:
|
|
|
|
foreach i (5 15 20 25 30 )
|
|
echo $i
|
|
time dd bs=8k if=big_file1 of=/dev/null &
|
|
sleep $i
|
|
time dd bs=8k if=big_file1 of=/dev/null &
|
|
wait
|
|
end
|
|
|
|
I did a couple more runs in the low range since their is a drastic
|
|
jump in elapsed (wall clock) time after doing a 6 second sleep:
|
|
|
|
first process second process
|
|
sleep user kernel elapsed user kernel elapsed
|
|
0 sec 0.200 7.980 1:26.57 0.240 7.720 1:26.56
|
|
3 sec 0.260 7.600 1:25.71 0.260 8.100 1:22.60
|
|
5 sec 0.160 7.890 1:26.04 0.220 8.180 1:21.04
|
|
6 sec 0.220 8.070 1:19.59 0.230 7.620 1:25.69
|
|
7 sec 0.210 9.270 1:57.92 0.100 8.750 1:50.76
|
|
8 sec 0.240 8.060 4:47.47 0.300 7.800 4:40.40
|
|
15 sec 0.200 8.500 4:51.11 0.180 7.280 4:44.36
|
|
20 sec 0.160 8.040 4:40.72 0.240 7.790 4:37.24
|
|
25 sec 0.170 8.150 4:37.58 0.140 8.200 4:33.08
|
|
30 sec 0.200 7.390 4:37.01 0.230 8.220 4:31.83
|
|
|
|
|
|
|
|
with a sleep of > 6 seconds, either the second process isn't getting
|
|
cached data or readahead is being turned off. I'd guess the former, I
|
|
don't see why read-ahead would be turned off since they're both doing
|
|
sequential operations.
|
|
|
|
Although with 512mb of memory and the disk reading at about 22 mb/sec,
|
|
maybe we're not hitting the cache. I'd guess at least ~400 megs of
|
|
kernel cache is being used for buffering this 2 gig file. free(1)
|
|
reports:
|
|
|
|
% free
|
|
total used free shared buffers cached
|
|
Mem: 512924 508576 4348 0 2640 477960
|
|
-/+ buffers/cache: 27976 484948
|
|
Swap: 527152 15864 511288
|
|
|
|
so shouldn't we be getting cached data even with a sleep of up to
|
|
about (400/22) 18 seconds...? Maybe I'm just in the dark on what's
|
|
really happening. I should point out that this is linux 2.4.18.
|
|
|
|
|
|
|
|
|
|
Bruce Momjian wrote:
|
|
>
|
|
> I am trying to illustrate how kernel read-ahead could be turned off in
|
|
> certain cases.
|
|
>
|
|
> ---------------------------------------------------------------------------
|
|
>
|
|
> Kyle wrote:
|
|
> > What are you trying to test, the kernel's cache vs disk speed?
|
|
> >
|
|
> >
|
|
> > Bruce Momjian wrote:
|
|
> > >
|
|
> > > Nice test. Would you test simultaneous 'dd' on the same file, perhaps
|
|
> > > with a slight delay between to the two so they don't read each other's
|
|
> > > blocks?
|
|
> > >
|
|
> > > seek() in the file will turn off read-ahead in most OS's. I am not
|
|
> > > saying this is a major issue for PostgreSQL but the numbers would be
|
|
> > > interesting.
|
|
|