TOAST performance (was Re: [GENERAL] Delete Performance)

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Tom La » Sun, 18 Nov 2001 10:16:09




Quote:> [ complains that deletes are slow in table containing toasted data ]

I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies.  This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.

I did:

create table foo (f1 text);
insert into foo values ('a short entry');
insert into foo select * from foo;
-- repeat enough times to build up 32K rows total
delete from foo;

The "delete" took about 2 seconds.  I then did it over with the
'value' being a 5K chunk of text, which according to octet_length
got compressed to 3900 bytes.  (This'd require two rows in the TOAST
table.)  This time the delete took 127 seconds.  I was expecting
about a 3X penalty since we needed to delete three rows not one,
but what I got was a 60X penalty.

Trying to understand this, I did some profiling and found that most
of the time was going into XLogInsert and XLOG I/O.  That's when I
remembered that the actual data volume involved is considerably
different in the two cases.  Allowing for tuple header overhead and
so forth, the small-data case involves about 1.8MB, the large-data
case about 131MB, or about 70 times as much data.

I believe this indicates that what's determining the runtime is the fact
that the XLOG code writes out an image of each page modified in the
transaction.  These page images will be the bulk of the XLOG traffic
for the TOAST table (since there are only four or so tuples on each
TOAST page, the actual XLOG delete records take little space by
comparison).

I've worried for some time that the decision to XLOG page images was
costing us a lot more performance than could be justified...

One trick we could perhaps pull is to postpone deletion of TOAST tuples
until VACUUM, so that the bulk of the work is done in a noncritical path
(from the point of view of the application anyway).  I'm not sure how
this interacts with the way that we re-use a TOAST entry when other
fields in the row are updated, however.  It might be too difficult for
VACUUM to tell when to delete a TOAST item.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Bruce Momji » Sun, 18 Nov 2001 11:21:24


Quote:> The "delete" took about 2 seconds.  I then did it over with the
> 'value' being a 5K chunk of text, which according to octet_length
> got compressed to 3900 bytes.  (This'd require two rows in the TOAST
> table.)  This time the delete took 127 seconds.  I was expecting
> about a 3X penalty since we needed to delete three rows not one,
> but what I got was a 60X penalty.

Wow.  Can someone remind me why we take page images on delete?  We
aren't really writing anything special to the page except a transction
id.

Quote:> I've worried for some time that the decision to XLOG page images was
> costing us a lot more performance than could be justified...

Is it because we take a snapshot of the page before we write it in case
we only write part of the page?

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Tom La » Sun, 18 Nov 2001 11:22:32



> Is it because we take a snapshot of the page before we write it in case
> we only write part of the page?

AFAIR, the partial-page-write problem is the entire reason for doing it.
If we could be certain that writes to datafile pages were atomic, we'd
not need this.

Of course we can't be certain of that.  But I'm wondering if there isn't
a cheaper solution.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Tom La » Sun, 18 Nov 2001 11:34:01



> Could we add code to detect a partial write when we recover from one
> using WAL so we can know if these partial writes are ever
> happening?

What's your point?  It clearly *can* happen during power-failure
scenarios.  All the monitoring in the world won't disprove that.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Bruce Momji » Sun, 18 Nov 2001 14:16:30



> > Seriously, how do OS's handle partial page write, especially to
> > directories?

I realize UPDATE also requires pre-page writes for the old tuples.  What
bothers me is that unlike INSERT and UPDATE of new rows, DELETE and
UPDATE of old rows is not writing new data but just setting transaction
ID's.  I wish there was a way to store those XID's somewhere else so the
page wouldn't have to be modified.

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Bruce Momji » Sun, 18 Nov 2001 14:26:21



> > Seriously, how do OS's handle partial page write, especially to
> > directories?

> ... fsck ...

But how can it handle partial writes to a directory when many files
exist in that single block?

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Tom La » Sun, 18 Nov 2001 14:37:15



> Seriously, how do OS's handle partial page write, especially to
> directories?

... fsck ...

                        regards, tom lane

---------------------------(end of broadcast)---------------------------

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Josh Rove » Wed, 21 Nov 2001 03:49:06


Tom Lane wrote

Quote:

>I did some experimentation here and found a rather surprising
>dependency: the time to delete a bunch of data is pretty much
>directly proportional to the disk space it occupies.  This says
>that we're paying through the nose for having XLOG make copies
>of about-to-be-modified pages.

At least now I know I wasn't imagining things....            :-)

Which brings up the question, what is the best way to deal with many
thousands of variable-length binary chunks.  Net input == net output
over the course of a day.   The new vacuum should help (both lo_ and
toasted tables take a long time to vacuum full), but I'm running into
the "Hotel California" situation.  Data goes in fast, but can't be
deleted fast enough to keep the database from continuously growing
in size.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Hannu Krosi » Wed, 21 Nov 2001 09:38:55



> Tom Lane wrote

>> I did some experimentation here and found a rather surprising
>> dependency: the time to delete a bunch of data is pretty much
>> directly proportional to the disk space it occupies.  This says
>> that we're paying through the nose for having XLOG make copies
>> of about-to-be-modified pages.

Can't we somehow WAL only metadata and not the actual pages for
DELETEs - as delete is essentially (though currently not technically)
just metadata it should be a possible thing to do.

Quote:>> ------------------

Hannu

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Hannu Krosi » Fri, 23 Nov 2001 19:02:29




> > What bothers me about this is that we have the original page with the
> > old data.  It would be nice if we could write the new page in a
> > different location, make the new page active and recycle the old page at
> > some later time.

> I don't see how that reduces the total amount of disk traffic?

> It's also kind of unclear how to do it without doubling (or worse) the
> amount of table space used in many common scenarios.  I doubt many
> people will be happy if "DELETE FROM foo" requires transient space equal
> to twice the original size of foo.

IIRC the double space requrement is what has kept us from implementing
DROP COLUMN.

-----------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Hannu Krosi » Fri, 23 Nov 2001 19:03:14




> >> I don't see how that reduces the total amount of disk traffic?

> > Well, right now we write the pre-image to WAL, then write the new page
> > over the old one.  In my case, you just write the new, and somewhere
> > record that the old page is no longer active.

> The devil is in the details of that last little bit.  How is "mark a
> page inactive" cheaper than "mark a tuple dead"?  More specifically,
> how do you propose to avoid WAL-logging the page you are going to do
> this marking in?  Seems you still end up with a WAL page image for
> something.

Assuming that we WAL with the granularity of disk sector (512b) I think
that restructuring of database heap page (8kb) would be a big win for
delete/update.

The idea is to move metadata (oid,tableoid,xmin,cmin,xmax,cmax,ctid)
to the beginning of heap page to the same space with tuple pointers.
It's easy (<grin>) as all of it is fixed length.
Then a change in metadata like setting xmax for deleted/updated tuple
will dirty only the first disk page and not all of them.

The new structure of ItemId will be (
 itemId-pointer nbits
 itemId-flags   32-n bits
 oid,
 tableoid,
 xmin,
 cmin,
 xmax,
 cmax,
 ctid
)

Assuming that we do account of dirty pages and WAL with the granularity
of database page we may get a big win by just moving to smaller
gramularity.

The win from increasing cranularity was not very big before WAL, as the
database pages are continuous on disk, but will be significant when we
have to log all dirty pages.

------------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Zeugswetter Andreas SB S » Fri, 23 Nov 2001 20:41:49


Quote:> With DELETE FROM foo, let's suppose you have 10 pages in the table.
To
> modify page 1, you write to page 11

But what with the indexes ? They would all need to be modified
accordingly.
If you did something like chaining, then before long all tuples would be

chained, even those that were not touched.

If you really want to avoid the page writes to WAL, imho the best way
would be
to revive the original PG page design where the physical position of
slots in a
heap page where only changed by vacuum.

Then, a heap page that was only partly written would only be a problem
iff
the hardware wrote wrong data, not if it only skipped part of the write.

Reasonable hardware does detect such corrupted pages.
E.g. on AIX if you reduce the PG pagesize to 4k, an only partly written
page
that stays undetected can be ruled out.

Then you would only need to write index pages to WAL, but not heap
pages.

Maybe a better idea would be to only conditionally write pages to WAL if
slot
positions changed. In the "delete" example heap slot positions certainly
do
not need to change.
To be extra safe it would probably be necessary to not split tuple
headers
(at least the xact info) across physical pages. Then it would also be
safe to
use a pg pagesize that is a multiple of the physical page size.

or so ? ...
Andreas

---------------------------(end of broadcast)---------------------------

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Bruce Momji » Sat, 24 Nov 2001 11:02:04


Quote:> > > It's also kind of unclear how to do it without doubling (or worse) the
> > > amount of table space used in many common scenarios.  I doubt many
> > > people will be happy if "DELETE FROM foo" requires transient space equal
> > > to twice the original size of foo.

> > IIRC the double space requrement is what has kept us from implementing
> > DROP COLUMN.

> The correct solution then, according methinks to my old Human Computer
> Interaction lecturer, is to implement the feature anyway, and warn the DBA
> what the consequences are.  That way, the DBA can do it if she wants, unlike
> the current situation where it's next to impossible (with lots of
> referencing foreign keys).

Yes, I personally am going to try this for 7.3, as well as fix CLUSTER.
I think someone has already started on CLUSTER anyway.

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

 
 
 

TOAST performance (was Re: [GENERAL] Delete Performance)

Post by Christopher Kings-Lynn » Sat, 24 Nov 2001 11:02:34


Quote:> > It's also kind of unclear how to do it without doubling (or worse) the
> > amount of table space used in many common scenarios.  I doubt many
> > people will be happy if "DELETE FROM foo" requires transient space equal
> > to twice the original size of foo.

> IIRC the double space requrement is what has kept us from implementing
> DROP COLUMN.

The correct solution then, according methinks to my old Human Computer
Interaction lecturer, is to implement the feature anyway, and warn the DBA
what the consequences are.  That way, the DBA can do it if she wants, unlike
the current situation where it's next to impossible (with lots of
referencing foreign keys).

Chris

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

 
 
 

1. TOAST performance (was Re: [GENERAL] Delete

Is it possible to do ordered writes, the way ext3 does?

http://www-106.ibm.com/developerworks/linux/library/l-fs7/

Is an interesting article discussing the approach.

Regards,
                                        Andrew.
--
--------------------------------------------------------------------

WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

2. JFS vs. RLV on RS6000/AIX 3.2.4

3. SQL Server Performance Problem - Good query performance, bad update performance

4. Initiating a procedure from ODBC

5. I am using performance monitor...

6. US-TX-ORACLE 11I APPLICATIONS PROGRAMMER

7. Am I Improving performance?

8. Query performance issue - What Am I doing wrong?

9. joins and performances : general question

10. Very general performance question about Servlets

11. last latch ratio (redo copy) and general performance advice

12. joins and performances : general question