2.4.x write barriers (updated for ext3)

2.4.x write barriers (updated for ext3)

Post by Chris Maso » Sat, 23 Feb 2002 08:40:08



Hi everyone,

I've changed the write barrier code around a little so the block layer
isn't forced to fail barrier requests the queue can't handle.

This makes it much easier to add support for ide writeback
flushing to things like ext3 and lvm, where I want to make
the minimal possible changes to make things safe.

The full patch is at:
ftp.suse.com/pub/people/mason/patches/2.4.18/queue-barrier-8.diff

There might be additional spots in ext3 where ordering needs to be
enforced, I've included the ext3 code below in hopes of getting
some comments.

The only other change was to make reiserfs use the IDE flushing mode
by default.  It falls back to non-ordered calls on scsi.

-chris

--- linus.23/fs/jbd/commit.c Mon, 28 Jan 2002 09:51:50 -0500

                struct buffer_head *bh = jh2bh(descriptor);
                clear_bit(BH_Dirty, &bh->b_state);
                bh->b_end_io = journal_end_buffer_io_sync;
+
+               /* if we're on an ide device, setting BH_Ordered_Flush
+                  will force a write cache flush before and after the
+                  commit block.  Otherwise, it'll do nothing.  */
+
+               set_bit(BH_Ordered_Flush, &bh->b_state);
                submit_bh(WRITE, bh);
+               clear_bit(BH_Ordered_Flush, &bh->b_state);
+
                wait_on_buffer(bh);
                put_bh(bh);             /* One for getblk() */
                journal_unlock_journal_head(descriptor);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Stephen C. Tweedi » Sat, 23 Feb 2002 23:30:12


Hi,


> This makes it much easier to add support for ide writeback
> flushing to things like ext3 and lvm, where I want to make
> the minimal possible changes to make things safe.

Nice.

Quote:> There might be additional spots in ext3 where ordering needs to be
> enforced, I've included the ext3 code below in hopes of getting
> some comments.

No.  However, there is another optimisation which we can make.

Most ext3 commits, in practice, are lazy, asynchronous commits, and we
only nedd BH_Ordered_Tag for that, not *_Flush.  It would be easy
enough to track whether a given transaction has any synchronous
waiters, and if not, to use the async *_Tag request for the commit
block instead of forcing a flush.

We'd also have to track the sync status of the most recent
transaction, too, so that on fsync of a non-dirty file/inode, we make
sure that its data had been forced to disk by at least one synchronous
flush.  

But that's really only a win for SCSI, where proper async ordered tags
are supported.  For IDE, the single BH_Ordered_Flush is quite
sufficient.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Chris Maso » Sun, 24 Feb 2002 00:40:06



Quote:>> There might be additional spots in ext3 where ordering needs to be
>> enforced, I've included the ext3 code below in hopes of getting
>> some comments.

> No.  However, there is another optimisation which we can make.

> Most ext3 commits, in practice, are lazy, asynchronous commits, and we
> only nedd BH_Ordered_Tag for that, not *_Flush.  It would be easy
> enough to track whether a given transaction has any synchronous
> waiters, and if not, to use the async *_Tag request for the commit
> block instead of forcing a flush.

Just a note, the scsi code doesn't implement flush at all, flush
either gets ignored or failed (if BH_Ordered_Hard is set), the
assumption being that scsi devices don't write back by default, so
wait_on_buffer() is enough.

The reiserfs code tries to be smart with _Tag, in pratice I haven't
found a device that gains from it, so I didn't want to make the larger
changes to ext3 until I was sure it was worthwhile ;-)

It seems the scsi drives don't do tag ordering as nicely as we'd
hoped, I'm hoping someone with a big raid controller can help
benchmark the ordered tag mode on scsi.  Also, check the barrier
threads from last week on how write errors might break the
ordering with the current scsi code.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Stephen C. Tweedi » Sun, 24 Feb 2002 01:20:15


Hi,


> Finally, I think the driver ordering problem can be solved easily as long as
> an observation I have about your barrier is true.  It seems to me that the
> barrier is only semi permeable, namely its purpose is to complete *after* a
> particular set of commands do.  This means that it doesnt matter if later
> commands move through the barrier, it only matters that earlier commands
> cannot move past it?

No.  A commit block must be fully ordered.  

If the commit block fails to be written, then we must be able to roll
the filesystem back to the consistent, pre-commit state, which implies
that any later IOs (which might be writeback IOs updating
now-committed metadata to final locations on disk) must not be allowed
to overtake the commit block.

However, in the current code, we don't assume that ordered queuing
works, so that later writeback will never be scheduled until we get a
positive completion acknowledgement for the commit block.  In other
words, right now, the scenario you describe is not a problem.

But ideally, with ordered queueing we would want to be able to relax
things by allowing writeback to be queued immediately the commit is
queued.  The ordered tag must be honoured in both directions in that
case.

There is a get-out for ext3 --- we can submit new journal IOs without
waiting for the commit IO to complete, but hold back on writeback IOs.
That still has the desired advantage of allowing us to stream to the
journal, but only requires that the commit block be ordered with
respect to older, not newer, IOs.  That gives us most of the benefits
of tagged queuing without any problems in your scenario.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Chris Maso » Sun, 24 Feb 2002 01:20:11



[ very interesting stuff ]

Quote:> Finally, I think the driver ordering problem can be solved easily as long as
> an observation I have about your barrier is true.  It seems to me that the
> barrier is only semi permeable, namely its purpose is to complete *after* a
> particular set of commands do.  

This is my requirement for reiserfs, where I still want to wait on the
commit block to check for io errors.  sct might have other plans.

Quote:> This means that it doesnt matter if later
> commands move through the barrier, it only matters that earlier commands
> cannot move past it?  If this is true, then we can fix the slot problem simply
> by having a slot dedicated to barrier tags, so the processing engine goes over
> it once per cycle.  However, if it finds the barrier slot full, it doesn't
> issue the command until the *next* cycle, thus ensuring that all commands sent
> down before the barrier (plus a few after) are accepted by the device queue
> before we send the barrier with its ordered tag.

Interesting, certainly sounds good.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by James Bottomle » Sun, 24 Feb 2002 02:40:10



Quote:> There is a get-out for ext3 --- we can submit new journal IOs without
> waiting for the commit IO to complete, but hold back on writeback IOs.
> That still has the desired advantage of allowing us to stream to the
> journal, but only requires that the commit block be ordered with
> respect to older, not newer, IOs.  That gives us most of the benefits
> of tagged queuing without any problems in your scenario.

Actually, I intended the tagged queueing discussion to be discouraging.  The
amount of work that would have to be done to implement it is huge, touching,
as it does, every low level driver's interrupt routine.  For the drivers that
require scripting changes to the chip engine, it's even worse: only someone
with specialised knowledge can actually make the changes.

It's feasible, but I think we'd have to demonstrate some quite significant
performance or other improvements before changes on this scale would fly.

Neither of you commented on the original suggestion.  What I was wondering is
if we could benchmark (or preferably improve on) it:


Quote:> The easy way out of the problem, I think, is to impose the barrier as
> an  effective queue plug in the SCSI mid-layer, so that after the
> mid-layer  recevies the barrier, it plugs the device queue from below,
> drains the drive  tag queue, sends the barrier and unplugs the device
> queue on barrier I/O  completion.

If you need strict barrier ordering, then the queue is double plugged since
the barrier has to be sent down and waited for on its own.  If you allow the
discussed permiability, the queue is only single plugged since the barrier can
be sent down along with the subsequent writes.

I can take a look at implementing this in the SCSI mid-layer and you could see
what the benchmark figures look like with it in place.  If it really is the
performance pig it looks like, then we could go back to the linux-scsi list
with the tag change suggestions.

James

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Chris Maso » Sun, 24 Feb 2002 03:20:16




>> There is a get-out for ext3 --- we can submit new journal IOs without
>> waiting for the commit IO to complete, but hold back on writeback IOs.
>> That still has the desired advantage of allowing us to stream to the
>> journal, but only requires that the commit block be ordered with
>> respect to older, not newer, IOs.  That gives us most of the benefits
>> of tagged queuing without any problems in your scenario.

> Actually, I intended the tagged queueing discussion to be discouraging.  

;-)

Quote:> The
> amount of work that would have to be done to implement it is huge, touching,
> as it does, every low level driver's interrupt routine.  For the drivers that
> require scripting changes to the chip engine, it's even worse: only someone
> with specialised knowledge can actually make the changes.

> It's feasible, but I think we'd have to demonstrate some quite significant
> performance or other improvements before changes on this scale would fly.

Very true.  At best, we pick one card we know it could work on, and
one target that we know is smart about tags, and try to demonstrate
the improvement.

> Neither of you commented on the original suggestion.  What I was wondering is
> if we could benchmark (or preferably improve on) it:


>> The easy way out of the problem, I think, is to impose the barrier as
>> an  effective queue plug in the SCSI mid-layer, so that after the
>> mid-layer  recevies the barrier, it plugs the device queue from below,
>> drains the drive  tag queue, sends the barrier and unplugs the device
>> queue on barrier I/O  completion.

The main way the barriers could help performance is by allowing the
drive to write all the transaction and commit blocks at once.  Your
idea increases the chance the drive heads will still be correctly
positioned to write the commit block, but doesn't let the drive
stream things better.

The big advantage to using wait_on_buffer() instead is that it doesn't
order against data writes at all (from bdflush, or some other proc
other than a commit), allowing the drive to optimize those
at the same time it is writing the commit.  Using ordered tags has the
same problem, it might just be that wait_on_buffer is the best way to
go.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Helge Haftin » Tue, 26 Feb 2002 20:10:09


[...]

Quote:> Unfortunately, there's actually a hole in the SCSI spec that means ordered
> tags are actually extremely difficult to use in the way you want (although I
> think this is an accident, conceptually, I think they were supposed to be used
> for this).  For the interested, I attach the details at the bottom.

[...]
> The SCSI tag system allows all devices to have a dynamic queue.  This means
> that there is no a priori guarantee about how many tags the device will accept
> before the queue becomes full.

I just wonder - isn't the amount of outstanding requests a device
can handle constant?  If so, the user could determine this (from spec or
by running an utility that generates "too much" traffic.)  

The max number of requests may then be compiled in or added as
a kernel boot parameter.  The kernel would honor this and never ever
have more outstanding requests than it believes the device
can handle.  

Those who don't want to bother can use some low default or accept the
risk.

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by James Bottomle » Wed, 27 Feb 2002 00:10:10



Quote:> I just wonder - isn't the amount of outstanding requests a device can
> handle constant?  If so, the user could determine this (from spec or
> by running an utility that generates "too much" traffic.)  

The spec doesn't make any statements about this, so the devices are allowed to
do whatever seems best.  Although it is undoubtedly implemented as a fixed
queue on a few devices, there are others whose queue depth depends on the
available resources (most disk arrays function this way---they tend to juggle
tag queue depth dynamically per lun).

Even if the queue depth is fixed, you have to probe it dynamically because it
will be different for each device.  Even worse, on a SAN or other shared bus,
you might not be the only initiator using the device queue, so even for a
device with a fixed queue depth you don't own all the slots so the queue depth
you see varies.

The bottom line is that you have to treat the queue full return as a normal
part of I/O flow control to SCSI devices.

James

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Chris Maso » Sat, 02 Mar 2002 01:50:12



Quote:> Doug Gilbert prompted me to re-examine my notions about SCSI drive caching,
> and sure enough the standard says (and all the drives I've looked at so far
> come with) write back caching enabled by default.

Really.  Has it always been this way?

Quote:

> Since this is a threat to the integrity of Journalling FS in power failure
> situations now, I think it needs to be addressed with some urgency.

> The "quick fix" would obviously be to get the sd driver to do a mode select at
> probe time to turn off the WCE and RCD bits (this will place the cache into
> write through mode), which would match the assumptions all the JFSs currently
> make.  I'll see if I can code up a quick patch to do this.

Ok.

Quote:

> A longer term solution might be to keep the writeback cache but send down a
> SYNCHRONIZE CACHE command as part of the back end completion of a barrier
> write, so the fs wouldn't get a completion until the write was done and all
> the dirty cache blocks flushed to the medium.

Right, they could just implement ORDERED_FLUSH in the barrier patch.

Quote:

> Clearly, there would also have to be a mechanism to flush the cache on
> unmount, so if this were done by ioctl, would you prefer that the filesystem
> be in charge of flushing the cache on barrier writes, or would you like the sd
> device to do it transparently?

How about triggered by closing the block device.  That would also cover
people like oracle that do stuff to the raw device.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by James Bottomle » Sat, 02 Mar 2002 01:50:15


Doug Gilbert prompted me to re-examine my notions about SCSI drive caching,
and sure enough the standard says (and all the drives I've looked at so far
come with) write back caching enabled by default.

Since this is a threat to the integrity of Journalling FS in power failure
situations now, I think it needs to be addressed with some urgency.

The "quick fix" would obviously be to get the sd driver to do a mode select at
probe time to turn off the WCE and RCD bits (this will place the cache into
write through mode), which would match the assumptions all the JFSs currently
make.  I'll see if I can code up a quick patch to do this.

A longer term solution might be to keep the writeback cache but send down a
SYNCHRONIZE CACHE command as part of the back end completion of a barrier
write, so the fs wouldn't get a completion until the write was done and all
the dirty cache blocks flushed to the medium.

Clearly, there would also have to be a mechanism to flush the cache on
unmount, so if this were done by ioctl, would you prefer that the filesystem
be in charge of flushing the cache on barrier writes, or would you like the sd
device to do it transparently?

James

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Chris Maso » Sat, 02 Mar 2002 04:40:09



Quote:

>> A longer term solution might be to keep the writeback cache but send down a
>> SYNCHRONIZE CACHE command as part of the back end completion of a barrier
>> write, so the fs wouldn't get a completion until the write was done and all
>> the dirty cache blocks flushed to the medium.

> Right, they could just implement ORDERED_FLUSH in the barrier patch.

So, a little testing with scsi_info shows my scsi drives do have
writeback cache on.  great.  What's interesting is they
must be doing additional work for ordered tags.  If they were treating
the block as written once in cache, using the tags should not change
performance at all.  But, I can clearly show the tags changing
performance, and hear the drive write pattern change when tags are on.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by Mike Anderso » Sat, 02 Mar 2002 04:40:14



> ..snip..

> > Clearly, there would also have to be a mechanism to flush the cache on
> > unmount, so if this were done by ioctl, would you prefer that the filesystem
> > be in charge of flushing the cache on barrier writes, or would you like the sd
> > device to do it transparently?

> How about triggered by closing the block device.  That would also cover
> people like oracle that do stuff to the raw device.

> -chris

Doing something in sd_release should be covered in the raw case.
raw_release->blkdev_put->bdev->bd_op->release "sd_release".

At least from what I understand of the raw release call path :-).
-Mike
--
Michael Anderson

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2.4.x write barriers (updated for ext3)

Post by James Bottomle » Sat, 02 Mar 2002 12:20:16



Quote:> So, a little testing with scsi_info shows my scsi drives do have
> writeback cache on.  great.  What's interesting is they must be doing
> additional work for ordered tags.  If they were treating the block as
> written once in cache, using the tags should not change  performance
> at all.  But, I can clearly show the tags changing performance, and
> hear the drive write pattern change when tags are on.

I checked all mine and they're write through.  However, I inherited all my
drives from an enterprise vendor so this might not be that surprising.

I can surmise why ordered tags kill performance on your drive, since an
ordered tag is required to affect the ordering of the write to the medium, not
the cache, it is probably implemented with an implicit cache flush.

Anyway, the attached patch against 2.4.18 (and I know it's rather gross code)
will probe the cache type and try to set it to write through on boot.  See
what this does to your performance ordinarily, and also to your tagged write
barrier performance.

James

  sd-cache.diff
3K Download