In the IBM Redbook, AIX Logical Volume Manager, from A to Z:
Introduction and Concepts, reference SG24-5432-00, in Chapter 6,
Subsection 1, it discusses on-line backups, and the way that you can
break an active mirror, backup the mirrored (consistant) data, and
then re-mirror, in order to achieve complete and consistent backups,
but at the same time having the system available, and usable for a
large majority of the time.
This chapter goes on at length and regularly emphasises that the
methods developed prior to the extra commands being available in AIX
4.3.3 is a "hack".
As such, we have reworked all our scripts, to use the new
functionality (chfs -a splitcopy=<new fs name> copy=2 <old fs name>),
and it works fine... however, we discovered the following "feature",
which having followed our normal support channels, was written off as
being the way it was designed. I'm sorry, but if this is the way it
was designed, then IBM wants to shoot the person who designed this
feature in.... I've no problem with it, but SURELY this is a bug !!!
I'd be very interested to hear comments from ANYONE about this... am I
being overly demanding ? I mean, I have scripts in place to check my
errorlog, and mail me differences, so I see these very quickly, and it
bugs me... but maybe everyone else can live with it ??
Thanks for listening,
------begin fault description--------
One "feature" of using the on-line backup method, is that the
split-off mirror which is being backed up becomes stale. This is
completely expected, as the whole purpose of using this method is to
be able to perform a backup of a system which requires as close as
possible to 24 hour processing. This implies that the two copies, once
split, will become out of sync.
What we see as a result of this "feature", is that the error log gets
many entries with label LVM_SA_STALEPP, this appears to be made up of:
one stale PP in the JFSLOG lv
one stale PP in the "split" lv
Sometimes, two entries are logged for the same stale partition. The
above, is obviously a minimum. Given a particularly active LV, it
could be that during the time of the backup, EVERY LV becomes stale,
and for a large LV (which it is likely to be given that it is commonly
going to be a database) there could be many thousands of errorlog
As an example, I have a filesystem, which consists of 20 LPs, having
split it in order to back it up, I filled the previously empty
filesystem, and filled it up, and then emptied it, and as a result I
had 29 LVM_SA_STALEPP errorlog entries.
IBM's response to this report was to change the errorlog template to
not log STALEPP errors.
OK, so it wouldn't log these masses of errors, but it also wouldn't
log any real STALEPP errors, thus putting the customer at risk of not
noticing a failing disk until it had failed fatally.
Given that these errors are entirely expected, and that the
splitlvcopy functionality was added by IBM in response to people using
"the hack" to perform a similar function, the impression that these
errors, and IBM's response to the report of these problems give is
that their solution is almost as bad a "hack" as the original fix.
It would seem that the ideal solution is for the splitlvcopy routine
to mark the "split off copy" in some way that we don't care if the
split copy becomes stale, and thus any staleness in this case should
not be logged to the errorlog.
I hope that explains the situation... for us it is not so vital
anymore, as we have added to our script a routine which having
successfully completed the on-line backup, the STALEPP errors to the
appropriate LVs are removed from the errorlog, but this is not ideal.
----end fault description-------
Malcolm (recent 2-1-0 sav%86.96 GAA 3.95 - career 32-31-1 84.69% 6.45)
Goaltending is 90% mental, the other 10% is in your head (ICQ#8195978)
Hockey Results & Tables: http://homepages.tcp.co.uk/~sonic/hockey.html