SCSI BUS Target Resets

SCSI BUS Target Resets

Post by racma » Wed, 14 Sep 2005 11:43:16



We have a AIX, Solaris and Redhat/Suse environemnt attached to a EMC
Symmetrix DMX 3000 SAN. Running all different OS versions, Veritas
Clusters, Veritas Volume Manager, no Powerpath-just Veritas DMP for
multipathing. Some systems have 2 paths, some have one. Some systems
have the latest firmware and drivers, others don`t. Are HBAs are JNI,
Qlogic for Solaris and Emulex for our AIX/Linux. Brocade switches with
many FA ports, etc. We also have a Windows 2000 box that is in a
cluster with Emulex. We also have Oracle RAC environment. The problem
is since we migrated our servers from HP and SHARK to EMC, our systems
either crash or spit out disk errors in groups almost daily. Some
servers crash a few times a day. All outtages are random. EMC hooked up
an analyzer which traced the windows servers which was creating a scsi
target reset. When we shut down half of the windows cluster, errors
were reported on some of our other Unix servers. This did not resolve
the issues. We feel there may be many resets coming from different
servers. The AIX systems are set to Arbitrated loop but IBM claims
although the setting says arbitrated, it still trys point to point
first. I thought this may have been the issue. Also, there was also
talk of setting the c bit and d bit on the disks. Also, setting max
number of commands queued to the HBA? How about queue depth settings on
the disks set by the OS? Some of our servers dont have erros but youll
see io errors when moving a file within the filesytems and corruption
after a planned reboot. What could be causing these servers to crash
like this?
 
 
 

SCSI BUS Target Resets

Post by base6 » Wed, 14 Sep 2005 12:08:46


[snip]
 > Also, there was also

Quote:> talk of setting the c bit and d bit on the disks. Also, setting max
> number of commands queued to the HBA? How about queue depth settings on
> the disks set by the OS? Some of our servers dont have erros but youll
> see io errors when moving a file within the filesytems and corruption
> after a planned reboot. What could be causing these servers to crash
> like this?

Similar environment...

Take a snap next time one of the AIX systems drop and ask IBM
to look at it... they were pretty helpful in our case even
though it wasn't their hardware.

Have EMC verify their microcode level to the OS & patch level...
after a microcode upgrade to our EMC SANs, our problem somehow
vanished.

Probably just a co-incidence :)

 
 
 

SCSI BUS Target Resets

Post by Adrian Bridget » Wed, 14 Sep 2005 23:53:33


In my experience AIX tries point-to-point/switch first, but then fails
back to fcal.  Motto being "plug fibres in first".  Doing an lsattr -El
fscsiX should say "switch" rather than fcal.  To reset them you have to
rmdev the fscsiX device (and fcsX device I think) and then cfgmgr it
back in again.

I'd also check the zoning - we normally put all boxes in a cluster in a
zone (e.g. hostA, hostB, arrayA, arrayB) but I've heard some people go
even further to say only one initiator per zone. (e.g. hostA, array A,
array B; then hostB, array A, array B in another).

HTH

Adrian

 
 
 

SCSI BUS Target Resets

Post by vlad... » Thu, 15 Sep 2005 20:33:46


the "attach" attribute controlled by "init_link" attribute of the
parent of fscsiX, corresponding fcsX,so

chdev -a init_link=pt2pt -P -l fcsX

and yes, zoning would be the first thing to look at.