Has anyone here had problems with HSG80 controllers creating new dubious
connections and then the server not being able to connect to its disks?
ES40 running Tru64 5.1
ES40 running Tru64 5.1
> ES40 running Tru64 5.1
SCSI chains can be * to debug; a SAN can be even nastier. Without
a few hints nobody will even want to guess.
Florian
--
Theodorus will be pleased at my death,
And someone else will be pleased at the death of Theodorus,
And yet everyone speaks evil of death.
-- Ezra Pound, "Homage to Quintus Septimus Florens Christianus"
>>Has anyone here had problems with HSG80 controllers creating new dubious
>>connections and then the server not being able to connect to its disks?
>>ES40 running Tru64 5.1
> Could you fill us in on
> * The HSG80 firmware version
KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41Quote:> * KGPSA firmware version
FabricQuote:> * Loop or Fabric?
Fabric OS: v2.1.9a1 - we were told after this problem it should beQuote:> * if Fabric: switch firmware version
NoQuote:> * if Fabric: do you do zoning?
2Quote:> * How many servers in your SAN
1 - 2 HSG80, 2 FC, 3 shelvesQuote:> * How many RAID boxes in your SAN?
Nothing your honour !!Quote:> * What did you do to cause the controller to create new connections?
When I came in at 0800AM the server had error message on blue >>> screenQuote:> * How did you determine the server could not see iot disks?
When show conn was first ran the !NEWCON29 & !NEWCON30 did not exist
Second time these were Online & BM1LA & BM1LB were Offline
BMXHG01U>show conn
Connection Unit
Name Operating system Controller Port Address Status Offset
!NEWCON25 TRU64_UNIX THIS 1 offline 0
HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FBC1
!NEWCON26 TRU64_UNIX OTHER 1 offline 0
HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FBC1
!NEWCON27 TRU64_UNIX OTHER 1 offline 0
HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0080-C924-FB41
!NEWCON28 TRU64_UNIX THIS 1 offline 0
HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0080-C924-FB41
!NEWCON29 TRU64_UNIX THIS 1 offline 0
HOST_ID=2000-0000-C924-FBC1 ADAPTER_ID=1000-0000-C924-FBC1
!NEWCON30 TRU64_UNIX OTHER 1 offline 0
HOST_ID=2000-0000-C924-FBC1 ADAPTER_ID=1000-0000-C924-FBC1
BM1LA TRU64_UNIX THIS 1 011200 OL this 0
HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FB41
BM1LB TRU64_UNIX OTHER 1 011200 OL other 0
HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FB41
BM1UA TRU64_UNIX THIS 2 011200 OL this 0
HOST_ID=2000-0000-C924-F95A ADAPTER_ID=1000-0000-C924-F95A
BM1UB TRU64_UNIX OTHER 2 011200 OL other 0
HOST_ID=2000-0000-C924-F95A ADAPTER_ID=1000-0000-C924-F95A
BM2LA TRU64_UNIX THIS 2 011300 OL this 0
HOST_ID=2000-0000-C924-FB2F ADAPTER_ID=1000-0000-C924-FB2F
BM2LB TRU64_UNIX OTHER 2 011300 OL other 0
HOST_ID=2000-0000-C924-FB2F ADAPTER_ID=1000-0000-C924-FB2F
BM2RA TRU64_UNIX THIS 1 011300 OL this 0
HOST_ID=2000-0000-C925-6BEA ADAPTER_ID=1000-0000-C925-6BEA
BM2RB TRU64_UNIX OTHER 1 011300 OL other 0
HOST_ID=2000-0000-C925-6BEA ADAPTER_ID=1000-0000-C925-6BEA
BMXHG01U>
Current problem is with BM1 (ES40). 5 weeks ago we had same problem with
BM2 (DS20), showing 2 !NEWCON99 connections. At the same time BM1 (ES40)
server went down and had PCI motherboard changed.
RegardsQuote:> SCSI chains can be * to debug; a SAN can be even nastier. Without
> a few hints nobody will even want to guess.
Colin Bull (Using Sues Email address !!)
>> ES40 running Tru64 5.1
>Could you fill us in on
>* The HSG80 firmware version
>* KGPSA firmware version
>* Loop or Fabric?
>* if Fabric: switch firmware version
>* if Fabric: do you do zoning?
>* How many servers in your SAN
>* How many RAID boxes in your SAN?
>* What did you do to cause the controller to create new connections?
>* How did you determine the server could not see iot disks?
>SCSI chains can be * to debug; a SAN can be even nastier. Without
>a few hints nobody will even want to guess.
Wilko
--
|/|/ / / /( (_) Bulte Arnhem, The Netherlands
>>>Has anyone here had problems with HSG80 controllers creating new dubious
>>>connections and then the server not being able to connect to its disks?
>>>ES40 running Tru64 5.1
>> Could you fill us in on
>> * The HSG80 firmware version
>BMXHG01U>show this
>Controller:
> HSG80 ZG10800585 Software V85F-0, Hardware E12
> NODE_ID = 5000-1FE1-0010-A9B0
> ALLOCATION_CLASS = 0
> SCSI_VERSION = SCSI-3
> Configured for MULTIBUS_FAILOVER with ZG91606315
> In dual-redundant configuration
> Device Port SCSI address 7
> Time: NOT SET
> Command Console LUN is lun 0 (IDENTIFIER = 9999)
>> * KGPSA firmware version
>KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41
> KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-f95a
> (from binary.errlog)
2.1.9M released to manufacturing 2 weeks (IIRC) ago.Quote:>> * if Fabric: switch firmware version
>Fabric OS: v2.1.9a1 - we were told after this problem it should be
>2.1.9g or m
--
|/|/ / / /( (_) Bulte Arnhem, The Netherlands
> > "s.smale" <s.sm...@ntlworld.com> writes:
> >>Has anyone here had problems with HSG80 controllers creating new dubious
> >>connections and then the server not being able to connect to its disks?
> >>ES40 running Tru64 5.1
> > Could you fill us in on
> > * The HSG80 firmware version
> BMXHG01U>show this
> Controller:
> HSG80 ZG10800585 Software V85F-0, Hardware E12
> NODE_ID = 5000-1FE1-0010-A9B0
> ALLOCATION_CLASS = 0
> SCSI_VERSION = SCSI-3
> Configured for MULTIBUS_FAILOVER with ZG91606315
> In dual-redundant configuration
> Device Port SCSI address 7
> Time: NOT SET
> Command Console LUN is lun 0 (IDENTIFIER = 9999)
> KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41
> > * Loop or Fabric?
> Fabric
> > * if Fabric: switch firmware version
> Fabric OS: v2.1.9a1 - we were told after this problem it should be
> 2.1.9g or m
> > * if Fabric: do you do zoning?
> No
> > * How many servers in your SAN
> 2
> > * How many RAID boxes in your SAN?
> 1 - 2 HSG80, 2 FC, 3 shelves
> > * What did you do to cause the controller to create new connections?
> Nothing your honour !!
> > * How did you determine the server could not see iot disks?
> When I came in at 0800AM the server had error message on blue >>> screen
> Message similar to - cannot connect to dga2001.1.0.1
> When show conn was first ran the !NEWCON29 & !NEWCON30 did not exist
> Second time these were Online & BM1LA & BM1LB were Offline
> BMXHG01U>show conn
> Connection Unit
> Name Operating system Controller Port Address Status Offset
> !NEWCON25 TRU64_UNIX THIS 1 offline 0
> HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FBC1
> !NEWCON26 TRU64_UNIX OTHER 1 offline 0
> HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FBC1
> !NEWCON27 TRU64_UNIX OTHER 1 offline 0
> HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0080-C924-FB41
> !NEWCON28 TRU64_UNIX THIS 1 offline 0
> HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0080-C924-FB41
> !NEWCON29 TRU64_UNIX THIS 1 offline 0
> HOST_ID=2000-0000-C924-FBC1 ADAPTER_ID=1000-0000-C924-FBC1
> !NEWCON30 TRU64_UNIX OTHER 1 offline 0
> HOST_ID=2000-0000-C924-FBC1 ADAPTER_ID=1000-0000-C924-FBC1
> BM1LA TRU64_UNIX THIS 1 011200 OL this 0
> HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FB41
> BM1LB TRU64_UNIX OTHER 1 011200 OL other 0
> HOST_ID=2000-0000-C924-FB41 ADAPTER_ID=1000-0000-C924-FB41
> BM1UA TRU64_UNIX THIS 2 011200 OL this 0
> HOST_ID=2000-0000-C924-F95A ADAPTER_ID=1000-0000-C924-F95A
> BM1UB TRU64_UNIX OTHER 2 011200 OL other 0
> HOST_ID=2000-0000-C924-F95A ADAPTER_ID=1000-0000-C924-F95A
> BM2LA TRU64_UNIX THIS 2 011300 OL this 0
> HOST_ID=2000-0000-C924-FB2F ADAPTER_ID=1000-0000-C924-FB2F
> BM2LB TRU64_UNIX OTHER 2 011300 OL other 0
> HOST_ID=2000-0000-C924-FB2F ADAPTER_ID=1000-0000-C924-FB2F
> BM2RA TRU64_UNIX THIS 1 011300 OL this 0
> HOST_ID=2000-0000-C925-6BEA ADAPTER_ID=1000-0000-C925-6BEA
> BM2RB TRU64_UNIX OTHER 1 011300 OL other 0
> HOST_ID=2000-0000-C925-6BEA ADAPTER_ID=1000-0000-C925-6BEA
> BMXHG01U>
> Current problem is with BM1 (ES40). 5 weeks ago we had same problem with
> BM2 (DS20), showing 2 !NEWCON99 connections. At the same time BM1 (ES40)
> server went down and had PCI motherboard changed.
Looking at !NEWCONN25 and !NEWCONN26, I see similar chaos.
Now the only time I had an HBA change its wwn was when I upgraded the
Firmware on it. And then it didn't change back.
So what may have happened is that the KGPSA became unsure of its wwn,
causing much confusion on the part of the controller.
You are running multipath failover; I don't quite understand why the
server lost all connections when (if) only one HBA failed. Perhaps
Unix got confused as well by the sudden change in wwn.
Do you restrict the connections on your units? Perhaps the second HBA
wasn't authorized to access the units, causing the server crash?
Does /var/adm/messages say anything about the last moments before the
crash? The binary error log? - Er, if everything is on the HSG, forget
the last question.
That's an intriguing problem you've got there.
Florian
--
Theodorus will be pleased at my death,
And someone else will be pleased at the death of Theodorus,
And yet everyone speaks evil of death.
-- Ezra Pound, "Homage to Quintus Septimus Florens Christianus"
Thanks for all the replies
>>>>Has anyone here had problems with HSG80 controllers creating new dubious
>>>>connections and then the server not being able to connect to its disks?
>>>>ES40 running Tru64 5.1
> Patched, of course? I think the jumbo patch du jour is #3...
Told today this should be upgraded to V3.83 for performance benefitsQuote:>>>Could you fill us in on
>>>* The HSG80 firmware version
>> HSG80 ZG10800585 Software V85F-0, Hardware E12
> We had trouble (different from yours) with 8.5F; consider upgrading to
> 8.6F. After we upgraded, we still had to upload a patch to the HSG80.
>>>* KGPSA firmware version
>>KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41
^^^^
Spot onQuote:> Note the wwn.
Compaq have told us today there is a bit sticking on the KGPSA and an
engineer is due to day to replace. This has taken since Thursday last.
Seems like the SAN was been given duff info from the KGPSA and treating it as a new connectionQuote:>> KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-f95a
>> (from binary.errlog)
>>>* What did you do to cause the controller to create new connections?
>>Nothing your honour !!
It appears the faulty adaptor corrupted the root filesystem and that isQuote:> Aha! they alway say that.
> I suspect a fishy KGPSA. Look at the wwn in !NEWCONN29 and !NEWCONN30:
> they end in FBC1. The HBA in BL1LA and BM1LB has FB41, as can be seen
> further up.
> Looking at !NEWCONN25 and !NEWCONN26, I see similar chaos.
> Now the only time I had an HBA change its wwn was when I upgraded the
> Firmware on it. And then it didn't change back.
> So what may have happened is that the KGPSA became unsure of its wwn,
> causing much confusion on the part of the controller.
> You are running multipath failover; I don't quite understand why the
> server lost all connections when (if) only one HBA failed. Perhaps
> Unix got confused as well by the sudden change in wwn.
So much for all that redundancy !!Quote:> That's an intriguing problem you've got there.
> Florian
Colin Bull
>>>>Has anyone here had problems with HSG80 controllers creating new dubious
>>>>connections and then the server not being able to connect to its disks?
>>>>ES40 running Tru64 5.1
>>> Could you fill us in on
>>> * The HSG80 firmware version
>>BMXHG01U>show this
>>Controller:
>> HSG80 ZG10800585 Software V85F-0, Hardware E12
--
|/|/ / / /( (_) Bulte Arnhem, The Netherlands
1. Tru64 HSG80 SAN access control
Hi Folks.
I have a Tru64 V5.1A with its storage on a HSG80 based SAN.
One of the units on the SAN is with access=all, and now I want to
selectively give access to the Tru64 system because some Windows NT
systems have joined this SAN.
The question I have is:
Can I change the host access to selectively include the Tru64 system without
interrupting the operation of the Tru64 system while it is working ?
Kind regards
Sam
2. How can I get my SGI running Irix 6.5 to work on my Broadband connection?
3. veritas cluster server + compaq secure path + HSG80 san.
4. Need clues on possible brokenness.
5. Opinions wanted on San Francisco/San Jose Linux dealers
6. A few question, mostly about pl8
7. MySQL MyODBC RedHat 8 San Jose - San Francisco help
8. Lacze STREAM - czy ktos uzywa ?
9. Newbie needs help! Adding drives to HSG80 ...
10. HSG80 - point to point connection
11. HSG80 snapshots
12. qlogic QLA2200F/66 with Compaq HSG80?
13. FC-AL/GigE combo and Compaq MA8000/HSG80