HSG80 SAN problems

HSG80 SAN problems

Post by s.smal » Mon, 08 Oct 2001 20:26:55



Has anyone here had problems with HSG80 controllers creating new dubious
connections and then the server not being able to connect to its disks?

ES40 running Tru64 5.1

 
 
 

HSG80 SAN problems

Post by Florian Wep » Tue, 09 Oct 2001 06:08:47



> Has anyone here had problems with HSG80 controllers creating new dubious
> connections and then the server not being able to connect to its disks?

> ES40 running Tru64 5.1

Could you fill us in on
* The HSG80 firmware version
* KGPSA firmware version
* Loop or Fabric?
* if Fabric: switch firmware version
* if Fabric: do you do zoning?
* How many servers in your SAN
* How many RAID boxes in your SAN?
* What did you do to cause the controller to create new connections?
* How did you determine the server could not see iot disks?

SCSI chains can be * to debug; a SAN can be even nastier. Without
a few hints nobody will even want to guess.

Florian

--
Theodorus will be pleased at my death,
And someone else will be pleased at the death of Theodorus,
And yet everyone speaks evil of death.
                -- Ezra Pound, "Homage to Quintus Septimus Florens Christianus"

 
 
 

HSG80 SAN problems

Post by s.smal » Wed, 10 Oct 2001 04:09:20




>>Has anyone here had problems with HSG80 controllers creating new dubious
>>connections and then the server not being able to connect to its disks?

>>ES40 running Tru64 5.1

> Could you fill us in on
> * The HSG80 firmware version

BMXHG01U>show this
Controller:
         HSG80 ZG10800585 Software V85F-0, Hardware  E12
         NODE_ID          = 5000-1FE1-0010-A9B0
         ALLOCATION_CLASS = 0
         SCSI_VERSION     = SCSI-3
         Configured for MULTIBUS_FAILOVER with ZG91606315
             In dual-redundant configuration
         Device Port SCSI address 7
         Time: NOT SET
         Command Console LUN is lun 0 (IDENTIFIER = 9999)

Quote:> * KGPSA firmware version

KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41
        KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-f95a
        (from binary.errlog)

Quote:> * Loop or Fabric?

Fabric

Quote:> * if Fabric: switch firmware version

Fabric OS:  v2.1.9a1  - we were told after this problem it should be
2.1.9g or m

Quote:> * if Fabric: do you do zoning?

No

Quote:> * How many servers in your SAN

2

Quote:> * How many RAID boxes in your SAN?

1 - 2 HSG80, 2 FC, 3 shelves

Quote:> * What did you do to cause the controller to create new connections?

Nothing your honour !!
No cables have been moved and nothing else changed. The problem happened
at 01.30 AM

Quote:> * How did you determine the server could not see iot disks?

When I came in at 0800AM the server had error message on blue >>> screen
Message similar to - cannot connect to dga2001.1.0.1

When show conn was first ran the !NEWCON29 & !NEWCON30 did not exist
Second time these were Online & BM1LA & BM1LB were Offline
BMXHG01U>show conn
Connection                                                                Unit
   Name      Operating system    Controller  Port    Address    Status   Offset
!NEWCON25      TRU64_UNIX           THIS       1               offline       0
           HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FBC1              
!NEWCON26      TRU64_UNIX           OTHER      1               offline       0
           HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FBC1
!NEWCON27      TRU64_UNIX           OTHER      1               offline       0
           HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0080-C924-FB41
!NEWCON28      TRU64_UNIX           THIS       1               offline       0
           HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0080-C924-FB41
!NEWCON29      TRU64_UNIX           THIS       1               offline       0
           HOST_ID=2000-0000-C924-FBC1         ADAPTER_ID=1000-0000-C924-FBC1
!NEWCON30      TRU64_UNIX           OTHER      1               offline       0
          HOST_ID=2000-0000-C924-FBC1         ADAPTER_ID=1000-0000-C924-FBC1              
BM1LA          TRU64_UNIX           THIS       1      011200   OL this       0
           HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FB41
BM1LB          TRU64_UNIX           OTHER      1      011200   OL other      0
           HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FB41
BM1UA          TRU64_UNIX           THIS       2      011200   OL this       0
           HOST_ID=2000-0000-C924-F95A         ADAPTER_ID=1000-0000-C924-F95A
BM1UB          TRU64_UNIX           OTHER      2      011200   OL other      0
           HOST_ID=2000-0000-C924-F95A         ADAPTER_ID=1000-0000-C924-F95A
BM2LA          TRU64_UNIX           THIS       2      011300   OL this       0
           HOST_ID=2000-0000-C924-FB2F         ADAPTER_ID=1000-0000-C924-FB2F
BM2LB          TRU64_UNIX           OTHER      2      011300   OL other      0
           HOST_ID=2000-0000-C924-FB2F         ADAPTER_ID=1000-0000-C924-FB2F
BM2RA          TRU64_UNIX           THIS       1      011300   OL this       0
           HOST_ID=2000-0000-C925-6BEA         ADAPTER_ID=1000-0000-C925-6BEA
BM2RB          TRU64_UNIX           OTHER      1      011300   OL other      0
           HOST_ID=2000-0000-C925-6BEA         ADAPTER_ID=1000-0000-C925-6BEA
BMXHG01U>      

Current problem is with BM1 (ES40). 5 weeks ago we had same problem with
BM2 (DS20), showing 2 !NEWCON99 connections. At the same time BM1 (ES40)
server went down and had PCI motherboard changed.

Quote:> SCSI chains can be * to debug; a SAN can be even nastier. Without
> a few hints nobody will even want to guess.

Regards

Colin Bull (Using Sues Email address !!)

 
 
 

HSG80 SAN problems

Post by Wilko Bul » Wed, 10 Oct 2001 04:35:11




>> Has anyone here had problems with HSG80 controllers creating new dubious
>> connections and then the server not being able to connect to its disks?

>> ES40 running Tru64 5.1

>Could you fill us in on
>* The HSG80 firmware version
>* KGPSA firmware version
>* Loop or Fabric?
>* if Fabric: switch firmware version
>* if Fabric: do you do zoning?
>* How many servers in your SAN
>* How many RAID boxes in your SAN?
>* What did you do to cause the controller to create new connections?
>* How did you determine the server could not see iot disks?
>SCSI chains can be * to debug; a SAN can be even nastier. Without
>a few hints nobody will even want to guess.

In addition, are you talking about Tru64 having problems, or do you refer
to the SRM console (wwidmgr etc)

Wilko

--

|/|/ / / /(  (_)  Bulte         Arnhem, The Netherlands

 
 
 

HSG80 SAN problems

Post by Wilko Bul » Wed, 10 Oct 2001 04:37:05





>>>Has anyone here had problems with HSG80 controllers creating new dubious
>>>connections and then the server not being able to connect to its disks?

>>>ES40 running Tru64 5.1

>> Could you fill us in on
>> * The HSG80 firmware version
>BMXHG01U>show this
>Controller:
>         HSG80 ZG10800585 Software V85F-0, Hardware  E12
>         NODE_ID          = 5000-1FE1-0010-A9B0
>         ALLOCATION_CLASS = 0
>         SCSI_VERSION     = SCSI-3
>         Configured for MULTIBUS_FAILOVER with ZG91606315
>             In dual-redundant configuration
>         Device Port SCSI address 7
>         Time: NOT SET
>         Command Console LUN is lun 0 (IDENTIFIER = 9999)
>> * KGPSA firmware version
>KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41
>    KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-f95a
>    (from binary.errlog)

...

Quote:>> * if Fabric: switch firmware version
>Fabric OS:  v2.1.9a1  - we were told after this problem it should be
>2.1.9g or m

2.1.9M released to manufacturing 2 weeks (IIRC) ago.

--

|/|/ / / /(  (_)  Bulte         Arnhem, The Netherlands

 
 
 

HSG80 SAN problems

Post by Florian Wep » Wed, 10 Oct 2001 05:53:10


"s.smale" <s.sm...@ntlworld.com> writes:
> Florian Weps wrote:

> > "s.smale" <s.sm...@ntlworld.com> writes:

> >>Has anyone here had problems with HSG80 controllers creating new dubious
> >>connections and then the server not being able to connect to its disks?

> >>ES40 running Tru64 5.1

Patched, of course? I think the jumbo patch du jour is #3...

> > Could you fill us in on
> > * The HSG80 firmware version

> BMXHG01U>show this
> Controller:
>          HSG80 ZG10800585 Software V85F-0, Hardware  E12
>          NODE_ID          = 5000-1FE1-0010-A9B0
>          ALLOCATION_CLASS = 0
>          SCSI_VERSION     = SCSI-3
>          Configured for MULTIBUS_FAILOVER with ZG91606315
>              In dual-redundant configuration
>          Device Port SCSI address 7
>          Time: NOT SET
>          Command Console LUN is lun 0 (IDENTIFIER = 9999)

We had trouble (different from yours) with 8.5F; consider upgrading to
8.6F. After we upgraded, we still had to upload a patch to the HSG80.

> > * KGPSA firmware version

> KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41

                                                                         ^^^^
Note the wwn.

- Show quoted text -

>    KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-f95a
>    (from binary.errlog)

> > * Loop or Fabric?

> Fabric

> > * if Fabric: switch firmware version

> Fabric OS:  v2.1.9a1  - we were told after this problem it should be
> 2.1.9g or m

> > * if Fabric: do you do zoning?

> No

> > * How many servers in your SAN

> 2

> > * How many RAID boxes in your SAN?

> 1 - 2 HSG80, 2 FC, 3 shelves

> > * What did you do to cause the controller to create new connections?

> Nothing your honour !!

Aha! they alway say that.

- Show quoted text -

> No cables have been moved and nothing else changed. The problem happened
> at 01.30 AM

> > * How did you determine the server could not see iot disks?

> When I came in at 0800AM the server had error message on blue >>> screen
> Message similar to - cannot connect to dga2001.1.0.1

> When show conn was first ran the !NEWCON29 & !NEWCON30 did not exist
> Second time these were Online & BM1LA & BM1LB were Offline
> BMXHG01U>show conn
> Connection                                                                Unit
>    Name      Operating system    Controller  Port    Address    Status   Offset
> !NEWCON25      TRU64_UNIX           THIS       1               offline       0
>            HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FBC1              
> !NEWCON26      TRU64_UNIX           OTHER      1               offline       0
>            HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FBC1
> !NEWCON27      TRU64_UNIX           OTHER      1               offline       0
>            HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0080-C924-FB41
> !NEWCON28      TRU64_UNIX           THIS       1               offline       0
>            HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0080-C924-FB41
> !NEWCON29      TRU64_UNIX           THIS       1               offline       0
>            HOST_ID=2000-0000-C924-FBC1         ADAPTER_ID=1000-0000-C924-FBC1
> !NEWCON30      TRU64_UNIX           OTHER      1               offline       0
>           HOST_ID=2000-0000-C924-FBC1         ADAPTER_ID=1000-0000-C924-FBC1              
> BM1LA          TRU64_UNIX           THIS       1      011200   OL this       0
>            HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FB41
> BM1LB          TRU64_UNIX           OTHER      1      011200   OL other      0
>            HOST_ID=2000-0000-C924-FB41         ADAPTER_ID=1000-0000-C924-FB41
> BM1UA          TRU64_UNIX           THIS       2      011200   OL this       0
>            HOST_ID=2000-0000-C924-F95A         ADAPTER_ID=1000-0000-C924-F95A
> BM1UB          TRU64_UNIX           OTHER      2      011200   OL other      0
>            HOST_ID=2000-0000-C924-F95A         ADAPTER_ID=1000-0000-C924-F95A
> BM2LA          TRU64_UNIX           THIS       2      011300   OL this       0
>            HOST_ID=2000-0000-C924-FB2F         ADAPTER_ID=1000-0000-C924-FB2F
> BM2LB          TRU64_UNIX           OTHER      2      011300   OL other      0
>            HOST_ID=2000-0000-C924-FB2F         ADAPTER_ID=1000-0000-C924-FB2F
> BM2RA          TRU64_UNIX           THIS       1      011300   OL this       0
>            HOST_ID=2000-0000-C925-6BEA         ADAPTER_ID=1000-0000-C925-6BEA
> BM2RB          TRU64_UNIX           OTHER      1      011300   OL other      0
>            HOST_ID=2000-0000-C925-6BEA         ADAPTER_ID=1000-0000-C925-6BEA
> BMXHG01U>      

> Current problem is with BM1 (ES40). 5 weeks ago we had same problem with
> BM2 (DS20), showing 2 !NEWCON99 connections. At the same time BM1 (ES40)
> server went down and had PCI motherboard changed.

I suspect a fishy KGPSA. Look at the wwn in !NEWCONN29 and !NEWCONN30:
they end in FBC1. The HBA in BL1LA and BM1LB has FB41, as can be seen
further up.

Looking at !NEWCONN25 and !NEWCONN26, I see similar chaos.

Now the only time I had an HBA change its wwn was when I upgraded the
Firmware on it. And then it didn't change back.

So what may have happened is that the KGPSA became unsure of its wwn,
causing much confusion on the part of the controller.

You are running multipath failover; I don't quite understand why the
server lost all connections when (if) only one HBA failed. Perhaps
Unix got confused as well by the sudden change in wwn.

Do you restrict the connections on your units? Perhaps the second HBA
wasn't authorized to access the units, causing the server crash?

Does /var/adm/messages say anything about the last moments before the
crash? The binary error log? - Er, if everything is on the HSG, forget
the last question.

That's an intriguing problem you've got there.

Florian

--
Theodorus will be pleased at my death,
And someone else will be pleased at the death of Theodorus,
And yet everyone speaks evil of death.
                -- Ezra Pound, "Homage to Quintus Septimus Florens Christianus"

 
 
 

HSG80 SAN problems

Post by s.smal » Wed, 10 Oct 2001 21:13:06


Thanks for all the replies



>>>>Has anyone here had problems with HSG80 controllers creating new dubious
>>>>connections and then the server not being able to connect to its disks?

>>>>ES40 running Tru64 5.1

> Patched, of course? I think the jumbo patch du jour is #3...

Yes patch kit 3 applied

Quote:

>>>Could you fill us in on
>>>* The HSG80 firmware version
>>         HSG80 ZG10800585 Software V85F-0, Hardware  E12

> We had trouble (different from yours) with 8.5F; consider upgrading to
> 8.6F. After we upgraded, we still had to upload a patch to the HSG80.

>>>* KGPSA firmware version

>>KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-fb41

Told today this should be upgraded to V3.83 for performance benefits

                                                                        ^^^^

Quote:> Note the wwn.

Spot on

Compaq have told us today there is a bit sticking on the KGPSA and an
engineer is due to day to replace. This has taken since Thursday last.

Quote:>>        KGPSA-CA : Driver Rev 1.29 : F/W Rev 3.03X2(1.11) : wwn 1000-0000-c924-f95a
>>        (from binary.errlog)

>>>* What did you do to cause the controller to create new connections?

>>Nothing your honour !!

Seems like the SAN was been given duff info from the KGPSA and treating it as a new connection

Quote:> Aha! they alway say that.

> I suspect a fishy KGPSA. Look at the wwn in !NEWCONN29 and !NEWCONN30:
> they end in FBC1. The HBA in BL1LA and BM1LB has FB41, as can be seen
> further up.

> Looking at !NEWCONN25 and !NEWCONN26, I see similar chaos.

> Now the only time I had an HBA change its wwn was when I upgraded the
> Firmware on it. And then it didn't change back.

> So what may have happened is that the KGPSA became unsure of its wwn,
> causing much confusion on the part of the controller.

> You are running multipath failover; I don't quite understand why the
> server lost all connections when (if) only one HBA failed. Perhaps
> Unix got confused as well by the sudden change in wwn.

It appears the faulty adaptor corrupted the root filesystem and that is
why the server cannot see the disks. NOt sure why that is the case, but
I am looking at it from a point of ignorance.

Quote:> That's an intriguing problem you've got there.

> Florian

So much for all that redundancy !!

Colin Bull

 
 
 

HSG80 SAN problems

Post by Wilko Bul » Sat, 13 Oct 2001 05:13:32






>>>>Has anyone here had problems with HSG80 controllers creating new dubious
>>>>connections and then the server not being able to connect to its disks?

>>>>ES40 running Tru64 5.1

>>> Could you fill us in on
>>> * The HSG80 firmware version
>>BMXHG01U>show this
>>Controller:
>>         HSG80 ZG10800585 Software V85F-0, Hardware  E12

You lack at least a bunch of patches for the ACS firmware. I think
6 patches are out there (but I could have lost count).

--

|/|/ / / /(  (_)  Bulte         Arnhem, The Netherlands

 
 
 

1. Tru64 HSG80 SAN access control

Hi Folks.

I have a Tru64 V5.1A with its storage on a HSG80 based SAN.

One of the units on the SAN is with access=all, and now I want to
selectively give access to the Tru64 system because some Windows NT
systems have joined this SAN.

The question I have is:
Can I change the host access to selectively include the Tru64 system without
interrupting the operation of the Tru64 system while it is working ?

Kind regards
Sam

2. How can I get my SGI running Irix 6.5 to work on my Broadband connection?

3. veritas cluster server + compaq secure path + HSG80 san.

4. Need clues on possible brokenness.

5. Opinions wanted on San Francisco/San Jose Linux dealers

6. A few question, mostly about pl8

7. MySQL MyODBC RedHat 8 San Jose - San Francisco help

8. Lacze STREAM - czy ktos uzywa ?

9. Newbie needs help! Adding drives to HSG80 ...

10. HSG80 - point to point connection

11. HSG80 snapshots

12. qlogic QLA2200F/66 with Compaq HSG80?

13. FC-AL/GigE combo and Compaq MA8000/HSG80