Intermittent shmat failure in 4.3.2, SMP

Intermittent shmat failure in 4.3.2, SMP

Post by Joerg Brueh » Thu, 27 Jan 2000 04:00:00



Greetings to all !

((Due to a configuration error in our news-server, my original
posting did not make it to the outside world. Please excuse
the inconvenience if this is a re-posting for any of you.))

In our software, the client communicates with the server via
a SHM segment created by the server; the client just does a
'shmat' with address 0 (= let the system determine).

At a customer site, we now experienced some 'shmat' failures
with errno = EINVAL. They are intermittent and may occur at
peak load times, I cannot check the machine load from here.

Checking our code (which had no problems in that area for long time)
and the reference manual for 'shmat', we see no reason for that.
It just mentions 4 possible reasons for EINVAL:

1) The SHM_RDONLY and SHM_COPY flags are both set.
2) The SharedMemoryID parameter is not a valid shared memory identifier.
3) The SharedMemoryAddress parameter is not equal to 0, and ...
   points outside the address space of the process.
4) The SharedMemoryAddress parameter is not equal to 0,
   the SHM_RND flag is not set in the SharedMemoryFlag parameter,
   and ... points to a location outside of ...

In this case, both the code and the error message written by it agree:
1) The flag parameter is 0 - none set.
2) The ID is valid, a call to 'shmctl ( id, IPC_STAT, ...)'
   (immediately after the error) returns information as expected.
   The segment was created at server start (hours or days ago).
3+4) The address passed is 0.

The site runs AIX 4.3.2.0 on a 4-way SMP machine.
Questions:

a) Are there any known problems with 'shmat' in general,
   or with AIX 4.3.2.0, or on SMP machines ?

b) Is there any specific PTF or other SW upgrade they should
   install ?

c) Is there any other known reason for 'shmat' to return EINVAL
   than the four I quoted from the manual ?

   It now appears that the segment in question has the "delete"
   flag set (but still exists because it is still attached by
   some processes) - is this a reason for the errno value "EINVAL" ?

d) If the "delete" is the reason - are there any known problems
   or effects which would cause AIX or some component to
   accidentally delete a SHM segment, or to apply the "delete"
   to another segment than the intended one ?
   Two cases we analyzed both had it happen to ID 16 - any clue ?

Thank you for all hints.

Regards, Joerg Bruehe

--
Joerg Bruehe, SQL Datenbanksysteme GmbH, Berlin, Germany
     (speaking only for himself)

 
 
 

Intermittent shmat failure in 4.3.2, SMP

Post by Nicholas Drone » Fri, 28 Jan 2000 04:00:00



> Greetings to all !
> In our software, the client communicates with the server via
> a SHM segment created by the server; the client just does a
> 'shmat' with address 0 (= let the system determine).
> At a customer site, we now experienced some 'shmat' failures
> with errno = EINVAL. They are intermittent and may occur at
> peak load times, I cannot check the machine load from here.
>    It now appears that the segment in question has the "delete"
>    flag set (but still exists because it is still attached by
>    some processes) - is this a reason for the errno value "EINVAL" ?

The following is from Stevens' _Advanced Programming in the Unix
Environment_, p. 465.

        IPC_RMID        Remove the shared memory segment set from the system.
                                Since an attachment cound is maintained for shared
                                memory segments (the smh_nattch field in the shmid_ds
                                structure) the segment is not actually removed until
                                the last process using the segment terminates or
                                detaches it.  Regardless whether the segment is still
                                in use or not, the segment's identifier is immediately
                                removed so that sgmat can no longer attach the segment.

The last sentence appears to be the most instructive.  Given the documented
error values that shmat returns, it appears that this is what's happening
with your program.

The behavior appears to be different from SysV message queues and semaphores,
which return EIDRM when a process attempts to perform an operation on a message
queue or semaphore.  In these cases, however, the removal operation itself is
immediate, whereas a shared memory segment can linger after a process issues
shmctl with IPC_RMID as an argument.  That doesn't explain to my satisfaction
why the return values are different.  Can anyone comment on this?

I don't see any modules with names resembling any of the SysV IPC
services in /usr/lib/drivers or /etc/drivers, which leads me to believe
that all of the SysV IPC codes is in the kernel.  Any fixes for IPC
would most likely be in bos.rte.libc and bos.mp (for an SMP machine).

Regards,

Nicholas Dronen


 
 
 

1. Intermittent shmat failure in 4.3.2, SMP

Greetings to all !

((This is a re-posting, but my original posting of Dec 13
could not be found on DejaNews or the archive at
   http://www.thp.Uni-Duisburg.DE/cuaix/cuaix.html ,
so it seems to have been deleted somewhere ...))

In our software, the client communicates with the server via
a SHM segment created by the server; the client just does a
'shmat' with address 0 (= let the system determine).

At a customer site, we now experienced some 'shmat' failures
with errno = EINVAL. They are intermittent and may occur at
peak load times, I cannot check the machine load from here.

Checking our code (which had no problems in that area for long time)
and the reference manual for 'shmat', we see no reason for that.
It just mentions 4 possible reasons for EINVAL:

1) The SHM_RDONLY and SHM_COPY flags are both set.
2) The SharedMemoryID parameter is not a valid shared memory identifier.
3) The SharedMemoryAddress parameter is not equal to 0, and ...
   points outside the address space of the process.
4) The SharedMemoryAddress parameter is not equal to 0,
   the SHM_RND flag is not set in the SharedMemoryFlag parameter,
   and ... points to a location outside of ...

In this case, both the code and the error message written by it agree:
1) The flag parameter is 0 - none set.
2) The ID is valid, a call to 'shmctl ( id, IPC_STAT, ...)'
   (immediately after the error) returns information as expected.
   The segment was created at server start (hours or days ago).
3+4) The address passed is 0.

The site runs AIX 4.3.2.0 on a 4-way SMP machine.
Questions:

a) Is there any other known reason for 'shmat' to return EINVAL
   than the four I quoted from the manual ?

b) Are there any known problems with 'shmat' in general,
   or with AIX 4.3.2.0, or on SMP machines ?

c) Is there any specific PTF or other SW upgrade they should
   install ?

Thank you for all hints.

Regards, Joerg Bruehe

--
Joerg Bruehe, SQL Datenbanksysteme GmbH, Berlin, Germany
     (speaking only for himself)

2. 2.1.131: modules blues

3. assertion failure : ext3 & lvm , 2.4.17 smp & 2.4.18-ac1 smp

4. gmake-3.77 $(wildcard) & $(filter-out) on Solaris?

5. shmat () failure on AIX Ver. 4.3

6. File server for PCs

7. SCO & SMP Intermittent hang - Patch oss469a

8. dhcpd and client-identifier

9. SCO SMP 5.0.4 Intermittent Hangs

10. intermittent boot failure

11. intermittent keyboard failure

12. Floppy problems, intermittent verify failures

13. intermittent network/firewall failure