AIX malloc and fault tolerance

AIX malloc and fault tolerance

Post by Scott Hanso » Fri, 04 Sep 1992 22:51:56



In AIX, a nonzero return from malloc does not guarantee that the
memory requested has been allocated to the process.  Neither does
a successful launch of a program containing a large static buffer
guarantee that the buffer may be fully usable.

I was quite surprised to discover this in a recent conversation that
my program had with the malloc function.  The dialog when something
like this:
    program: Hey malloc!  I'd like a few megabytes please.
    malloc:  No problem, here you go -- it's at this address.
    program: Thanks very much.  I'll just go and fill this memory, and --
    malloc:  Just kidding!  I didn't really give you the memory.  And by
             the way, you're going to die if you continue to use it.  Ha!

It turns out that when my program went to fill in the memory, it was sent
a SIGKILL by AIX.  This was clearly a bug of some kind, so I reported it.
I sent a demo program that just malloc'd as big a buffer as it could get,
and then started zeroing a byte every 4k.  When I started up a couple of
them, they all died.  Some got bus errors, others were sent SIGKILL.

IBM's response is that this is working as designed!

IBM said that they don't allocate the paging space until it's needed, in
order to accommodate programs which ask for large amounts of memory that
they never use (some nonsense about sparse arrays).  I said that I need
to know if the memory is really there before I start using it.  They
referred me to some sample code.  If it weren't so sad, it would have
been funny.

The sample code contained a malloc wrapper that returns nonzero only if
the memory is actually there.  It sets up a handler for SIGDANGER, then
calls malloc.  If it returns nonzero (a virtual certainty) it proceeds
to touch pages of memory.  If it gets SIGDANGER, the handler longjmps to
code which "untouches" the memory, frees the successfully-malloc'd-but-
not-really-there buffer and causes the malloc wrapper to return zero.

I have questions:
    1) Has anyone seen a system where static memory may not really be
       there, or where a nonzero malloc doesn't guarantee the successful
       usage of the memory?
    2) Has anyone heard of SIGDANGER before?
    3) Read the POSIX standard for malloc from a "legal" standpoint.
       If IBM claims POSIX compliance, can I use this as a weapon?
    4) Even if I use this malloc wrapper everywhere in my own code,
       how do I deal with third-party code I purchase that calls the
       unwrapped malloc?

 
 
 

AIX malloc and fault tolerance

Post by Tom McConne » Sat, 05 Sep 1992 06:34:32



> In AIX, a nonzero return from malloc does not guarantee that the
> memory requested has been allocated to the process.  Neither does
> a successful launch of a program containing a large static buffer
> guarantee that the buffer may be fully usable.

  You might also note that "malloc(0)" currently returns "0", although it is not
guarenteed to do this in the future. This is what breaks the current version of
g++ (version 2.2.2) if you call new[] with a size of 0.

    Cheers,

    Tom McConnell
--

 Intel, Corp. C3-91     |     Phone: (602)-554-8229
 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in
 Chandler, AZ  85226    | their right mind would claim them.

 
 
 

AIX malloc and fault tolerance

Post by Steve Los » Sat, 05 Sep 1992 06:45:27



Quote:(Scott Hansohn) writes:

|> I have questions:
|>     1) Has anyone seen a system where static memory may not really
be
|>        there, or where a nonzero malloc doesn't guarantee the
successful
|>        usage of the memory?

When you malloc a huge chunk of memory, you
do indeed have the memory mapped into your address space.  You just
haven't allocated any pages from swap. If you read this newly
malloc-ed memory (kinda boring since its all zeros) you still don't
allocate any new pages from swap.  You have to actually dirty a page
in order for the system to give you one.

About sparse arrays:  Some algorithms require a very large address
space, but actually write to a very small portion of it.  Such
algorithms
run efficiently under AIX because of this design feature.

|>     2) Has anyone heard of SIGDANGER before?

Only on AIX

|>     3) Read the POSIX standard for malloc from a "legal" standpoint.
|>        If IBM claims POSIX compliance, can I use this as a weapon?

I have no idea.  Personally, I don't think it's a bad feature.  Just
different.  I can see your point, but I can also see advantages in
AIX as well.  All in all, I think AIX does a nice job with
memory management.  For instance, AIX will make some use of all your
RAM, even when you aren't running enough processes to fill it up.
AIX just uses any spare RAM for cacheing disk data.  SunOS will leave
RAM lying about totally unused if you have more than you need.

|>     4) Even if I use this malloc wrapper everywhere in my own code,
|>        how do I deal with third-party code I purchase that calls the
|>        unwrapped malloc?

You can run lsps -a before running any third-party code to see how much
swap is available.  I admit this is crude.

I haven't seen the malloc wrapper, but it sounds like it might be
overly complex.  There is a function called psdanger() that reports
the amount of free swap space.  Instead of writing a signal handler
for SIGDANGER and all that, you could simply see if the system has
enough swap before you malloc.  Admittedly, some other process could
start gobbling pages before you could dirty all of yours, and your
process would get killed in that case.  However, what would you do
if you ever got a SIGDANGER?  Probably exit.

--

University of *ia Academic Computing Center

 
 
 

AIX malloc and fault tolerance

Post by Karl Denning » Sat, 05 Sep 1992 13:41:25



>In AIX, a nonzero return from malloc does not guarantee that the
>memory requested has been allocated to the process.  Neither does
>a successful launch of a program containing a large static buffer
>guarantee that the buffer may be fully usable.

>    program: Hey malloc!  I'd like a few megabytes please.
>    malloc:  No problem, here you go -- it's at this address.
>    program: Thanks very much.  I'll just go and fill this memory, and --
>    malloc:  Just kidding!  I didn't really give you the memory.  And by
>             the way, you're going to die if you continue to use it.  Ha!

>It turns out that when my program went to fill in the memory, it was sent
>a SIGKILL by AIX.  This was clearly a bug of some kind, so I reported it.
>I sent a demo program that just malloc'd as big a buffer as it could get,
>and then started zeroing a byte every 4k.  When I started up a couple of
>them, they all died.  Some got bus errors, others were sent SIGKILL.

>IBM said that they don't allocate the paging space until it's needed, in
>order to accommodate programs which ask for large amounts of memory that
>they never use (some nonsense about sparse arrays).  

>I have questions:
>    1) Has anyone seen a system where static memory may not really be
>       there, or where a nonzero malloc doesn't guarantee the successful
>       usage of the memory?

Yeah, on AIX :-)

Quote:>    2) Has anyone heard of SIGDANGER before?

You bet. I've had entire PRODUCTION SYSTEMS come crashing down because of
this.  Believe it or not, I have had >curses< programs get SIGDANGER when
the machine was heavily loaded.  The result is a nice program crash.

Further, I have had processes receive SIGDANGER when only one of two paging
spaces on the system was close to filling.  It seems that if >any< page
space on an AIX box gets close to being full you get the SIGDANGER, even if
there are other paging spaces with lots of room left.  Yikes!  So much for
the performance advantages of spreading the page space across spindles!

Quote:>    3) Read the POSIX standard for malloc from a "legal" standpoint.
>       If IBM claims POSIX compliance, can I use this as a weapon?

Probably ;-)

Quote:>    4) Even if I use this malloc wrapper everywhere in my own code,
>       how do I deal with third-party code I purchase that calls the
>       unwrapped malloc?

You don't, other than to allow it to die.  Oh, you had something >important<
in that program going on, like perhaps a financial transaction?  Too bad --
that SIGKILL you just received can't be caught!  So much for reliable
software.

This is one of the reasons I hate AIX.  There are lots of them, but this is
definately one of the top 5.  When malloc() returns non-NULL, you are supposed
to have the space available >period<.  Same is true for static arrays -- I
typically will declare these for things I >must< be able to get at and can't
afford a NULL malloc() return for.  It is quite a surprise to get SIGDANGER
or SIGKILL when you don't expect it, and have no way to deal with it.
Oh, you mean that large static array I have declared really >can't< be
used, and you won't tell me ahead of time?!  Oh, that array is declared in a
library (like internal to Curses)?  Now what the hell do I do about it?

This is a >big< problem on heavily-loaded machines.

--

VideOcart Inc.          Voice: (312) 987-5022

 
 
 

AIX malloc and fault tolerance

Post by Tim Cha » Sun, 06 Sep 1992 05:21:52



>I have questions:
>    1) Has anyone seen a system where static memory may not really be
>       there, or where a nonzero malloc doesn't guarantee the successful
>       usage of the memory?

Yes, Apollo's Aegis version 9.7 behaved this way.  In future versions,
I believe they changed it to the more "traditional" method where the backing
store is also allocated a malloc time, but I believe there was an (possibly)
undocumented option to revert to the old behavior to satify those who
wanted to use sparse arrays, etc.

--
Tim Chase                          Introl Corp. Milwaukee, WI USA

 
 
 

AIX malloc and fault tolerance

Post by Thomas Braunbeck/1310 » Sun, 06 Sep 1992 04:24:28



|> You don't, other than to allow it to die.  Oh, you had something >important<
|> in that program going on, like perhaps a financial transaction?  Too bad --
|> that SIGKILL you just received can't be caught!  So much for reliable
|> software.
|>
Only processe that do not have a signal handler for the SIGDANGER signal
will get the SIGKILL.
It it possible to suspend your process if the system sends the
SIGDANGER until other processes release paging space.
Use the psdanger subroutine to check the amount of free paging space.

From General Concepts and Procedures, GC23-2202-02, page17-2:
Fri Sep  4 21:09:06 MEZ 1992 Copyright (c) 1991 IBM Corporation     Page 1

Understanding Paging Space Allocation Policies
The amount of paging space required depends on the type of activities
performed on the system.  If paging space runs low, processes may be
lost, and if paging space runs out, the system may panic.  When a paging
space low condition is detected, additional paging space should be
defined.

The system monitors the number of free paging space blocks and detects
when a paging space shortage condition exists.  When the number of free
paging space blocks falls below a threshold known as the paging space
warning level, the system informs all processes (except kprocs) of this
condition by sending the SIGDANGER signal.  If the shortage continues
and falls below a second threshold known as the paging space kill level,
the system sends the SIGKILL signal to processes that are the major
users of paging space and that do not have a signal handler for the
SIGDANGER signal (the default action for the SIGDANGER signal is to
ignore the signal).  The system continues sending SIGKILL signals until
the number of free paging space blocks is above the paging space kill
level.

Processes that dynamically allocate memory can ensure that sufficient
paging space exists by monitoring the paging space levels with the
psdanger subroutine or by using special allocation routines (see the
psmalloc.c file for sample code which uses memory allocation routines
that allocate paging space at memory allocation time).  Processes can
keep from getting ended when the paging space kill level is reached by
defining a signal handler for the SIGDANGER signal and by releasing
memory and paging space resources allocated in their data and stack
areas and in shared memory segments, using the disclaim subroutine.
--

Best regards,   Thomas Braunbeck
AS Software Service AIX, Germany

        All the opinions expressed are my own and
        do not necessarily reflect those of IBM



|       DEIBM3M3 at IBMMAIL

| Voice +49-6131-84-2445,  FAX +49-6131-84-6585

 
 
 

AIX malloc and fault tolerance

Post by Jon Alper » Wed, 09 Sep 1992 21:15:33


Umm.......not quite the truth...


|> Only processe that do not have a signal handler for the SIGDANGER signal
|> will get the SIGKILL.
|> {...}
|> From General Concepts and Procedures, GC23-2202-02, page17-2:
|> Fri Sep  4 21:09:06 MEZ 1992 Copyright (c) 1991 IBM Corporation     Page 1
|>
|> {...}  Processes can
|> keep from getting ended when the paging space kill level is reached by
|> defining a signal handler for the SIGDANGER signal and by releasing
                                                             ^^^^^^^^^
|> memory and paging space resources allocated in their data and stack
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> areas and in shared memory segments, using the disclaim subroutine.
|> --
|>
|> Best regards,   Thomas Braunbeck

As can be seen by the entire post, trapping for SIGDANGER in and of itself does
nothing. If you trap for, and receive a SIGDANGER, you must free page space
immediately, or you will receive a SIGKILL shortly thereafter.

jon

--
Jon Alperin
Bell Communications Research


---> Voicenet: (908) 699-8674
---> UUNET: uunet!bcr!jona

* All opinions and stupid questions are my own *

 
 
 

AIX malloc and fault tolerance

Post by Pierre Assel » Thu, 10 Sep 1992 04:54:32




[He bumped into the malloc virtual allocation nonsense again.
 I still get mad just thinking about it.]

Quote:>I have questions:
>    1) Has anyone seen a system where static memory may not really be
>       there, or where a nonzero malloc doesn't guarantee the successful
>       usage of the memory?

Alas, yes.  When this thread started about 18 months ago, I tried a
malloc-touch loop on a DG aviion.  No NULLS from malloc, processes
get killed.  Damned!  I didn't try large static arrays.

DG has been reasonably good at not reinventing the wheel; this
wonderful feature may come to us from AT&T;  netters confirmed this.
I was told, however, that it is not a part of the SVID.

Quote:>    2) Has anyone heard of SIGDANGER before?

It's an IBM innovation.

Quote:>    4) Even if I use this malloc wrapper everywhere in my own code,
>       how do I deal with third-party code I purchase that calls the
>       unwrapped malloc?

Or with the pre-linked code in libc.a?
Get your money back.  (Hah! right.)

Some arguments to the effect that vapour-memory was a good thing were:
 -Lets you use gigantic sparse arrays.
 -Lets vendors ship Fortran binaries with static arrays dimensioned
  to maximum size, and yet have them run on small machines for small
  problems that use only part of the arrays.

I'm skeptical.  Sparse arrays at 4kB/page?  As for the Fortran bit, it
only makes sense on machines dedicated to a single application.  That
sure isn't the way we use ours.
--

--Pierre Asselin, Magnetoresistive Head Engineering, Applied Magnetics.

 
 
 

AIX malloc and fault tolerance

Post by sc.. » Thu, 10 Sep 1992 02:22:08



>Further, I have had processes receive SIGDANGER when only one of two paging
>spaces on the system was close to filling.  It seems that if >any< page
>space on an AIX box gets close to being full you get the SIGDANGER, even if
>there are other paging spaces with lots of room left.  Yikes!  So much for
>the performance advantages of spreading the page space across spindles!

This shouldn't happen.  You should only get SIGDANGER or SIGKILL when the
TOTAL paging space runs low.  It might be that one of your paging spaces
wasn't activated.

Scott L. Porter                         IBM PSP Austin / AIX Kernel Development

 
 
 

AIX malloc and fault tolerance

Post by Alex Martel » Fri, 11 Sep 1992 19:57:31


        ...
:>    4) Even if I use this malloc wrapper everywhere in my own code,
:>       how do I deal with third-party code I purchase that calls the
:>       unwrapped malloc?
:
:You don't, other than to allow it to die.  Oh, you had something >important<
:in that program going on, like perhaps a financial transaction?  Too bad --
:that SIGKILL you just received can't be caught!  So much for reliable
:software.

Exactly!

:This is one of the reasons I hate AIX.  There are lots of them, but this is

I don't have any other important reason for AIX-hating, but the horrendously
bogous malloc() semantics may be enough...

--

CAD.LAB s.p.a., v. Ronzani 7/29, Casalecchio, Italia   Fax: ++39 (51) 6130294

 
 
 

AIX malloc and fault tolerance

Post by Alex Martel » Sat, 12 Sep 1992 19:10:04




:
:[He bumped into the malloc virtual allocation nonsense again.
: I still get mad just thinking about it.]

DITTO!  And the excuses we get about it look just like that, EXCUSES!

:Some arguments to the effect that vapour-memory was a good thing were:
: -Lets you use gigantic sparse arrays.
: -Lets vendors ship Fortran binaries with static arrays dimensioned
:  to maximum size, and yet have them run on small machines for small
:  problems that use only part of the arrays.
:
:I'm skeptical.  Sparse arrays at 4kB/page?  As for the Fortran bit, it
:only makes sense on machines dedicated to a single application.  That
:sure isn't the way we use ours.

Dedicating a WS to a single (set of) application IS the typical way that
cad.lab customers would use their machines, and we are heavy Fortran
users, and the malloc()-but-not-really idea STILL stinks.  We are
selling INDUSTRIAL STRENGTHS applications, that will be used for CRUCIAL
PRODUCTION WORK; it's COMPLETELY UNACCEPTABLE for our customers to lose
data because the application dumps abruptly!!!  So our apps are full of
error-checks.

In particular, we do NOT place data, that will grow for large problems,
inside Fortran arrays; they reside, instead, in areas which are
dynamically allocated by an underlying library written in C, and
accessed via functions or subroutines by the Fortran portions.  On
machines where malloc() semantics make sense, the C routine will return
an error indicator to the Fortran portion if it's unable to get the
memory requested; in this case, the application communicates to the
interactive user that the requested operation cannot be completed due to
running out of virtual memory, but the app is still alive and the user
can save hir work so far, and restart from there presumably after having
swapspace reconfigured.

We've been particularly careful that nothing in the save-to-disk
subsystem NEEDS to allocate extra memory, so that the saving will work
even in crucial memory-low situations; we even had to recode the
output-to-file portions as C subroutines running over low-level
systemcalls, as we found with surprise that Fortran I/O, and C stdio, on
some platforms, may need a malloc() to succeed and will die if it fails
(and, yes, our applications ARE and WILL REMAIN extremely portable
code).

All this care, of course, is for naught on the IBM R/6000 (thankfully we
don't presently run on DG Aviion, where malloc() reportedly's similarly
broken).  And no, we can't just set "limit datasize" appropriately,
because it depends on what the user is doing exactly: sometimes the 3D
modeler will be running alone, other times it will be scheduled together
with the 2D drafter and/or the surface renderer and/or the relational
database and/or the tool which builds programs for numerically
controlled tools and/or...  each of these applications is written to be
able to run alone OR communicate with its brethren.

We've tried the tricks IBM suggested to stop our application from dying
in unexpected places, but what happens then is that OTHER processes
die -- and the first to go is typically the X server (a memory hog, I
guess!), so the user cannot communicate with the apps to ask to save...
and NO, we CANNOT just do the saving from the SIGDANGER handler as a
safetynet; the handler can be basically entered from anywhere in the
application, including "critical sections" where the data structures
are in transition and inconsistent (and NO, we CANNOT protect the
critical sections by turning off signals there, or we'll die for
lack of SIGDANGER handling).  

Yes, I know that a thousand clever tricks spring to mind to workaround
one of the other of these problems, but believe me:  we must have tried
at least 900 of them and they don't work.  We've spent more time and
effort on battling this malloc() idiocy than on any other single porting
problem EVER (and with the huge list of platforms we've supported over
the years we've had quite SOME such problems, believe you me!)!!!  Most
porting problems come from bugs in the target system, some from bugs in
our code, but here we're fighting against something BROKEN AS DESIGNED
-- ***HORRIBLY*** BROKEN.  I would say it's been half the cost of the
IBM R/6000 port, if it weren't for the fact that the monstruously slow
linker (thankfully remedied in 3.2, but this port was started right at
system announcement...)  and the bugs in the early X have driven that
cost way up.  Anyway, at the end, we've given up and just document to
our customers how AND WHY their work may go up in smoke on IBM R/6000
and not on DEC, Olivetti, Sun, HP, Sony or other platforms.

If IBM ever gives us a malloc() WHICH WORKS, we'll be glad to use it.
And I hope that periodically rekindled flames about it will do some
good -- if we could get together with everybody who's suffered for
this and blackmail IBM into it the world would become a better place
in at least this small way...
--

CAD.LAB s.p.a., v. Ronzani 7/29, Casalecchio, Italia   Fax: ++39 (51) 6130294

 
 
 

AIX malloc and fault tolerance

Post by Beirne Konars » Tue, 15 Sep 1992 21:40:30



>Yes, I know that a thousand clever tricks spring to mind to workaround
>one of the other of these problems, but believe me:  we must have tried
>at least 900 of them and they don't work.  We've spent more time and
>effort on battling this malloc() idiocy than on any other single porting
>problem EVER (and with the huge list of platforms we've supported over
>the years we've had quite SOME such problems, believe you me!)!!!  Most
>porting problems come from bugs in the target system, some from bugs in
>our code, but here we're fighting against something BROKEN AS DESIGNED
>-- ***HORRIBLY*** BROKEN.  I would say it's been half the cost of the
>IBM R/6000 port, if it weren't for the fact that the monstruously slow
>linker (thankfully remedied in 3.2, but this port was started right at
>system announcement...)  and the bugs in the early X have driven that
>cost way up.  Anyway, at the end, we've given up and just document to
>our customers how AND WHY their work may go up in smoke on IBM R/6000
>and not on DEC, Olivetti, Sun, HP, Sony or other platforms.

In other posting IBM has recommended setting MALLOCTYPE=3.1, saying this works
at runtime.  Have you tried this? If you did it work?
--
-------------------------------------------------------------------------------
Beirne Konarski                 | Reading maketh a full man, conference a

                                |       -- Francis Bacon
 
 
 

AIX malloc and fault tolerance

Post by John Ger » Wed, 16 Sep 1992 00:32:19


|>
|> In other posting IBM has recommended setting MALLOCTYPE=3.1, saying this works
|> at runtime.  Have you tried this? If you did it work?
|>

Setting MALLOCTYPE isn't related to the problems being discussed in this thread.
Here the issue is AIX's policy of not actually allocating malloc'ed storage
until it is touched and this basic policy is part of both 3.1 and 3.2 AIX.

--

 
 
 

AIX malloc and fault tolerance

Post by Ramon Pant » Wed, 16 Sep 1992 14:43:44


--- DON'T REPLY TO THE SENDER.  I'm posting this followup for a friend,
--- for some reason he can't post articles.  Hopefully I've got the
--- right article.


Quote:

>I don't have any other important reason for AIX-hating, but the horrendously
>bogous malloc() semantics may be enough...

I've had the same frustrations with the malloc/SIGKILL problem,
BSD tty code and other AIXisms.

Anybody knows about a port of SVR4 that IBM had contracted out
to one of the UNIX/386 houses?  I remember reading this about a year
ago in some UNIX magazine (maybe on a "Unix Today").  I believe
that this was a special product to be used only for large bids
that explictly asked for SVR4.

Is there any interest out there for a straight port of SVR4 to the
RS/6000?  Anybody knows if this is really happening?  Any other
details?

I was curious what would be the minimal requirements for such a product.
My requirements would be:
        - Binary compatible with AIX 3.X applications.  Not sure how
          the SVR4 shared objects and the AIX shared library models
          match.
        - Support for most IBM provided hardware (tough I don't
          care about diskless workstations in particular).
        - Good support.

My non-requirements would be:
        - No compatibility with kernel extensions or device drivers.
        - No disk space compatibility, i.e. JFS filesystems or logical
          volumes would be useless, reformating and a different filesystem
          would be required (UFS maybe VXFS).

What would be your requirements?

--- Disclaimer: the sender doesn't have any interest on this posting,
--- just acting as a gateway.