Bus Error Blues [some code at end of post]

Bus Error Blues [some code at end of post]

Post by Th » Fri, 12 Oct 2001 07:03:07



*sighs*

Allrighty... I just wrote a long message explaining my problem. Then I
thought I found the solution. Hence long message -> garbage.
My solution of course did not work.
This is my problem:
         I have a linked list of structures which contain linked lists of
elements with linked lists of coords within.

IE: [struct 1]
        |
        Elem -> Coord-> Coord-> Coord-> Coord
        |
        Elem -> Coord-> Coord
        |
        Elem -> Coord-> Coord-> Coord-> Coord-> Coord-> Coord-> Coord-> Coord
etc

I wrote a method to run through the array, and free all of the elements
here, but it seems that it doesn't working very much.
It will free all of the struct1's and elements if I comment out the coord
freeing part.
It will also free all of the structs, elems and the *first* coord in the
linked list of coords, if I comment out the code to free the links after
the first coord.

However, it will *not* free the whole linked list of coords.
It successfully frees the first 50-100 struct1's in the array, but after
that it dies.  [50 out of 900-1000]
Depending on the file size it dies at different spots, but for any
particular file size it dies at the same spot.

When allocating memory for the coords, it will die sooner if I allocate an
extra coord on the end.
IE: if currentCoord is the *last* non-empty coord then the following makes
it die sooner
        currentCoord->next = (coord *) malloc(sizeof(coord));
It still, however, dies eventually. [75-125 structs freed before death]

Other info:
        it will successfully work if everything is empty. (no coords, no
elems, no structs)
        if each struct entry is the same, it still dies. [and not on the
first struct either]

Hmm....
Let me think now...
That's all the info I can think of for now. The code I wrote for actually
freeing the structs/elems/coords is below.

Thanks for your help....

-Xlegna

int freeAll()
{   int count;
    struct1 myEvilStruct;
    Element *myEvilElement, *nextElem;

    for(count = 0; count < MAX_NUM_STRUCTS; count++)
    {  myEvilStruct = myStructs[count];
        if(myEvilStruct.name != NULL)
        {   fprintf(stderr,"Freeing %s (%u).\n",myEvilStruct.name,count);
             for(myEvilElement = myEvilStruct.first; myEvilElement !=
NULL; myEvilElement = nextElem)
            {   nextElem = (Element *) myEvilElement->next;
                freeElement(myEvilElement);
                myEvilElement = nextElem;
            }
            free(myEvilStruct.lmt);
            free(myEvilStruct.lat);
            fprintf(stderr,"%s freed.\n",myEvilStruct.name);
            free(myEvilStruct.name);
            free(myEvilStruct);
        }
    }
    free(nextElem);
    return 0;

Quote:}

void freeElement(Element *anEvilElement)
{       Coord *myEvilCoord, *nextCoord;
        if(OUTPUT & DEBUG_ELEM_FREE)  {  fprintf(stderr,"%u free
started.\n", anEvilElement->type);    }
        if(anEvilElement != NULL)
        {
            for(myEvilCoord = anEvilElement->xy; myEvilCoord != NULL;
myEvilCoord = nextCoord)
            {   nextCoord = (Coord *) myEvilCoord->next;
        /*   ^^^^ It says it dies on this line. Hmph.
        If nextCoord is set to NULL, and that line commented out then it
works. Except of course it doesn't free all the links in the linked-list.
Only the first.
        */
                freeCoord(myEvilCoord);
            }
            free(anEvilElement->myopt);
            free(anEvilElement);
      }

        free(nextCoord);

Quote:}

void freeCoord(Coord *anEvilCoord)
{   if(anEvilCoord !=NULL)
    {  free(anEvilCoord);
     }
Quote:}

 
 
 

Bus Error Blues [some code at end of post]

Post by Eric Sosma » Fri, 12 Oct 2001 07:31:35



> [...]
> int freeAll()
> {   int count;
>     struct1 myEvilStruct;
>     Element *myEvilElement, *nextElem;

>     for(count = 0; count < MAX_NUM_STRUCTS; count++)
>     {  myEvilStruct = myStructs[count];
>         if(myEvilStruct.name != NULL)
>         {   fprintf(stderr,"Freeing %s (%u).\n",myEvilStruct.name,count);
>              for(myEvilElement = myEvilStruct.first; myEvilElement !=
> NULL; myEvilElement = nextElem)
>             {   nextElem = (Element *) myEvilElement->next;
>                 freeElement(myEvilElement);
>                 myEvilElement = nextElem;
>             }
>             free(myEvilStruct.lmt);
>             free(myEvilStruct.lat);
>             fprintf(stderr,"%s freed.\n",myEvilStruct.name);
>             free(myEvilStruct.name);
>             free(myEvilStruct);
> [...]

    I think your problem (one of them, anyhow) is in the last
free() above.  Notice that `myEvilStruct' is a struct, not a
pointer to something; when free() gets hold of this garbage
non-pointer value, pretty much anything can happen.

    A guess: You didn't #include <stdlib.h> in this code,
because if you had the compiler would have known that free()
requires a pointer argument and would have complained about
this line.  Hint, hint.

--


 
 
 

Bus Error Blues [some code at end of post]

Post by Th » Fri, 12 Oct 2001 08:12:13


Quote:>    I think your problem (one of them, anyhow) is in the last
>free() above.  Notice that `myEvilStruct' is a struct, not a
>pointer to something; when free() gets hold of this garbage
>non-pointer value, pretty much anything can happen.

>   A guess: You didn't #include <stdlib.h> in this code,
>because if you had the compiler would have known that free()
>requires a pointer argument and would have complained about
>this line.  Hint, hint.

*smacks himself upside the head*

I forgot to paste it into my header when I cut it from my other file. Gah
[btw, is that bad form? to have #include's in the header instead of code
file?]

Doesn't fix my problem, but yes that is indeed wrong ;)

Any more ideas ? ;)

-Xlegna

 
 
 

Bus Error Blues [some code at end of post]

Post by Chuck Dillo » Fri, 12 Oct 2001 22:36:36



> >    I think your problem (one of them, anyhow) is in the last
> >free() above.  Notice that `myEvilStruct' is a struct, not a
> >pointer to something; when free() gets hold of this garbage
> >non-pointer value, pretty much anything can happen.

> >   A guess: You didn't #include <stdlib.h> in this code,
> >because if you had the compiler would have known that free()
> >requires a pointer argument and would have complained about
> >this line.  Hint, hint.

> *smacks himself upside the head*

> I forgot to paste it into my header when I cut it from my other file. Gah
> [btw, is that bad form? to have #include's in the header instead of code
> file?]

> Doesn't fix my problem, but yes that is indeed wrong ;)

> Any more ideas ? ;)

First, there is nothing about unix programming in your post.  You
should direct questions about coding in C to a group that is about
coding in C.

Having said that:
        1) Nothing in a computer 'dies' that I'm aware of.  When your
           program fails the system or compiler or whatever tries to
           give you the info it has.  You should let the people you
           query about the problem in on what the system has told you.
        2) The fact that it 'dies' while freeing does not necessarily
           mean the 'bug' is in the freeing code.  You could be
           trashing memory anywhere between that allocation and the free.
           You could potentially also be trashing memory in data structures
           that were malloced but are unrelated to your lists.
        3) There are tools available for tracking these kinds of problems.
           De*s that monitor memory operations, debug versions of
           malloc et. al., they are your friend.

Have fun!

-- ced

--
Chuck Dillon
Senior Software Engineer
Accelrys Inc., a subsidiary of Pharmacopeia, Inc.

 
 
 

Bus Error Blues [some code at end of post]

Post by Th » Fri, 12 Oct 2001 23:14:29


Quote:>First, there is nothing about unix programming in your post.  You
>should direct questions about coding in C to a group that is about
>coding in C.

A'hm sorry, I forgot something in my original post... It was the second
time I rewrote it ;)
I am currently running in HP-UX and also wanted to ask whether it could be
that I am having a memory alignment problem. I know that this type of
problem causes a SIGBUS, but I am sure there must be other causes for
SIGBUS out there.
I am doing all my memory allocation with malloc() however, so I think that
my memory alignment is fine.
[The man page states that malloc() returns a block of memory that is
'correctly aligned for any use';
"The  malloc() function returns a pointer to  a  block  of  at  least size
bytes suitably aligned for any use."]

I would also like to know where I can find more information about what
*causes* a SIGBUS. How does this differ from a Segmentation fault? Etc.
The documents I have found through a google search were not of much help;
most saying that Segmentation Faults and Bus Errors were almost
interchangeable.

For future reference, what is a good newsgroup for straight C postings?

Quote:>Having said that:
>                1) Nothing in a computer 'dies' that I'm aware of.  When
your
>                   program fails the system or compiler or whatever tries
to
>                   give you the info it has.  You should let the people
you
>                   query about the problem in on what the system has told

you.

Indeed, my aologies for being vague once again.
The computer does not 'die', and to be more excet it does not even
"freeze".
My program receives a SIGBUS, which terminates it. I believe I did mention
this in my first post.
The output of my program is
"...
AOI32X1 freed.
Freeing AOI32X2 (num 70).
Bus Error (core dumped)

The exact error the systen tells me is, of course, "Bus Error (Core
dumped)"
The exit status is 138.

Quote:>                2) The fact that it 'dies' while freeing does not
necessarily
>                   mean the 'bug' is in the freeing code.  You could be
>                   trashing memory anywhere between that allocation and
the free.
>                   You could potentially also be trashing memory in data
structures
>                   that were malloced but are unrelated to your lists.

Indeed, this very thought had occurred to me.
My 'solution' which I originally thought would work was a rewriting of the
memory allocation code, because I thought that my problem lay therein.
This rewritten version did not, of course, fix my solution and so I am
stranded writing to these groups asking for help.
How do you mean 'trashing' memory? How does one go about 'trashing'
memory?
I might be able to track down if it happens if I knew the signs of it ;)

Quote:>                3) There are tools available for tracking these kinds of
problems.
>                   De*s that monitor memory operations, debug
versions of
>                   malloc et. al., they are your friend.

Indeed, and I am not very familiar with them.
I attempted to use GDB to track+fix my program, but since I am
unfamilliar, I was only able to track down that the error occured when I
do;
"nextCoord = (Coord *) myEvilCoord->next;"

I also did various printf debugging which I find more useful than many
de*s out there.
However, if you could point out a good unix de* I would be grateful.

Thank you very much for your response Chuck! :)

-Xlegna

 
 
 

Bus Error Blues [some code at end of post]

Post by Joe Durusa » Sat, 13 Oct 2001 03:54:57



> >First, there is nothing about unix programming in your post.  You
> >should direct questions about coding in C to a group that is about
> >coding in C.
> A'hm sorry, I forgot something in my original post... It was the second
> time I rewrote it ;)
> I am currently running in HP-UX and also wanted to ask whether it could be
> that I am having a memory alignment problem. I know that this type of
> problem causes a SIGBUS, but I am sure there must be other causes for
> SIGBUS out there.
> I am doing all my memory allocation with malloc() however, so I think that
> my memory alignment is fine.
> [The man page states that malloc() returns a block of memory that is
> 'correctly aligned for any use';
> "The  malloc() function returns a pointer to  a  block  of  at  least size
> bytes suitably aligned for any use."]

> I would also like to know where I can find more information about what
> *causes* a SIGBUS. How does this differ from a Segmentation fault? Etc.
> The documents I have found through a google search were not of much help;
> most saying that Segmentation Faults and Bus Errors were almost
> interchangeable.

> For future reference, what is a good newsgroup for straight C postings?

> >Having said that:
> >                1) Nothing in a computer 'dies' that I'm aware of.  When
> your
> >                   program fails the system or compiler or whatever tries
> to
> >                   give you the info it has.  You should let the people
> you
> >                   query about the problem in on what the system has told
> you.

> Indeed, my aologies for being vague once again.
> The computer does not 'die', and to be more excet it does not even
> "freeze".
> My program receives a SIGBUS, which terminates it. I believe I did mention
> this in my first post.
> The output of my program is
> "...
> AOI32X1 freed.
> Freeing AOI32X2 (num 70).
> Bus Error (core dumped)

> The exact error the systen tells me is, of course, "Bus Error (Core
> dumped)"
> The exit status is 138.

> >                2) The fact that it 'dies' while freeing does not
> necessarily
> >                   mean the 'bug' is in the freeing code.  You could be
> >                   trashing memory anywhere between that allocation and
> the free.
> >                   You could potentially also be trashing memory in data
> structures
> >                   that were malloced but are unrelated to your lists.
> Indeed, this very thought had occurred to me.
> My 'solution' which I originally thought would work was a rewriting of the
> memory allocation code, because I thought that my problem lay therein.
> This rewritten version did not, of course, fix my solution and so I am
> stranded writing to these groups asking for help.
> How do you mean 'trashing' memory? How does one go about 'trashing'
> memory?
> I might be able to track down if it happens if I knew the signs of it ;)

> >                3) There are tools available for tracking these kinds of
> problems.
> >                   De*s that monitor memory operations, debug
> versions of
> >                   malloc et. al., they are your friend.
> Indeed, and I am not very familiar with them.
> I attempted to use GDB to track+fix my program, but since I am
> unfamilliar, I was only able to track down that the error occured when I
> do;
> "nextCoord = (Coord *) myEvilCoord->next;"

> I also did various printf debugging which I find more useful than many
> de*s out there.
> However, if you could point out a good unix de* I would be grateful.

> Thank you very much for your response Chuck! :)

> -Xlegna

OK, since now you know where the bad line is, print out the value of
myEvilCoord and find out why myEvilCoord-> results in an unaligned ref
or some such.  You might want to print to a file, since I doubt that
it fails on the first try, based on your earlier comments.

An example of how the failure could occur is:

Coord is 4 bytes long, next has an offset of 2 from the beginning of
the structure (or whatever) pointed to myEvilCoord, and the struct
itself
if aligned on a 4-byte boundary.

In other words,

MyEvilCoord = 0x100,
offset of next =2,
and Coord * is 4 bytes long.

Speaking only for myself,

Joe Durusau

 
 
 

Bus Error Blues [some code at end of post]

Post by George R. Gonzale » Sat, 13 Oct 2001 01:47:33


What I've done many times is add a "password" field
to each structure.  Then write wrapper functions,
one that does all the malloc( sizeof(node) )
another that does all the free()'s.
The malloc one will set the password field,
the Free one will check the field before doing the free,
the change the pw field.  Good choices for the pw field are
4 character strings into which you can stuff in "busy", "free", etc...

It's even better if you put in 2 password fields,
one at the beginning of the node, another at the end,
then set and check both of them.

It wouldnt hurt to have another function that traverses all the
lists and checks all the passwords.   Call this one several times
in your proogram to verify the consistency of the nodes.

Usually you'll find you're freeing something twice, or
one node is linked ito two lists and getting freed twice.



> > >First, there is nothing about unix programming in your post.  You
> > >should direct questions about coding in C to a group that is about
> > >coding in C.
> > A'hm sorry, I forgot something in my original post... It was the second
> > time I rewrote it ;)
> > I am currently running in HP-UX and also wanted to ask whether it could
be
> > that I am having a memory alignment problem. I know that this type of
> > problem causes a SIGBUS, but I am sure there must be other causes for
> > SIGBUS out there.
> > I am doing all my memory allocation with malloc() however, so I think
that
> > my memory alignment is fine.
> > [The man page states that malloc() returns a block of memory that is
> > 'correctly aligned for any use';
> > "The  malloc() function returns a pointer to  a  block  of  at  least
size
> > bytes suitably aligned for any use."]

> > I would also like to know where I can find more information about what
> > *causes* a SIGBUS. How does this differ from a Segmentation fault? Etc.
> > The documents I have found through a google search were not of much
help;
> > most saying that Segmentation Faults and Bus Errors were almost
> > interchangeable.

> > For future reference, what is a good newsgroup for straight C postings?

> > >Having said that:
> > >                1) Nothing in a computer 'dies' that I'm aware of.
When
> > your
> > >                   program fails the system or compiler or whatever
tries
> > to
> > >                   give you the info it has.  You should let the people
> > you
> > >                   query about the problem in on what the system has
told
> > you.

> > Indeed, my aologies for being vague once again.
> > The computer does not 'die', and to be more excet it does not even
> > "freeze".
> > My program receives a SIGBUS, which terminates it. I believe I did
mention
> > this in my first post.
> > The output of my program is
> > "...
> > AOI32X1 freed.
> > Freeing AOI32X2 (num 70).
> > Bus Error (core dumped)

> > The exact error the systen tells me is, of course, "Bus Error (Core
> > dumped)"
> > The exit status is 138.

> > >                2) The fact that it 'dies' while freeing does not
> > necessarily
> > >                   mean the 'bug' is in the freeing code.  You could be
> > >                   trashing memory anywhere between that allocation and
> > the free.
> > >                   You could potentially also be trashing memory in
data
> > structures
> > >                   that were malloced but are unrelated to your lists.
> > Indeed, this very thought had occurred to me.
> > My 'solution' which I originally thought would work was a rewriting of
the
> > memory allocation code, because I thought that my problem lay therein.
> > This rewritten version did not, of course, fix my solution and so I am
> > stranded writing to these groups asking for help.
> > How do you mean 'trashing' memory? How does one go about 'trashing'
> > memory?
> > I might be able to track down if it happens if I knew the signs of it ;)

> > >                3) There are tools available for tracking these kinds
of
> > problems.
> > >                   De*s that monitor memory operations, debug
> > versions of
> > >                   malloc et. al., they are your friend.
> > Indeed, and I am not very familiar with them.
> > I attempted to use GDB to track+fix my program, but since I am
> > unfamilliar, I was only able to track down that the error occured when I
> > do;
> > "nextCoord = (Coord *) myEvilCoord->next;"

> > I also did various printf debugging which I find more useful than many
> > de*s out there.
> > However, if you could point out a good unix de* I would be
grateful.

> > Thank you very much for your response Chuck! :)

> > -Xlegna

> OK, since now you know where the bad line is, print out the value of
> myEvilCoord and find out why myEvilCoord-> results in an unaligned ref
> or some such.  You might want to print to a file, since I doubt that
> it fails on the first try, based on your earlier comments.

> An example of how the failure could occur is:

> Coord is 4 bytes long, next has an offset of 2 from the beginning of
> the structure (or whatever) pointed to myEvilCoord, and the struct
> itself
> if aligned on a 4-byte boundary.

> In other words,

> MyEvilCoord = 0x100,
> offset of next =2,
> and Coord * is 4 bytes long.

> Speaking only for myself,

> Joe Durusau

 
 
 

Bus Error Blues [some code at end of post]

Post by Morris Dove » Sat, 13 Oct 2001 02:16:30



> For future reference, what is a good newsgroup for straight C postings?


--
Morris Dovey
West Des Moines, Iowa USA
 
 
 

Bus Error Blues [some code at end of post]

Post by Chuck Dillo » Sat, 13 Oct 2001 06:35:55



> I would also like to know where I can find more information about what
> *causes* a SIGBUS. How does this differ from a Segmentation fault? Etc.
> The documents I have found through a google search were not of much help;
> most saying that Segmentation Faults and Bus Errors were almost
> interchangeable.

Both imply bad addresses/pointer values.  SEGSEGV means the address is
not available to your program, IOW points outside your address space.
SIGBUS means what you provided as an address is improper and can't
possibly point to memory on the bus.  This case exists when the hardware
has restrictions in what constitutes a valid address.  For example, hardware
requires addresses to be 32 bit aligned and you give it the value 1.

Quote:

> For future reference, what is a good newsgroup for straight C postings?

comp.lang.c

Quote:> How do you mean 'trashing' memory? How does one go about 'trashing'
> memory?
> I might be able to track down if it happens if I knew the signs of it ;)

The following code will trash memory:
        struct mylist {
          char name[10];
          struct mylist *next;
        };

        {
         struct mylist *node=(struct mylist *)malloc(sizeof(struct mylist));

         node->next=NULL;
         node->name[0]=0;

         ... lots of complicated code ...

         strcpy(node->name,"abcdefghijklmnopqrstuvwxyz");  /* just trashed node->next */

         ... lots more ingenius code ...

         if (node->next)
           free(node->next); /* probably a --> SIGBUS or SIGSEGV */
        }

The strcpy copies 27 bytes to a location where only 10 bytes have been allocated.
Those 17 byte were copied somewhere.  In the above case 4 of those chars ended up
in node->next and those 4 bytes probably don't represent a valid pointer, definitely
not to anything you want them to.

This is a simple case but such situations can be far trickier to solve.  Think of
it like this:  Malloc'd memory is all in a bucket.  Also in that bucket are data
structures that support doing bookkeeping, IOW keeping track of malloc'd memory
chunks so that free can figure out what to do..  All of this memory is packed together
and there is nothing to protect the boundaries between what you malloc nor to
protect the data bookkeeping data.

If you malloc N bytes at an address and then copy M>N bytes to that address those
M-N bytes will almost surely cause a problem *somewhere* in your code.  If might
just trash bookkeeping data and cause a free to fail or crash.  It might trash
some of your data structures that were malloc but they might be totally unrelated
to the data where you copied M bytes to.  The trashed data might have been malloc'd
by some library you called for example.  The trashed memory becomes a kind of *y
trap set to go off when the code that cares about it comes back to access it.

So you have to be very careful about using malloc'd memory.  Something as simple
as forgetting to allocate a byte for the NULL terminator of a string can often
cause a crash.  If you lucky it will be a reproducable crash.  Most often it
is an intermittent crash that is very hard to find.

Oh yeah, and automatic variables are in the same bucket.  IOW, if you do:
        void myfunc() {
          char str[10];

          strcpy(str,"asdfadfasdfasdasdf");
        }
You are trashing memory and the symptoms might or might not occur in myfunc().

Quote:

> >                3) There are tools available for tracking these kinds of
> problems.
> >                   De*s that monitor memory operations, debug
> versions of
> >                   malloc et. al., they are your friend.
> Indeed, and I am not very familiar with them.
> I attempted to use GDB to track+fix my program, but since I am
> unfamilliar, I was only able to track down that the error occured when I
> do;
> "nextCoord = (Coord *) myEvilCoord->next;"

> I also did various printf debugging which I find more useful than many
> de*s out there.
> However, if you could point out a good unix de* I would be grateful.

I've never developed on an HP so I'm not sure what to point you to.  Printf
debugging is not going to be very effective with this kind of problem.  Adding
debug print statements moves things in memory and can cause symptoms to move
or go away.  It's really tough to isolate this kind of problem that way.

Since it appears you know what is being trashed, you should be able to use
gdb to watch that address (i.e. &myEvilCoord).  What you do is run the program
in the de* and after the soon to be trashed data structure is allocated
you tell the de* to monitor that address and break when it changes.  This
might lead you to the point where it gets trashed and you might be surprised
to find out where it is.

Alternatively investigate whether there is a debug version of malloc on your
system and try using it.  When I see these kinds of things I use dbx on solaris
and enable memory checking.  It often uncovers the problem quickly but it won't
uncover all kinds of corruption problems.

Have fun,

-- ced

--
Chuck Dillon
Senior Software Engineer
Accelrys Inc., a subsidiary of Pharmacopeia, Inc.

 
 
 

Bus Error Blues [some code at end of post]

Post by T.. » Sun, 14 Oct 2001 07:04:04


<snp>

Quote:>SEGSEGV means the address is
>not available to your program, IOW points outside your address space.
>SIGBUS means what you provided as an address is improper and can't
>possibly point to memory on the bus.

Aha!
Finally someone explains concisely what the difference is without
confusing me.
Thank you very much for your patience and helpfulness!
I am pretty much a newbie to C, and this is the first time I've run across
a SIGBUS ;)
I'm sorry as well for many of my questions probably seeming very ignorant.

Hopefully they/I will become less-so with the answers. :)

Quote:>This is a simple case but such situations can be far trickier to solve.
Think of
>it like this:  Malloc'd memory is all in a bucket.  Also in that bucket
are data
>structures that support doing bookkeeping, IOW keeping track of malloc'd
memory
>chunks so that free can figure out what to do..  All of this memory is
packed together
>and there is nothing to protect the boundaries between what you malloc
nor to
>protect the data bookkeeping data.

 (the cause I mean)

Quote:>If you malloc N bytes at an address and then copy M>N bytes to that
address those
>M-N bytes will almost surely cause a problem *somewhere* in your code. If
might
>just trash bookkeeping data and cause a free to fail or crash.  It might
trash
>some of your data structures that were malloc but they might be totally
unrelated
>to the data where you copied M bytes to.

Okay, so if I understand correctly, I allocate each structure N bytes,
then a structure which in actuality takes M>N bytes might/will overwrite
another structure 'next door' as it were?
This doesn't only happen within the structure itself?
IE:
struct somestruct {
        char yadda[5];
        int a;
        long b;
Quote:}

yadda, by your own example, could overwrite a and b, but it could also
overwrite another structure that 'happened' to be next to it?

There was a suggestion saying add in a 'password' char field to a
structure, instead could one 'pad' the structure with  2 large empty char
arrays?
Would this allow you to track down whether or not structures were
overstepping their own bounds?
It seems to me that it would, as long as memory is allocated according to
the order in which you declare your vars in your struct.
Erm, memory *is* allocated according to the order in which you declare
your vars in your struct, right?....?
If memory *is* allocated in this fashion, I guess we could even pad
particular variables within structures to check whether they are
overflowing as well, yes?
Hmm...
EG:
struct somestruct {
        char wastedspace[50];
        char yadda[5];
        char empty[50];
        int a;
        long b;

Quote:};

So, if yadda overflows, it will [hopefully] get caught in empty?

Quote:>So you have to be very careful about using malloc'd memory.  Something as
simple
>as forgetting to allocate a byte for the NULL terminator of a string can
often
>cause a crash.  If you lucky it will be a reproducable crash.  Most often
it
>Is an intermittent crash that is very hard to find.

Well, my crash is quite reproduceable, but it certainly *is* hard to find
(the cause I mean).
I have a number of test files that reproduce it.
You have mentioned strings quite a bit, and I am wondering whether my
problem comes from the fact that many of my structs have 'nametags' and
such.
Variables representing numbers shouldn't cause problems like this should
they?
An unsigned int of 70000 simply gets converted into 4464 right? ... So I
can elimate that possibility?

I am using strdup() to make copies of strings, and I am no longer sure
whether or not they are actually terminated by nulls.
When I read the man doc on strdup() I didn't notice whether or not it
looks for a terminating null.
*thinks for a moment*
I guess it must, since otherwise it will just start reading random data
past the end of the string, eh??
*goes to try and fix his code*
Hmm... changed all my strdup()'s to strncpy()'s but it still bugs out.
I'm going to have to think about this one for awhile.

<snip>

Quote:>I've never developed on an HP so I'm not sure what to point you to.
Printf
>debugging is not going to be very effective with this kind of problem.
Adding
>debug print statements moves things in memory and can cause symptoms to
move
>or go away. It's really tough to isolate this kind of problem that way.

Yes I noticed this myself rather early. Thats why I turned to gdb....
I'm not yet adept at it's uses though, and I wasn't able to do much
After much playing, I am glad to admit I can actually <pause for Beer
Call> set breakpoints, watch's, print data values etc.
Still, however, I haven't found the source of my bug.

Quote:>Since it appears you know what is being trashed, you should be able to
use
>gdb to watch that address (i.e. &myEvilCoord).  What you do is run the
program
>in the de* and after the soon to be trashed data structure is
allocated
>you tell the de* to monitor that address and break when it changes.
This
>might lead you to the point where it gets trashed and you might be
surprised
>to find out where it is.

<snip>

I've tried doing that.... Its rather difficult finding the right one.
In one of the test files I have, there are 50,000 coords total, and the
(first) corrupted structure occurs at the 14th struct.
I have not *yet* been able to get a smaller test file that generates the
error.
Mainly because I can't make my own test files, and must rely on what my
coworkers give me. Oh well.

Thank you again for all your help and patience, Chuck. :)

-Xlegna

 
 
 

Bus Error Blues [some code at end of post]

Post by Chuck Dillo » Tue, 16 Oct 2001 23:21:59



> Okay, so if I understand correctly, I allocate each structure N bytes,
> then a structure which in actuality takes M>N bytes might/will overwrite
> another structure 'next door' as it were?
> This doesn't only happen within the structure itself?
> IE:
> struct somestruct {
>         char yadda[5];
>         int a;
>         long b;
> }
> yadda, by your own example, could overwrite a and b, but it could also
> overwrite another structure that 'happened' to be next to it?

Yes, and it could overwrite the location where malloc stored information
about some allocated space.  IOW, memory that your code knows nothing
about explicitly.

Quote:

> There was a suggestion saying add in a 'password' char field to a
> structure, instead could one 'pad' the structure with  2 large empty char
> arrays?
> Would this allow you to track down whether or not structures were
> overstepping their own bounds?

Possibly yes, however you have to make an assumption about how much of an overflow
is occurring.  If you are only copying an extra few bytes then a small amount
of 'pad' would be sufficient.  But if many bytes are being incorrectly
managed you need big pads.

Quote:> It seems to me that it would, as long as memory is allocated according to
> the order in which you declare your vars in your struct.
> Erm, memory *is* allocated according to the order in which you declare
> your vars in your struct, right?....?
> If memory *is* allocated in this fashion, I guess we could even pad
> particular variables within structures to check whether they are
> overflowing as well, yes?
> Hmm...
> EG:
> struct somestruct {
>         char wastedspace[50];
>         char yadda[5];
>         char empty[50];
>         int a;
>         long b;
> };
> So, if yadda overflows, it will [hopefully] get caught in empty?

I suggest that if you don't see anyone providing a development tool that
is designed to validate memory usage that you follow a strategy like the
following...  Rather than pepper your data structures with many pads use
a single large pad strategically to find the problem through a process of
elimination.  Make your best estimation of which data structure is being
overrun.  Not what is being clobbered but what is being overrun to clobber
something else.  Pad that structure with a very big buffer.  For example,
if you thinks its somewhere in struct foo add a 8k block of chars to the
end of struct foo and see if the problem goes away.  If it does then you are
probably somehow overrunning something in struct foo.  If the problem
persists use the de* to look at the contents of any corrupted data
structures, including the pad block, and see if you learn anything.  Then
reestimate the most likely source of the problem and move your pad block.
Perhaps elsewhere in struct foo or perhaps to another data structure.

Also, look at everywhere you do malloc/calloc/realloc calls and consider
whether you are allocating sufficient space, consider applying your 8k pad
to those calls (one at a time).

Quote:> Variables representing numbers shouldn't cause problems like this should
> they?

Correct, no assignment to non-pointer types will cause this problem.
Bad pointer arithmetic might.  String manipulations and memcpy calls
are notorious for causing these kinds of problems because there is no
protection from unintended corruption of data or of the execution
instructions for that matter.  Also, malloc/calloc/realloc calls that
fail to allocate sufficient space due to a logic error set you up
for memory corruption problems.

Quote:

> Yes I noticed this myself rather early. Thats why I turned to gdb....
> Still, however, I haven't found the source of my bug.

De*s are great for debugging logic problems but most lack any
capability for tracking down these kinds of problems.  The kind of
problem you have could actually crash the de*.

I strongly suggest you discuss, with your coworkers and manager, what
development tools are available for finding memory corruption and
leak problems.  Your time is better spent learning to apply such tools
as opposed to using 'manual' methods.  If your organization has no
such tools I suggest you point out to whomever is in charge that such
tools pay for themselves.

-- ced

--
Chuck Dillon
Senior Software Engineer
Accelrys Inc., a subsidiary of Pharmacopeia, Inc.

 
 
 

Bus Error Blues [some code at end of post]

Post by Ben Hutching » Thu, 18 Oct 2001 00:18:06




<snip>
> > Erm, memory *is* allocated according to the order in which you declare
> > your vars in your struct, right?....?
> > If memory *is* allocated in this fashion, I guess we could even pad
> > particular variables within structures to check whether they are
> > overflowing as well, yes?
<snip>
> > So, if yadda overflows, it will [hopefully] get caught in empty?

> I suggest that if you don't see anyone providing a development tool that
> is designed to validate memory usage that you follow a strategy like the
> following...

<snip>

Electric Fence is free (in both senses of the word) and does a pretty
good job of trapping both read and write over-runs in heap-allocated
memory.  It works under various flavours of Unix.  You can get it from
<ftp://ftp.perens.com/pub/ElectricFence/>.

<http://www-1.ibm.com/servers/eserver/zseries/os/linux/ldt/whitepaper2...>
describes a couple of other tools as well as Electric Fence, though
since the article is specifically about Linux I don't know whether
they are at all portable.

 
 
 

1. BusLogic BT445 DMA error - Beware the Blue Lightning/486SLC2 and 32 bit busses

Hi All,

   I've just had an unhappy experience and perhaps making it public will
prevent it from happening to others.

About a year ago, I bought an Alaris motherboard "Leopard 486SLC2 System
board (RevC)" - 2 vesa, 4-16 bit ISA, 1-8 bit ISA slots becasue it was
pretty cheap.  I used it to run Linux (various versions) with a Fahrenheit
1280 video card in one ISA,  WD 8013 ethernet, SIIG I/O board in the 8 bit
slot, USLI coprocessor, and a Adaptec 1540C SCSI controller.  Recently,
the SCSI controller started to give me lots of trouble on the external
chain, and after piddling around with it for far too long, I went out and
got a Buslogic BT445C, based on the numerous good things said about it in
the Hardware HOWTO.

   It didn't work.  After several days, it turns out that it's almost
certainly the MB.  Like some html pages at DiamondMM imply, the problem is
with the bozo CPU, an IBM Blue Lightning 486SLC2.  It's 32 bit internal,
but only 16 bit external so that most cards that expect to find a 32 bit
path to the CPU (almost all VLB cards) will fail.  
   The whole point about VLB is 32 bit thruput so I'm puzzled why Alaris
would even design a board based on the Blue Lightning.  From their own
admission, there are a limited number of VLB cards that will work in this
thing.

So.... while the board is cheap (probably already out of production) and
will run linux, avoid it if you're planning to use the VLB slots.  Chances
are that they won't work.

Thanks to all those that mailed me advice and pointers.

Anyone want to buy a motherboard very cheap?

Cheers
Harry
--
Harry J Mangalam, Microbiology and Molecular Genetics, UC Irvine,
      Irvine, CA, 92717, (714) 824-4824, fax (714) 824 8598
                 --- knowledge is fractal ---
            http://hornet.mmg.uci.edu/~hjm/hjm.html
  Computational Biology..SGI..Woodworking..Bicycling..Linux..WWW

2. HELP!! Removing Linux

3. The Netscape 4.02 Bus error Blues

4. Dual CPU Compaq Proliant problem with Solaris 7

5. Re-Post - gcc error: control reaches end of non-void function..

6. Install Corel Linux from hard drive?

7. Source code: test pointer for seg, bus errors

8. boot without keyboard

9. bus error from this code - why?

10. Netscape bus error fixed, replaced with exit code 135!!

11. Proxy Bad Gateway (502) Errors between front-end ssl and back-end mod_perl server!

12. DeCSS code - Find It, post it, and post it again.

13. Legality of posting code that contains GPL'd code