Segmentation Faults and Bus Errors

Segmentation Faults and Bus Errors

Post by Richard E Sgrigno » Thu, 09 Jan 2003 02:54:06



Question below on what may be causing "Segmentation Faults" (coredump)
(Return Code=139) and "Bus Errors".....

First, some info you may need.....
Primary Players:  PeopleSoft Release 8 and MicroFocus COBOL (Server
Express Version 2.0.11)
Operating System:  Solaris 8 Kernel Level 108528-17
Hardware:  Sun E10000 4-Board Domain with 14 CPUs (400MHz) and 8Gb of
Memory

Description of Problem:

We run a handful of COBOL programs within our batch
cycle.....PeopleSoft-delivered General Ledger, Accounts-Payable,
Purchasing, et cetera.....however, of this handful of modules, the
error ONLY seems to affect ONE -- the General Ledger.....

The jobs are initiated from the mainframe (MVS) and execute scripts on
the Open Systems (UNIX) server.  These jobs may run SUCCESSFULLY many
times, anywhere from 10-60 times, but then suddenly a subsequent batch
may abend, randomly.  The main errors are either "segmentation fault"
or "bus error".

------------------- Script Error Message --------------------
signal fault in critical section
signal number: 11, signal code: 1       fault address: 0xfff
libthread panic: fault in libthread critical section : dumping core
(PID: #####
stacktrace:
        feeb4e10
        feebcc18
        0
<directory path>/hm_cobol_delivered.262.:27490 Segmentation
-------------------- End of Script Error Message --------------------

Abended in PROC09 with Return Code = 139 (Segmentation Fault)
Abended in PROC04 with Return Code = 139 (Segmentation Fault)

When the job abends, it is simply reinitiated from the MVS side at the
step where it failed, and it completes successfully.  NOTE: In these
situations where the jobs were re-run, there were NO changes to the
program, environment, configuration, arguments being passed, et
cetera.....it just simply completed successfully on the re-try.

Attempts to Resolve:

PeopleSoft has been consulted on this issue, which has been plaguing
us for several weeks ever since upgrading to PeopleSoft Release 8;
however, there has been no progress.  PeopleSoft is unwilling to move
beyond current level of support if the problem cannot be replicated on
a CONSISTENT basis.  Again, these abends occur randomly.

Questions:

Unsure if this is a PeopleSoft issue, a COBOL (Server Express) issue,
or a UNIX issue.  Could this be related to a memory issue?  I am
providing some "vmstat" data which surrounds the times of some of the
abends for your review.

 procs     memory            page            disk          faults    
cpu
 r b w   swap  free  re  mf pi po fr de sr s7 s8 s7 s7   in   sy   cs
us sy id
Mon Jan  6 16:50:00 EST 2003
 2 1 0 5278328 659088 3800 18030 20456 8 8 0 0 9 0 17 0 5055 40629
5919 31 34 36
 0 3 0 5287416 667664 3201 9829 19440 64 0 0 0 6 0 6 0 6209 26650 5653
21 24 55
 0 2 0 5281232 662240 3105 7058 18736 72 8 0 0 7 0 7 0 4081 21691 2559
39 19 43
Mon Jan  6 16:41:00 EST 2003
 2 3 0 5273232 654120 3556 8035 25632 16 0 0 0 18 0 8 0 6340 33407
5951 37 25 38
 1 2 0 5281440 661472 3178 10186 23656 240 0 0 0 41 42 42 55 4975
37476 3677 30 25 45
 2 1 0 5282336 658032 2811 13198 15160 72 8 0 0 8 0 8 0 3931 46298
4554 39 32 29
Mon Jan  6 16:42:00 EST 2003
 2 0 0 5267616 644344 2861 4579 20272 120 16 0 0 12 0 18 0 4306 27143
2985 47 19 34
 0 2 0 5271040 737592 2500 3791 19848 16 8 0 0 2 0 2 0 5519 23832 3817
30 17 53
 0 0 0 5288496 761184 765 7251 88 16 0 0 0  2  0  2  0 2357 20717 2758
23 12 65
Mon Jan  6 16:43:00 EST 2003
 0 0 0 5282336 808976 3561 11009 18504 8 8 0 0 5 0 10 0 4922 20066
3442 18 24 58
 0 0 0 5275568 856912 495 2414 2904 0 0 0 0 1  0  1  0 3232 7951 2478
17  6 78
 0 1 0 5284024 885480 656 3002 16504 0 0 0 0 5 0  4  0 3629 18645 2873
21 10 69
Mon Jan  6 16:44:00 EST 2003
 0 2 0 5271384 942416 525 5494 256 8 8 0 0  5  0  4  0 3029 43548 3065
13 13 74
 0 0 0 5287416 910008 634 2832 2152 64 0 0 0 6 0  6  0 3450 58254 4916
23 12 64
 0 0 0 5287624 967440 1731 2057 344 200 16 0 0 15 5 15 9 2843 41656
2993 21 11 69
Mon Jan  6 16:45:00 EST 2003
 0 1 0 5293968 948328 1663 17499 2664 32 16 0 0 20 0 16 0 3876 43643
6932 25 31 44
 0 2 0 5281368 930984 765 8964 2560 0 0 0 0 0  0  0  0 5381 16128 4547
15 16 69
 0 1 0 5284472 1025248 1395 17755 96 8 8 0 0 1 0  1  0 2559 31961 3999
24 24 52
Mon Jan  6 16:46:00 EST 2003
 0 3 0 5279272 1009928 979 10872 1136 96 8 0 0 5 0 5 0 2986 21320 3601
21 15 65
 1 1 0 5284752 1009552 693 8703 128 40 0 0 0 5 0  5  0 2756 17887 3506
20 12 68
 0 0 0 5273264 979648 266 2719 832 0 0 0 0  0  0  0  0 2488 8195 2655
13  6 81
Mon Jan  6 16:47:00 EST 2003
 0 1 0 5241488 977224 1217 12185 1136 88 8 0 0 6 0 8 0 2854 29627 3929
22 18 59
 0 0 0 5280848 979072 873 10057 216 0 0 0 0 0  0  0  0 2306 16722 3127
18 12 70
 0 0 0 5278128 976592 812 9571 176 0 0 0 0  0  0  0  0 2697 18991 3771
19 13 68
Mon Jan  6 16:48:00 EST 2003
 0 1 0 5281688 991344 831 9884 1800 16 0 0 0 5 1  3  0 3888 22427 4811
25 15 60
 0 0 0 5289832 1006616 805 7820 1456 16 0 0 0 2 0 2  1 3442 29261 4941
25 16 59
 0 1 0 5285352 986696 584 7667 544 0 0 0 0  6  0  6  0 2976 64067 3948
36 17 47
Mon Jan  6 16:49:00 EST 2003
 1 6 0 5274568 699024 727 6336 16288 0 0 0 0 1 0  0  0 7621 55165 8947
39 22 40
 0 1 0 5278824 688080 636 2481 3920 8 8 0 0 1  0  1  0 3118 9568 2587
19  6 75
 0 0 0 5301376 708048 339 4255 168 0 0 0 0  0  0  0  0 2191 12442 2510
6  8 86
Mon Jan  6 16:50:00 EST 2003

What are some of the areas which we should be looking at to resolve
these abends?  Do you feel this is a PeopleSoft issue, and what should
we look for and pass on to PeopleSoft that will persuade them to look
into this further?  Is this a UNIX issue?  What changes can/should be
made on the UNIX side to potentially resolve the issue and/or
eliminate that as a contributing factor?

Any info and/or pointers you may be able to provide, either from
specific experience with this problem or your expertise with how all
the above interact, would be GREATLY appreciated.

You may either respond within the newsgroup or, preferably, direct to
my e-mail address:

Many thanks.

Richard E Sgrignoli

 
 
 

Segmentation Faults and Bus Errors

Post by Richard E Sgrigno » Thu, 09 Jan 2003 23:58:00


I guess I left out some info relating to the VMSTATS and time of abend.....

The abend occurred at 16:48:33 on 6 January.....

 
 
 

Segmentation Faults and Bus Errors

Post by Paul Pluzhniko » Fri, 10 Jan 2003 15:04:52



Quote:> Question below on what may be causing "Segmentation Faults" (coredump)
> (Return Code=139) and "Bus Errors".....

Any number of things, but most likely a program bug.

Quote:> Description of Problem:

> We run a handful of COBOL programs within our batch
> cycle.....PeopleSoft-delivered General Ledger, Accounts-Payable,
> Purchasing, et cetera.....however, of this handful of modules, the
> error ONLY seems to affect ONE -- the General Ledger.....

> The jobs are initiated from the mainframe (MVS) and execute scripts on
> the Open Systems (UNIX) server.  These jobs may run SUCCESSFULLY many
> times, anywhere from 10-60 times, but then suddenly a subsequent batch
> may abend, randomly.  The main errors are either "segmentation fault"
> or "bus error".

These symptoms are quite typical for heap corruption and stack
overflow bugs -- most of the time things that were "stepped on"
are not critical, but once in a while they are.

Quote:> signal fault in critical section
> signal number: 11, signal code: 1  fault address: 0xfff
> libthread panic: fault in libthread critical section : dumping core
> (PID: #####
> stacktrace:
>    feeb4e10
>    feebcc18
>    0
> <directory path>/hm_cobol_delivered.262.:27490 Segmentation

Somebody corrupted libpthread's data.

Quote:> When the job abends, it is simply reinitiated from the MVS side at the
> step where it failed, and it completes successfully.  NOTE: In these
> situations where the jobs were re-run, there were NO changes to the
> program, environment, configuration, arguments being passed, et
> cetera.....it just simply completed successfully on the re-try.

That is often the case for non-threaded programs, but with
threads even slight change in timing or system load could
make the bug disappear.

Quote:> Attempts to Resolve:

> PeopleSoft has been consulted on this issue, which has been plaguing
> us for several weeks ever since upgrading to PeopleSoft Release 8;
> however, there has been no progress.  PeopleSoft is unwilling to move
> beyond current level of support if the problem cannot be replicated on
> a CONSISTENT basis.  Again, these abends occur randomly.

Tell them you are not going to renew maintenance unless and until
this issue is resolved.

Quote:> Questions:

> Unsure if this is a PeopleSoft issue, a COBOL (Server Express) issue,
> or a UNIX issue.  

It could be any.
Does the program that abends include your code, or is it
a pure PeopleSoft-compiled binary? If the latter, COBOL
probably has nothing to do with it.
I have no idea what you call "a UNIX issue".

Quote:> Could this be related to a memory issue?

Are you having any other programs "randomly" coredump?
If not, you can probably exclude your hardware from the
list of culprits. And since your other "modules" (just what
do you call modules?) do not exhibit the same problem, you can
probably exclude Solaris libraries as well.

Quote:>  I am providing some "vmstat" data ...

Irrelevant, I think.

Quote:> What are some of the areas which we should be looking at to resolve
> these abends?

Are you *qualified* to look at these abends?
Do you know how to use a de*?
If so, what is the crash stack trace?

If not, hire someone who does.
An expert will likely be able to tell you a short list of suspect
areas within a day or two.

Quote:>  Do you feel this is a PeopleSoft issue, and what should

It doesn't matter what we feel. You haven't supplied nearly enough
details to feel anything. You may have better luck talking to other
people who run the same release of PeopleSoft on similar hardware.

Cheers,
--
In order to understand recursion you must first understand recursion.

 
 
 

Segmentation Faults and Bus Errors

Post by those who know me have no need of my nam » Fri, 10 Jan 2003 16:46:35


[fu-t set]

in comp.unix.questions i read:

Quote:>> Could this be related to a memory issue?

>Are you having any other programs "randomly" coredump?
>If not, you can probably exclude your hardware from the
>list of culprits.

provided that at least one other program has been run on the system that
stresses it to the same degree as the one that faults.  if the system is
typically quiescent, existing only to run this job, then it might be that
it is the only one to hit the bad memory, or controller, or cpu, or various
bus interfaces, or a myriad of other devices that may exist.

Quote:>If so, what is the crash stack trace?

if it's in the peoplesoft code it's not likely that much beyond a hex dump
of the stack frames will be available, and given that peoplesoft doesn't
seem to be responding usefully it might be something of a problem actually
finding the culprit.

--
bringing you boring signatures for 17 years

 
 
 

1. bus error, alloc error, segmentation fault

hi all

my netscape 4.07 often crashes with bus error or alloc error.
arena also often crashes with segmentation fault.

I don't know what's wrong with my X.
but everything seems ok except these web browsers.

I'm using slackware 3.6 packages.
CPU is P-133, Memory is 16M, and no swap partition.
kernel is 2.0.35.

help me please..!

thanks.

2. Kernel-Documentation

3. segmentation fault and bus error

4. Linux 1.2.13 and PPP 2.2.0c

5. ps command failed and produced "Bus error" or "Segmentation fault" messages

6. keyboard bounces on jetbook

7. "Bus error" or "Segmentation Fault" By ps command

8. Help Compiling psionic portsentry

9. Segmentation Fault & Bus error

10. difference between bus error and segmentation fault

11. Bus Error v Segmentation Fault

12. Compiling *** VIM 5.3 *** Segmentation Fault..what is Seg-Fault..MEM Bounds?

13. Page Faults/Segmentation Faults??