hi there,
I'm looking for some performance tuning tips to do with digging deeper
into mpstat and lockstat.
I have a machine performing poorly and I can find no info on the web
about where to go with lockstat output. Details are as follows:
The machine has 2 cpus & 4 GB of ram
# psrinfo -v
Status of virtual processor 0 as of: 02/07/2006 16:11:30
on-line since 12/11/2005 10:05:16.
The sparcv9 processor operates at 1281 MHz,
and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 02/07/2006 16:11:30
on-line since 12/11/2005 10:05:15.
The sparcv9 processor operates at 1281 MHz,
and has a sparcv9 floating point processor.
# prtconf -v | grep Memo
Memory size: 4096 Megabytes
The load average is quite high
$ sar -q -f /var/adm/sa/sa20060206
<snip>
08:00:01 29.6 39 0.0 0
08:20:00 34.6 45 0.0 0
08:40:01 34.6 52 0.0 0
09:00:00 39.1 75 0.0 0
09:20:00 40.5 55 0.0 0
09:40:00 38.6 57 0.0 0
10:00:00 43.7 72 0.0 0
10:20:00 28.7 85 0.0 0
10:40:00 26.7 92 0.0 0
11:00:00 41.6 64 0.0 0
11:20:00 45.2 37 0.0 0
11:40:00 57.0 37 0.0 0
12:00:01 70.0 40 0.0 0
12:20:00 58.8 42 0.0 0
12:40:01 51.0 28 0.0 0
13:00:00 75.3 23 0.0 0
13:20:00 52.3 24 0.0 0
13:40:00 41.2 46 0.0 0
14:00:00 54.8 47 0.0 0
14:20:00 34.0 57 0.0 0
14:40:00 53.7 47 0.0 0
15:00:00 43.3 58 0.0 0
15:20:00 41.0 40 0.0 0
15:40:00 53.5 36 0.0 0
16:00:00 64.0 22 0.0 0
16:20:00 44.6 17 0.0 0
16:40:00 76.0 9 0.0 0
17:00:00 72.2 11 0.0 0
17:20:00 52.9 7 0.0 0
<snip>
Average 40.6 24 0.0 0
The cpu utilisation is high, but not 100% all day:
$ sar -u -f /var/adm/sa/sa20060206
<snip>
00:00:00 %usr %sys %wio %idle
<snip>
08:00:01 48 19 26 6
08:20:00 44 21 25 10
08:40:01 50 22 23 6
09:00:00 56 28 16 0
09:20:00 49 24 26 1
09:40:00 54 24 19 3
10:00:00 62 25 11 2
10:20:00 73 22 5 0
10:40:00 76 21 3 0
11:00:00 64 23 8 5
11:20:00 41 20 20 19
11:40:00 38 20 15 26
12:00:01 35 28 9 28
12:20:00 49 21 12 18
12:40:01 39 16 12 34
13:00:00 26 20 7 46
13:20:00 32 16 18 33
13:40:00 53 21 18 8
14:00:00 44 25 11 21
14:20:00 54 24 16 6
14:40:00 44 25 17 14
15:00:00 51 25 8 15
15:20:00 48 21 24 6
15:40:00 50 18 17 16
16:00:00 31 17 15 37
16:20:00 31 14 26 29
16:40:00 16 9 7 67
17:00:00 32 11 5 53
<snip>
Average 32 14 19 35
Normally, my analysis of a high load average combined with non-zero
idle cpu would be that processes waiting on small i/o's are being
included in the run queue. However, the machine is noticeably slow to
use, so I don't think it's a case of just one metric being out of
whack.
It would seem that there is contention for CPU resources & kernel
mutexes:
# mpstat 2 20
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys
wt idl
0 193 34 196 93 2 54 162 311 40 0 215 20 10
16 54
1 203 35 468 503 359 663 189 311 125 0 783 22 11
16 51
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys
wt idl
0 590 136 1535 131 1 6970 146 1834 1251 4 19560 32 12
47 8
1 755 108 1381 654 520 8379 130 1926 1233 1 20421 32 18
43 6
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys
wt idl
0 141 116 1213 111 1 7477 116 2247 1497 0 14384 25 18
46 10
1 400 144 1593 699 563 8843 177 2167 1498 0 18754 36 14
43 7
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys
wt idl
0 96 120 1317 114 1 6972 138 1974 1246 0 18495 17 11
56 16
1 213 113 1445 734 602 7616 124 1945 1291 0 17512 25 13
54 8
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys
wt idl
0 57 127 1208 134 1 7839 159 2001 1474 0 15872 31 13
50 6
1 7 148 1430 704 572 8065 131 2016 1412 0 18480 27 15
54 4
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys
wt idl
0 215 164 1517 169 1 8130 199 2204 1164 0 18284 39 11
48 2
1 321 88 1139 694 560 7671 135 2193 1480 0 14894 36 22
35 7
The high value for stalled mutexes (smtx) leads me to lockstat, which
shows that one system call is generating almost all the mutex stalls
and the mutex blocks:
# lockstat -i 971 sleep 15
Adaptive mutex spin: 44962 events in 15.148 seconds (2968 events/sec)
Count indv cuml rcnt spin Lock Caller Count indv cuml rcnt nsec Lock Caller The high value for xcal (in mpstat) might be reduced by increasing $ /opt/RMCmem/bin/prtmem Total memory: 3916 Megabytes $ Virtual Adrian has nothing to say except for cpu busy: # /opt/RICHPse/bin/se virtual_adrian.se Adrian detected CPU overload (amber): Tue Feb 7 16:14:42 2006 Adrian detected CPU overload (red): Tue Feb 7 16:15:12 2006 The load on network, memory and disk seems negligible. I have another Any ideas on how I can progress the investigation into cv_wait_sig? Any help appreciated, (especially from legends like Casper Dik, cheers
---------------------------------------------------------------------------
43097 96% 96% 1.00 2 0x30000056568 cv_wait_sig+0x1a4
<snip>
Adaptive mutex block: 239 events in 15.148 seconds (16 events/sec)
---------------------------------------------------------------------------
232 97% 97% 1.00 163433 0x30000056568 cv_wait_sig+0x1a4
<snip>
segmap_percent, but I find this unlikely as the memory situation is
generally healthy, ie the file cache has free memory available to it:
Kernel Memory: 544 Megabytes
Application: 1394 Megabytes
Executable & libs: 21 Megabytes
File Cache: 400 Megabytes
Free, file cache: 1551 Megabytes
Free, free: 6 Megabytes
<snip>
Adrian detected CPU overload (red): Tue Feb 7 16:14:12 2006
CPU overload
procs memory page faults cpu
r b w swap free pi po sr rt in sy cs smtx us sy
id
17 1 0 10602132 1457662 3756 28 0 600 785 31207 16015 2789 33 15
53
CPU busy
procs memory page faults cpu
r b w swap free pi po sr rt in sy cs smtx us sy
id
6 1 0 10598704 1458007 2911 24 0 600 793 37778 16319 2923 28 13
59
CPU overload
procs memory page faults cpu
r b w swap free pi po sr rt in sy cs smtx us sy
id
29 1 0 10597442 1456310 3462 50 0 600 827 35931 16184 2740 37 17
46
12 machines with identical hardware, identical o/s configuration (built
by Flash), and very similar database configuration and user usage.
None of them are experiencing these problems.
Truss doesn't show kernel calls, and I don't have DTrace as this is
Solaris 9, not Solaris 10.
Brendan Gregg or Adrian!!!) :)
nf