>***Jack
> >You don't indicate how long this box has been up. How long did it take
it
> >to get to 12 billion isam instructions? That would be kind of
interesting
> >to know. 1200 checkpoint waits - so 1000 hours? Or are you getting
> >checkpoints more frequently than CKPTINTL - my guess is yes, so maybe
2-300
> >hours?
>***Dirk
> Yes, sorry, the system was up for approximately 4 days by the time I sent
> the mail.
>***Jack
> >You don't include an onstat -d - but from the -D I can see some layout
> >issues. Your dbspaces are deep, - probably designed around tables? They
> >are not balanced to the number of CPUs. A table fragmented over eight or
> >twelve dbspaces might perform better that one built on a single
seven-chunk
> >dbspace. I'm not really sure what your schema is or your application -
so
> >YMMV. However you can see that the load on the disks is not even. Your
> >disks are getting hammered You number of temp spaces should be a
multiple
> >of the number of CPUs. I would pull the whole thing apart and rebuild it
> >just to get the disks to group on the onstat -d output. .
>***Dirk
> I am starting to see disk bottlenecks yes. The previous weekend I
> unloaded and fragmented one of our bigger tables that gets accessed
> a lot.
> Fragmented by round robin and the disks seem much better this week.
> (This specific table gets a lot of reads and not so many writes).
> Could you perhaps elaborate a bit more on the issue of balancing dbspace
to
> the number of cpus ? Are you just refering to fragmentation of tables so
> that multithreading can take place ?
>***Jack
>If you have a table fragmented over 4 dbspaces and 4 CPUs, you will get 4
>scan threads. Odds are that they will run together and complete in about
>the same amount of time. If you have 5 dbspaces, then you will get 5 scan
>threads - one of those will have to bounce around from CPU to CPU. 8
>threads and you would get two per cpu - no bouncing. Mind you this is best
>when your database has to do scans of all fragments.
>Fragmented indices tend to also be shallower - hence not as many reads when
>looking for data. If the index is big........
>Round Robin is a lovely thing when it comes time to load, but it means no
>fragment elimination when it comes time to read. Read on.
***Dirk
This makes perfect sense thank you. The only thing that I still wonder about
is the round robin fragmentation. I would guess from what you are saying,
that round robin fragmentation for reading doesn't help, because of the
records not being contiguous, if you understand what I mean, ie. if I have
to do a sequential scan on a range of records, the time to access them,
because they are now scattered across fragments, will actually slow me down
?
>***Jack
> >Nice write cache hit rate. Read cache hit - ok.
> >4.5 million compresses? Someone is doing a lot of deletes to cause this.
> >Find them, blame them, make them pay. Look into why they are deleting -
> >perhaps truncates, or table rewrite would be more effective.
> In our environment we do a lot of deletes yes. Our data gets old quickly
> and then get moved to "history" tables. The data is still online and
> reports / queries still take place on the older data, especially the
> very recent records.
> By the way, what exactly are compresseses - is this index rebuilds ?
>When you 'delete' a row that space is not automatically re-available.
There
>is a 'next space' pointer in the page header which always points to the end
>of the data - new data is always placed at the 'end' of the page. Over
time
>the engine will recognize that a page is getting 'holey' and will re-write
>the page freeing up space at the end - this is a compress. Note that there
>is stuff at the end of the physical page - but let's not go there now -
>hence the tic marks around the word 'end'.
>In a delete situation like this consider fragmenting by the date of the
data
>(provided it is not updated - that would be a disaster). Then when you
need
>to delete you can detach the old data fragment and re-attach it as a new
>fragment for the new data. This may imply a lot of fragments - depending
on
>how low a granularity you need and how long you keep the data. It may also
>mean unbalancing your fragments against the number of CPUs. But it's way
>fast. I've done 33 fragments on a 16CPU box for this sort of thing (40GB
of
>data). The table screamed.
***Dirk
I will have to take another look at fragmentation. I just don't know if it
will help with our compression rate. It is highly likely that the programs
do deletes on individual records as statuses change, and that I will not
have much control over this.
>***Jack
> >2.7 million seq scans? Do you have a lot of small tables without
indices?
> >Or without stats updated? If sequential scans are desirable then you
might
> >want to look into light scans and bumping up your DS parameters to
support
> >those.
> Yes, we have some small tables and do a lot of sequential scans.
>Check into those and check the size of the tables involved - if they are 8
>pages (or whatever your disk read is) then that means one disk access to
get
>to your data. If they are 32 pages (e.g.) that means on average two disk
>accesses. An indexed read is always one index read + 1 data read. For
each
>level your index descends the number of reads required to find your index
>may be one per level. Hence sequential scans CAN be more effective in some
>situations. Generally in OLTP they are undesirable.
>A quick check might be select tabname from systables where npused > 8 and
>nindexes = 0;
***Dirk
Thanks, I will have a look at these.
>***Dirk
> Update stats runs every night and is up to date. I double checked this
> recently by running the query -
> select tabname, max(constructed) from
> systables,sysdistrib
> where systables.tabid = sysdistrib.tabid
> group by 1
> order by 2,1
> I have a problem with the DS (PDQ) parameters. When I give a value to
> MAX_PDQPRIORITY, all the users start using PDQ, even without PDQPRIORITY
> being set in their environments. I had to disable MAX_PDQPRIORITY to get
> around this problem.
> I did the same for DS_TOTAL_MEMORY, gave it the lowest value I could.
>***Jack
>Yes, it is an issue. XPS has some stored procedures which will be run
>before a user session starts where you can change that, but I see your
>problem. You can also set the environment variable PDQPRIORITY=1 to
acheive
>the same effect.
> >While we're there, your RA_PAGES and THRESHOLD are way out of whack.
They
> >should be closer to one another. With a setting of 128, you are doing a
> lot
> >of reading ahead - which goes into your buffers and implies management of
> >same. This may be desirable, in which case your threshold should
increase
> >some to meet it. 120 for example. If you want light scans then 128/120
is
> >an ideal setting. However if you are doing OLTP work - which the rest of
> >this engine looks like, then you want to decrease these dependant on the
> >size of your tables. Again, I don't know your application or how the
data
> >is being read/used. You may have a buffer management issue here 13
million
> >buffer waits, 42 million latch waits - depending on how long the engine
has
> >been up. You may have a problem here where you are flooding your buffers
> >with read ahead for a random access set of applications.
> I will play around with these values thank you.
> >Impressively low lockwaits - especially considering 14 billion lock
> >requests. Why so many? Is there really a need for all of those locks or
> >has some developer gone amok and decided that he will always lock
> everything
> >(s)he reads - even when running a report. Find them....
> I do not know where the lock requests come from - any way I can trace this
?
> We do a lot of inserts on this system. To give you a better picture, this
is
> a transport company scanning thousands of barcodes into and out of
vehicles.
> Isn't this perhaps where the lock requests come from ?
>You can look at onstat -u and see who has a lot of lock requests and then
>see what they're doing. For something like that I generally do up a script
>which scans every 'n' sceonds, when it finds something big in the locks
>column it would call a function to onstat -g sql and ses that session.
> >Are you really supporting 2600 connections?
>***Dirk
> My NETTYPE was originally set to 4,300 ,which I thought were 1200
> connections. Tech Support (and also in a recent discussion on this list)
> told me that it is in fact only 300 connections, and they requested me
> to change it to 4,600 ,which according to them is now only 600
> connections.
> Apparently the number of poll threads do not affect the number of
> connections like I originally thought.
>***Jack
>Surprises me too. Of course it's one of those 'set and forget' things.
>I'll have to go read Manuella for that one.
>***Dirk
> The change to 4,600 was due to the error
> -25580 System error occurred in network function.
>***Jack
> >First thing I would do is decrease the read ahead.
> >Next thing I would do is plan a weekend of pulling apart the dbspaces and
> >rebuilding them.
> >Then I would re-visit and see what affect that had.
> >Of course that's my .02 worth. I'm sure others will gently correct me or
> >provide other insight.
>***Dirk
> Thank you for all the input. We are moving from the current L2000 to
> an N4000 server next weekend (at long last!) and from there on I think
> most of my effort will go into disk reorgs.
***Jack
>"Know thy data and know thy queries". Once you can figure out what the
>users are doing against what data you can plan more effectively on how to
>deal with it.
>Let me know if the RA helps things.
***Dirk
Thanks, will let you know, and thank you for all the info.
==========================================================
This message contains information intended
...
read more »