Linpack (9/28) c.be FAQ

Linpack (9/28) c.be FAQ

Post by Eugene N. Mi » Wed, 10 Nov 1993 21:25:10



9       Linpack                                 <This panel>
10
11      NIST source and .orgs
12      Measurement Environments
13      SLALOM
14
15      12 Ways to Fool the Masses with Benchmarks
16      SPEC
17      Benchmark invalidation methods
18
19      WPI Benchmark
20      Equivalence
21      TPC
22
23
24
25      Ridiculously short benchmarks
26      Other miscellaneous benchmarks
27
28      References
1       Introduction to FAQ chain and netiquette
2       Benchmarking Concepts
3       PERFECT
4
5       Performance Metrics
6       Temporary scaffold of New FAQ material
7       Music to benchmark by
8       Benchmark types

w/ great help from Patrick McGehearty

The LINPACK benchmark is a very simple LU decomposition of a dense linear
system (Gaussian elimination) by Jack Dongarra, one of the developers of the
LINPACK library and the netlib numerical software server.

Ref: Dongarra's article in CACM on Netlib.
And ACM SIGARCH Computer Architecture News and SIGNUM Newsletter.

It consists of three parts:
100x100 ("LINPACK Benchmark") All Fortran, no changes allowed
1000x1000 ("TPP", best effort) No limits on algorithm selection, or use of
        assembly language to improve performance.
and
"A Look at Parallel Processing" (problem size = NxN with N selected by vendor)

Advantages:
Simple fairly portable FORTRAN.  One of the shorter benchamrks.
Source is small enough to be carried on disk or in Jack's laptop without
consuming too much porting time.
A good attempt at experiment control, has stringent execution requirements.

Dongarra also records the compiler options used to invoke the Fortran
compilers.  Record keeping is good.  Reports are quickly available
electronically and published with some frequency in Supercomputing Review.

The 100x100 case represents a well-defined type of floating point
computation.  The 1000x1000 case allows vendors to showcase their product's
potential if they are so inclined.  The third problem set is intended for
use by vendors of highly parallel systems which find even 1000x1000 problem
set to be too small when spread over hundreds or thousands of processors.
In this case, the vendor selects N and demonstrates the asymptotic effective
rate of their highly parallel machine.

Disadvantages:
Diminishing parallelism during the decomposition (as in all Gaussian
elimination).  It only tests some numeric aspects of a system, on
data with well-defined behavior patterns.

The 100x100 problem set is quite small by today's standards, and can have
problems with accurate measurements on those machines which do not offer
sub-millisecond timer resolution.  Also, the 100x100 problem set is too
small to show the performance potential of machines with high startup costs,
such as massively parallel or parallel-vector architectures.  It also can
fit entirely in a machine with a large cache, failing to measure the cache
miss behavior of a slightly larger problem.  Finally, the algorithm used by
the all Fortran code is suboptimal for machines which can do significantly
more floating point operations than memory to register transfers.

The 1000x1000 problem set is intended to address these concerns.  Each
machine vendor is allowed to use whatever algorithm they chose, including
assembly language if they desire.  By changing algorithms and increasing the
problem size, many vendors are able to demonstrate the full potential of
their machines on the 1000x1000 problem set.  Generating true "best effort"
results is not free, and vendors which do not put a high priority on
floating point performance or which do not expect a significant improvement
from 100x100 to 1000x1000 may not report results for the 1000x1000 problem
set.

NETLIB benchmark index (Linpack benchmark)
net...@ornl.gov
        send index from benchmark
includes linpack (100x100,300x300,1000x1000).

The entries of the report change drastically with time.  Anyone
interested in floating point performance should get a new copy from
netlib from time to time.

Of additional interest about these sizes is that they are not
the powers of 2 which characterize many benchmarks.  Powers of 2
can bias in favor of some architectures and bias against other
architectures.

                   ^ A  
                s / \ r                
               m /   \ c              
              h /     \ h            
             t /       \ i          
            i /         \ t        
           r /           \ e      
          o /             \ c    
         g /               \ t  
        l /                 \ u
       A /                   \ r
        <_____________________> e  
                Language

=====temp tag ======

Date: Sat, 9 Jan 93 07:13:12 PST
From: d...@validgh.com (David G. Hough on validgh)
Subject: comp.benchmarks FAQ nits

I think you mean

  Jack Dongarra, one of the developers of the

> Jack Dongarra the developer of the
> LINPACK library and the netlib numerical software server

and under "Disadvantages" it's worth mentioning that on high performance
systems, the 100x100 system is too fast to be interesting, and the 1000x1000
measures memory system much more than fp hardware.

Also the "MFLOPS" rating can be a little misleading if Winograd-Strassen
methods are used in a matrix-multiply oriented version of LU decomposition.
I think the actual elapsed time to do the decomposition is a more useful
statistic than MFLOPS,
although I guess this argument is really with Dongarra rather than the FAQ.

Finally, it's worth mentioning that the "Linpack Report" has changed
drastically in the last couple of years, and anybody who hasn't looked at
it lately should get a new copy from netlib.

Date: Mon, 15 Feb 93 18:46:04 PST
From: d...@validgh.com (David G. Hough on validgh)
Subject: Re:  1000*1000 linpack, your note to me

You're right, with respect to pipelined machines, I should have said
throughput instead of latency.   With respect to current PC's,
of course, there's hardly any difference.

More to the point, vector architectures are sometimes tuned
so that the memory throughput for saxpy-type operations is matched,
more or less, to the floating-point throughput for saxpy-type operations,
so that indeed neither is a bottleneck, by definition.  
This is great for unit-stride
implementations of linear equation solving based on saxpy-type operations,
but may not always work so well
for other linear algebra problems like matrix multiply in
which one dimension is inherently not unit stride.

I'll ask Miya to revise that paragraph of my comment on the FAQ as follows:

and under "Disadvantages" it's worth mentioning that on high performance
systems, the 100x100 system is too fast to be interesting, and on RISC
workstations with cache memory systems, the 1000x1000
measures memory system much more than fp hardware, and even so tends to
produce misleading expectations because, while linear equations can be solved
using unit-stride "saxpy" inner loops, many other problems of scientific
computation don't fit that mold.
Matrix multiplication, for instance, is a primary primitive at the core of
the LAPACK routines for linear algebra, which are intended to supersede the
Linpack library that this benchmark was originally written for.

Date: Tue, 16 Feb 93 11:56:36 EST
From: j...@watson.ibm.com
Subject: 1000*1000 linpack, your last note to me

         I remain in substantial disagreement with what you are saying.
     1.)  Even if TPP measured memory performance more than fp hardware
(which it doesn't, at least on high end systems) I don't see why this
would be a disadvantage since real applications tend to be much more
memory bound than TPP.  The real disadvantage is TPP doesn't stress the
memory system as much as a typical large application.
     2.)  You suggest TPP has the same memory bandwidth to fp through-
put ratio as a unit stride daxpy operation.  This is untrue.  An unit
stride daxpy operation stresses the memory system much more than TPP.
For example on an IBM 550 workstation it is fairly easy to max out the
fpu on TPP achieving .84 of the peak rate.  However a unit stride daxpy
is hopelessly memory bound on the 550 running at only a bit more than .2
of the fpu peak rate.  This means any vector machine with a enough mem-
ory bandwidth to peak out the fpu on daxpy operations is hopelessly fpu
bound on TPP.
     3.)  You suggest that matrix multiply is inherently not unit
stride.  This is untrue.  It is trivial to do matrix multiply using unit
stride daxpy operations.  It is easy to further reduce the stress on the
memory system.  In fact TPP and matrix multiply are basically the same
problem.  Matrix multiply should in fact get a better fraction of the
peak rate than TPP.  For example an IBM 550 workstation achieves .87 of
the fpu peak rate doing a 1000*1000 by 1000*1000 matrix multiply.
         I believe a legitimate objection to TPP is that it allows the
vendor to use any method.  This means for example he may code the
whole thing in assembler which is not a realistic option for most real
applications.
         It also true that TPP is very vectorizable and does not stress
the memory system.  Whether this is an advantage or disadvantage depends
on what you are using the benchmark for.
         Finally you also stated the 100*100 benchmark is too fast to be
interesting.  This makes no sense to me.  A legitimate comment would be
that it is small enough to fit in cache on many systems.  Whether this
is good or bad depends on what you are trying to measure.
                          James B. Shearer

Date: Wed, 17 Feb 93 09:31:32 PST
From: d...@validgh.com (David G. Hough on validgh)
Subject: Re:  1000*1000 linpack, your last note to me

Our e-mail discussion has revolved around what's wrong with the following
paragraph:

> and under "Disadvantages" it's worth mentioning that on high performance
> systems, the 100x100 system is too fast to be interesting, and the 1000x1000
> measures memory system much more than fp hardware.

and a replacement I proposed

> and under "Disadvantages" it's worth mentioning that on high performance
> systems, the 100x100 system is too fast to be interesting, and on RISC
> workstations with cache memory systems, the 1000x1000
> measures memory system much more than fp hardware, and even so tends to
> produce misleading expectations because, while linear equations can be solved
> using unit-stride "saxpy" inner loops, many other problems of scientific
> computation don't fit that mold.
> Matrix multiplication, for instance, is a primary primitive at the core of
> the LAPACK routines for linear algebra, which are intended to supersede the
> Linpack library that this benchmark was originally written for.

which might be further edited to something like

> and under "Disadvantages" it's worth mentioning that:
> The 100x100 system is too small to be interesting on anything faster than
> PC's.    The 1000x1000 system may mostly measure floating-point unit
> performance on low-end PC's and on high-end vector processors, and mostly
> memory system bandwidth on cache-based RISC workstations. Since linear
> equation solving may be readily coded by unit-stride inner loops,
> and the 1000x1000 Linpack may be solved that way or by any other method,
> including assembly language - and the algorithm used needn't be published -
> it is seldom useful for predicting performance of realistic scientific
> applications, which may not lend themselves to unit-stride
> or machine-specific assembly language algorithms.

Whether this content expansion is in the best interest of somebody reading
the comp.benchmarks FAQ for advice is arguable.   The whole notion of a
monthly-cycle FAQ in progress seems to baffle many.

Thus there are many ways to do matrix multiplication, and it's often not easy
to tell which is the best in advance of trying them.  But a digression on that
subject is more than this comp.benchmarks FAQ really needs.

> From: "John D. McCalpin" <uunet!perelandra.cms.udel.edu!mccalpin>

> I am not sure what the trouble is on the DEC Alpha systems.  While the
> RS/6000 runs LINPACK 1000 at about 85% of "peak", the DEC Alpha machines
> run it at no better than 56% of "peak".   This may be simply due to
> code immaturity, but since the rules allow DEC to do a fully assembly-
> language implementation if desired, the result is disturbing....

My guess is that the high-end RS/6000 implementations are
unusual among current RISC workstations in the amount of design
effort devoted to memory bandwidth,
so that they approach their "PEAK" floating-point ratings better than
most others.   Also in the amount of manpower devoted to finding optimal
matrix multiply/matrix factorization codes for specific implementations.

It's the belief of Sun's hardware designers, right or wrong, that such designs
don't scale down to cost-effective low-end mass-market machines.
And it's my belief that the assembly language manpower effort is
not worthwhile for products with two-year lifetimes, especially since it's
so hard to get such codes incorporated into ISV's products even if available.

And it's the belief of some other Sun people that it's not a good idea
to promote the idea of model-specific software optimization; that tends to
fragment the RISC market and thus
undermine the credibility of RISC as an alternative to x86 PC's.

At least as far as current Sun designs go,
even if the fpops took zero time that wouldn't help performance much on
systems with caches << 8 MB, which are indeed memory bound for 1000x1000
Linpack.

Date: Wed, 17 Feb 93 23:01:12 EST
From: j...@watson.ibm.com
Subject: 1000*1000 linpack, your last note

         I remain in disagreement.  Instead of expanding your comment
I believe it should just be removed because it is incorrect.
         You continue to assert TPP is memory bound on cache based work-
stations.  This is not true on the high end IBM machines.  I would like
to know why you believe it to be true on DEC or HP or SGI machines.  For
that matter I would like to see the calculations which show its memory
bound on Suns.  Do you believe matrix multiply is memory bound on Suns?
         In any case the TPP benchmark requires less memory bandwidth
than just about any real application.  Hence if machine cannot not max
out the fpu on TPP the extra peak rate is basically useless.  TPP is
measuring the effective peak fpu rate.  The fact that this effective
peak rate may be below the theoretical peak rate on machines with lousy
memory systems is an advantage of TPP not a disadvantage.
         Your objections to TPP seem to boil down to Sun doesn't do well
on it therefore it is a poor benchmark.  I do not believe this is a
legitimate objection.
         I also disagree with your dismissal of the 100*100 benchmark.
Some large applications will solve moderate sized sets of linear systems
millions of times.  Linpack 100*100 may be a perfectly reasonable bench-
mark for predicting the performance of such applications.  On some
systems accurately timing Linpack 100*100 may be a little tricky, how-
ever this is a detail not a fundamental flaw.
                           James B. Shearer

From: David.Ho...@Eng.Sun.COM (David Hough)
Subject: 1000x1000 Linpack

Thanks to your insightful comments, I've sharpened what I would want to convey
about 1000x1000 in the comp.benchmarks FAQ:

  What a 1000x1000 Linpack MFLOPS claim tells you about performance on
  scientific computations may be difficult to discern.  The TPP listings in
  the Linpack report, of which some but not all are produced by unpublished
  assembly language or Fortran codings tuned for very specific computer models,
  are perhaps less informative than would be multi-part listings of the best
  1000x1000 performances obtained with

    1) any algorithm,
    2) any algorithm coded in standard Fortran-90,
    3) any algorithm coded in standard Fortran-77,
    4) the fixed Fortran-77 algorithm specified for 100x100 Linpack.

  Such a listing would suggest the relationship
  between obtainable performance and recoding effort that might obtain for
  other scientific applications.   Some high-performance
  systems can obtain close to their "guaranteed speed limit"
  peak floating-point performance with suitably specialized algorithms,
  yet will be limited by memory bandwidth with more general codes.

> >          You continue to assert TPP is memory bound on cache based work-
> > stations.  This is not true on the high end IBM machines.  I would like
> > to know why you believe it to be true on DEC or HP or SGI machines.

McCalpin has already testified about Alpha, and Mashey has commented on
comp.arch that SGI machines can show performance degradation on linear algebra
problems that exceed the size of the secondary cache, similar to that I reported
for SS10/41.   Some extracts from Dongarra's 12/7/92 report:

        RS 6000/970, 50 Mhz     84 MFLOPS achieved out of 100 limit     84%
        RS 6000/580, 62.5 Mhz   80 of 125                               64%
        Alpha 200 Mhz           112 of 200                              56%
        HP 720, 50 Mhz          58 of 100                               58%
        SGI Crimson 50 MHz      32 of 50                                64%

Since the HP system is capable of issuing two
fpops per cycle in some circumstances, I expect that the correct "Theoretical
Peak" - really the guaranteed speed limit -
for this system is 100 MFLOPS rather than the 50 MFLOPS listed, just as
RS/6000 systems list its speed limit as twice the clock rate.

One of my own interim results:
        SS10/41, 40 MHz         17 of 40                                43%

The last is not comparable to the others, being written in standard Fortran and
probably subject to further improvement in source code and compilers.   I've
been studying matrix multiplication on SS10 multiprocessors for a couple
of months and will be posting a report eventually.   Extreme sensitivity to
details that affect memory usage but not the number of fpops suggest that
the SS10 performance on large matrices will always be memory bound, and the
results for the systems other than 6000/970 suggest, for this problem,
that they too are either
inherently memory bound or are not yet close to finding the optimal algorithm.
So for them, at this stage of the game, 1000x1000 Linpack looks memory bound,
despite that some other systems may not be, and that realistic applications
are usually even more memory bound.   It may even be true that systems for
scientific computation should be designed so that 1000x1000 Linpack will be
FPU-bound with readily-discernable and reasonably portable algorithms,
but the marketplace may prioritize things differently.

Date: Thu, 18 Feb 93 19:48:25 EST
From: j...@watson.ibm.com
Subject: 1000*1000 linpack

         I have no major objections to your latest effort.
         Concerning your other comments, while it is true that other
workstations do not achieve as high a fraction of the theoretical peak
rate on TPP as the IBM machines it does not follow that TPP is memory
bound on these machines.  The performance may be falling short for
other reasons.
         For example the latency of the multiply and add pipelines on
the DEC alpha is I believe 6 cycles compared to 2 for a fused multiply-
add on the IBM machines.  This may cause scheduling problems which
could account for some of the shortfall.
         The HP 720 achieves 36 of 50 megaflops on TPP a ratio of .72.
I don't know where you got the numbers in your note.  58 is the specfp92
for the 720 and 50 is the correct peak rate.  The 720 has multiply and
add pipelines which can each accept instructions every other cycle.
This means if both pipelines issue an instruction in some particular
cycle (using the special HP instruction which does this) then neither
can issue an instruction the next cycle.  Hence the peak rate is equal
to the megahertz rate.  This holds for the 730 and 750 as well.  For
the 735 (99 megahertz, 103 TPP, 198 peak rate) HP improved its pipeline
design so that the multiply and add pipelines can now each accept oper-
ands every cycle.  Hence for the 735 the peak rate is twice the mega-
hertz rate.  However note this peak rate can only be achieved by issuing
the special multiply,add instruction every cycle.  I suspect this is
awkward to do (does the special instruction preserve all its operands?)
and that this accounts in part for the HP 735 ratio of .52.
         Your TPP value for the IBM 580 is obviously wrong since it is
less than that for the 970 (a slower machine).  Dongarra's report dated
1/13/93 gives 105 also a ratio of .84.  The IBM models 560, 950, 550,
530H, 540 (Dongarra's report incorrectly states the 540 has a peak of
66, the correct value is 60), 930 and 530 all have ratios of .84 (or
.83).  The code achieving these values is (or could easily be put) in
standard fortran 77 and is sold by IBM (in object form) as part of the
ESSL6000 library.  I believe other high performance vendors also make
available to customers the code they use for TPP.  The techniques used
to achieve high performance on the IBM machines are not secret.  They
are explained at length in IBM publications such as "Optimization and
Tuning Guide for the XL Fortran and XL C compilers" (SC09-1545-00).
                          James B. Shearer

Date: Fri, 19 Feb 93 15:01:20 PST
From: David.Ho...@Eng.Sun.COM (David Hough)
Subject:  1000*1000 linpack

>          Concerning your other comments, while it is true that other
> workstations do not achieve as high a fraction of the theoretical peak
> rate on TPP as the IBM machines it does not follow that TPP is memory
> bound on these machines.  The performance may be falling short for
> other reasons.
>          For example the latency of the multiply and add pipelines on
> the DEC alpha is I believe 6 cycles compared to 2 for a fused multiply-
> add on the IBM machines.  This may cause scheduling problems which
> could account for some of the shortfall.

I think floating-point latency problems, being fixed,
can usually be overcome by compiler
techniques, unless there aren't enough registers.   That would be the case
for ld/st latency too, except the latter isn't constant in complicated memory
systems, which is why I'm inclined to attribute performance problems there
until proven otherwise.

> I don't know where you got the numbers in your note.

I commented in the FAQ
that the Linpack benchmark report had changed a lot in the last
couple of years.   Evidently it's changed a lot even in the last couple
of months.   Thanks for the updates.

> The code achieving these values is (or could easily be put) in
> standard fortran 77 and is sold by IBM (in object form) as part of the
> ESSL6000 library.  I believe other high performance vendors also make
> available to customers the code they use for TPP.  The techniques used
> to achieve high performance on the IBM machines are not secret.  They
> are explained at length in IBM publications such as "Optimization and
> Tuning Guide for the XL Fortran and XL C compilers" (SC09-1545-00).

At Sun we discovered it was a mistake to assume that ISV's, in general, would
go to much trouble to optimize for a particular platform.  Most (not all) are
mainly interested in how fast you can run their existing portable higher
level language source code without any changes on their part.   When it comes
to platform-specific optimizations, SPARC is usually lower in priority
than PC's and Macintosh, with the other RISC vendors lower than that, all
strictly based on installed base.    This is different from the situation with
supercomputers.   Many supercomputer sites have their own source code and
performance specialists to tune it.

The relevance to this discussion is that while TPP numbers as currently
constituted are probably highly relevant to supercomputers, the 1000x1000
performance on public portable code would probably be more relevant to the
situation of most workstation users using third-party software.

Interestingly enough, the 2/93 issue of Sun World features the RS/6000 and
has an interesting discussion of its pros and cons.

From: neid...@nestvx.enet.dec.com (Burkhard Neidecker-Lutz)
Subject: Re: [l/m 3/23/92] Linpack                      (9/28)  c.be FAQ

Given that I've seen this posting before, I think you should stop posting
old, outdated information...

In article <C8CqE2....@nas.nasa.gov> eug...@amelia.nas.nasa.gov (Eugene N. Miya) writes:

>> From: "John D. McCalpin" <uunet!perelandra.cms.udel.edu!mccalpin>

>> I am not sure what the trouble is on the DEC Alpha systems.  While the
>> RS/6000 runs LINPACK 1000 at about 85% of "peak", the DEC Alpha machines
>> run it at no better than 56% of "peak".   This may be simply due to
>> code immaturity, but since the rules allow DEC to do a fully assembly-
>> language implementation if desired, the result is disturbing....

        But we didn't at the time of the report....

> Some extracts from Dongarra's 12/7/92 report:
>    RS 6000/580, 62.5 Mhz   80 of 125                               64%
>    Alpha 200 Mhz           112 of 200                              56%
>    HP 720, 50 Mhz          58 of 100                               58%

And here are actual numbers from more recent runs (same hardware, better
code):

        Alpha, 200 Mhz (DEC 10000/610):         155 of 200              77 %

So can we just put to rest the myth about the RS/6000 being infinitely
better for such large codes ?

                Burkhard Neidecker-Lutz

Distributed Multimedia Group, CEC Karlsruhe  
Software Motion Pictures & BERKOM II Project
Digital Equipment Corporation
neidec...@nestvx.enet.dec.com

From br...@oregon.cray.com  Thu Sep  9 14:26:35 1993
Received: from timbuk.cray.com by amelia.nas.nasa.gov (5.67-NAS.2/5.67-NAS-1.1(SGI))
        id AA12795; Thu, 9 Sep 93 14:26:35 -0700
Received: from oregon.cray.com by cray.com (4.1/CRI-MX 2.19)
        id AA28120; Thu, 9 Sep 93 16:26:32 CDT
Received: from barnacle.cray.com by oregon.cray.com (4.1/SMI-4.1c)
        id AA22663; Thu, 9 Sep 93 14:27:00 PDT
Date: Thu, 9 Sep 93 14:27:00 PDT
From: br...@oregon.cray.com (Brad Carlile)
Message-Id: <9309092127.AA22...@oregon.cray.com>
To: eug...@amelia.nas.nasa.gov
Subject: Linpack in comp.benchmarks FAQ
X-Sun-Charset: US-ASCII
Status: R

Hi,

I think you are missing an important part about the Linpack benchmarks.  
Central to the differences is the amount of data movement required to perform
the algorithms.  The key is Compute Intensity, a term defined by Hockney &
Jessope.  Compute intensity is defined as:

     Compute Intensity = Operations/word

This can be used to estimate performance as well since the memory bandwidth
limited performance of an algorithm is determined by:

     Performance = Intensity * Bandwidth

A complete LU solver has a compute intensity of:

     Compute Intensity = (2/3*N**3 operations)/(2*N**2 words) = .3333*n

This sounds wonderful even a 100x100 Linpack has a compute intensity of 33,
however the rules say that you can can only optimize the FORTRAN provided.
It was written with BLAS 1 algorithms (DAXPY).  Daxpy has a compute intensity
of 2/3 (two operations per 3 memory references) no matter what the size of the
matrix.  This is a requires a lot of memory bandwidth to get any performance.

The Linpack 1000 with no limits on algorithm means that everyone uses a LAPACK
solver based on the BLAS 3 kernels (DGEMM).  These have a compute intensity
that is equal to the blocking used in the algorithm.

Most Vendors understand this but most user's don't realize that this is the
true limiting factor for linpack.

I would change your description from:

> It consists of three parts:
> 100x100 ("LINPACK Benchmark") All Fortran, no changes allowed
> 1000x1000 ("TPP", best effort) No limits on algorithm selection, or use of
>       assembly language to improve performance.
> and
> "A Look at Parallel Processing" (problem size = NxN with N selected by vendor)

Change this to:
------------------------------------------------
It consists of three parts:
100x100 ("LINPACK Benchmark") All Fortran, no changes allowed, old algorithm
                              that has low compute intensity and makes poor
                              use of memory bandwidth.
1000x1000 ("TPP", best effort) No limits on algorithm selection, or use of
              assembly language to improve performance.  Best implementations
              currently use LAPACK solvers that make efficient use of memory.
and
"A Look at Parallel Processing" (problem size = NxN with N selected by vendor)
              Best implementations currently use LAPACK solvers that make
              efficient use of memory.

Brad Carlile
Cray Research Superservers, Inc
br...@oregon.cray.com