Optimized BLAS/Lapack routine

Optimized BLAS/Lapack routine

Post by Kazushige Go » Sat, 10 Feb 2001 22:27:08



  This is an announce of Optimized BLAS/Lapack routines for Alpha EV5
and EV6.

You can download from

ftp://www.netstat.ne.jp/pub/Linux/Linux-Alpha-JP/BLAS/

Features:
  This is optimized BLAS(Level 1, 3 and some 2) and some lapack(GESV,
LASWP, GETF2, GETRF) library for Alpha.  And Level 3 routines and
some lapack(GETRF, LASWP) routines are parallelized  by posix thread.
You can choose SINGLE/SMP version by editing the File "Makefile.rule".

Support OS:
  Linux/Alpha and Tru64 UNIX(it works fine on V4.0F and 5.1, need GNU
make).

Good Luck,
  g...@statabo.rim.or.jp

=====================================================================

1. Total Performance(Linpack GETRF)

   Now, Linpack TPP performs 1025 MFlops(XP1000, 21264 677MHz, Tru64
Unix V4.0F, this is a peak performance.  The avarage is about
1015MFlops) as fast as Compaq's announced performance.  On Linux, TPP
performs a little bit slower(Peak is 1016MFlops, avarage is 1010MFlops).

     1-1. TPP performance(XP1000, 21264 667MHz, Tru64 UNIX V4.0F)
  -----------------------------------------------------------------------
     norm. resid      resid           machep         x(1)          x(n)
  4.56250082E+00  5.06539255E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00

    times are reported for matrices of order  1000
      factor     solve      total     mflops       unit      ratio
 times for array with leading dimension of1001
  6.437E-01  8.940E-03  6.526E-01  1.025E+03  1.952E-03  1.165E+01
  end of tests -- this version dated 10/12/92
  -----------------------------------------------------------------------

     1-2. TPP performance(LX164, 21164 600MHz, Tru64 UNIX V5.1)
  -----------------------------------------------------------------------
     norm. resid      resid           machep         x(1)          x(n)
  4.15450074E+00  4.61242156E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00

    times are reported for matrices of order  1000
      factor     solve      total     mflops       unit      ratio
 times for array with leading dimension of1001
  1.015E+00  2.731E-02  1.042E+00  6.417E+02  3.116E-03  1.861E+01
  end of tests -- this version dated 10/12/92
  -----------------------------------------------------------------------

     1-3. GETRF performance(XP1000, 21264 667MHz, Linux-2.4.1,
                                        "()" is CXML-5.0 performance)
  --------------------------------------------------------------------------
   SIZE |     SGETRF     |     DGETRF     |    CGETRF      |    ZGETRF
  ------+----------------+----------------+----------------+----------------
    100 | 616.71( 232.45)| 545.55( 193.18)| 660.58( 418.63)| 634.77( 305.95)
    400 | 978.10( 535.13)| 838.54( 439.63)| 998.10( 734.06)| 938.81( 551.33)
   1000 |1096.57( 773.86)|1014.98( 682.78)|1099.80( 922.46)|1047.88( 791.71)
   2000 |1130.48( 902.63)|1055.53( 834.52)|1123.82(1018.72)|1086.88( 911.97)
   3000 |1144.48( 971.80)|1087.45( 905.77)|1135.26(1056.64)|1102.71( 960.29)
   4000 |1158.67(1003.11)|1098.55( 948.78)|1135.94(1076.42)|1113.87( 990.93)
  --------------------------------------------------------------------------

     1-4. GETRF(SMP) performance(DP264, 21264 500MHz x 2, Linux-2.2.18)
  --------------------------------------------------------------------------
   SIZE |     SGETRF     |     DGETRF     |    CGETRF      |    ZGETRF
  ------+----------------+----------------+----------------+----------------
    100 | 341.18         | 341.18         | 545.89         | 454.83
    400 | 970.43         | 856.25         |1343.64         |1230.10
   1000 |1476.88         |1340.51         |1538.49         |1479.29
   2000 |1585.37         |1462.91         |1580.13         |1545.11
   3000 |1629.23         |1501.86         |1614.24         |1557.35
   4000 |1643.55         |1510.83         |1617.98         |1572.15
  --------------------------------------------------------------------------

     1-5. GETRF performance(LX164, 21164 600MHz, Linux-2.2.18,
                                        "()" is CXML-5.0 performance)
  --------------------------------------------------------------------------
   SIZE |     SGETRF     |     DGETRF     |    CGETRF      |    ZGETRF
  ------+----------------+----------------+----------------+----------------
    400 | 728.60( 619.29)| 624.51( 514.30)| 874.32( 672.55)| 728.60( 495.36)
   1000 | 819.02( 728.53)| 660.86( 558.70)| 941.25( 764.95)| 784.28( 595.47)
   2000 | 850.70( 794.59)| 719.85( 646.24)| 988.58( 822.58)| 819.74( 646.31)
   3000 | 880.03( 817.35)| 743.50( 677.19)|1005.40( 847.46)|  Out of memory
  --------------------------------------------------------------------------

2.  BLAS Level 3 Performance

2-1. GEMM routines

  Old GEMM routines(including CXML's GEMM) are slow on small matrices.
This time, with dynamically adjusting block size and suppressing
system calls, small matrix performance is dramatically improved.  And
also large matrix performance is improved a little bit(DGEMM performs
1206MFlops,  if you have DDR cache machine, it may reach
1230MFlops!(92.5% of peak performance)).

     2-1. GEMM performance(XP1000, 21264 667MHz, Linux-2.4.1,
                        No-Transposed,  "()" is CXML-5.0 performance)
          *) 8 - 200 are peark performance
  --------------------------------------------------------------------------
   SIZE |     SGEMM      |     DGEMM      |    CGEMM       |    ZGEMM
  ------+----------------+----------------+----------------+----------------
      8 | 697.66(  32.47)| 681.65(  24.02)| 566.81(  86.06)| 512.77(  60.72)
     16 | 971.39( 107.98)| 972.60(  85.53)| 792.64( 303.32)| 717.21( 173.79)
     24 |1058.44( 184.70)|1058.20( 152.53)| 938.35( 479.13)| 875.50( 332.73)
     48 |1136.14( 545.26)|1130.35( 431.06)|1051.16( 862.81)| 973.43( 713.48)
     64 |1126.03( 746.81)|1031.52( 582.65)|1077.58( 970.89)|1021.62( 839.38)
    100 |1167.60( 954.24)|1064.11( 826.18)|1113.63(1078.54)|1061.30( 990.31)
    144 |1173.76(1078.98)|1111.62( 968.76)|1135.79(1133.13)|1095.61(1064.44)
    200 |1200.83(1145.67)|1142.48(1056.37)|1154.26(1169.39)|1109.87(1114.74)
    400 |1184.96(1107.40)|1071.52(1098.27)|1123.04(1194.26)|1094.35(1147.66)
   1000 |1231.71(1200.55)|1193.69(1143.45)|1186.36(1195.59)|1144.59(1148.33)
   2000 |1252.07(1227.28)|1202.42(1157.27)|1189.10(1198.91)|1165.85(1152.46)
   3000 |1248.96(1218.57)|1199.80(1170.99)|1187.22(1200.16)|1155.04(1172.45)
   4000 |1252.95(1237.43)|1206.30(1169.26)|1191.71(1199.17)| Out of memory
  --------------------------------------------------------------------------

2-2. Other Level 3 routines

  Other level 3 routines are also improved owing to GEMM and GEMV
routines.

     2-2. Other Level 3 performance(XP1000, 21264 667MHz, Linux-2.4.1,
        No-Transposed, Upper, Non-Unit, "()" is CXML-5.0 performance)
 ------------------------------------------------------------------------------
 FUNC(SIZE)|  Single Real   |  Double Real   | Single Complex |Double Complex
 ----------+----------------+-------------+-------------------+----------------
 TRSM  200 | 968.76( 308.37)| 880.09( 251.07)|1020.47( 464.44)| 945.24( 325.44)
      1000 |1148.36( 771.74)|1074.18( 695.84)|1095.45( 911.86)|1099.07( 779.44)
      3000 |1170.10( 975.88)|1144.26( 931.50)|1154.61(1054.18)|1132.81( 958.63)
 ----------+----------------+----------------+----------------+----------------
 TRMM  200 | 967.82( 282.85)| 862.35( 248.34)|1031.46( 482.91)| 897.69( 322.04)
      1000 |1144.53( 768.54)|1069.89( 701.05)|1097.58( 912.20)|1086.52( 599.84)
      3000 |1179.92( 982.40)|1150.31( 925.54)|1156.26(1055.74)|1127.97( 955.99)
 ----------+----------------+----------------+----------------+----------------
 SYMM  200 |1082.91( 772.35)| 994.35( 653.78)|1071.47(1029.83)|1024.44( 886.60)
      1000 |1195.76(1077.46)|1102.83( 788.23)|1143.53(1141.23)|1107.63(1072.97)
      3000 |1223.63(1105.22)|1146.79(1044.05)|1161.99(1148.40)|1114.69(1088.97)
 ----------+----------------+----------------+----------------+----------------
 SYRK  200 | 890.57( 426.08)| 826.70( 816.74)|1021.52(1049.39)| 964.87( 938.61)
      1000 |1140.09( 985.11)|1077.68( 980.45)|1096.32(1119.22)|1034.97( 804.01)
      3000 |1182.35(1084.61)|1101.77(1033.34)|1150.74(1127.06)|1102.38(1054.79)
 ----------+----------------+----------------+----------------+----------------
 SYR2K 200 | 880.62( 991.39)| 822.16( 881.54)| 945.11(1124.25)| 785.15(1021.71)
      1000 |1121.03(1175.67)| 972.98(1063.17)|1105.07(1163.93)| 994.64(1146.42)
      3000 |1195.83(1193.61)|1096.98(1104.29)|1070.97(1187.44)|1067.48(1152.66)
 ------------------------------------------------------------------------------

 
 
 

1. Lapack/Blas Library compiling on Sun Solaris

Hi,

Anyone has experience on compiling Blas and Lapack library on Solaris8/9
before ( using f77 )?

How long it usually take on a 4-6 CPU Sun workgroup server?

Mine has been running for 2-3 hours, not sure if there is anything wrong.

-- Daniel

2. su to another user with prompt for input

3. lapack/blas linking; Absoft and RH rpms

4. ET4000 video card help needed

5. LAPACK and/or BLAS

6. Linux on Cyrix 486/40 DLC ...

7. blas, lapack, etc.

8. TELNET ACCESS FOR ROOT UNDER REDHAT LINUX 4.0???

9. lapack and blas from C

10. BLAS & LAPACK

11. lapack/blas

12. Lapack & Blas are broken on Linux (RH 5.1)

13. Linking lapack and BLAS from C