Fast BLAS routine(Level 3)

Fast BLAS routine(Level 3)

Post by Kazushige Go » Tue, 25 Jul 2000 04:00:00


I release optimized BLAS routine(all of Level 3 and some Level 2

Though we can use gemm_based Level 3 routines at netlib, I made
other optimized(blocked) routines.  Available routines are followed at
end of this message.


 1. Fast and Fast

    Most Level 3 routines perform over 1GFlops on 21264 677MHz,
    and faster than CXML or ATLAS(they uses my optimized routine,
    Especially,  complex routines(c-, z-) run much faster than
    CXML and ATLAS.

2.  Size-independent

    This routine sustains high-speeds even if the sizes are large.
    And the small-matrix performances are improved than before.

3.  Auto-detect architecture

    You do not have to check your machine's architecture.  Because
    this Level 3 and Level 2 routines can automatically detect
    architecture whether the architecture is based on ev5 or ev6.  You
    may only link your program with this library.

4. 'R' option is supported

    Some Level 3 routines are supported 'R' option(Non-Transposed and
    Conjugate) like CXML.

5. Available optimized routines.

    All  Level 1 routines(except for xerbla).

        lsame,  dcabs1, scabs1
        isamax, idamax, icamax, izamax
        saxpy,  daxpy,  caxpy,  zaxpy
        scopy,  dcopy,  ccopy,  zcopy
        sdot,   sdsdot, ddot,   dsdot
        cdotc,  cdotu,  zdotc,  zdotu
        snrm2,  dnrm2,  scnrm2, dznrm2
        srot,   drot,   csrot,  zdrot
        crotg,  srotg,  drotg,  zrotg
        srotm,  drotm,  srotmg, drotmg
        sscal,  dscal,  cscal,  zscal,  csscal, zdscal
        sasum,  dasum,  scasum, dzasum
        sswap,  dswap,  cswap,  zswap

    Some Level 2 routines
        sgemv,  dgemv,  cgemv,  zgemv
        sger,   dger,   cger,   zger
        strsv,  dtrsv,  ctrsv,  ztrsv

    All Level 3 routines
        sgemm,  dgemm,  cgemm,  zgemm
        ssymm,  dsymm,  csymm,  zsymm
        strmm,  dtrmm,  ctrmm,  ztrmm
        strsm,  dtrsm,  ctrsm,  ztrsm
        ssyrk,  dsyrk,  csyrk,  zsyrk
        ssyr2k, dsyr2k, csyr2k, zsyr2k
                        chemm,  zhemm
                        cherk,  zherk
                        cher2k, zher2k

6.  TO DO

Level 2 routines will be available soon.  But I must re-optimize
GEMV and GER routine, because I think these routines are not
well-optimized yet.

7.  Getting source

Enjoy "GFLOPS" World!!



1. Fast opmized BLAS(Level 1) routine is available


I release fast optimized BLAS routines. This time, I can only
release level 1 routine(amax, axpy, ...., dot, rot), but
these routines are written in Assembler and much faser than
generic blas routine(as fast as CXML?).

If you want this library, please see.


I also put fast DGETRF(decomposition routine) as a test.
Please try with my fast gemm and ger routine.


2. Combine two PPP modems to get 56K?

3. Optimized BLAS/Lapack routine


5. calling Fortran BLAS Routines

6. 486 Install

7. BLAS and new LibFFM routine for Alpha

8. Configurate video driver in X

9. libffm patch and BLAS routine

10. Low level termcap routines

11. Fast memory search routine wanted

12. FAST vector multiply routine

13. Ported fast Cray libm routines now available !