fbpx
Wikipedia

Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software (ATLAS) is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and FORTRAN 77.

Automatically Tuned Linear Algebra Software
Stable release
3.10.3 / July 28, 2016; 7 years ago (2016-07-28)
Repository
  • github.com/math-atlas/math-atlas
Written inC, FORTRAN 77
TypeSoftware library
LicenseBSD License
Websitemath-atlas.sourceforge.net 

ATLAS is often recommended as a way to automatically generate an optimized BLAS library. While its performance often trails that of specialized libraries written for one specific hardware platform, it is often the first or even only optimized BLAS implementation available on new systems and is a large improvement over the generic BLAS available at Netlib. For this reason, ATLAS is sometimes used as a performance baseline for comparison with other products.

ATLAS runs on most Unix-like operating systems and on Microsoft Windows (using Cygwin). It is released under a BSD-style license without advertising clause, and many well-known mathematics applications including MATLAB, Mathematica, Scilab, SageMath, and some builds of GNU Octave may use it.

Functionality edit

ATLAS provides a full implementation of the BLAS APIs as well as some additional functions from LAPACK, a higher-level library built on top of BLAS. In BLAS, functionality is divided into three groups called levels 1, 2 and 3.

  • Level 1 contains vector operations of the form
 
as well as scalar dot products and vector norms, among other things.
  • Level 2 contains matrix-vector operations of the form
 
as well as solving   for   with   being triangular, among other things.
 
as well as solving   for triangular matrices  , among other things.

Optimization approach edit

The optimization approach is called Automated Empirical Optimization of Software (AEOS), which identifies four fundamental approaches to computer assisted optimization of which ATLAS employs three:[1]

  1. Parameterization—searching over the parameter space of a function, used for blocking factor, cache edge, etc.
  2. Multiple implementation—searching through various approaches to implementing the same function, e.g., for SSE support before intrinsics made them available in C code
  3. Code generation—programs that write programs incorporating what knowledge they can about what will produce the best performance for the system
  • Optimization of the level 1 BLAS uses parameterization and multiple implementation
Every ATLAS level 1 BLAS function has its own kernel. Since it would be difficult to maintain thousands of cases in ATLAS there is little architecture specific optimization for Level 1 BLAS. Instead multiple implementation is relied upon to allow for compiler optimization to produce high performance implementation for the system.
  • Optimization of the level 2 BLAS uses parameterization and multiple implementation
With   data and   operations to perform the function is usually limited by bandwidth to memory, and thus there is not much opportunity for optimization
All routines in the ATLAS level 2 BLAS are built from two Level 2 BLAS kernels:
    • GEMV—matrix by vector multiply update:
 
    • GER—general rank 1 update from an outer product:
 
  • Optimization of the level 3 BLAS uses code generation and the other two techniques
Since we have   ops with only   data, there are many opportunities for optimization

Level 3 BLAS edit

Most of the Level 3 BLAS is derived from GEMM, so that is the primary focus of the optimization.

  operations vs.   data

The intuition that the   operations will dominate over the   data accesses only works for roughly square matrices. The real measure should be some kind of surface area to volume. The difference becomes important for very non-square matrices.

Can it afford to copy? edit

Copying the inputs allows the data to be arranged in a way that provides optimal access for the kernel functions, but this comes at the cost of allocating temporary space, and an extra read and write of the inputs.

So the first question GEMM faces is, can it afford to copy the inputs?

If so,

  • Put into block major format with good alignment
  • Take advantage of user contributed kernels and cleanup
  • Handle the transpose cases with the copy: make everything into TN (transpose - no-transpose)
  • Deal with α in the copy

If not,

  • Use the nocopy version
  • Make no assumptions on the stride of matrix A and B in memory
  • Handle all transpose cases explicitly
  • No guarantee about alignment of data
  • Support α specific code
  • Run the risk of TLB issues, bad strides, etc.

The actual decision is made through a simple heuristic which checks for "skinny cases".

Cache edge edit

For 2nd Level Cache blocking a single cache edge parameter is used. The high level choose an order to traverse the blocks: ijk, jik, ikj, jki, kij, kji. These need not be the same order as the product is done within a block.

Typically chosen orders are ijk or jik. For jik the ideal situation would be to copy A and the NB wide panel of B. For ijk swap the role of A and B.

Choosing the bigger of M or N for the outer loop reduces the footprint of the copy. But for large K ATLAS does not even allocate such a large amount of memory. Instead it defines a parameter, Kp, to give best use of the L2 cache. Panels are limited to Kp in length. It first tries to allocate (in the jik case)  . If that fails it tries  . (If that fails it uses the no-copy version of GEMM, but this case is unlikely for reasonable choices of cache edge.) Kp is a function of cache edge and NB.

LAPACK edit

When integrating the ATLAS BLAS with LAPACK an important consideration is the choice of blocking factor for LAPACK. If the ATLAS blocking factor is small enough the blocking factor of LAPACK could be set to match that of ATLAS.

To take advantage of recursive factorization, ATLAS provides replacement routines for some LAPACK routines. These simply overwrite the corresponding LAPACK routines from Netlib.

Need for installation edit

Installing ATLAS on a particular platform is a challenging process which is typically done by a system vendor or a local expert and made available to a wider audience.

For many systems, architectural default parameters are available; these are essentially saved searches plus the results of hand tuning. If the arch defaults work they will likely get 10-15% better performance than the install search. On such systems the installation process is greatly simplified.

References edit

  1. ^ R. Clint Whaley; Antoine Petitet & Jack J. Dongarra (2001). "Automated Empirical Optimization of Software and the ATLAS Project" (PDF). Parallel Computing. 27 (1–2): 3–35. CiteSeerX 10.1.1.35.2297. doi:10.1016/S0167-8191(00)00087-9. Retrieved 2006-10-06.

External links edit

  • Automatically Tuned Linear Algebra Software on SourceForge
  • User contribution to ATLAS
  • A Collaborative guide to ATLAS Development
  • The FAQ has links to the Quick reference guide to BLAS and Quick reference to ATLAS LAPACK API reference
  • Microsoft Visual C++ Howto 2007-09-28 at the Wayback Machine for ATLAS

automatically, tuned, linear, algebra, software, this, article, relies, excessively, references, primary, sources, please, improve, this, article, adding, secondary, tertiary, sources, find, sources, news, newspapers, books, scholar, jstor, november, 2012, lea. This article relies excessively on references to primary sources Please improve this article by adding secondary or tertiary sources Find sources Automatically Tuned Linear Algebra Software news newspapers books scholar JSTOR November 2012 Learn how and when to remove this message Automatically Tuned Linear Algebra Software ATLAS is a software library for linear algebra It provides a mature open source implementation of BLAS APIs for C and FORTRAN 77 Automatically Tuned Linear Algebra SoftwareStable release3 10 3 July 28 2016 7 years ago 2016 07 28 Repositorygithub wbr com wbr math atlas wbr math atlasWritten inC FORTRAN 77TypeSoftware libraryLicenseBSD LicenseWebsitemath atlas wbr sourceforge wbr net ATLAS is often recommended as a way to automatically generate an optimized BLAS library While its performance often trails that of specialized libraries written for one specific hardware platform it is often the first or even only optimized BLAS implementation available on new systems and is a large improvement over the generic BLAS available at Netlib For this reason ATLAS is sometimes used as a performance baseline for comparison with other products ATLAS runs on most Unix like operating systems and on Microsoft Windows using Cygwin It is released under a BSD style license without advertising clause and many well known mathematics applications including MATLAB Mathematica Scilab SageMath and some builds of GNU Octave may use it Contents 1 Functionality 2 Optimization approach 3 Level 3 BLAS 3 1 Can it afford to copy 3 2 Cache edge 4 LAPACK 5 Need for installation 6 References 7 External linksFunctionality editATLAS provides a full implementation of the BLAS APIs as well as some additional functions from LAPACK a higher level library built on top of BLAS In BLAS functionality is divided into three groups called levels 1 2 and 3 Level 1 contains vector operations of the form y a x y displaystyle mathbf y leftarrow alpha mathbf x mathbf y nbsp dd as well as scalar dot products and vector norms among other things Level 2 contains matrix vector operations of the form y a A x b y displaystyle mathbf y leftarrow alpha A mathbf x beta mathbf y nbsp dd as well as solving T x y displaystyle T mathbf x mathbf y nbsp for x displaystyle mathbf x nbsp with T displaystyle T nbsp being triangular among other things Level 3 contains matrix matrix operations such as the widely used General Matrix Multiply GEMM operation C a A B b C displaystyle C leftarrow alpha AB beta C nbsp dd as well as solving B a T 1 B displaystyle B leftarrow alpha T 1 B nbsp for triangular matrices T displaystyle T nbsp among other things Optimization approach editThe optimization approach is called Automated Empirical Optimization of Software AEOS which identifies four fundamental approaches to computer assisted optimization of which ATLAS employs three 1 Parameterization searching over the parameter space of a function used for blocking factor cache edge etc Multiple implementation searching through various approaches to implementing the same function e g for SSE support before intrinsics made them available in C code Code generation programs that write programs incorporating what knowledge they can about what will produce the best performance for the system Optimization of the level 1 BLAS uses parameterization and multiple implementation Every ATLAS level 1 BLAS function has its own kernel Since it would be difficult to maintain thousands of cases in ATLAS there is little architecture specific optimization for Level 1 BLAS Instead multiple implementation is relied upon to allow for compiler optimization to produce high performance implementation for the system Optimization of the level 2 BLAS uses parameterization and multiple implementation With N 2 displaystyle N 2 nbsp data and N 2 displaystyle N 2 nbsp operations to perform the function is usually limited by bandwidth to memory and thus there is not much opportunity for optimization All routines in the ATLAS level 2 BLAS are built from two Level 2 BLAS kernels GEMV matrix by vector multiply update y a A x b y displaystyle mathbf y leftarrow alpha A mathbf x beta mathbf y nbsp dd GER general rank 1 update from an outer product A a x y T A displaystyle A leftarrow alpha mathbf x mathbf y T A nbsp dd Optimization of the level 3 BLAS uses code generation and the other two techniques Since we have N 3 displaystyle N 3 nbsp ops with only N 2 displaystyle N 2 nbsp data there are many opportunities for optimizationLevel 3 BLAS editMost of the Level 3 BLAS is derived from GEMM so that is the primary focus of the optimization O n 3 displaystyle O n 3 nbsp operations vs O n 2 displaystyle O n 2 nbsp data The intuition that the n 3 displaystyle n 3 nbsp operations will dominate over the n 2 displaystyle n 2 nbsp data accesses only works for roughly square matrices The real measure should be some kind of surface area to volume The difference becomes important for very non square matrices Can it afford to copy edit Copying the inputs allows the data to be arranged in a way that provides optimal access for the kernel functions but this comes at the cost of allocating temporary space and an extra read and write of the inputs So the first question GEMM faces is can it afford to copy the inputs If so Put into block major format with good alignment Take advantage of user contributed kernels and cleanup Handle the transpose cases with the copy make everything into TN transpose no transpose Deal with a in the copy If not Use the nocopy version Make no assumptions on the stride of matrix A and B in memory Handle all transpose cases explicitly No guarantee about alignment of data Support a specific code Run the risk of TLB issues bad strides etc The actual decision is made through a simple heuristic which checks for skinny cases Cache edge edit For 2nd Level Cache blocking a single cache edge parameter is used The high level choose an order to traverse the blocks ijk jik ikj jki kij kji These need not be the same order as the product is done within a block Typically chosen orders are ijk or jik For jik the ideal situation would be to copy A and the NB wide panel of B For ijk swap the role of A and B Choosing the bigger of M or N for the outer loop reduces the footprint of the copy But for large K ATLAS does not even allocate such a large amount of memory Instead it defines a parameter Kp to give best use of the L2 cache Panels are limited to Kp in length It first tries to allocate in the jik case M p N B K p N B N B displaystyle M cdot p NB cdot Kp NB cdot NB nbsp If that fails it tries 2 K p N B N B N B displaystyle 2 cdot Kp cdot NB NB cdot NB nbsp If that fails it uses the no copy version of GEMM but this case is unlikely for reasonable choices of cache edge Kp is a function of cache edge and NB LAPACK editWhen integrating the ATLAS BLAS with LAPACK an important consideration is the choice of blocking factor for LAPACK If the ATLAS blocking factor is small enough the blocking factor of LAPACK could be set to match that of ATLAS To take advantage of recursive factorization ATLAS provides replacement routines for some LAPACK routines These simply overwrite the corresponding LAPACK routines from Netlib Need for installation editInstalling ATLAS on a particular platform is a challenging process which is typically done by a system vendor or a local expert and made available to a wider audience For many systems architectural default parameters are available these are essentially saved searches plus the results of hand tuning If the arch defaults work they will likely get 10 15 better performance than the install search On such systems the installation process is greatly simplified References edit R Clint Whaley Antoine Petitet amp Jack J Dongarra 2001 Automated Empirical Optimization of Software and the ATLAS Project PDF Parallel Computing 27 1 2 3 35 CiteSeerX 10 1 1 35 2297 doi 10 1016 S0167 8191 00 00087 9 Retrieved 2006 10 06 External links editAutomatically Tuned Linear Algebra Software on SourceForge User contribution to ATLAS A Collaborative guide to ATLAS Development The FAQ has links to the Quick reference guide to BLAS and Quick reference to ATLAS LAPACK API reference Microsoft Visual C Howto Archived 2007 09 28 at the Wayback Machine for ATLAS Retrieved from https en wikipedia org w index php title Automatically Tuned Linear Algebra Software amp oldid 1210157750, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.