fbpx
Wikipedia

Loop fission and fusion

In computer science, loop fission (or loop distribution) is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body.[1][2] The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor.

Conversely, loop fusion (or loop jamming) is a compiler optimization and loop transformation which replaces multiple loops with a single one.[3][2] Loop fusion does not always improve run-time speed. On some architectures, two loops may actually perform better than one loop because, for example, there is increased data locality within each loop. One of the main benefits of loop fusion is that it allows temporary allocations to be avoided, which can lead to huge performance gains in numerical computing languages such as Julia when doing elementwise operations on arrays (however, Julia's loop fusion is not technically a compiler optimization, but a syntactic guarantee of the language).[4]

Other benefits of loop fusion are that it avoids the overhead of the loop control structures, and also that it allows the loop body to be parallelized by the processor[5] by taking advantage of instruction-level parallelism. This is possible when there are no data dependencies between the bodies of the two loops (this is in stark contrast to the other main benefit of loop fusion described above, which only presents itself when there are data dependencies that require an intermediate allocation to store the results). If loop fusion is able to remove redundant allocations, performance increases can be large.[4] Otherwise, there is a more complex trade-off between data locality, instruction-level parallelism, and loop overhead (branching, incrementing, etc.) that may make loop fusion, loop fission, or neither, the preferable optimization.

Fission edit

Example in C edit

int i, a[100], b[100]; for (i = 0; i < 100; i++) {  a[i] = 1;  b[i] = 2; } 

is equivalent to:

int i, a[100], b[100]; for (i = 0; i < 100; i++) {  a[i] = 1; } for (i = 0; i < 100; i++) {  b[i] = 2; } 

Fusion edit

Example in C++ and MATLAB edit

Consider the following MATLAB code:

x = 0:999; % Create an array of numbers from 0 to 999 (range is inclusive) y = sin(x) + 4; % Take the sine of x (element-wise) and add 4 to each element 

You could achieve the same syntax in C++ by using function and operator overloading:

#include <cmath> #include <cassert> #include <memory> #include <iostream> class Array {  size_t length;  std::unique_ptr<float[]> data;    // Internal constructor that produces an uninitialized array  Array(size_t n) : length(n), data(new float[n]) { }   public:  // Factory method to produce an array over an integer range (the upper  // bound is exclusive, unlike MATLAB's ranges).  static Array Range(size_t start, size_t end) {  assert(end > start);  size_t length = end - start;  Array a(length);  for (size_t i = 0; i < length; ++i) {  a[i] = start + i;  }  return a;  }    // Basic array operations  size_t size() const { return length; }  float& operator[](size_t i) { return data[i]; }  const float& operator[](size_t i) const { return data[i]; }  // Declare an overloaded addition operator as a free friend function (this  // syntax defines operator+ as a free function that is a friend of this  // class, despite it appearing as a member function declaration).  friend Array operator+(const Array& a, float b) {  Array c(a.size());  for (size_t i = 0; i < a.size(); ++i) {  c[i] = a[i] + b;  }  return c;  }    // Similarly, we can define an overload for the sin() function. In practice,  // it would be unwieldy to define all possible overloaded math operations as  // friends inside the class like this, but this is just an example.  friend Array sin(const Array& a) {  Array b(a.size());  for (size_t i = 0; i < a.size(); ++i) {  b[i] = std::sin(a[i]);  }  return b;  } }; int main(int argc, char* argv[]) {  // Here, we perform the same computation as the MATLAB example  auto x = Array::Range(0, 1000);  auto y = sin(x) + 4;    // Print the result out - just to make sure the optimizer doesn't remove  // everything (if it's smart enough to do so).  std::cout << "The result is: " << std::endl;  for (size_t i = 0; i < y.size(); ++i) {  std::cout << y[i] << std::endl;  }    return 0; } 

However, the above example unnecessarily allocates a temporary array for the result of sin(x). A more efficient implementation would allocate a single array for y, and compute y in a single loop. To optimize this, a C++ compiler would need to:

  1. Inline the sin and operator+ function calls.
  2. Fuse the loops into a single loop.
  3. Remove the unused stores into the temporary arrays (can use a register or stack variable instead).
  4. Remove the unused allocation and free.

All of these steps are individually possible. Even step four is possible despite the fact that functions like malloc and free have global side effects, since some compilers hardcode symbols such as malloc and free so that they can remove unused allocations from the code.[6] However, as of clang 12.0.0 and gcc 11.1, this loop fusion and redundant allocation removal does not occur - even on the highest optimization level.[7][8]

Some languages specifically targeted towards numerical computing such as Julia might have the concept of loop fusion built into it at a high level, where the compiler will notice adjacent elementwise operations and fuse them into a single loop.[9] Currently, to achieve the same syntax in general purpose languages like C++, the sin and operator+ functions must pessimistically allocate arrays to store their results, since they do not know what context they will be called from. This issue can be avoided in C++ by using a different syntax that does not rely on the compiler to remove unnecessary temporary allocations (e.g., using functions and overloads for in-place operations, such as operator+= or std::transform).

See also edit

References edit

  1. ^ Y.N. Srikant; Priti Shankar (3 October 2018). The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press. ISBN 978-1-4200-4383-9.
  2. ^ a b Kennedy, Ken & Allen, Randy. (2001). Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann. ISBN 1-55860-286-0.
  3. ^ Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation. Morgan Kaufmann. ISBN 978-1-55860-320-2. loop fusion.
  4. ^ a b Johnson, Steven G. (21 January 2017). "More Dots: Syntactic Loop Fusion in Julia". julialang.org. Retrieved 25 June 2021.
  5. ^ "Loop Fusion". Intel. Retrieved 2021-06-25.
  6. ^ Godbolt, Matt. "Compiler Explorer - C++ (x86-64 clang 12.0.0)". godbolt.org. Retrieved 2021-06-25.
  7. ^ Godbolt, Matt. "Compiler Explorer - C++ (x86-64 clang 12.0.0)". godbolt.org. Retrieved 2021-06-25.
  8. ^ Godbolt, Matt. "Compiler Explorer - C++ (x86-64 gcc 11.1)". godbolt.org. Retrieved 2021-06-25.
  9. ^ "Functions · The Julia Language". docs.julialang.org. Retrieved 2021-06-25.

    loop, fission, fusion, computer, science, loop, fission, loop, distribution, compiler, optimization, which, loop, broken, into, multiple, loops, over, same, index, range, with, each, taking, only, part, original, loop, body, goal, break, down, large, loop, bod. In computer science loop fission or loop distribution is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop s body 1 2 The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference This optimization is most efficient in multi core processors that can split a task into multiple tasks for each processor Conversely loop fusion or loop jamming is a compiler optimization and loop transformation which replaces multiple loops with a single one 3 2 Loop fusion does not always improve run time speed On some architectures two loops may actually perform better than one loop because for example there is increased data locality within each loop One of the main benefits of loop fusion is that it allows temporary allocations to be avoided which can lead to huge performance gains in numerical computing languages such as Julia when doing elementwise operations on arrays however Julia s loop fusion is not technically a compiler optimization but a syntactic guarantee of the language 4 Other benefits of loop fusion are that it avoids the overhead of the loop control structures and also that it allows the loop body to be parallelized by the processor 5 by taking advantage of instruction level parallelism This is possible when there are no data dependencies between the bodies of the two loops this is in stark contrast to the other main benefit of loop fusion described above which only presents itself when there are data dependencies that require an intermediate allocation to store the results If loop fusion is able to remove redundant allocations performance increases can be large 4 Otherwise there is a more complex trade off between data locality instruction level parallelism and loop overhead branching incrementing etc that may make loop fusion loop fission or neither the preferable optimization Contents 1 Fission 1 1 Example in C 2 Fusion 2 1 Example in C and MATLAB 3 See also 4 ReferencesFission editExample in C edit int i a 100 b 100 for i 0 i lt 100 i a i 1 b i 2 is equivalent to int i a 100 b 100 for i 0 i lt 100 i a i 1 for i 0 i lt 100 i b i 2 Fusion editExample in C and MATLAB editConsider the following MATLAB code x 0 999 Create an array of numbers from 0 to 999 range is inclusive y sin x 4 Take the sine of x element wise and add 4 to each elementYou could achieve the same syntax in C by using function and operator overloading include lt cmath gt include lt cassert gt include lt memory gt include lt iostream gt class Array size t length std unique ptr lt float gt data Internal constructor that produces an uninitialized array Array size t n length n data new float n public Factory method to produce an array over an integer range the upper bound is exclusive unlike MATLAB s ranges static Array Range size t start size t end assert end gt start size t length end start Array a length for size t i 0 i lt length i a i start i return a Basic array operations size t size const return length float amp operator size t i return data i const float amp operator size t i const return data i Declare an overloaded addition operator as a free friend function this syntax defines operator as a free function that is a friend of this class despite it appearing as a member function declaration friend Array operator const Array amp a float b Array c a size for size t i 0 i lt a size i c i a i b return c Similarly we can define an overload for the sin function In practice it would be unwieldy to define all possible overloaded math operations as friends inside the class like this but this is just an example friend Array sin const Array amp a Array b a size for size t i 0 i lt a size i b i std sin a i return b int main int argc char argv Here we perform the same computation as the MATLAB example auto x Array Range 0 1000 auto y sin x 4 Print the result out just to make sure the optimizer doesn t remove everything if it s smart enough to do so std cout lt lt The result is lt lt std endl for size t i 0 i lt y size i std cout lt lt y i lt lt std endl return 0 However the above example unnecessarily allocates a temporary array for the result of sin x A more efficient implementation would allocate a single array for y and compute y in a single loop To optimize this a C compiler would need to Inline the sin and operator function calls Fuse the loops into a single loop Remove the unused stores into the temporary arrays can use a register or stack variable instead Remove the unused allocation and free All of these steps are individually possible Even step four is possible despite the fact that functions like malloc and free have global side effects since some compilers hardcode symbols such as malloc and free so that they can remove unused allocations from the code 6 However as of clang 12 0 0 and gcc 11 1 this loop fusion and redundant allocation removal does not occur even on the highest optimization level 7 8 Some languages specifically targeted towards numerical computing such as Julia might have the concept of loop fusion built into it at a high level where the compiler will notice adjacent elementwise operations and fuse them into a single loop 9 Currently to achieve the same syntax in general purpose languages like C the sin and operator functions must pessimistically allocate arrays to store their results since they do not know what context they will be called from This issue can be avoided in C by using a different syntax that does not rely on the compiler to remove unnecessary temporary allocations e g using functions and overloads for in place operations such as operator or std transform See also editExpression templates Loop transformationReferences edit Y N Srikant Priti Shankar 3 October 2018 The Compiler Design Handbook Optimizations and Machine Code Generation Second Edition CRC Press ISBN 978 1 4200 4383 9 a b Kennedy Ken amp Allen Randy 2001 Optimizing Compilers for Modern Architectures A Dependence based Approach Morgan Kaufmann ISBN 1 55860 286 0 Steven Muchnick Muchnick and Associates 15 August 1997 Advanced Compiler Design Implementation Morgan Kaufmann ISBN 978 1 55860 320 2 loop fusion a b Johnson Steven G 21 January 2017 More Dots Syntactic Loop Fusion in Julia julialang org Retrieved 25 June 2021 Loop Fusion Intel Retrieved 2021 06 25 Godbolt Matt Compiler Explorer C x86 64 clang 12 0 0 godbolt org Retrieved 2021 06 25 Godbolt Matt Compiler Explorer C x86 64 clang 12 0 0 godbolt org Retrieved 2021 06 25 Godbolt Matt Compiler Explorer C x86 64 gcc 11 1 godbolt org Retrieved 2021 06 25 Functions The Julia Language docs julialang org Retrieved 2021 06 25 Loop fission Retrieved from https en wikipedia org w index php title Loop fission and fusion amp oldid 1177434205, wikipedia, wiki, book, books, library,

    article

    , read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.