fbpx
Wikipedia

Multiply–accumulate operation

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC unit); the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a:

When done with floating point numbers, it might be performed with two roundings (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).

Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. Percy Ludgate was the first to conceive a MAC in his Analytical Machine of 1909,[1] and the first to exploit a MAC for division (using multiplication seeded by reciprocal, via the convergent series (1+x)−1). The first modern processors to be equipped with MAC units were digital signal processors, but the technique is now also common in general-purpose processors.[2][3][4][5]

In floating-point arithmetic

When done with integers, the operation is typically exact (computed modulo some power of two). However, floating-point numbers have only a certain amount of mathematical precision. That is, digital floating-point arithmetic is generally not associative or distributive. (See Floating point § Accuracy problems.) Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). IEEE 754-2008 specifies that it must be performed with one rounding, yielding a more accurate result.[6]

Fused multiply–add

A fused multiply–add (FMA or fmadd)[7] is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.

A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly.[8] If x2y2 is evaluated as ((x × x) − y × y) (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the (x × x) term first) using fused multiply–add, then the result may be negative even when x = y due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.

When implemented inside a microprocessor, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2N-bit adder to compute the sum properly.[9]

Another benefit of including this instruction is that it allows an efficient software implementation of division (see division algorithm) and square root (see methods of computing square roots) operations, thus eliminating the need for dedicated hardware for those operations.[10]

Dot product instruction

Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit SIMD registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.

Support

The FMA operation is included in IEEE 754-2008.

The Digital Equipment Corporation (DEC) VAX's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step.[11] This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.

The 1999 standard of the C programming language supports the FMA operation through the fma() standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas (#pragma STDC FP_CONTRACT). The GCC and Clang C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma,[12] this can be globally controlled by the -ffp-contract command line option.[13]

The fused multiply–add operation was introduced as "multiply–add fused" in the IBM POWER1 (1990) processor,[14] but has been added to numerous other processors since then:

References

  1. ^ "The Feasibility of Ludgate's Analytical Machine". from the original on 2019-08-07. Retrieved 2020-08-30.
  2. ^ Lyakhov, Pavel; Valueva, Maria; Valuev, Georgii; Nagornov, Nikolai (January 2020). "A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units". Applied Sciences. 10 (24): 9052. doi:10.3390/app10249052.
  3. ^ Tung Thanh Hoang; Sjalander, M.; Larsson-Edefors, P. (May 2009). "Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements". 2009 IEEE International Symposium on Parallel Distributed Processing: 1–7. doi:10.1109/IPDPS.2009.5161212. ISBN 978-1-4244-3751-1. S2CID 14535090.
  4. ^ Kang, Jongsung; Kim, Taewhan (2020-03-01). "PV-MAC: Multiply-and-accumulate unit structure exploiting precision variability in on-device convolutional neural networks". Integration. 71: 76–85. doi:10.1016/j.vlsi.2019.11.003. ISSN 0167-9260.
  5. ^ "mad - ps". Retrieved 2021-08-14.
  6. ^ Whitehead, Nathan; Fit-Florea, Alex (2011). "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" (PDF). nvidia. Retrieved 2013-08-31.
  7. ^ "fmadd instrs".
  8. ^ Kahan, William (1996-05-31). "IEEE Standard 754 for Binary Floating-Point Arithmetic".
  9. ^ Quinnell, Eric (May 2007). Floating-Point Fused Multiply–Add Architectures (PDF) (PhD thesis). Retrieved 2011-03-28.
  10. ^ Markstein, Peter (November 2004). Software Division and Square Root Using Goldschmidt's Algorithms (PDF). 6th Conference on Real Numbers and Computers. CiteSeerX 10.1.1.85.9648.
  11. ^ . Archived from the original on 2020-02-13.
  12. ^ "Bug 20785 - Pragma STDC * (C99 FP) unimplemented". gcc.gnu.org. Retrieved 2022-02-02.
  13. ^ "Optimize Options (Using the GNU Compiler Collection (GCC))". gcc.gnu.org. Retrieved 2022-02-02.
  14. ^ Montoye, R. K.; Hokenek, E.; Runyon, S. L. (January 1990). "Design of the IBM RISC System/6000 floating-point execution unit". IBM Journal of Research and Development. 34 (1): 59–70. doi:10.1147/rd.341.0059. 
  15. ^ "Godson-3 Emulates x86: New MIPS-Compatible Chinese Processor Has Extensions for x86 Translation".
  16. ^ Hollingsworth, Brent (October 2012). "New "Bulldozer" and "Piledriver" Instructions". AMD Developer Central.
  17. ^ . The Register. Archived from the original on 2012-02-17. Retrieved 2008-08-19.

multiply, accumulate, operation, computing, especially, digital, signal, processing, multiply, accumulate, multiply, operation, common, step, that, computes, product, numbers, adds, that, product, accumulator, hardware, unit, that, performs, operation, known, . In computing especially digital signal processing the multiply accumulate MAC or multiply add MAD operation is a common step that computes the product of two numbers and adds that product to an accumulator The hardware unit that performs the operation is known as a multiplier accumulator MAC unit the operation itself is also often called a MAC or a MAD operation The MAC operation modifies an accumulator a a a b c displaystyle a leftarrow a b times c When done with floating point numbers it might be performed with two roundings typical in many DSPs or with a single rounding When performed with a single rounding it is called a fused multiply add FMA or fused multiply accumulate FMAC Modern computers may contain a dedicated MAC consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register that stores the result The output of the register is fed back to one input of the adder so that on each clock cycle the output of the multiplier is added to the register Combinational multipliers require a large amount of logic but can compute a product much more quickly than the method of shifting and adding typical of earlier computers Percy Ludgate was the first to conceive a MAC in his Analytical Machine of 1909 1 and the first to exploit a MAC for division using multiplication seeded by reciprocal via the convergent series 1 x 1 The first modern processors to be equipped with MAC units were digital signal processors but the technique is now also common in general purpose processors 2 3 4 5 Contents 1 In floating point arithmetic 2 Fused multiply add 2 1 Dot product instruction 2 2 Support 3 ReferencesIn floating point arithmetic EditWhen done with integers the operation is typically exact computed modulo some power of two However floating point numbers have only a certain amount of mathematical precision That is digital floating point arithmetic is generally not associative or distributive See Floating point Accuracy problems Therefore it makes a difference to the result whether the multiply add is performed with two roundings or in one operation with a single rounding a fused multiply add IEEE 754 2008 specifies that it must be performed with one rounding yielding a more accurate result 6 Fused multiply add EditA fused multiply add FMA or fmadd 7 is a floating point multiply add operation performed in one step with a single rounding That is where an unfused multiply add would compute the product b c round it to N significant bits add the result to a and round back to N significant bits a fused multiply add would compute the entire expression a b c to its full precision before rounding the final result down to N significant bits A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products Dot product Matrix multiplication Polynomial evaluation e g with Horner s rule Newton s method for evaluating functions from the inverse function Convolutions and artificial neural networks Multiplication in double double arithmeticFused multiply add can usually be relied on to give more accurate results However William Kahan has pointed out that it can give problems if used unthinkingly 8 If x2 y2 is evaluated as x x y y following Kahan s suggested notation in which redundant parentheses direct the compiler to round the x x term first using fused multiply add then the result may be negative even when x y due to the first multiplication discarding low significance bits This could then lead to an error if for instance the square root of the result is then evaluated When implemented inside a microprocessor an FMA can be faster than a multiply operation followed by an add However standard industrial implementations based on the original IBM RS 6000 design require a 2N bit adder to compute the sum properly 9 Another benefit of including this instruction is that it allows an efficient software implementation of division see division algorithm and square root see methods of computing square roots operations thus eliminating the need for dedicated hardware for those operations 10 Dot product instruction Edit Some machines combine multiple fused multiply add operations into a single step e g performing a four element dot product on two 128 bit SIMD registers a0 b0 a1 b1 a2 b2 a3 b3 with single cycle throughput Support Edit The FMA operation is included in IEEE 754 2008 The Digital Equipment Corporation DEC VAX s POLY instruction is used for evaluating polynomials with Horner s rule using a succession of multiply and add steps Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step 11 This instruction has been a part of the VAX instruction set since its original 11 780 implementation in 1977 The 1999 standard of the C programming language supports the FMA operation through the fma standard math library function and the automatic transformation of a multiplication followed by an addition contraction of floating point expressions which can be explicitly enabled or disabled with standard pragmas pragma STDC FP CONTRACT The GCC and Clang C compilers do such transformations by default for processor architectures that support FMA instructions With GCC which does not support the aforementioned pragma 12 this can be globally controlled by the ffp contract command line option 13 The fused multiply add operation was introduced as multiply add fused in the IBM POWER1 1990 processor 14 but has been added to numerous other processors since then HP PA 8000 1996 and above Hitachi SuperH SH 4 1998 SCE Toshiba Emotion Engine 1999 Intel Itanium 2001 STI Cell 2006 Fujitsu SPARC64 VI 2007 and above MIPS compatible Loongson 2F 2008 15 Elbrus 8SV 2018 x86 processors with FMA3 and or FMA4 instruction set AMD Bulldozer 2011 FMA4 only AMD Piledriver 2012 FMA3 and FMA4 16 AMD Steamroller 2014 AMD Excavator 2015 AMD Zen 2017 FMA3 only Intel Haswell 2013 FMA3 only 17 Intel Skylake 2015 FMA3 only ARM processors with VFPv4 and or NEONv2 ARM Cortex M4F 2010 ARM Cortex A5 2012 ARM Cortex A7 2013 ARM Cortex A15 2012 Qualcomm Krait 2012 Apple A6 2012 All ARMv8 processors Fujitsu A64FX has Four operand FMA with Prefix Instruction IBM z Architecture since 1998 GPUs and GPGPU boards AMD GPUs 2009 and newer TeraScale 2 Evergreen series based Graphics Core Next based Nvidia GPUs 2010 and newer Fermi based 2010 Kepler based 2012 Maxwell based 2014 Pascal based 2016 Volta based 2017 Intel GPUs since Sandy Bridge Intel MIC 2012 ARM Mali T600 Series 2012 and above Vector Processors NEC SX Aurora TSUBASA RISC V instruction set 2010 References Edit The Feasibility of Ludgate s Analytical Machine Archived from the original on 2019 08 07 Retrieved 2020 08 30 Lyakhov Pavel Valueva Maria Valuev Georgii Nagornov Nikolai January 2020 A Method of Increasing Digital Filter Performance Based on Truncated Multiply Accumulate Units Applied Sciences 10 24 9052 doi 10 3390 app10249052 Tung Thanh Hoang Sjalander M Larsson Edefors P May 2009 Double Throughput Multiply Accumulate unit for FlexCore processor enhancements 2009 IEEE International Symposium on Parallel Distributed Processing 1 7 doi 10 1109 IPDPS 2009 5161212 ISBN 978 1 4244 3751 1 S2CID 14535090 Kang Jongsung Kim Taewhan 2020 03 01 PV MAC Multiply and accumulate unit structure exploiting precision variability in on device convolutional neural networks Integration 71 76 85 doi 10 1016 j vlsi 2019 11 003 ISSN 0167 9260 mad ps Retrieved 2021 08 14 Whitehead Nathan Fit Florea Alex 2011 Precision amp Performance Floating Point and IEEE 754 Compliance for NVIDIA GPUs PDF nvidia Retrieved 2013 08 31 fmadd instrs Kahan William 1996 05 31 IEEE Standard 754 for Binary Floating Point Arithmetic Quinnell Eric May 2007 Floating Point Fused Multiply Add Architectures PDF PhD thesis Retrieved 2011 03 28 Markstein Peter November 2004 Software Division and Square Root Using Goldschmidt s Algorithms PDF 6th Conference on Real Numbers and Computers CiteSeerX 10 1 1 85 9648 VAX instruction of the week POLY Archived from the original on 2020 02 13 Bug 20785 Pragma STDC C99 FP unimplemented gcc gnu org Retrieved 2022 02 02 Optimize Options Using the GNU Compiler Collection GCC gcc gnu org Retrieved 2022 02 02 Montoye R K Hokenek E Runyon S L January 1990 Design of the IBM RISC System 6000 floating point execution unit IBM Journal of Research and Development 34 1 59 70 doi 10 1147 rd 341 0059 Godson 3 Emulates x86 New MIPS Compatible Chinese Processor Has Extensions for x86 Translation Hollingsworth Brent October 2012 New Bulldozer and Piledriver Instructions AMD Developer Central Intel adds 22nm octo core Haswell to CPU design roadmap The Register Archived from the original on 2012 02 17 Retrieved 2008 08 19 Retrieved from https en wikipedia org w index php title Multiply accumulate operation amp oldid 1129286078 Fused multiply add, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.