fbpx
Wikipedia

Quadruple-precision floating-point format

In computing, quadruple precision (or quad precision) is a binary floating-point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.

This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision,[1] but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE 754 floating-point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."[2]

In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128.

IEEE 754 quadruple-precision binary floating-point format: binary128 edit

The IEEE 754 standard specifies a binary128 as having:

This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.[3]

The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: log10(2113) ≈ 34.016). The bits are laid out as:

 

Exponent encoding edit

The quadruple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.

  • Emin = 000116 − 3FFF16 = −16382
  • Emax = 7FFE16 − 3FFF16 = 16383
  • Exponent bias = 3FFF16 = 16383

Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.

The stored exponents 000016 and 7FFF16 are interpreted specially.

Exponent Significand zero Significand non-zero Equation
000016 0, −0 subnormal numbers (−1)signbit × 2−16382 × 0.significandbits2
000116, ..., 7FFE16 normalized value (−1)signbit × 2exponentbits2 − 16383 × 1.significandbits2
7FFF16 ± NaN (quiet, signalling)

The minimum strictly positive (subnormal) value is 2−16494 ≈ 10−4965 and has a precision of only one bit. The minimum positive normal value is 2−163823.3621 × 10−4932 and has a precision of 113 bits, i.e. ±2−16494 as well. The maximum representable value is 216384 − 2162711.1897 × 104932.

Quadruple precision examples edit

These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.

0000 0000 0000 0000 0000 0000 0000 000116 = 2−16382 × 2−112 = 2−16494 ≈ 6.4751751194380251109244389582276465525 × 10−4966 (smallest positive subnormal number) 
0000 ffff ffff ffff ffff ffff ffff ffff16 = 2−16382 × (1 − 2−112) ≈ 3.3621031431120935062626778173217519551 × 10−4932 (largest subnormal number) 
0001 0000 0000 0000 0000 0000 0000 000016 = 2−16382 ≈ 3.3621031431120935062626778173217526026 × 10−4932 (smallest positive normal number) 
7ffe ffff ffff ffff ffff ffff ffff ffff16 = 216383 × (2 − 2−112) ≈ 1.1897314953572317650857593266280070162 × 104932 (largest normal number) 
3ffe ffff ffff ffff ffff ffff ffff ffff16 = 1 − 2−113 ≈ 0.9999999999999999999999999999999999037 (largest number less than one) 
3fff 0000 0000 0000 0000 0000 0000 000016 = 1 (one) 
3fff 0000 0000 0000 0000 0000 0000 000116 = 1 + 2−112 ≈ 1.0000000000000000000000000000000001926 (smallest number larger than one) 
c000 0000 0000 0000 0000 0000 0000 000016 = −2 
0000 0000 0000 0000 0000 0000 0000 000016 = 0 8000 0000 0000 0000 0000 0000 0000 000016 = −0 
7fff 0000 0000 0000 0000 0000 0000 000016 = infinity ffff 0000 0000 0000 0000 0000 0000 000016 = −infinity 
4000 921f b544 42d1 8469 898c c517 01b816 ≈ π 
3ffd 5555 5555 5555 5555 5555 5555 555516 ≈ 1/3 

By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

Double-double arithmetic edit

A common software technique to implement nearly quadruple precision using pairs of double-precision values is sometimes called double-double arithmetic.[4][5][6] Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least[4] 2 × 53 = 106 bits (actually 107 bits[7] except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,[4] significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of 1.8 × 10308 for double-double versus 1.2 × 104932 for binary128).

In particular, a double-double/quadruple-precision value q in the double-double technique is represented implicitly as a sum q = x + y of two double-precision values x and y, each of which supplies half of q's significand.[5] That is, the pair (x, y) is stored in place of q, and operations on q values (+, −, ×, ...) are transformed into equivalent (but more complicated) operations on the x and y values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general arbitrary-precision arithmetic techniques.[4][5]

Note that double-double arithmetic has the following special characteristics:[8]

  • As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is 1000...02 (106 zeros) × 2−1074, or 1.000...02 (106 zeros) × 2−968. Numbers whose magnitude is smaller than 2−1021 will not have additional precision compared with double precision.
  • The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0s or all 1s) are implied between the significant of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
  • Because of the reason above, it is possible to represent values like 1 + 2−1074, which is the smallest representable number greater than 1.

In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively.

A similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.[9]

Implementations edit

Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, as of 2016, less common (see "Hardware support" below). One can use general arbitrary-precision arithmetic libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.

Computer-language support edit

A separate question is the extent to which quadruple-precision types are directly incorporated into computer programming languages.

Quadruple precision is specified in Fortran by the real(real128) (module iso_fortran_env from Fortran 2008 must be used, the constant real128 is equal to 16 on most processors), or as real(selected_real_kind(33, 4931)), or in a non-standard way as REAL*16. (Quadruple-precision REAL*16 is supported by the Intel Fortran Compiler[10] and by the GNU Fortran compiler[11] on x86, x86-64, and Itanium architectures, for example.)

For the C programming language, ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies _Float128 as the type implementing the IEEE 754 quadruple-precision format (binary128).[12] Alternatively, in C/C++ with a few systems and compilers, quadruple precision may be specified by the long double type, but this is not required by the language (which only requires long double to be at least as precise as double), nor is it common.

On x86 and x86-64, the most common C/C++ compilers implement long double as either 80-bit extended precision (e.g. the GNU C Compiler gcc[13] and the Intel C++ Compiler with a /Qlong‑double switch[14]) or simply as being synonymous with double precision (e.g. Microsoft Visual C++[15]), rather than as quadruple precision. The procedure call standard for the ARM 64-bit architecture (AArch64) specifies that long double corresponds to the IEEE 754 quadruple-precision format.[16] On a few other architectures, some C/C++ compilers implement long double as quadruple precision, e.g. gcc on PowerPC (as double-double[17][18][19]) and SPARC,[20] or the Sun Studio compilers on SPARC.[21] Even if long double is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called __float128 for x86, x86-64 and Itanium CPUs,[22] and on PowerPC as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;[23] and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called _Quad.[24]

Google's work-in-progress language Carbon provides support for it with the type called 'f128'.[25]

Libraries and toolboxes edit

  • The GCC quad-precision math library, libquadmath, provides __float128 and __complex128 operations.
  • The Boost multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for __float128 and _Quad types, and includes a custom implementation of the standard math library.[26]
  • The Multiprecision Computing Toolbox for MATLAB allows quadruple-precision computations in MATLAB. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra.[27]
  • The DoubleFloats[28] package provides support for double-double computations for the Julia programming language.
  • The doubledouble.py[29] library enables double-double computations in Python.[citation needed]
  • Mathematica supports IEEE quad-precision numbers: 128-bit floating-point values (Real128), and 256-bit complex values (Complex256).[citation needed]

Hardware support edit

IEEE quadruple precision was added to the IBM System/390 G5 in 1998,[30] and is supported in hardware in subsequent z/Architecture processors.[31][32] The IBM POWER9 CPU (Power ISA 3.0) has native 128-bit hardware support.[23]

Native support of IEEE 128-bit floats is defined in PA-RISC 1.0,[33] and in SPARC V8[34] and V9[35] architectures (e.g. there are 16 quad-precision registers %q0, %q4, ...), but no SPARC CPU implements quad-precision operations in hardware as of 2004.[36]

Non-IEEE extended-precision (128 bits of storage, 1 sign bit, 7 exponent bits, 112 fraction bits, 8 bits unused) was added to the IBM System/370 series (1970s–1980s) and was available on some System/360 models in the 1960s (System/360-85,[37] -195, and others by special request or simulated by OS software).

The Siemens 7.700 and 7.500 series mainframes and their successors support the same floating-point formats and instructions as the IBM System/360 and System/370.

The VAX processor implemented non-IEEE quadruple-precision floating point as its "H Floating-point" format. It had one sign bit, a 15-bit exponent and 112-fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floating-point instructions in hardware, all the others emulated H Floating-point in software.

The NEC Vector Engine architecture supports adding, subtracting, multiplying and comparing 128-bit binary IEEE 754 quadruple-precision numbers.[38] Two neighboring 64-bit registers are used. Quadruple-precision arithmetic is not supported in the vector register.[39]

The RISC-V architecture specifies a "Q" (quad-precision) extension for 128-bit binary IEEE 754-2008 floating-point arithmetic.[40] The "L" extension (not yet certified) will specify 64-bit and 128-bit decimal floating point.[41]

Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement SIMD instructions, such as Streaming SIMD Extensions or AltiVec, which refers to 128-bit vectors of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.

See also edit

References edit

  1. ^ David H. Bailey; Jonathan M. Borwein (July 6, 2009). "High-Precision Computation and Mathematical Physics" (PDF).
  2. ^ Higham, Nicholas (2002). "Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. p. 43.
  3. ^ William Kahan (1 October 1987). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" (PDF).
  4. ^ a b c d Yozo Hida, X. Li, and D. H. Bailey, Quad-Double Arithmetic: Algorithms, Implementation, and Application, Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al., Library for double-double and quad-double arithmetic (2007).
  5. ^ a b c J. R. Shewchuk, Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates, Discrete & Computational Geometry 18:305–363, 1997.
  6. ^ Knuth, D. E. The Art of Computer Programming (2nd ed.). chapter 4.2.3. problem 9.
  7. ^ Robert Munafo F107 and F161 High-Precision Floating-Point Data Types (2011).
  8. ^ 128-Bit Long Double Floating-Point Data Type
  9. ^ sourceware.org Re: The state of glibc libm
  10. ^ (PDF). Su. Archived from the original on October 25, 2008. Retrieved 2010-01-23.{{cite web}}: CS1 maint: unfit URL (link)
  11. ^ "GCC 4.6 Release Series - Changes, New Features, and Fixes". Retrieved 2010-02-06.
  12. ^ "ISO/IEC TS 18661-3" (PDF). 2015-06-10. Retrieved 2019-09-22.
  13. ^ , Using the GNU Compiler Collection.
  14. ^ Intel Developer Site
  15. ^ MSDN homepage, about Visual C++ compiler
  16. ^ (PDF). 2013-05-22. Archived from the original (PDF) on 2019-10-16. Retrieved 2019-09-22.
  17. ^ RS/6000 and PowerPC Options, Using the GNU Compiler Collection.
  18. ^ Inside Macintosh - PowerPC Numerics October 9, 2012, at the Wayback Machine
  19. ^ 128-bit long double support routines for Darwin
  20. ^ SPARC Options, Using the GNU Compiler Collection.
  21. ^ The Math Libraries, Sun Studio 11 Numerical Computation Guide (2005).
  22. ^ Additional Floating Types, Using the GNU Compiler Collection
  23. ^ a b "GCC 6 Release Series - Changes, New Features, and Fixes". Retrieved 2016-09-13.
  24. ^ Intel C++ Forums (2007).
  25. ^ "Carbon Language's main repository - Language design". GitHub. 2022-08-09. Retrieved 2022-09-22.
  26. ^ "Boost.Multiprecision - float128". Retrieved 2015-06-22.
  27. ^ Pavel Holoborodko (2013-01-20). "Fast Quadruple Precision Computations in MATLAB". Retrieved 2015-06-22.
  28. ^ "DoubleFloats.jl". GitHub.
  29. ^ "doubledouble.py". GitHub.
  30. ^ Schwarz, E. M.; Krygowski, C. A. (September 1999). "The S/390 G5 floating-point unit". IBM Journal of Research and Development. 43 (5/6): 707–721. CiteSeerX 10.1.1.117.6711. doi:10.1147/rd.435.0707.
  31. ^ Gerwig, G. and Wetter, H. and Schwarz, E. M. and Haess, J. and Krygowski, C. A. and Fleischer, B. M. and Kroener, M. (May 2004). "The IBM eServer z990 floating-point unit. IBM J. Res. Dev. 48; pp. 311-322".{{cite news}}: CS1 maint: multiple names: authors list (link)
  32. ^ Eric Schwarz (June 22, 2015). "The IBM z13 SIMD Accelerators for Integer, String, and Floating-Point" (PDF). Retrieved July 13, 2015.
  33. ^ . grouper.ieee.org. Archived from the original on 2017-10-27. Retrieved 2021-07-15.
  34. ^ (PDF). SPARC International, Inc. 1992. Archived from the original (PDF) on 2005-02-04. Retrieved 2011-09-24. SPARC is an instruction set architecture (ISA) with 32-bit integer and 32-, 64-, and 128-bit IEEE Standard 754 floating-point as its principal data types.
  35. ^ David L. Weaver; Tom Germond, eds. (1994). (PDF). SPARC International, Inc. Archived from the original (PDF) on 2012-01-18. Retrieved 2011-09-24. Floating-point: The architecture provides an IEEE 754-compatible floating-point instruction set, operating on a separate register file that provides 32 single-precision (32-bit), 32 double-precision (64-bit), 16 quad-precision (128-bit) registers, or a mixture thereof.
  36. ^ "SPARC Behavior and Implementation". Numerical Computation Guide — Sun Studio 10. Sun Microsystems, Inc. 2004. Retrieved 2011-09-24. There are four situations, however, when the hardware will not successfully complete a floating-point instruction: ... The instruction is not implemented by the hardware (such as ... quad-precision instructions on any SPARC FPU).
  37. ^ Padegs A (1968). "Structural aspects of the System/360 Model 85, III: Extensions to floating-point architecture". IBM Systems Journal. 7: 22–29. doi:10.1147/sj.71.0022.
  38. ^ Vector Engine AssemblyLanguage Reference Manual, Chapter4 Assembler Syntax page 23.
  39. ^ SX-Aurora TSUBASA Architecture Guide Revision 1.1 (p. 38, 60).
  40. ^ RISC-V ISA Specification v. 20191213, Chapter 13, “Q” Standard Extension for Quad-Precision Floating-Point, page 79.
  41. ^ [1] Chapter 15 (p. 95).

External links edit

  • High-Precision Software Directory
  • QPFloat, a free software (GPL) software library for quadruple-precision arithmetic
  • HPAlib, a free software (LGPL) software library for quad-precision arithmetic
  • libquadmath, the GCC quad-precision math library
  • IEEE-754 Analysis, Interactive web page for examining Binary32, Binary64, and Binary128 floating-point values

quadruple, precision, floating, point, format, computing, quadruple, precision, quad, precision, binary, floating, point, based, computer, number, format, that, occupies, bytes, bits, with, precision, least, twice, double, precision, this, quadruple, precision. In computing quadruple precision or quad precision is a binary floating point based computer number format that occupies 16 bytes 128 bits with precision at least twice the 53 bit double precision This 128 bit quadruple precision is designed not only for applications requiring results in higher than double precision 1 but also as a primary function to allow the computation of double precision results more reliably and accurately by minimising overflow and round off errors in intermediate calculations and scratch variables William Kahan primary architect of the original IEEE 754 floating point standard noted For now the 10 byte Extended format is a tolerable compromise between the value of extra precise arithmetic and the price of implementing it to run fast very soon two more bytes of precision will become tolerable and ultimately a 16 byte format That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating Point Arithmetic was framed 2 In IEEE 754 2008 the 128 bit base 2 format is officially referred to as binary128 Contents 1 IEEE 754 quadruple precision binary floating point format binary128 1 1 Exponent encoding 1 2 Quadruple precision examples 2 Double double arithmetic 3 Implementations 3 1 Computer language support 3 2 Libraries and toolboxes 3 3 Hardware support 4 See also 5 References 6 External linksIEEE 754 quadruple precision binary floating point format binary128 editThe IEEE 754 standard specifies a binary128 as having Sign bit 1 bit Exponent width 15 bits Significand precision 113 bits 112 explicitly stored This gives from 33 to 36 significant decimal digits precision If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple precision format giving a normal number and then converted back to a decimal string with the same number of digits the final result should match the original string If an IEEE 754 quadruple precision number is converted to a decimal string with at least 36 significant digits and then converted back to quadruple precision representation the final result must match the original number 3 The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros Thus only 112 bits of the significand appear in the memory format but the total precision is 113 bits approximately 34 decimal digits log10 2113 34 016 The bits are laid out as nbsp Exponent encoding edit The quadruple precision binary floating point exponent is encoded using an offset binary representation with the zero offset being 16383 this is also known as exponent bias in the IEEE 754 standard Emin 000116 3FFF16 16382 Emax 7FFE16 3FFF16 16383 Exponent bias 3FFF16 16383Thus as defined by the offset binary representation in order to get the true exponent the offset of 16383 has to be subtracted from the stored exponent The stored exponents 000016 and 7FFF16 are interpreted specially Exponent Significand zero Significand non zero Equation000016 0 0 subnormal numbers 1 signbit 2 16382 0 significandbits2000116 7FFE16 normalized value 1 signbit 2exponentbits2 16383 1 significandbits27FFF16 NaN quiet signalling The minimum strictly positive subnormal value is 2 16494 10 4965 and has a precision of only one bit The minimum positive normal value is 2 16382 3 3621 10 4932 and has a precision of 113 bits i e 2 16494 as well The maximum representable value is 216384 216271 1 1897 104932 Quadruple precision examples edit These examples are given in bit representation in hexadecimal of the floating point value This includes the sign biased exponent and significand 0000 0000 0000 0000 0000 0000 0000 000116 2 16382 2 112 2 16494 6 4751751194380251109244389582276465525 10 4966 smallest positive subnormal number 0000 ffff ffff ffff ffff ffff ffff ffff16 2 16382 1 2 112 3 3621031431120935062626778173217519551 10 4932 largest subnormal number 0001 0000 0000 0000 0000 0000 0000 000016 2 16382 3 3621031431120935062626778173217526026 10 4932 smallest positive normal number 7ffe ffff ffff ffff ffff ffff ffff ffff16 216383 2 2 112 1 1897314953572317650857593266280070162 104932 largest normal number 3ffe ffff ffff ffff ffff ffff ffff ffff16 1 2 113 0 9999999999999999999999999999999999037 largest number less than one 3fff 0000 0000 0000 0000 0000 0000 000016 1 one 3fff 0000 0000 0000 0000 0000 0000 000116 1 2 112 1 0000000000000000000000000000000001926 smallest number larger than one c000 0000 0000 0000 0000 0000 0000 000016 2 0000 0000 0000 0000 0000 0000 0000 000016 0 8000 0000 0000 0000 0000 0000 0000 000016 0 7fff 0000 0000 0000 0000 0000 0000 000016 infinity ffff 0000 0000 0000 0000 0000 0000 000016 infinity 4000 921f b544 42d1 8469 898c c517 01b816 p 3ffd 5555 5555 5555 5555 5555 5555 555516 1 3 By default 1 3 rounds down like double precision because of the odd number of bits in the significand So the bits beyond the rounding point are 0101 which is less than 1 2 of a unit in the last place Double double arithmetic editA common software technique to implement nearly quadruple precision using pairs of double precision values is sometimes called double double arithmetic 4 5 6 Using pairs of IEEE double precision values with 53 bit significands double double arithmetic provides operations on numbers with significands of at least 4 2 53 106 bits actually 107 bits 7 except for some of the largest values due to the limited exponent range only slightly less precise than the 113 bit significand of IEEE binary128 quadruple precision The range of a double double remains essentially the same as the double precision format because the exponent has still 11 bits 4 significantly lower than the 15 bit exponent of IEEE quadruple precision a range of 1 8 10308 for double double versus 1 2 104932 for binary128 In particular a double double quadruple precision value q in the double double technique is represented implicitly as a sum q x y of two double precision values x and y each of which supplies half of q s significand 5 That is the pair x y is stored in place of q and operations on q values are transformed into equivalent but more complicated operations on the x and y values Thus arithmetic in this technique reduces to a sequence of double precision operations since double precision arithmetic is commonly implemented in hardware double double arithmetic is typically substantially faster than more general arbitrary precision arithmetic techniques 4 5 Note that double double arithmetic has the following special characteristics 8 As the magnitude of the value decreases the amount of extra precision also decreases Therefore the smallest number in the normalized range is narrower than double precision The smallest number with full precision is 1000 02 106 zeros 2 1074 or 1 000 02 106 zeros 2 968 Numbers whose magnitude is smaller than 2 1021 will not have additional precision compared with double precision The actual number of bits of precision can vary In general the magnitude of the low order part of the number is no greater than half ULP of the high order part If the low order part is less than half ULP of the high order part significant bits either all 0s or all 1s are implied between the significant of the high order and low order numbers Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128 bit long double numbers Because of the reason above it is possible to represent values like 1 2 1074 which is the smallest representable number greater than 1 In addition to the double double arithmetic it is also possible to generate triple double or quad double arithmetic if higher precision is required without any higher precision floating point library They are represented as a sum of three or four double precision values respectively They can represent operations with at least 159 161 and 212 215 bits respectively A similar technique can be used to produce a double quad arithmetic which is represented as a sum of two quadruple precision values They can represent operations with at least 226 or 227 bits 9 Implementations editQuadruple precision is often implemented in software by a variety of techniques such as the double double technique above although that technique does not implement IEEE quadruple precision since direct hardware support for quadruple precision is as of 2016 less common see Hardware support below One can use general arbitrary precision arithmetic libraries to obtain quadruple or higher precision but specialized quadruple precision implementations may achieve higher performance Computer language support edit A separate question is the extent to which quadruple precision types are directly incorporated into computer programming languages Quadruple precision is specified in Fortran by the real real128 module iso fortran env from Fortran 2008 must be used the constant real128 is equal to 16 on most processors or as real selected real kind 33 4931 or in a non standard way as REAL 16 Quadruple precision REAL 16 is supported by the Intel Fortran Compiler 10 and by the GNU Fortran compiler 11 on x86 x86 64 and Itanium architectures for example For the C programming language ISO IEC TS 18661 3 floating point extensions for C interchange and extended types specifies Float128 as the type implementing the IEEE 754 quadruple precision format binary128 12 Alternatively in C C with a few systems and compilers quadruple precision may be specified by the long double type but this is not required by the language which only requires long double to be at least as precise as double nor is it common On x86 and x86 64 the most common C C compilers implement long double as either 80 bit extended precision e g the GNU C Compiler gcc 13 and the Intel C Compiler with a Qlong double switch 14 or simply as being synonymous with double precision e g Microsoft Visual C 15 rather than as quadruple precision The procedure call standard for the ARM 64 bit architecture AArch64 specifies that long double corresponds to the IEEE 754 quadruple precision format 16 On a few other architectures some C C compilers implement long double as quadruple precision e g gcc on PowerPC as double double 17 18 19 and SPARC 20 or the Sun Studio compilers on SPARC 21 Even if long double is not quadruple precision however some C C compilers provide a nonstandard quadruple precision type as an extension For example gcc provides a quadruple precision type called float128 for x86 x86 64 and Itanium CPUs 22 and on PowerPC as IEEE 128 bit floating point using the mfloat128 hardware or mfloat128 options 23 and some versions of Intel s C C compiler for x86 and x86 64 supply a nonstandard quadruple precision type called Quad 24 Google s work in progress language Carbon provides support for it with the type called f128 25 Libraries and toolboxes edit The GCC quad precision math library libquadmath provides float128 and complex128 operations The Boost multiprecision library Boost Multiprecision provides unified cross platform C interface for float128 and Quad types and includes a custom implementation of the standard math library 26 The Multiprecision Computing Toolbox for MATLAB allows quadruple precision computations in MATLAB It includes basic arithmetic functionality as well as numerical methods dense and sparse linear algebra 27 The DoubleFloats 28 package provides support for double double computations for the Julia programming language The doubledouble py 29 library enables double double computations in Python citation needed Mathematica supports IEEE quad precision numbers 128 bit floating point values Real128 and 256 bit complex values Complex256 citation needed Hardware support edit IEEE quadruple precision was added to the IBM System 390 G5 in 1998 30 and is supported in hardware in subsequent z Architecture processors 31 32 The IBM POWER9 CPU Power ISA 3 0 has native 128 bit hardware support 23 Native support of IEEE 128 bit floats is defined in PA RISC 1 0 33 and in SPARC V8 34 and V9 35 architectures e g there are 16 quad precision registers q0 q4 but no SPARC CPU implements quad precision operations in hardware as of 2004 update 36 Non IEEE extended precision 128 bits of storage 1 sign bit 7 exponent bits 112 fraction bits 8 bits unused was added to the IBM System 370 series 1970s 1980s and was available on some System 360 models in the 1960s System 360 85 37 195 and others by special request or simulated by OS software The Siemens 7 700 and 7 500 series mainframes and their successors support the same floating point formats and instructions as the IBM System 360 and System 370 The VAX processor implemented non IEEE quadruple precision floating point as its H Floating point format It had one sign bit a 15 bit exponent and 112 fraction bits however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed Only a few of the earliest VAX processors implemented H Floating point instructions in hardware all the others emulated H Floating point in software The NEC Vector Engine architecture supports adding subtracting multiplying and comparing 128 bit binary IEEE 754 quadruple precision numbers 38 Two neighboring 64 bit registers are used Quadruple precision arithmetic is not supported in the vector register 39 The RISC V architecture specifies a Q quad precision extension for 128 bit binary IEEE 754 2008 floating point arithmetic 40 The L extension not yet certified will specify 64 bit and 128 bit decimal floating point 41 Quadruple precision 128 bit hardware implementation should not be confused with 128 bit FPUs that implement SIMD instructions such as Streaming SIMD Extensions or AltiVec which refers to 128 bit vectors of four 32 bit single precision or two 64 bit double precision values that are operated on simultaneously See also editIEEE 754 IEEE standard for floating point arithmetic ISO IEC 10967 Language independent arithmetic Primitive data type Q notation scientific notation References edit David H Bailey Jonathan M Borwein July 6 2009 High Precision Computation and Mathematical Physics PDF Higham Nicholas 2002 Designing stable algorithms in Accuracy and Stability of Numerical Algorithms 2 ed SIAM p 43 William Kahan 1 October 1987 Lecture Notes on the Status of IEEE Standard 754 for Binary Floating Point Arithmetic PDF a b c d Yozo Hida X Li and D H Bailey Quad Double Arithmetic Algorithms Implementation and Application Lawrence Berkeley National Laboratory Technical Report LBNL 46996 2000 Also Y Hida et al Library for double double and quad double arithmetic 2007 a b c J R Shewchuk Adaptive Precision Floating Point Arithmetic and Fast Robust Geometric Predicates Discrete amp Computational Geometry 18 305 363 1997 Knuth D E The Art of Computer Programming 2nd ed chapter 4 2 3 problem 9 Robert Munafo F107 and F161 High Precision Floating Point Data Types 2011 128 Bit Long Double Floating Point Data Type sourceware org Re The state of glibc libm Intel Fortran Compiler Product Brief archived copy on web archive org PDF Su Archived from the original on October 25 2008 Retrieved 2010 01 23 a href Template Cite web html title Template Cite web cite web a CS1 maint unfit URL link GCC 4 6 Release Series Changes New Features and Fixes Retrieved 2010 02 06 ISO IEC TS 18661 3 PDF 2015 06 10 Retrieved 2019 09 22 i386 and x86 64 Options archived copy on web archive org Using the GNU Compiler Collection Intel Developer Site MSDN homepage about Visual C compiler Procedure Call Standard for the ARM 64 bit Architecture AArch64 PDF 2013 05 22 Archived from the original PDF on 2019 10 16 Retrieved 2019 09 22 RS 6000 and PowerPC Options Using the GNU Compiler Collection Inside Macintosh PowerPC Numerics Archived October 9 2012 at the Wayback Machine 128 bit long double support routines for Darwin SPARC Options Using the GNU Compiler Collection The Math Libraries Sun Studio 11 Numerical Computation Guide 2005 Additional Floating Types Using the GNU Compiler Collection a b GCC 6 Release Series Changes New Features and Fixes Retrieved 2016 09 13 Intel C Forums 2007 Carbon Language s main repository Language design GitHub 2022 08 09 Retrieved 2022 09 22 Boost Multiprecision float128 Retrieved 2015 06 22 Pavel Holoborodko 2013 01 20 Fast Quadruple Precision Computations in MATLAB Retrieved 2015 06 22 DoubleFloats jl GitHub doubledouble py GitHub Schwarz E M Krygowski C A September 1999 The S 390 G5 floating point unit IBM Journal of Research and Development 43 5 6 707 721 CiteSeerX 10 1 1 117 6711 doi 10 1147 rd 435 0707 Gerwig G and Wetter H and Schwarz E M and Haess J and Krygowski C A and Fleischer B M and Kroener M May 2004 The IBM eServer z990 floating point unit IBM J Res Dev 48 pp 311 322 a href Template Cite news html title Template Cite news cite news a CS1 maint multiple names authors list link Eric Schwarz June 22 2015 The IBM z13 SIMD Accelerators for Integer String and Floating Point PDF Retrieved July 13 2015 Implementor support for the binary interchange formats grouper ieee org Archived from the original on 2017 10 27 Retrieved 2021 07 15 The SPARC Architecture Manual Version 8 archived copy on web archive org PDF SPARC International Inc 1992 Archived from the original PDF on 2005 02 04 Retrieved 2011 09 24 SPARC is an instruction set architecture ISA with 32 bit integer and 32 64 and 128 bit IEEE Standard 754 floating point as its principal data types David L Weaver Tom Germond eds 1994 The SPARC Architecture Manual Version 9 archived copy on web archive org PDF SPARC International Inc Archived from the original PDF on 2012 01 18 Retrieved 2011 09 24 Floating point The architecture provides an IEEE 754 compatible floating point instruction set operating on a separate register file that provides 32 single precision 32 bit 32 double precision 64 bit 16 quad precision 128 bit registers or a mixture thereof SPARC Behavior and Implementation Numerical Computation Guide Sun Studio 10 Sun Microsystems Inc 2004 Retrieved 2011 09 24 There are four situations however when the hardware will not successfully complete a floating point instruction The instruction is not implemented by the hardware such as quad precision instructions on any SPARC FPU Padegs A 1968 Structural aspects of the System 360 Model 85 III Extensions to floating point architecture IBM Systems Journal 7 22 29 doi 10 1147 sj 71 0022 Vector Engine AssemblyLanguage Reference Manual Chapter4 Assembler Syntax page 23 SX Aurora TSUBASA Architecture Guide Revision 1 1 p 38 60 RISC V ISA Specification v 20191213 Chapter 13 Q Standard Extension for Quad Precision Floating Point page 79 1 Chapter 15 p 95 External links editHigh Precision Software Directory QPFloat a free software GPL software library for quadruple precision arithmetic HPAlib a free software LGPL software library for quad precision arithmetic libquadmath the GCC quad precision math library IEEE 754 Analysis Interactive web page for examining Binary32 Binary64 and Binary128 floating point values Retrieved from https en wikipedia org w index php title Quadruple precision floating point format amp oldid 1180647258, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.