fbpx
Wikipedia

Double-precision floating-point format

Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Double precision may be chosen when the range or precision of single precision would be insufficient.

In the IEEE 754-2008 standard, the 64-bit base-2 format is officially referred to as binary64; it was called double in IEEE 754-1985. IEEE 754 specifies additional floating-point formats, including 32-bit base-2 single precision and, more recently, base-10 representations (decimal floating point).

One of the first programming languages to provide floating-point data types was Fortran.[citation needed] Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language implementers. E.g., GW-BASIC's double-precision data type was the 64-bit MBF floating-point format.

IEEE 754 double-precision binary floating-point format: binary64

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

The sign bit determines the sign of the number (including when this number is zero, which is signed).

The exponent field is an 11-bit unsigned integer from 0 to 2047, in biased form: an exponent value of 1023 represents the actual zero. Exponents range from −1022 to +1023 because exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers.

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to the IEEE 754 double-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.[1]

The format is written with the significand having an implicit integer bit of value 1 (except for special data, see the exponent encoding below). With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈ 15.955). The bits are laid out as follows:

 

The real value assumed by a given 64-bit double-precision datum with a given biased exponent   and a 52-bit fraction is

 

or

 

Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.

The spacing as a fraction of the numbers in the range from 2n to 2n+1 is 2n−52. The maximum relative rounding error when rounding a number to the nearest representable one (the machine epsilon) is therefore 2−53.

The 11 bit width of the exponent allows the representation of numbers between 10−308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10−324.

Exponent encoding

The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such representations would be:

e =000000000012=00116=1:   (smallest exponent for normal numbers)
e =011111111112=3ff16=1023:   (zero offset)
e =100000001012=40516=1029:  
e =111111111102=7fe16=2046:   (highest exponent)

The exponents 00016 and 7ff16 have a special meaning:

  • 000000000002=00016 is used to represent a signed zero (if F = 0) and subnormal numbers (if F ≠ 0); and
  • 111111111112=7ff16 is used to represent (if F = 0) and NaNs (if F ≠ 0),

where F is the fractional part of the significand. All bit patterns are valid encoding.

Except for the above exceptions, the entire double-precision number is described by:

 

In the case of subnormal numbers (e = 0) the double-precision number is described by:

 

Endianness

Although many processors use little-endian storage for all types of data (integer, floating point), there are a number of hardware architectures where floating-point numbers are represented in big-endian form while integers are represented in little-endian form.[2] There are ARM processors that have mixed-endian floating-point representation for double-precision numbers: each of the two 32-bit words is stored as little-endian, but the most significant word is stored first. VAX floating point stores little-endian 16-bit words in big-endian order. Because there have been many floating-point formats with no network standard representation for them, the XDR standard uses big-endian IEEE 754 as its representation. It may therefore appear strange that the widespread IEEE 754 floating-point standard does not specify endianness.[3] Theoretically, this means that even standard IEEE floating-point data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may safely assume that the endianness is the same for floating-point numbers as for integers, making the conversion straightforward regardless of data type. Small embedded systems using special floating-point formats may be another matter, however.

Double-precision examples

0 01111111111 00000000000000000000000000000000000000000000000000002 ≙ 3FF0 0000 0000 000016 ≙ +20 × 1 = 1
0 01111111111 00000000000000000000000000000000000000000000000000012 ≙ 3FF0 0000 0000 000116 ≙ +20 × (1 + 2−52) ≈ 1.0000000000000002, the smallest number > 1
0 01111111111 00000000000000000000000000000000000000000000000000102 ≙ 3FF0 0000 0000 000216 ≙ +20 × (1 + 2−51) ≈ 1.0000000000000004
0 10000000000 00000000000000000000000000000000000000000000000000002 ≙ 4000 0000 0000 000016 ≙ +21 × 1 = 2
1 10000000000 00000000000000000000000000000000000000000000000000002 ≙ C000 0000 0000 000016 ≙ −21 × 1 = −2
0 10000000000 10000000000000000000000000000000000000000000000000002 ≙ 4008 0000 0000 000016 ≙ +21 × 1.12 = 112 = 3
0 10000000001 00000000000000000000000000000000000000000000000000002 ≙ 4010 0000 0000 000016 ≙ +22 × 1 = 1002 = 4
0 10000000001 01000000000000000000000000000000000000000000000000002 ≙ 4014 0000 0000 000016 ≙ +22 × 1.012 = 1012 = 5
0 10000000001 10000000000000000000000000000000000000000000000000002 ≙ 4018 0000 0000 000016 ≙ +22 × 1.12 = 1102 = 6
0 10000000011 01110000000000000000000000000000000000000000000000002 ≙ 4037 0000 0000 000016 ≙ +24 × 1.01112 = 101112 = 23
0 01111111000 10000000000000000000000000000000000000000000000000002 ≙ 3F88 0000 0000 000016 ≙ +2−7 × 1.12 = 0.000000112 = 0.01171875 (3/256)
0 00000000000 00000000000000000000000000000000000000000000000000012 ≙ 0000 0000 0000 000116 ≙ +2−1022 × 2−52 = 2−1074 ≈ 4.9406564584124654 × 10−324 (Min. subnormal positive double)
0 00000000000 11111111111111111111111111111111111111111111111111112 ≙ 000F FFFF FFFF FFFF16 ≙ +2−1022 × (1 − 2−52) ≈ 2.2250738585072009 × 10−308 (Max. subnormal double)
0 00000000001 00000000000000000000000000000000000000000000000000002 ≙ 0010 0000 0000 000016 ≙ +2−1022 × 1 ≈ 2.2250738585072014 × 10−308 (Min. normal positive double)
0 11111111110 11111111111111111111111111111111111111111111111111112 ≙ 7FEF FFFF FFFF FFFF16 ≙ +21023 × (1 + (1 − 2−52)) ≈ 1.7976931348623157 × 10308 (Max. double)
0 00000000000 00000000000000000000000000000000000000000000000000002 ≙ 0000 0000 0000 000016 ≙ +0
1 00000000000 00000000000000000000000000000000000000000000000000002 ≙ 8000 0000 0000 000016 ≙ −0
0 11111111111 00000000000000000000000000000000000000000000000000002 ≙ 7FF0 0000 0000 000016 ≙ +∞ (positive infinity)
1 11111111111 00000000000000000000000000000000000000000000000000002 ≙ FFF0 0000 0000 000016 ≙ −∞ (negative infinity)
0 11111111111 00000000000000000000000000000000000000000000000000012 ≙ 7FF0 0000 0000 000116 ≙ NaN (sNaN on most processors, such as x86 and ARM)
0 11111111111 10000000000000000000000000000000000000000000000000012 ≙ 7FF8 0000 0000 000116 ≙ NaN (qNaN on most processors, such as x86 and ARM)
0 11111111111 11111111111111111111111111111111111111111111111111112 ≙ 7FFF FFFF FFFF FFFF16 ≙ NaN (an alternative encoding of NaN)
0 01111111101 01010101010101010101010101010101010101010101010101012 = 3FD5 5555 5555 555516 ≙ +2−2 × (1 + 2−2 + 2−4 + ... + 2−52) ≈ 1/3
0 10000000000 10010010000111111011010101000100010000101101000110002 = 4009 21FB 5444 2D1816 ≈ pi

Encodings of qNaN and sNaN are not completely specified in IEEE 754 and depend on the processor. Most processors, such as the x86 family and the ARM family processors, use the most significant bit of the significand field to indicate a quiet NaN; this is what is recommended by IEEE 754. The PA-RISC processors use the bit to indicate a signaling NaN.

By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the significand.

In more detail:

Given the hexadecimal representation 3FD5 5555 5555 555516, Sign = 0 Exponent = 3FD16 = 1021 Exponent Bias = 1023 (constant value; see above) Fraction = 5 5555 5555 555516 Value = 2(Exponent − Exponent Bias) × 1.Fraction – Note that Fraction must not be converted to decimal here = 2−2 × (15 5555 5555 555516 × 2−52) = 2−54 × 15 5555 5555 555516 = 0.333333333333333314829616256247390992939472198486328125 ≈ 1/3 

Execution speed with double-precision arithmetic

Using double-precision floating-point variables is usually slower than working with their single precision counterparts. One area of computing where this is a particular issue is parallel code running on GPUs. For example, when using NVIDIA's CUDA platform, calculations with double precision can take, depending on hardware, from 2 to 32 times as long to complete compared to those done using single precision.[4]

Additionally, many mathematical functions (e.g., sin, cos, atan2, log, exp and sqrt) need more computations to give accurate double-precision results, and are therefore slower.

Precision limitations on integer values

  • Integers from −253 to 253 (−9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented.
  • Integers between 253 and 254 = 18,014,398,509,481,984 round to a multiple of 2 (even number).
  • Integers between 254 and 255 = 36,028,797,018,963,968 round to a multiple of 4.
  • Integers between 2n and 2n+1 round to a multiple of 2n−52.

Implementations

Doubles are implemented in many programming languages in different ways such as the following. On processors with only dynamic precision, such as x86 without SSE2 (or when SSE2 is not used, for compatibility purpose) and with extended precision used by default, software may have difficulties to fulfill some requirements.

C and C++

C and C++ offer a wide variety of arithmetic types. Double precision is not required by the standards (except by the optional annex F of C99, covering IEEE 754 arithmetic), but on most systems, the double type corresponds to double precision. However, on 32-bit x86 with extended precision by default, some compilers may not conform to the C standard or the arithmetic may suffer from double rounding.[5]

Fortran

Fortran provides several integer and real types, and the 64-bit type real64, accessible via Fortran's intrinsic module iso_fortran_env, corresponds to double precision.

Common Lisp

Common Lisp provides the types SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT and LONG-FLOAT. Most implementations provide SINGLE-FLOATs and DOUBLE-FLOATs with the other types appropriate synonyms. Common Lisp provides exceptions for catching floating-point underflows and overflows, and the inexact floating-point exception, as per IEEE 754. No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions.

Java

On Java before version 1.2, every implementation had to be IEEE 754 compliant. Version 1.2 allowed implementations to bring extra precision in intermediate computations for platforms like x87. Thus a modifier strictfp was introduced to enforce strict IEEE 754 computations. Strict floating point has been restored in Java 17.[6]

JavaScript

As specified by the ECMAScript standard, all arithmetic in JavaScript shall be done using double-precision floating-point arithmetic.[7]

JSON

The JSON data encoding format supports numeric values, and the grammar to which numeric expressions must conform has no limits on the precision or range of the numbers so encoded. However, RFC 8259 advises that, since IEEE 754 binary64 numbers are widely implemented, good interoperability can be achieved by implementations processing JSON if they expect no more precision or range than binary64 offers.[8]

Notes and references

  1. ^ William Kahan (1 October 1997). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" (PDF). (PDF) from the original on 8 February 2012.
  2. ^ Savard, John J. G. (2018) [2005], "Floating-Point Formats", quadibloc, from the original on 2018-07-03, retrieved 2018-07-16
  3. ^ "pack – convert a list into a binary representation".
  4. ^ "Nvidia's New Titan V Pushes 110 Teraflops From A Single Chip". Tom's Hardware. 2017-12-08. Retrieved 2018-11-05.
  5. ^ "Bug 323 – optimized code gives strange floating point results". gcc.gnu.org. from the original on 30 April 2018. Retrieved 30 April 2018.
  6. ^ Darcy, Joseph D. "JEP 306: Restore Always-Strict Floating-Point Semantics". Retrieved 2021-09-12.
  7. ^ ECMA-262 ECMAScript Language Specification (PDF) (5th ed.). Ecma International. p. 29, §8.5 The Number Type. (PDF) from the original on 2012-03-13.
  8. ^ "The JavaScript Object Notation (JSON) Data Interchange Format". Internet Engineering Task Force. December 2017. Retrieved 2022-02-01.

double, precision, floating, point, format, sometimes, called, fp64, float64, floating, point, number, format, usually, occupying, bits, computer, memory, represents, wide, dynamic, range, numeric, values, using, floating, radix, point, double, precision, chos. Double precision floating point format sometimes called FP64 or float64 is a floating point number format usually occupying 64 bits in computer memory it represents a wide dynamic range of numeric values by using a floating radix point Double precision may be chosen when the range or precision of single precision would be insufficient In the IEEE 754 2008 standard the 64 bit base 2 format is officially referred to as binary64 it was called double in IEEE 754 1985 IEEE 754 specifies additional floating point formats including 32 bit base 2 single precision and more recently base 10 representations decimal floating point One of the first programming languages to provide floating point data types was Fortran citation needed Before the widespread adoption of IEEE 754 1985 the representation and properties of floating point data types depended on the computer manufacturer and computer model and upon decisions made by programming language implementers E g GW BASIC s double precision data type was the 64 bit MBF floating point format Contents 1 IEEE 754 double precision binary floating point format binary64 1 1 Exponent encoding 1 2 Endianness 1 3 Double precision examples 1 4 Execution speed with double precision arithmetic 1 5 Precision limitations on integer values 2 Implementations 2 1 C and C 2 2 Fortran 2 3 Common Lisp 2 4 Java 2 5 JavaScript 2 6 JSON 3 Notes and referencesIEEE 754 double precision binary floating point format binary64Double precision binary floating point is a commonly used format on PCs due to its wider range over single precision floating point in spite of its performance and bandwidth cost It is commonly known simply as double The IEEE 754 standard specifies a binary64 as having Sign bit 1 bit Exponent 11 bits Significand precision 53 bits 52 explicitly stored The sign bit determines the sign of the number including when this number is zero which is signed The exponent field is an 11 bit unsigned integer from 0 to 2047 in biased form an exponent value of 1023 represents the actual zero Exponents range from 1022 to 1023 because exponents of 1023 all 0s and 1024 all 1s are reserved for special numbers The 53 bit significand precision gives from 15 to 17 significant decimal digits precision 2 53 1 11 10 16 If a decimal string with at most 15 significant digits is converted to the IEEE 754 double precision format giving a normal number and then converted back to a decimal string with the same number of digits the final result should match the original string If an IEEE 754 double precision number is converted to a decimal string with at least 17 significant digits and then converted back to double precision representation the final result must match the original number 1 The format is written with the significand having an implicit integer bit of value 1 except for special data see the exponent encoding below With the 52 bits of the fraction F significand appearing in the memory format the total precision is therefore 53 bits approximately 16 decimal digits 53 log10 2 15 955 The bits are laid out as follows nbsp The real value assumed by a given 64 bit double precision datum with a given biased exponent e displaystyle e nbsp and a 52 bit fraction is 1 sign 1 b 51 b 50 b 0 2 2 e 1023 displaystyle 1 text sign 1 b 51 b 50 b 0 2 times 2 e 1023 nbsp or 1 sign 1 i 1 52 b 52 i 2 i 2 e 1023 displaystyle 1 text sign left 1 sum i 1 52 b 52 i 2 i right times 2 e 1023 nbsp Between 252 4 503 599 627 370 496 and 253 9 007 199 254 740 992 the representable numbers are exactly the integers For the next range from 253 to 254 everything is multiplied by 2 so the representable numbers are the even ones etc Conversely for the previous range from 251 to 252 the spacing is 0 5 etc The spacing as a fraction of the numbers in the range from 2n to 2n 1 is 2n 52 The maximum relative rounding error when rounding a number to the nearest representable one the machine epsilon is therefore 2 53 The 11 bit width of the exponent allows the representation of numbers between 10 308 and 10308 with full 15 17 decimal digits precision By compromising precision the subnormal representation allows even smaller values up to about 5 10 324 Exponent encoding The double precision binary floating point exponent is encoded using an offset binary representation with the zero offset being 1023 also known as exponent bias in the IEEE 754 standard Examples of such representations would be e 00000000001 sub 2 sub 001 sub 16 sub 1 2 1 1023 2 1022 displaystyle 2 1 1023 2 1022 nbsp smallest exponent for normal numbers e 01111111111 sub 2 sub 3ff sub 16 sub 1023 2 1023 1023 2 0 displaystyle 2 1023 1023 2 0 nbsp zero offset e 10000000101 sub 2 sub 405 sub 16 sub 1029 2 1029 1023 2 6 displaystyle 2 1029 1023 2 6 nbsp e 11111111110 sub 2 sub 7fe sub 16 sub 2046 2 2046 1023 2 1023 displaystyle 2 2046 1023 2 1023 nbsp highest exponent The exponents 000 sub 16 sub and 7ff sub 16 sub have a special meaning 00000000000 sub 2 sub 000 sub 16 sub is used to represent a signed zero if F 0 and subnormal numbers if F 0 and 11111111111 sub 2 sub 7ff sub 16 sub is used to represent if F 0 and NaNs if F 0 where F is the fractional part of the significand All bit patterns are valid encoding Except for the above exceptions the entire double precision number is described by 1 sign 2 e 1023 1 fraction displaystyle 1 text sign times 2 e 1023 times 1 text fraction nbsp In the case of subnormal numbers e 0 the double precision number is described by 1 sign 2 1 1023 0 fraction 1 sign 2 1022 0 fraction displaystyle 1 text sign times 2 1 1023 times 0 text fraction 1 text sign times 2 1022 times 0 text fraction nbsp Endianness This section is an excerpt from Endianness Floating point edit Although many processors use little endian storage for all types of data integer floating point there are a number of hardware architectures where floating point numbers are represented in big endian form while integers are represented in little endian form 2 There are ARM processors that have mixed endian floating point representation for double precision numbers each of the two 32 bit words is stored as little endian but the most significant word is stored first VAX floating point stores little endian 16 bit words in big endian order Because there have been many floating point formats with no network standard representation for them the XDR standard uses big endian IEEE 754 as its representation It may therefore appear strange that the widespread IEEE 754 floating point standard does not specify endianness 3 Theoretically this means that even standard IEEE floating point data written by one machine might not be readable by another However on modern standard computers i e implementing IEEE 754 one may safely assume that the endianness is the same for floating point numbers as for integers making the conversion straightforward regardless of data type Small embedded systems using special floating point formats may be another matter however Double precision examples 0 01111111111 00000000000000000000000000000000000000000000000000002 3FF0 0000 0000 000016 20 1 1 0 01111111111 00000000000000000000000000000000000000000000000000012 3FF0 0000 0000 000116 20 1 2 52 1 0000000000000002 the smallest number gt 1 0 01111111111 00000000000000000000000000000000000000000000000000102 3FF0 0000 0000 000216 20 1 2 51 1 0000000000000004 0 10000000000 00000000000000000000000000000000000000000000000000002 4000 0000 0000 000016 21 1 2 1 10000000000 00000000000000000000000000000000000000000000000000002 C000 0000 0000 000016 21 1 2 0 10000000000 10000000000000000000000000000000000000000000000000002 4008 0000 0000 000016 21 1 12 112 3 0 10000000001 00000000000000000000000000000000000000000000000000002 4010 0000 0000 000016 22 1 1002 4 0 10000000001 01000000000000000000000000000000000000000000000000002 4014 0000 0000 000016 22 1 012 1012 5 0 10000000001 10000000000000000000000000000000000000000000000000002 4018 0000 0000 000016 22 1 12 1102 6 0 10000000011 01110000000000000000000000000000000000000000000000002 4037 0000 0000 000016 24 1 01112 101112 23 0 01111111000 10000000000000000000000000000000000000000000000000002 3F88 0000 0000 000016 2 7 1 12 0 000000112 0 01171875 3 256 0 00000000000 00000000000000000000000000000000000000000000000000012 0000 0000 0000 000116 2 1022 2 52 2 1074 4 9406564584124654 10 324 Min subnormal positive double 0 00000000000 11111111111111111111111111111111111111111111111111112 000F FFFF FFFF FFFF16 2 1022 1 2 52 2 2250738585072009 10 308 Max subnormal double 0 00000000001 00000000000000000000000000000000000000000000000000002 0010 0000 0000 000016 2 1022 1 2 2250738585072014 10 308 Min normal positive double 0 11111111110 11111111111111111111111111111111111111111111111111112 7FEF FFFF FFFF FFFF16 21023 1 1 2 52 1 7976931348623157 10308 Max double 0 00000000000 00000000000000000000000000000000000000000000000000002 0000 0000 0000 000016 0 1 00000000000 00000000000000000000000000000000000000000000000000002 8000 0000 0000 000016 0 0 11111111111 00000000000000000000000000000000000000000000000000002 7FF0 0000 0000 000016 positive infinity 1 11111111111 00000000000000000000000000000000000000000000000000002 FFF0 0000 0000 000016 negative infinity 0 11111111111 00000000000000000000000000000000000000000000000000012 7FF0 0000 0000 000116 NaN sNaN on most processors such as x86 and ARM 0 11111111111 10000000000000000000000000000000000000000000000000012 7FF8 0000 0000 000116 NaN qNaN on most processors such as x86 and ARM 0 11111111111 11111111111111111111111111111111111111111111111111112 7FFF FFFF FFFF FFFF16 NaN an alternative encoding of NaN 0 01111111101 01010101010101010101010101010101010101010101010101012 3FD5 5555 5555 555516 2 2 1 2 2 2 4 2 52 1 3 0 10000000000 10010010000111111011010101000100010000101101000110002 4009 21FB 5444 2D1816 pi Encodings of qNaN and sNaN are not completely specified in IEEE 754 and depend on the processor Most processors such as the x86 family and the ARM family processors use the most significant bit of the significand field to indicate a quiet NaN this is what is recommended by IEEE 754 The PA RISC processors use the bit to indicate a signaling NaN By default 1 3 rounds down instead of up like single precision because of the odd number of bits in the significand In more detail Given the hexadecimal representation 3FD5 5555 5555 555516 Sign 0 Exponent 3FD16 1021 Exponent Bias 1023 constant value see above Fraction 5 5555 5555 555516 Value 2 Exponent Exponent Bias 1 Fraction Note that Fraction must not be converted to decimal here 2 2 15 5555 5555 555516 2 52 2 54 15 5555 5555 555516 0 333333333333333314829616256247390992939472198486328125 1 3 Execution speed with double precision arithmetic Using double precision floating point variables is usually slower than working with their single precision counterparts One area of computing where this is a particular issue is parallel code running on GPUs For example when using NVIDIA s CUDA platform calculations with double precision can take depending on hardware from 2 to 32 times as long to complete compared to those done using single precision 4 Additionally many mathematical functions e g sin cos atan2 log exp and sqrt need more computations to give accurate double precision results and are therefore slower Precision limitations on integer values Integers from 253 to 253 9 007 199 254 740 992 to 9 007 199 254 740 992 can be exactly represented Integers between 253 and 254 18 014 398 509 481 984 round to a multiple of 2 even number Integers between 254 and 255 36 028 797 018 963 968 round to a multiple of 4 Integers between 2n and 2n 1 round to a multiple of 2n 52 ImplementationsDoubles are implemented in many programming languages in different ways such as the following On processors with only dynamic precision such as x86 without SSE2 or when SSE2 is not used for compatibility purpose and with extended precision used by default software may have difficulties to fulfill some requirements C and C C and C offer a wide variety of arithmetic types Double precision is not required by the standards except by the optional annex F of C99 covering IEEE 754 arithmetic but on most systems the double type corresponds to double precision However on 32 bit x86 with extended precision by default some compilers may not conform to the C standard or the arithmetic may suffer from double rounding 5 Fortran Fortran provides several integer and real types and the 64 bit type real64 accessible via Fortran s intrinsic module iso fortran env corresponds to double precision Common Lisp Common Lisp provides the types SHORT FLOAT SINGLE FLOAT DOUBLE FLOAT and LONG FLOAT Most implementations provide SINGLE FLOATs and DOUBLE FLOATs with the other types appropriate synonyms Common Lisp provides exceptions for catching floating point underflows and overflows and the inexact floating point exception as per IEEE 754 No infinities and NaNs are described in the ANSI standard however several implementations do provide these as extensions Java On Java before version 1 2 every implementation had to be IEEE 754 compliant Version 1 2 allowed implementations to bring extra precision in intermediate computations for platforms like x87 Thus a modifier strictfp was introduced to enforce strict IEEE 754 computations Strict floating point has been restored in Java 17 6 JavaScript As specified by the ECMAScript standard all arithmetic in JavaScript shall be done using double precision floating point arithmetic 7 JSON The JSON data encoding format supports numeric values and the grammar to which numeric expressions must conform has no limits on the precision or range of the numbers so encoded However RFC 8259 advises that since IEEE 754 binary64 numbers are widely implemented good interoperability can be achieved by implementations processing JSON if they expect no more precision or range than binary64 offers 8 Notes and references William Kahan 1 October 1997 Lecture Notes on the Status of IEEE Standard 754 for Binary Floating Point Arithmetic PDF Archived PDF from the original on 8 February 2012 Savard John J G 2018 2005 Floating Point Formats quadibloc archived from the original on 2018 07 03 retrieved 2018 07 16 pack convert a list into a binary representation Nvidia s New Titan V Pushes 110 Teraflops From A Single Chip Tom s Hardware 2017 12 08 Retrieved 2018 11 05 Bug 323 optimized code gives strange floating point results gcc gnu org Archived from the original on 30 April 2018 Retrieved 30 April 2018 Darcy Joseph D JEP 306 Restore Always Strict Floating Point Semantics Retrieved 2021 09 12 ECMA 262 ECMAScript Language Specification PDF 5th ed Ecma International p 29 8 5 The Number Type Archived PDF from the original on 2012 03 13 The JavaScript Object Notation JSON Data Interchange Format Internet Engineering Task Force December 2017 Retrieved 2022 02 01 Retrieved from https en wikipedia org w index php title Double precision floating point format amp oldid 1215715515, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.