fbpx
Wikipedia

Fermi (microarchitecture)

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm[citation needed]. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.

Nvidia Fermi
NVIDIA GeForce GTX 590 of the GeForce 500-line of graphics-cards, was the final major iteration featuring the Fermi microarchitecture (GF110-351-A1).
Release dateApril 2010
Manufactured byTSMC
Designed byNvidia
Fabrication process40 nm and 28 nm[citation needed]
History
PredecessorTesla
SuccessorKepler
Support status
Unsupported
Photo of Enrico Fermi, eponym of architecture

Fermi was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs.

In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, and in Nvidia Tesla computing modules.

The architecture is named after Enrico Fermi, an Italian physicist.

Overview edit

 
NVIDIA GeForce GTX 480 of the GeForce 400-line of graphics-cards; the first iteration to feature the Fermi micro-architecture (GF100-375-A3).
 
Fig. 1. NVIDIA Fermi architecture
Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches.
 
Die shot of the GF100 GPU found inside GeForce GTX 470 cards

Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors and a schematic is sketched in Fig. 1.

  • Streaming Multiprocessor (SM): composed of 32 CUDA cores (see Streaming Multiprocessor and CUDA core sections).
  • GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section).
  • Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8 GB/s).
  • DRAM: supported up to 6 GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section).
  • Clock frequency: 1.5 GHz (not released by NVIDIA, but estimated by Insight 64).
  • Peak performance: 1.5 TFlops.
  • Global memory clock: 2 GHz.
  • DRAM bandwidth: 192 GB/s.
  • H.264 FHD decode support.
  • H.265 FHD decode support (GT 730 only).[1]

Streaming multiprocessor edit

Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64 KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).

Load/Store Units edit

Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to cache or DRAM.

Special Functions Units (SFUs) edit

Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

CUDA core edit

Integer Arithmetic Logic Unit (ALU) edit

Supports full 32-bit precision for all instructions, consistent with standard programming language requirements.[which?] It is also optimized to efficiently support 64-bit in workstation and server models, but artificially crippled for consumer versions.

Floating Point Unit (FPU) edit

Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.[2]

Fused multiply-add edit

Fused multiply-add (FMA) perform multiplication and addition (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

Warp scheduling edit

The Fermi architecture uses a two-level, distributed thread scheduler.

Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 1. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies.

64-bit floating point operations require both the first two execution columns, so run at half the speed of 32-bit operations.

Dual Warp Scheduler edit

At the SM level, each warp scheduler distributes warps of 32 threads to its execution units. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double precision instructions do not support dual dispatch with any other operation.[citation needed]

Performance edit

The theoretical single-precision processing power of a Fermi GPU in GFLOPS is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × shader clock speed (in GHz). Note that the previous generation Tesla could dual-issue MAD+MUL to CUDA cores and SFUs in parallel, but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized.[3] Therefore, it is not possible to leverage the SFUs to reach more than 2 operations per CUDA core per cycle.

The theoretical double-precision processing power of a Fermi GPU is 1/2 of the single precision performance on GF100/110. However, in practice this double-precision power is only available on professional Quadro and Tesla cards, while consumer GeForce cards are capped to 1/8.[4]

Memory edit

L1 cache per SM and unified L2 cache that services all operations (load, store and texture).

Registers edit

Each SM has 32K of 32-bit registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.

L1+Shared Memory edit

On-chip memory that can be used either to cache data for individual threads (register spilling/L1 cache) and/or to share data among several threads (shared memory). This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic. Shared memory is accessible by the threads in the same thread block. It provides low-latency access (10-20 cycles) and very high bandwidth (1,600 GB/s) to moderate amounts of data (such as intermediate results in a series of calculations, one row or column of data for matrix operations, a line of video, etc.). David Patterson says that this Shared Memory uses idea of local scratchpad[5]

Local Memory edit

Local memory is meant as a memory location used to hold "spilled" registers. Register spilling occurs when a thread block requires more register storage than is available on an SM. Local memory is used only for some automatic variables (which are declared in the device code without any of the __device__, __shared__, or __constant__ qualifiers). Generally, an automatic variable resides in a register except for the following: (1) Arrays that the compiler cannot determine are indexed with constant quantities; (2) Large structures or arrays that would consume too much register space; Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.

L2 Cache edit

768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests. The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global memory edit

Global memory (VRAM) is accessible by all threads directly as well as the host system over the PCIe bus. It has a high latency of 400-800 cycles.[citation needed]

Video decompression/compression edit

See Nvidia NVDEC (formerly called NVCUVID) as well as Nvidia PureVideo.

The Nvidia NVENC technology was not available yet, but introduced in the successor, Kepler.

Fermi chips edit

  • GF100
  • GF104
  • GF106
  • GF108
  • GF110
  • GF114
  • GF116
  • GF117
  • GF119

See also edit

References edit

  1. ^ "NVIDIA GPU Decoder Device Information".
  2. ^ "NVIDIA's Next Generation CUDA Compute Architecture: Fermi" (PDF). 2009. Retrieved December 7, 2015.
  3. ^ Glaskowsky, Peter N. (September 2009). "NVIDIA's Fermi: The First Complete GPU Computing Architecture" (PDF). p. 22. Retrieved December 6, 2015. A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM
  4. ^ Smith, Ryan (March 26, 2010). "NVIDIA's GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?". AnandTech. p. 6. Retrieved December 6, 2015. the GTX 400 series' FP64 performance is capped at 1/8th (12.5%) of its FP32 performance, as opposed to what the hardware natively can do of 1/2 (50%) FP32
  5. ^ Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved October 3, 2013.

General edit

  • N. Brookwood, "NVIDIA Solves the GPU Computing Puzzle."
  • P.N. Glaskowsky, "NVIDIA’s Fermi: The First Complete GPU Computing Architecture."
  • N. Whitehead, A. Fit-Florea, "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs.", 2011.
  • Oberman, S.F.; Siu, M.Y. (2005). "A High-Performance Area-Efficient Multifunction Interpolator". 17th IEEE Symposium on Computer Arithmetic (ARITH'05). pp. 272–279. doi:10.1109/arith.2005.7. ISBN 0-7695-2366-8. S2CID 14975421.
  • R. Farber, "CUDA Application Design and Development," Morgan Kaufmann, 2011.
  • NVIDIA Application Note "Tuning CUDA applications for Fermi".

External links edit

  • NVIDIA Fermi Architecture on Orange Owl Solutions January 4, 2022, at the Wayback Machine

fermi, microarchitecture, this, article, includes, list, general, references, lacks, sufficient, corresponding, inline, citations, please, help, improve, this, article, introducing, more, precise, citations, august, 2014, learn, when, remove, this, message, fe. This article includes a list of general references but it lacks sufficient corresponding inline citations Please help to improve this article by introducing more precise citations August 2014 Learn how and when to remove this message Fermi is the codename for a graphics processing unit GPU microarchitecture developed by Nvidia first released to retail in April 2010 as the successor to the Tesla microarchitecture It was the primary microarchitecture used in the GeForce 400 series and 500 series All desktop Fermi GPUs were manufactured in 40nm mobile Fermi GPUs in 40nm and 28nm citation needed Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft s rendering API Direct3D 12 feature level 11 Nvidia FermiNVIDIA GeForce GTX 590 of the GeForce 500 line of graphics cards was the final major iteration featuring the Fermi microarchitecture GF110 351 A1 Release dateApril 2010Manufactured byTSMCDesigned byNvidiaFabrication process40 nm and 28 nm citation needed HistoryPredecessorTeslaSuccessorKeplerSupport statusUnsupported Photo of Enrico Fermi eponym of architecture Fermi was followed by Kepler and used alongside Kepler in the GeForce 600 series GeForce 700 series and GeForce 800 series in the latter two only in mobile GPUs In the workstation market Fermi found use in the Quadro x000 series Quadro NVS models and in Nvidia Tesla computing modules The architecture is named after Enrico Fermi an Italian physicist Contents 1 Overview 2 Streaming multiprocessor 2 1 Load Store Units 2 2 Special Functions Units SFUs 3 CUDA core 3 1 Integer Arithmetic Logic Unit ALU 3 2 Floating Point Unit FPU 4 Fused multiply add 5 Warp scheduling 5 1 Dual Warp Scheduler 6 Performance 7 Memory 7 1 Registers 7 2 L1 Shared Memory 7 3 Local Memory 7 4 L2 Cache 7 5 Global memory 8 Video decompression compression 9 Fermi chips 10 See also 11 References 11 1 General 12 External linksOverview edit nbsp NVIDIA GeForce GTX 480 of the GeForce 400 line of graphics cards the first iteration to feature the Fermi micro architecture GF100 375 A3 nbsp Fig 1 NVIDIA Fermi architectureConvention in figures orange scheduling and dispatch green execution light blue registers and caches nbsp Die shot of the GF100 GPU found inside GeForce GTX 470 cards Fermi Graphic Processing Units GPUs feature 3 0 billion transistors and a schematic is sketched in Fig 1 Streaming Multiprocessor SM composed of 32 CUDA cores see Streaming Multiprocessor and CUDA core sections GigaThread global scheduler distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution see Warp Scheduling section Host interface connects the GPU to the CPU via a PCI Express v2 bus peak transfer rate of 8 GB s DRAM supported up to 6 GB of GDDR5 DRAM memory thanks to the 64 bit addressing capability see Memory Architecture section Clock frequency 1 5 GHz not released by NVIDIA but estimated by Insight 64 Peak performance 1 5 TFlops Global memory clock 2 GHz DRAM bandwidth 192 GB s H 264 FHD decode support H 265 FHD decode support GT 730 only 1 Streaming multiprocessor editEach SM features 32 single precision CUDA cores 16 load store units four Special Function Units SFUs a 64 KB block of high speed on chip memory see L1 Shared Memory subsection and an interface to the L2 cache see L2 Cache subsection Load Store Units edit Allow source and destination addresses to be calculated for 16 threads per clock Load and store the data from to cache or DRAM Special Functions Units SFUs edit Execute transcendental instructions such as sin cosine reciprocal and square root Each SFU executes one instruction per thread per clock a warp executes over eight clocks The SFU pipeline is decoupled from the dispatch unit allowing the dispatch unit to issue to other execution units while the SFU is occupied CUDA core editInteger Arithmetic Logic Unit ALU edit Supports full 32 bit precision for all instructions consistent with standard programming language requirements which It is also optimized to efficiently support 64 bit in workstation and server models but artificially crippled for consumer versions Floating Point Unit FPU edit Implements the new IEEE 754 2008 floating point standard providing the fused multiply add FMA instruction for both single and double precision arithmetic Up to 16 double precision fused multiply add operations can be performed per SM per clock 2 Fused multiply add editFused multiply add FMA perform multiplication and addition i e A B C with a single final rounding step with no loss of precision in the addition FMA is more accurate than performing the operations separately Warp scheduling editThe Fermi architecture uses a two level distributed thread scheduler Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig 1 For example the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores or 16 operations from the load store units with four from SFUs or any other combinations the program specifies 64 bit floating point operations require both the first two execution columns so run at half the speed of 32 bit operations Dual Warp Scheduler edit At the SM level each warp scheduler distributes warps of 32 threads to its execution units Each SM features two warp schedulers and two instruction dispatch units allowing two warps to be issued and executed concurrently The dual warp scheduler selects two warps and issues one instruction from each warp to a group of 16 cores 16 load store units or 4 SFUs Most instructions can be dual issued two integer instructions two floating instructions or a mix of integer floating point load store and SFU instructions can be issued concurrently Double precision instructions do not support dual dispatch with any other operation citation needed Performance editThe theoretical single precision processing power of a Fermi GPU in GFLOPS is computed as 2 operations per FMA instruction per CUDA core per cycle number of CUDA cores shader clock speed in GHz Note that the previous generation Tesla could dual issue MAD MUL to CUDA cores and SFUs in parallel but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized 3 Therefore it is not possible to leverage the SFUs to reach more than 2 operations per CUDA core per cycle The theoretical double precision processing power of a Fermi GPU is 1 2 of the single precision performance on GF100 110 However in practice this double precision power is only available on professional Quadro and Tesla cards while consumer GeForce cards are capped to 1 8 4 Memory editL1 cache per SM and unified L2 cache that services all operations load store and texture Registers edit Each SM has 32K of 32 bit registers Each thread has access to its own registers and not those of other threads The maximum number of registers that can be used by a CUDA kernel is 63 The number of available registers degrades gracefully from 63 to 21 as the workload and hence resource requirements increases by number of threads Registers have a very high bandwidth about 8 000 GB s L1 Shared Memory edit On chip memory that can be used either to cache data for individual threads register spilling L1 cache and or to share data among several threads shared memory This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache or 16 KB of shared memory with 48 KB of L1 cache Shared memory enables threads within the same thread block to cooperate facilitates extensive reuse of on chip data and greatly reduces off chip traffic Shared memory is accessible by the threads in the same thread block It provides low latency access 10 20 cycles and very high bandwidth 1 600 GB s to moderate amounts of data such as intermediate results in a series of calculations one row or column of data for matrix operations a line of video etc David Patterson says that this Shared Memory uses idea of local scratchpad 5 Local Memory edit Local memory is meant as a memory location used to hold spilled registers Register spilling occurs when a thread block requires more register storage than is available on an SM Local memory is used only for some automatic variables which are declared in the device code without any of the device shared or constant qualifiers Generally an automatic variable resides in a register except for the following 1 Arrays that the compiler cannot determine are indexed with constant quantities 2 Large structures or arrays that would consume too much register space Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM L2 Cache edit 768 KB unified L2 cache shared among the 16 SMs that services all load and store from to global memory including copies to from CPU host and also texture requests The L2 cache subsystem also implements atomic operations used for managing access to data that must be shared across thread blocks or even kernels Global memory edit Global memory VRAM is accessible by all threads directly as well as the host system over the PCIe bus It has a high latency of 400 800 cycles citation needed Video decompression compression editSee Nvidia NVDEC formerly called NVCUVID as well as Nvidia PureVideo The Nvidia NVENC technology was not available yet but introduced in the successor Kepler Fermi chips editGF100 GF104 GF106 GF108 GF110 GF114 GF116 GF117 GF119See also editQualcomm Adreno CUDA List of eponyms of Nvidia GPU microarchitectures List of Nvidia graphics processing units Scalable Link Interface SLI References edit NVIDIA GPU Decoder Device Information NVIDIA s Next Generation CUDA Compute Architecture Fermi PDF 2009 Retrieved December 7 2015 Glaskowsky Peter N September 2009 NVIDIA s Fermi The First Complete GPU Computing Architecture PDF p 22 Retrieved December 6 2015 A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM Smith Ryan March 26 2010 NVIDIA s GeForce GTX 480 and GTX 470 6 Months Late Was It Worth the Wait AnandTech p 6 Retrieved December 6 2015 the GTX 400 series FP64 performance is capped at 1 8th 12 5 of its FP32 performance as opposed to what the hardware natively can do of 1 2 50 FP32 Patterson David September 30 2009 The Top 10 Innovations in the New NVIDIA Fermi Architecture and the Top 3 Next Challenges PDF Parallel Computing Research Laboratory amp NVIDIA Retrieved October 3 2013 General edit N Brookwood NVIDIA Solves the GPU Computing Puzzle P N Glaskowsky NVIDIA s Fermi The First Complete GPU Computing Architecture N Whitehead A Fit Florea Precision amp Performance Floating Point and IEEE 754 Compliance for NVIDIA GPUs 2011 Oberman S F Siu M Y 2005 A High Performance Area Efficient Multifunction Interpolator 17th IEEE Symposium on Computer Arithmetic ARITH 05 pp 272 279 doi 10 1109 arith 2005 7 ISBN 0 7695 2366 8 S2CID 14975421 R Farber CUDA Application Design and Development Morgan Kaufmann 2011 NVIDIA Application Note Tuning CUDA applications for Fermi External links editNVIDIA Fermi Architecture on Orange Owl Solutions Archived January 4 2022 at the Wayback Machine Portal nbsp Electronics Retrieved from https en wikipedia org w index php title Fermi microarchitecture amp oldid 1220412954, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.