fbpx
Wikipedia

Machine-check exception

A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.

The nature and causes of MCEs can vary by architecture and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a reboot. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by ECC memory. On some architectures, such as PowerPC, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as x86, MCEs typically originate from hardware only.

Reporting edit

IBM mainframe operating systems edit

IBM System/360 Operating System (OS/360) records input/output errors in a dataset called SYS1.LOGREC. Since then IBM has coined the term error recording data set (ERDS) for successor versions that allow the installation to choose the name and for operating systems not derived from OS/360.[1]

OS/360 edit

In OS/360, the installation can choose several levels of support for handling machine checks. The most sophisticated, Machine Check Handler (MCH), records failure data on SYS1.LOGREC and attempts recovery. The installation can print those data using the Environmental Record Editing and Printing Program (EREP) service aid or the stand-alone version SEREP. The MCH can handle memory failures in refreshable nucleus control sections by reading a fresh copy from SYS1.ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1.SVCLIB.

z/OS edit

In z/OS the installation can either use an ERDS or can define a z/OS System Logger log stream[2] to hold the error data. As with OS/360, the installation uses EREP to print those data; SEREP is no longer available. The MCH is no longer optional, and handles many more failure modes than the OS/360 MCH.

Microsoft Windows edit

On Microsoft Windows platforms, in the event of an unrecoverable MCEs, the system generates a BugCheck — also called a STOP error, or a Blue Screen of Death.

More recent versions of Windows use the Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parentheses) will vary, but the first is always 0x0 for an MCE.[3] Example:

 STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000) 

Older versions of Windows use the Machine Check Architecture, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION.[4] Example:

 STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA) 

Linux edit

On Linux, the kernel writes messages about MCEs to the kernel message log and the system console. When the MCEs are not fatal, they will also typically be copied to the system log and/or systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities.[5]

Example:

 CPU 0: Machine Check Exception: 0000000000000004 Bank 2: f200200000000863 Kernel panic: CPU context corrupt 

Problem types edit

Some of the main hardware problems that cause MCEs include:

Possible causes edit

Machine checks are a hardware problem, not a software problem. They are often the result of overclocking or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:

  • Poor CPU cooling due to a CPU heatsink and case fans (or filters) that's clogged with dust or has come loose.
  • Overclocking beyond the highest clock rate at which the CPU is still reliable.
  • Failing motherboard.
  • Failing processor.
  • Failing memory.
  • Failing I/O controllers, on either the motherboard or separate cards.
  • Failing I/O devices.
  • Inadequate or failing power supply.

Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.

Decoding MCEs edit

For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual[6] Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.[7]

Programs to decode Intel and AMD MCEs edit

  • rasdaemon[8] is a RAS (reliability, availability and serviceability) logging tool for Linux. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem that handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists. It is recommended to use rasdaemon to gather MCE information on Linux systems because mcelog has been deprecated as of 2017.[9][10][11][12]
  • mcelog[13] is a Linux daemon by Andi Kleen to handle MCEs for x86 processors. mcelog can also decode machine checks. mcelog is considered functionally obsolete as of 2017.[11][12] The replacement of mcelog for Linux systems is rasdaemon.[9][10]
  • parsemce[14] is a Linux program by Dave Jones to decode MCEs from AMD K7 processors.
  • mced[15] (mcedaemon) is a Linux program by Tim Hockin to gather MCEs from the kernel and alert interested applications. Note that it does not try to interpret the MCE data, it simply alerts other programs.
  • mcat is a Windows command-line program from AMD to decode MCEs from AMD K8, Family 0x10 and 0x11 processors.

See also edit

References edit

  1. ^ "Chapter 1. Introducing EREP" (PDF). Environmental Record Editing and Printing Program (EREP) 3.5 - User's Guide (PDF). IBM. September 30, 2021. p. 1. GC35-0151-50. Retrieved February 20, 2023.
  2. ^ System Programmer's Guide to: z/OS System Logger (PDF) (Second ed.). IBM. July 2007. SG24-6898-01. Retrieved February 20, 2023. {{cite book}}: |work= ignored (help)
  3. ^ "Bug Check 0x124: WHEA_UNCORRECTABLE_ERROR". Microsoft. 2022-11-03. Retrieved 2022-12-11.
  4. ^ "Bug Check 0x9C: MACHINE_CHECK_EXCEPTION". Microsoft. 2021-12-14. Retrieved 2022-12-11.
  5. ^ "mcelog not working with AMD processor family 16 and above on SLES11 SP3". SuSE. 2022-09-27. Retrieved 2022-12-11.
  6. ^ "Machine Check Architecture". Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2. Intel Corporation. November 2018.
  7. ^ "Stop error message in Windows XP that you may receive: "0x0000009C (0x00000004, 0x00000000, 0xb2000000, 0x00020151)"". MSDN. 2015-12-07. Retrieved 2017-07-13.
  8. ^ Mauro Carvalho Chehab (mchehab) (2023-02-20). "rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool". github.com. Retrieved 2023-02-20.
  9. ^ a b "Machine-check exception". wiki.archlinux.org. 2021-05-08. Retrieved 2023-02-21.
  10. ^ a b "ECC RAM". wiki.gentoo.org. 2022-12-30. Retrieved 2023-02-21.
  11. ^ a b "x86/mce: Factor out and deprecate the /dev/mcelog driver". git.kernel.org. 2017-03-28. Retrieved 2023-02-21.
  12. ^ a b "x86/mce: Factor out and deprecate the /dev/mcelog driver". github.com/torvalds/linux/. 2017-03-28. Retrieved 2023-02-21.
  13. ^ "mcelog: Advanced hardware error handling for x86 Linux". 2015-04-20. Retrieved 2017-07-13.
  14. ^ "parsemce: Linux Machine check exception handler parser". 2003-07-22. Retrieved 2017-07-13.
  15. ^ mcedaemon on GitHub

External links edit

  • mcelog: Advanced hardware error handling for x86 Linux
  • parsemce: Linux Machine check exception handler parser

machine, check, exception, this, article, needs, additional, citations, verification, please, help, improve, this, article, adding, citations, reliable, sources, unsourced, material, challenged, removed, find, sources, news, newspapers, books, scholar, jstor, . This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources Machine check exception news newspapers books scholar JSTOR June 2011 Learn how and when to remove this template message A machine check exception MCE is a type of computer error that occurs when a problem involving the computer s hardware is detected With most mass market personal computers an MCE indicates faulty or misconfigured hardware The nature and causes of MCEs can vary by architecture and generation of system In some designs an MCE is always an unrecoverable error that halts the machine requiring a reboot In other architectures some MCEs may be non fatal such as for single bit errors corrected by ECC memory On some architectures such as PowerPC certain software bugs can cause MCEs such as an invalid memory access On other architectures such as x86 MCEs typically originate from hardware only Contents 1 Reporting 1 1 IBM mainframe operating systems 1 1 1 OS 360 1 1 2 z OS 1 2 Microsoft Windows 1 3 Linux 2 Problem types 3 Possible causes 4 Decoding MCEs 4 1 Programs to decode Intel and AMD MCEs 5 See also 6 References 7 External linksReporting editIBM mainframe operating systems edit This section needs expansion with at lease z VM and z VSE You can help by adding to it February 2023 IBM System 360 Operating System OS 360 records input output errors in a dataset called SYS1 LOGREC Since then IBM has coined the term error recording data set ERDS for successor versions that allow the installation to choose the name and for operating systems not derived from OS 360 1 OS 360 edit In OS 360 the installation can choose several levels of support for handling machine checks The most sophisticated Machine Check Handler MCH records failure data on SYS1 LOGREC and attempts recovery The installation can print those data using the Environmental Record Editing and Printing Program EREP service aid or the stand alone version SEREP The MCH can handle memory failures in refreshable nucleus control sections by reading a fresh copy from SYS1 ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1 SVCLIB z OS edit In z OS the installation can either use an ERDS or can define a z OS System Logger log stream 2 to hold the error data As with OS 360 the installation uses EREP to print those data SEREP is no longer available The MCH is no longer optional and handles many more failure modes than the OS 360 MCH Microsoft Windows edit On Microsoft Windows platforms in the event of an unrecoverable MCEs the system generates a BugCheck also called a STOP error or a Blue Screen of Death More recent versions of Windows use the Windows Hardware Error Architecture WHEA and generate STOP code 0x124 WHEA UNCORRECTABLE ERROR The four parameters in parentheses will vary but the first is always 0x0 for an MCE 3 Example STOP 0x00000124 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 Older versions of Windows use the Machine Check Architecture with STOP code 0x9C MACHINE CHECK EXCEPTION 4 Example STOP 0x0000009C 0x00000030 0x00000002 0x00000001 0x80003CBA Linux edit On Linux the kernel writes messages about MCEs to the kernel message log and the system console When the MCEs are not fatal they will also typically be copied to the system log and or systemd journal For some systems ECC and other correctable errors may be reported through MCE facilities 5 Example CPU 0 Machine Check Exception 0000000000000004 Bank 2 f200200000000863 Kernel panic CPU context corruptProblem types editSome of the main hardware problems that cause MCEs include System bus errors error communicating between the processor and the motherboard Memory errors parity checking detects when a memory error has occurred Error correction code ECC can correct limited memory errors so that processing can continue CPU cache errors in the processor Possible causes editMachine checks are a hardware problem not a software problem They are often the result of overclocking or overheating In some cases the CPU will shut itself off once passing a thermal limit to avoid permanent damage But they can also be caused by bus errors introduced by other failing components like memory or I O devices Possible causes include Poor CPU cooling due to a CPU heatsink and case fans or filters that s clogged with dust or has come loose Overclocking beyond the highest clock rate at which the CPU is still reliable Failing motherboard Failing processor Failing memory Failing I O controllers on either the motherboard or separate cards Failing I O devices Inadequate or failing power supply Cooling problems are usually obvious upon inspection A failing motherboard or processor can be identified by swapping them with functioning parts Memory can be checked by booting to a diagnostic tool like memtest86 Non essential failing I O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears If the failures typically only occur fairly soon after the OS is booted or not at all or not for days it may be suggestive of a power supply issue With a power supply problem the failure often occurs when power demand peaks as the OS starts up any external devices for use Decoding MCEs editFor IA 32 and Intel 64 processors consult the Intel 64 and IA 32 Architectures Software Developer s Manual 6 Chapter 15 Machine Check Architecture or the Microsoft KB Article on Windows Exceptions 7 Programs to decode Intel and AMD MCEs edit rasdaemon 8 is a RAS reliability availability and serviceability logging tool for Linux It records memory errors using the EDAC tracing events EDAC is a Linux kernel subsystem that handles detection of ECC errors from memory controllers for most chipsets on i386 and x86 64 architectures EDAC drivers for other architectures like arm also exists It is recommended to use rasdaemon to gather MCE information on Linux systems because mcelog has been deprecated as of 2017 9 10 11 12 mcelog 13 is a Linux daemon by Andi Kleen to handle MCEs for x86 processors mcelog can also decode machine checks mcelog is considered functionally obsolete as of 2017 11 12 The replacement of mcelog for Linux systems is rasdaemon 9 10 parsemce 14 is a Linux program by Dave Jones to decode MCEs from AMD K7 processors mced 15 mcedaemon is a Linux program by Tim Hockin to gather MCEs from the kernel and alert interested applications Note that it does not try to interpret the MCE data it simply alerts other programs mcat is a Windows command line program from AMD to decode MCEs from AMD K8 Family 0x10 and 0x11 processors See also editMachine Check Architecture MCA High availability HA Reliability availability and serviceability RAS Windows Hardware Error Architecture WHEA Blue screen of death Kernel panicReferences edit Chapter 1 Introducing EREP PDF Environmental Record Editing and Printing Program EREP 3 5 User s Guide PDF IBM September 30 2021 p 1 GC35 0151 50 Retrieved February 20 2023 System Programmer s Guide to z OS System Logger PDF Second ed IBM July 2007 SG24 6898 01 Retrieved February 20 2023 a href Template Cite book html title Template Cite book cite book a work ignored help Bug Check 0x124 WHEA UNCORRECTABLE ERROR Microsoft 2022 11 03 Retrieved 2022 12 11 Bug Check 0x9C MACHINE CHECK EXCEPTION Microsoft 2021 12 14 Retrieved 2022 12 11 mcelog not working with AMD processor family 16 and above on SLES11 SP3 SuSE 2022 09 27 Retrieved 2022 12 11 Machine Check Architecture Intel 64 and IA 32 Architectures Software Developer s Manual Volume 3B System Programming Guide Part 2 Intel Corporation November 2018 Stop error message in Windows XP that you may receive 0x0000009C 0x00000004 0x00000000 0xb2000000 0x00020151 MSDN 2015 12 07 Retrieved 2017 07 13 Mauro Carvalho Chehab mchehab 2023 02 20 rasdaemon is a RAS Reliability Availability and Serviceability logging tool github com Retrieved 2023 02 20 a b Machine check exception wiki archlinux org 2021 05 08 Retrieved 2023 02 21 a b ECC RAM wiki gentoo org 2022 12 30 Retrieved 2023 02 21 a b x86 mce Factor out and deprecate the dev mcelog driver git kernel org 2017 03 28 Retrieved 2023 02 21 a b x86 mce Factor out and deprecate the dev mcelog driver github com torvalds linux 2017 03 28 Retrieved 2023 02 21 mcelog Advanced hardware error handling for x86 Linux 2015 04 20 Retrieved 2017 07 13 parsemce Linux Machine check exception handler parser 2003 07 22 Retrieved 2017 07 13 mcedaemon on GitHubExternal links editmcelog Advanced hardware error handling for x86 Linux parsemce Linux Machine check exception handler parser nbsp This computer hardware article is a stub You can help Wikipedia by expanding it vte Retrieved from https en wikipedia org w index php title Machine check exception amp oldid 1178264621, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.