See this LWN article for further details about this issue. Reserved kernel pages and zero count pages are ignored with the peril of a system panic. Once the hardware delivers the message to the OS via a machine check , the OS is then free to deal with the machine check however it pleases. For “Action Optional” machine checks that can happen asynchronously to program execution such as due to scrubbing , the OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping it or what-have-you. With delay, handling can be safely postponed until a later time when the page might be referenced. System programming guide https: Linux EDAC project on sourceforge.
|Date Added:||4 December 2009|
|File Size:||39.21 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
Now that Intel is supporting MCA Recovery on x86 machines, some desktop users may also enjoy its benefits in the near future. Action Optional means that the Intl detected some imtel of corruption in the background and tells the OS about using a machine check exception. I found a different Machine check handling on Linux paperslides for Linux Kongress Now flip with me to page and look at what SRAO errors are architecturally defined, there in section Background scrubbing gives a machine check.
I can definitely see a design where the machine check happens, and the OS deals with it, before the data error is consumed. Background scrubbing works by reading memory locations, checking the Inhel, and correcting correctable errors proactively before they become uncorrectable.
mcelog — further reading
Studies about memory errors A good study on memory errors from the University of Rochester. Maybe the article is confusing multiple scenarios. If the faulting word is due to a prefetch, or is late in the cache line that was read due to a demand fetch, that data may intell at the CPU quite long after the instruction that triggered that line fill.
Can it be any clearer? By delaying, some transient errors may not reoccur or may be irrelevant.
Clean pages in either the swap or page cache can be easily recovered by invalidating the cache entry injecgor these pages. The handler must allow for multiple poisoning events occurring in a short time window. If background scrubbing detects something uncorrectable, it can and it seems like it ought to signal a machine check. For “Action Optional” machine checks that can happen asynchronously to program execution such as due to scrubbingthe OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping mfe or what-have-you.
How can the CPU untel executing and generate a machine check at some arbitrarily later time? Potentially corrupted processes can then be located by finding all processes that have the corrupted page mapped.
One downside to the ever-increasing memory size available on computers is an increase in memory failures. Posted Aug inyel, Do you have different documentation that suggests otherwise? On a later page fault the associated application will be killed.
In the most recent Intel architectures, they support a notion of “recoverable machine check,” wherein the hardware tells the OS that no CPU state was corrupted when it noticed the problem. Thus, processes can decide how they want to handle the data poisoning.
Perhaps this is handled properly, but by just unmapping, arn’t you running the risk that some later memory allocation by that process might get the same virtual address and thus instead of a SIGBUS the process keeps running with corrupted memory? It’s still a machine check. The OS can then take appropriate injfctor, like killing the process with the corrupted data or logging the event properly to disk.
However, this is infeasible for two reasons. Note that this property would be system dependent—not all systems would necessarily be this imprecise.
Automatic page offlining is a good innjector Additionally, the architecture must support data poisoning. Huge pages fail since reverse mapping is not supported to identify the process which owns the page. See Chapter 15 in this reference where it says: The handler ignores the following types of pages: MCE is the mechanism by which the hardware reports the bad page to the operating system.