123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187 |
- What is hwpoison?
- Upcoming Intel CPUs have support for recovering from some memory errors
- (``MCA recovery''). This requires the OS to declare a page "poisoned",
- kill the processes associated with it and avoid using it in the future.
- This patchkit implements the necessary infrastructure in the VM.
- To quote the overview comment:
- * High level machine check handler. Handles pages reported by the
- * hardware as being corrupted usually due to a 2bit ECC memory or cache
- * failure.
- *
- * This focusses on pages detected as corrupted in the background.
- * When the current CPU tries to consume corruption the currently
- * running process can just be killed directly instead. This implies
- * that if the error cannot be handled for some reason it's safe to
- * just ignore it because no corruption has been consumed yet. Instead
- * when that happens another machine check will happen.
- *
- * Handles page cache pages in various states. The tricky part
- * here is that we can access any page asynchronous to other VM
- * users, because memory failures could happen anytime and anywhere,
- * possibly violating some of their assumptions. This is why this code
- * has to be extremely careful. Generally it tries to use normal locking
- * rules, as in get the standard locks, even if that means the
- * error handling takes potentially a long time.
- *
- * Some of the operations here are somewhat inefficient and have non
- * linear algorithmic complexity, because the data structures have not
- * been optimized for this case. This is in particular the case
- * for the mapping from a vma to a process. Since this case is expected
- * to be rare we hope we can get away with this.
- The code consists of a the high level handler in mm/memory-failure.c,
- a new page poison bit and various checks in the VM to handle poisoned
- pages.
- The main target right now is KVM guests, but it works for all kinds
- of applications. KVM support requires a recent qemu-kvm release.
- For the KVM use there was need for a new signal type so that
- KVM can inject the machine check into the guest with the proper
- address. This in theory allows other applications to handle
- memory failures too. The expection is that near all applications
- won't do that, but some very specialized ones might.
- ---
- There are two (actually three) modi memory failure recovery can be in:
- vm.memory_failure_recovery sysctl set to zero:
- All memory failures cause a panic. Do not attempt recovery.
- (on x86 this can be also affected by the tolerant level of the
- MCE subsystem)
- early kill
- (can be controlled globally and per process)
- Send SIGBUS to the application as soon as the error is detected
- This allows applications who can process memory errors in a gentle
- way (e.g. drop affected object)
- This is the mode used by KVM qemu.
- late kill
- Send SIGBUS when the application runs into the corrupted page.
- This is best for memory error unaware applications and default
- Note some pages are always handled as late kill.
- ---
- User control:
- vm.memory_failure_recovery
- See sysctl.txt
- vm.memory_failure_early_kill
- Enable early kill mode globally
- PR_MCE_KILL
- Set early/late kill mode/revert to system default
- arg1: PR_MCE_KILL_CLEAR: Revert to system default
- arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
- PR_MCE_KILL_EARLY: Early kill
- PR_MCE_KILL_LATE: Late kill
- PR_MCE_KILL_DEFAULT: Use system global default
- Note that if you want to have a dedicated thread which handles
- the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
- call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
- the SIGBUS is sent to the main thread.
- PR_MCE_KILL_GET
- return current mode
- ---
- Testing:
- madvise(MADV_HWPOISON, ....)
- (as root)
- Poison a page in the process for testing
- hwpoison-inject module through debugfs
- /sys/debug/hwpoison/
- corrupt-pfn
- Inject hwpoison fault at PFN echoed into this file. This does
- some early filtering to avoid corrupted unintended pages in test suites.
- unpoison-pfn
- Software-unpoison page at PFN echoed into this file. This
- way a page can be reused again.
- This only works for Linux injected failures, not for real
- memory failures.
- Note these injection interfaces are not stable and might change between
- kernel versions
- corrupt-filter-dev-major
- corrupt-filter-dev-minor
- Only handle memory failures to pages associated with the file system defined
- by block device major/minor. -1U is the wildcard value.
- This should be only used for testing with artificial injection.
- corrupt-filter-memcg
- Limit injection to pages owned by memgroup. Specified by inode number
- of the memcg.
- Example:
- mkdir /sys/fs/cgroup/mem/hwpoison
- usemem -m 100 -s 1000 &
- echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
- memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
- echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
- page-types -p `pidof init` --hwpoison # shall do nothing
- page-types -p `pidof usemem` --hwpoison # poison its pages
- corrupt-filter-flags-mask
- corrupt-filter-flags-value
- When specified, only poison pages if ((page_flags & mask) == value).
- This allows stress testing of many kinds of pages. The page_flags
- are the same as in /proc/kpageflags. The flag bits are defined in
- include/linux/kernel-page-flags.h and documented in
- Documentation/vm/pagemap.txt
- Architecture specific MCE injector
- x86 has mce-inject, mce-test
- Some portable hwpoison test programs in mce-test, see blow.
- ---
- References:
- http://halobates.de/mce-lc09-2.pdf
- Overview presentation from LinuxCon 09
- git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
- Test suite (hwpoison specific portable tests in tsrc)
- git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
- x86 specific injector
- ---
- Limitations:
- - Not all page types are supported and never will. Most kernel internal
- objects cannot be recovered, only LRU pages for now.
- - Right now hugepage support is missing.
- ---
- Andi Kleen, Oct 2009
|