123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516 |
- = Transparent Hugepage Support =
- == Objective ==
- Performance critical computing applications dealing with large memory
- working sets are already running on top of libhugetlbfs and in turn
- hugetlbfs. Transparent Hugepage Support is an alternative means of
- using huge pages for the backing of virtual memory with huge pages
- that supports the automatic promotion and demotion of page sizes and
- without the shortcomings of hugetlbfs.
- Currently it only works for anonymous memory mappings and tmpfs/shmem.
- But in the future it can expand to other filesystems.
- The reason applications are running faster is because of two
- factors. The first factor is almost completely irrelevant and it's not
- of significant interest because it'll also have the downside of
- requiring larger clear-page copy-page in page faults which is a
- potentially negative effect. The first factor consists in taking a
- single page fault for each 2M virtual region touched by userland (so
- reducing the enter/exit kernel frequency by a 512 times factor). This
- only matters the first time the memory is accessed for the lifetime of
- a memory mapping. The second long lasting and much more important
- factor will affect all subsequent accesses to the memory for the whole
- runtime of the application. The second factor consist of two
- components: 1) the TLB miss will run faster (especially with
- virtualization using nested pagetables but almost always also on bare
- metal without virtualization) and 2) a single TLB entry will be
- mapping a much larger amount of virtual memory in turn reducing the
- number of TLB misses. With virtualization and nested pagetables the
- TLB can be mapped of larger size only if both KVM and the Linux guest
- are using hugepages but a significant speedup already happens if only
- one of the two is using hugepages just because of the fact the TLB
- miss is going to run faster.
- == Design ==
- - "graceful fallback": mm components which don't have transparent hugepage
- knowledge fall back to breaking huge pmd mapping into table of ptes and,
- if necessary, split a transparent hugepage. Therefore these components
- can continue working on the regular pages or regular pte mappings.
- - if a hugepage allocation fails because of memory fragmentation,
- regular pages should be gracefully allocated instead and mixed in
- the same vma without any failure or significant delay and without
- userland noticing
- - if some task quits and more hugepages become available (either
- immediately in the buddy or through the VM), guest physical memory
- backed by regular pages should be relocated on hugepages
- automatically (with khugepaged)
- - it doesn't require memory reservation and in turn it uses hugepages
- whenever possible (the only possible reservation here is kernelcore=
- to avoid unmovable pages to fragment all the memory but such a tweak
- is not specific to transparent hugepage support and it's a generic
- feature that applies to all dynamic high order allocations in the
- kernel)
- Transparent Hugepage Support maximizes the usefulness of free memory
- if compared to the reservation approach of hugetlbfs by allowing all
- unused memory to be used as cache or other movable (or even unmovable
- entities). It doesn't require reservation to prevent hugepage
- allocation failures to be noticeable from userland. It allows paging
- and all other advanced VM features to be available on the
- hugepages. It requires no modifications for applications to take
- advantage of it.
- Applications however can be further optimized to take advantage of
- this feature, like for example they've been optimized before to avoid
- a flood of mmap system calls for every malloc(4k). Optimizing userland
- is by far not mandatory and khugepaged already can take care of long
- lived page allocations even for hugepage unaware applications that
- deals with large amounts of memory.
- In certain cases when hugepages are enabled system wide, application
- may end up allocating more memory resources. An application may mmap a
- large region but only touch 1 byte of it, in that case a 2M page might
- be allocated instead of a 4k page for no good. This is why it's
- possible to disable hugepages system-wide and to only have them inside
- MADV_HUGEPAGE madvise regions.
- Embedded systems should enable hugepages only inside madvise regions
- to eliminate any risk of wasting any precious byte of memory and to
- only run faster.
- Applications that gets a lot of benefit from hugepages and that don't
- risk to lose memory by using hugepages, should use
- madvise(MADV_HUGEPAGE) on their critical mmapped regions.
- == sysfs ==
- Transparent Hugepage Support for anonymous memory can be entirely disabled
- (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
- regions (to avoid the risk of consuming more memory resources) or enabled
- system wide. This can be achieved with one of:
- echo always >/sys/kernel/mm/transparent_hugepage/enabled
- echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
- echo never >/sys/kernel/mm/transparent_hugepage/enabled
- It's also possible to limit defrag efforts in the VM to generate
- anonymous hugepages in case they're not immediately free to madvise
- regions or to never try to defrag memory and simply fallback to regular
- pages unless hugepages are immediately available. Clearly if we spend CPU
- time to defrag memory, we would expect to gain even more by the fact we
- use hugepages later instead of regular pages. This isn't always
- guaranteed, but it may be more likely in case the allocation is for a
- MADV_HUGEPAGE region.
- echo always >/sys/kernel/mm/transparent_hugepage/defrag
- echo defer >/sys/kernel/mm/transparent_hugepage/defrag
- echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
- echo never >/sys/kernel/mm/transparent_hugepage/defrag
- "always" means that an application requesting THP will stall on allocation
- failure and directly reclaim pages and compact memory in an effort to
- allocate a THP immediately. This may be desirable for virtual machines
- that benefit heavily from THP use and are willing to delay the VM start
- to utilise them.
- "defer" means that an application will wake kswapd in the background
- to reclaim pages and wake kcompact to compact memory so that THP is
- available in the near future. It's the responsibility of khugepaged
- to then install the THP pages later.
- "madvise" will enter direct reclaim like "always" but only for regions
- that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
- "never" should be self-explanatory.
- By default kernel tries to use huge zero page on read page fault to
- anonymous mapping. It's possible to disable huge zero page by writing 0
- or enable it back by writing 1:
- echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
- echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
- khugepaged will be automatically started when
- transparent_hugepage/enabled is set to "always" or "madvise, and it'll
- be automatically shutdown if it's set to "never".
- khugepaged runs usually at low frequency so while one may not want to
- invoke defrag algorithms synchronously during the page faults, it
- should be worth invoking defrag at least in khugepaged. However it's
- also possible to disable defrag in khugepaged by writing 0 or enable
- defrag in khugepaged by writing 1:
- echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
- echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
- You can also control how many pages khugepaged should scan at each
- pass:
- /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
- and how many milliseconds to wait in khugepaged between each pass (you
- can set this to 0 to run khugepaged at 100% utilization of one core):
- /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
- and how many milliseconds to wait in khugepaged if there's an hugepage
- allocation failure to throttle the next allocation attempt.
- /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
- The khugepaged progress can be seen in the number of pages collapsed:
- /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
- for each pass:
- /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
- max_ptes_none specifies how many extra small pages (that are
- not already mapped) can be allocated when collapsing a group
- of small pages into one large page.
- /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
- A higher value leads to use additional memory for programs.
- A lower value leads to gain less thp performance. Value of
- max_ptes_none can waste cpu time very little, you can
- ignore it.
- max_ptes_swap specifies how many pages can be brought in from
- swap when collapsing a group of pages into a transparent huge page.
- /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
- A higher value can cause excessive swap IO and waste
- memory. A lower value can prevent THPs from being
- collapsed, resulting fewer pages being collapsed into
- THPs, and lower memory access performance.
- == Boot parameter ==
- You can change the sysfs boot time defaults of Transparent Hugepage
- Support by passing the parameter "transparent_hugepage=always" or
- "transparent_hugepage=madvise" or "transparent_hugepage=never"
- (without "") to the kernel command line.
- == Hugepages in tmpfs/shmem ==
- You can control hugepage allocation policy in tmpfs with mount option
- "huge=". It can have following values:
- - "always":
- Attempt to allocate huge pages every time we need a new page;
- - "never":
- Do not allocate huge pages;
- - "within_size":
- Only allocate huge page if it will be fully within i_size.
- Also respect fadvise()/madvise() hints;
- - "advise:
- Only allocate huge pages if requested with fadvise()/madvise();
- The default policy is "never".
- "mount -o remount,huge= /mountpoint" works fine after mount: remounting
- huge=never will not attempt to break up huge pages at all, just stop more
- from being allocated.
- There's also sysfs knob to control hugepage allocation policy for internal
- shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
- is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
- MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
- In addition to policies listed above, shmem_enabled allows two further
- values:
- - "deny":
- For use in emergencies, to force the huge option off from
- all mounts;
- - "force":
- Force the huge option on for all - very useful for testing;
- == Need of application restart ==
- The transparent_hugepage/enabled values and tmpfs mount option only affect
- future behavior. So to make them effective you need to restart any
- application that could have been using hugepages. This also applies to the
- regions registered in khugepaged.
- == Monitoring usage ==
- The number of anonymous transparent huge pages currently used by the
- system is available by reading the AnonHugePages field in /proc/meminfo.
- To identify what applications are using anonymous transparent huge pages,
- it is necessary to read /proc/PID/smaps and count the AnonHugePages fields
- for each mapping.
- The number of file transparent huge pages mapped to userspace is available
- by reading ShmemPmdMapped and ShmemHugePages fields in /proc/meminfo.
- To identify what applications are mapping file transparent huge pages, it
- is necessary to read /proc/PID/smaps and count the FileHugeMapped fields
- for each mapping.
- Note that reading the smaps file is expensive and reading it
- frequently will incur overhead.
- There are a number of counters in /proc/vmstat that may be used to
- monitor how successfully the system is providing huge pages for use.
- thp_fault_alloc is incremented every time a huge page is successfully
- allocated to handle a page fault. This applies to both the
- first time a page is faulted and for COW faults.
- thp_collapse_alloc is incremented by khugepaged when it has found
- a range of pages to collapse into one huge page and has
- successfully allocated a new huge page to store the data.
- thp_fault_fallback is incremented if a page fault fails to allocate
- a huge page and instead falls back to using small pages.
- thp_collapse_alloc_failed is incremented if khugepaged found a range
- of pages that should be collapsed into one huge page but failed
- the allocation.
- thp_file_alloc is incremented every time a file huge page is successfully
- i allocated.
- thp_file_mapped is incremented every time a file huge page is mapped into
- user address space.
- thp_split_page is incremented every time a huge page is split into base
- pages. This can happen for a variety of reasons but a common
- reason is that a huge page is old and is being reclaimed.
- This action implies splitting all PMD the page mapped with.
- thp_split_page_failed is is incremented if kernel fails to split huge
- page. This can happen if the page was pinned by somebody.
- thp_deferred_split_page is incremented when a huge page is put onto split
- queue. This happens when a huge page is partially unmapped and
- splitting it would free up some memory. Pages on split queue are
- going to be split under memory pressure.
- thp_split_pmd is incremented every time a PMD split into table of PTEs.
- This can happen, for instance, when application calls mprotect() or
- munmap() on part of huge page. It doesn't split huge page, only
- page table entry.
- thp_zero_page_alloc is incremented every time a huge zero page is
- successfully allocated. It includes allocations which where
- dropped due race with other allocation. Note, it doesn't count
- every map of the huge zero page, only its allocation.
- thp_zero_page_alloc_failed is incremented if kernel fails to allocate
- huge zero page and falls back to using small pages.
- As the system ages, allocating huge pages may be expensive as the
- system uses memory compaction to copy data around memory to free a
- huge page for use. There are some counters in /proc/vmstat to help
- monitor this overhead.
- compact_stall is incremented every time a process stalls to run
- memory compaction so that a huge page is free for use.
- compact_success is incremented if the system compacted memory and
- freed a huge page for use.
- compact_fail is incremented if the system tries to compact memory
- but failed.
- compact_pages_moved is incremented each time a page is moved. If
- this value is increasing rapidly, it implies that the system
- is copying a lot of data to satisfy the huge page allocation.
- It is possible that the cost of copying exceeds any savings
- from reduced TLB misses.
- compact_pagemigrate_failed is incremented when the underlying mechanism
- for moving a page failed.
- compact_blocks_moved is incremented each time memory compaction examines
- a huge page aligned range of pages.
- It is possible to establish how long the stalls were using the function
- tracer to record how long was spent in __alloc_pages_nodemask and
- using the mm_page_alloc tracepoint to identify which allocations were
- for huge pages.
- == get_user_pages and follow_page ==
- get_user_pages and follow_page if run on a hugepage, will return the
- head or tail pages as usual (exactly as they would do on
- hugetlbfs). Most gup users will only care about the actual physical
- address of the page and its temporary pinning to release after the I/O
- is complete, so they won't ever notice the fact the page is huge. But
- if any driver is going to mangle over the page structure of the tail
- page (like for checking page->mapping or other bits that are relevant
- for the head page and not the tail page), it should be updated to jump
- to check head page instead. Taking reference on any head/tail page would
- prevent page from being split by anyone.
- NOTE: these aren't new constraints to the GUP API, and they match the
- same constrains that applies to hugetlbfs too, so any driver capable
- of handling GUP on hugetlbfs will also work fine on transparent
- hugepage backed mappings.
- In case you can't handle compound pages if they're returned by
- follow_page, the FOLL_SPLIT bit can be specified as parameter to
- follow_page, so that it will split the hugepages before returning
- them. Migration for example passes FOLL_SPLIT as parameter to
- follow_page because it's not hugepage aware and in fact it can't work
- at all on hugetlbfs (but it instead works fine on transparent
- hugepages thanks to FOLL_SPLIT). migration simply can't deal with
- hugepages being returned (as it's not only checking the pfn of the
- page and pinning it during the copy but it pretends to migrate the
- memory in regular page sizes and with regular pte/pmd mappings).
- == Optimizing the applications ==
- To be guaranteed that the kernel will map a 2M page immediately in any
- memory region, the mmap region has to be hugepage naturally
- aligned. posix_memalign() can provide that guarantee.
- == Hugetlbfs ==
- You can use hugetlbfs on a kernel that has transparent hugepage
- support enabled just fine as always. No difference can be noted in
- hugetlbfs other than there will be less overall fragmentation. All
- usual features belonging to hugetlbfs are preserved and
- unaffected. libhugetlbfs will also work fine as usual.
- == Graceful fallback ==
- Code walking pagetables but unaware about huge pmds can simply call
- split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
- pmd_offset. It's trivial to make the code transparent hugepage aware
- by just grepping for "pmd_offset" and adding split_huge_pmd where
- missing after pmd_offset returns the pmd. Thanks to the graceful
- fallback design, with a one liner change, you can avoid to write
- hundred if not thousand of lines of complex code to make your code
- hugepage aware.
- If you're not walking pagetables but you run into a physical hugepage
- but you can't handle it natively in your code, you can split it by
- calling split_huge_page(page). This is what the Linux VM does before
- it tries to swapout the hugepage for example. split_huge_page() can fail
- if the page is pinned and you must handle this correctly.
- Example to make mremap.c transparent hugepage aware with a one liner
- change:
- diff --git a/mm/mremap.c b/mm/mremap.c
- --- a/mm/mremap.c
- +++ b/mm/mremap.c
- @@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
- return NULL;
- pmd = pmd_offset(pud, addr);
- + split_huge_pmd(vma, pmd, addr);
- if (pmd_none_or_clear_bad(pmd))
- return NULL;
- == Locking in hugepage aware code ==
- We want as much code as possible hugepage aware, as calling
- split_huge_page() or split_huge_pmd() has a cost.
- To make pagetable walks huge pmd aware, all you need to do is to call
- pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
- mmap_sem in read (or write) mode to be sure an huge pmd cannot be
- created from under you by khugepaged (khugepaged collapse_huge_page
- takes the mmap_sem in write mode in addition to the anon_vma lock). If
- pmd_trans_huge returns false, you just fallback in the old code
- paths. If instead pmd_trans_huge returns true, you have to take the
- page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
- page table lock will prevent the huge pmd to be converted into a
- regular pmd from under you (split_huge_pmd can run in parallel to the
- pagetable walk). If the second pmd_trans_huge returns false, you
- should just drop the page table lock and fallback to the old code as
- before. Otherwise you can proceed to process the huge pmd and the
- hugepage natively. Once finished you can drop the page table lock.
- == Refcounts and transparent huge pages ==
- Refcounting on THP is mostly consistent with refcounting on other compound
- pages:
- - get_page()/put_page() and GUP operate in head page's ->_refcount.
- - ->_refcount in tail pages is always zero: get_page_unless_zero() never
- succeed on tail pages.
- - map/unmap of the pages with PTE entry increment/decrement ->_mapcount
- on relevant sub-page of the compound page.
- - map/unmap of the whole compound page accounted in compound_mapcount
- (stored in first tail page). For file huge pages, we also increment
- ->_mapcount of all sub-pages in order to have race-free detection of
- last unmap of subpages.
- PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
- For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
- subpages is offset up by one. This additional reference is required to
- get race-free detection of unmap of subpages when we have them mapped with
- both PMDs and PTEs.
- This is optimization required to lower overhead of per-subpage mapcount
- tracking. The alternative is alter ->_mapcount in all subpages on each
- map/unmap of the whole compound page.
- For anonymous pages, we set PG_double_map when a PMD of the page got split
- for the first time, but still have PMD mapping. The additional references
- go away with last compound_mapcount.
- File pages get PG_double_map set on first map of the page with PTE and
- goes away when the page gets evicted from page cache.
- split_huge_page internally has to distribute the refcounts in the head
- page to the tail pages before clearing all PG_head/tail bits from the page
- structures. It can be done easily for refcounts taken by page table
- entries. But we don't have enough information on how to distribute any
- additional pins (i.e. from get_user_pages). split_huge_page() fails any
- requests to split pinned huge page: it expects page count to be equal to
- sum of mapcount of all sub-pages plus one (split_huge_page caller must
- have reference for head page).
- split_huge_page uses migration entries to stabilize page->_refcount and
- page->_mapcount of anonymous pages. File pages just got unmapped.
- We safe against physical memory scanners too: the only legitimate way
- scanner can get reference to a page is get_page_unless_zero().
- All tail pages have zero ->_refcount until atomic_add(). This prevents the
- scanner from getting a reference to the tail page up to that point. After the
- atomic_add() we don't care about the ->_refcount value. We already known how
- many references should be uncharged from the head page.
- For head page get_page_unless_zero() will succeed and we don't mind. It's
- clear where reference should go after split: it will stay on head page.
- Note that split_huge_pmd() doesn't have any limitation on refcounting:
- pmd can be split at any point and never fails.
- == Partial unmap and deferred_split_huge_page() ==
- Unmapping part of THP (with munmap() or other way) is not going to free
- memory immediately. Instead, we detect that a subpage of THP is not in use
- in page_remove_rmap() and queue the THP for splitting if memory pressure
- comes. Splitting will free up unused subpages.
- Splitting the page right away is not an option due to locking context in
- the place where we can detect partial unmap. It's also might be
- counterproductive since in many cases partial unmap unmap happens during
- exit(2) if an THP crosses VMA boundary.
- Function deferred_split_huge_page() is used to queue page for splitting.
- The splitting itself will happen when we get memory pressure via shrinker
- interface.
|