mm/hugetlb_vmemmap: fix race with speculative PFN walkers
While investigating HVO for THPs [1], it turns out that speculative PFN walkers like compaction can race with vmemmap modifications, e.g., CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) ------------------------------- ------------------------------ Allocates an LRU folio page1 Sees page1 Frees page1 Allocates a hugeTLB folio page2 (page1 being a tail of page2) Updates vmemmap mapping page1 get_page_unless_zero(page1) Even though page1->_refcount is zero after HVO, get_page_unless_zero() can still try to modify this read-only field, resulting in a crash. An independent report [2] confirmed this race. There are two discussed approaches to fix this race: 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without triggering a PF. 2. Use RCU to make sure get_page_unless_zero() either sees zero page->_refcount through the old vmemmap or non-zero page->_refcount through the new one. The second approach is preferred here because: 1. It can prevent illegal modifications to struct page[] that has been HVO'ed; 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix similar races in other places, e.g., arch_remove_memory() on x86 [3], which frees vmemmap mapping offlined struct page[]. While adding synchronize_rcu(), the goal is to be surgical, rather than optimized. Specifically, calls to synchronize_rcu() on the error handling paths can be coalesced, but it is not done for the sake of Simplicity: noticeably, this fix removes ~50% more lines than it adds. According to the hugetlb_optimize_vmemmap section in Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or freeing hugeTLB pages "~2x slower than before". Having synchronize_rcu() on top makes those operations even worse, and this also affects the user interface /proc/sys/vm/nr_overcommit_hugepages. This is *very* hard to trigger: 1. Most hugeTLB use cases I know of are static, i.e., reserved at boot time, because allocating at runtime is not reliable at all. 2. On top of that, someone has to be very unlucky to get tripped over above, because the race window is so small -- I wasn't able to trigger it with a stress testing that does nothing but that (with THPs though). [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/ [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/ Link: https://lkml.kernel.org/r/20240627222705.2974207-1-yuzhao@google.com Signed-off-by:Yu Zhao <yuzhao@google.com> Acked-by:
Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
mentioned in commit 66897a07
-
mentioned in commit c30c0860
-
mentioned in commit e4f85fc3
-
mentioned in commit c08fc349
-
mentioned in commit 19797caa
-
mentioned in commit d92b44d6
-
mentioned in commit e5587825
-
mentioned in commit 9c748eeb
-
mentioned in commit c955ebf3
-
mentioned in commit 8fe3345b
-
mentioned in commit e5428520
-
mentioned in commit 55de3c38
-
mentioned in commit b1e66315
-
mentioned in commit d9e5751c
-
mentioned in commit 28799a55
-
mentioned in commit 703564ba
-
mentioned in commit e66d2de5
-
mentioned in commit c9a29166
-
mentioned in commit 5cc13c80
-
mentioned in commit f2897acd
-
mentioned in commit 70fff774
-
mentioned in commit ecf998ed
-
mentioned in commit 5e60eee9
-
mentioned in commit 5ab4c24e
-
mentioned in commit e29802a6
-
mentioned in commit 51544493
-
mentioned in commit a7315e17
-
mentioned in commit 074416ab
-
mentioned in commit 0ebe844f
-
mentioned in commit aebaf352
-
mentioned in commit db0587ba
-
mentioned in commit 3503dc73
-
mentioned in commit b94240be