Skip to content
Snippets Groups Projects
  1. Nov 27, 2023
    • Mike Kravetz's avatar
      hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write · fe697327
      Mike Kravetz authored
      The routine __vma_private_lock tests for the existence of a reserve map
      associated with a private hugetlb mapping.  A pointer to the reserve map
      is in vma->vm_private_data.  __vma_private_lock was checking the pointer
      for NULL.  However, it is possible that the low bits of the pointer could
      be used as flags.  In such instances, vm_private_data is not NULL and not
      a valid pointer.  This results in the null-ptr-deref reported by syzbot:
      
      general protection fault, probably for non-canonical address 0xdffffc000000001d:
       0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
      CPU: 0 PID: 5048 Comm: syz-executor139 Not tainted 6.6.0-rc7-syzkaller-00142-g88
      8cf78c29e2 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 1
      0/09/2023
      RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5004
      ...
      Call Trace:
       <TASK>
       lock_acquire kernel/locking/lockdep.c:5753 [inline]
       lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5718
       down_write+0x93/0x200 kernel/locking/rwsem.c:1573
       hugetlb_vma_lock_write mm/hugetlb.c:300 [inline]
       hugetlb_vma_lock_write+0xae/0x100 mm/hugetlb.c:291
       __hugetlb_zap_begin+0x1e9/0x2b0 mm/hugetlb.c:5447
       hugetlb_zap_begin include/linux/hugetlb.h:258 [inline]
       unmap_vmas+0x2f4/0x470 mm/memory.c:1733
       exit_mmap+0x1ad/0xa60 mm/mmap.c:3230
       __mmput+0x12a/0x4d0 kernel/fork.c:1349
       mmput+0x62/0x70 kernel/fork.c:1371
       exit_mm kernel/exit.c:567 [inline]
       do_exit+0x9ad/0x2a20 kernel/exit.c:861
       __do_sys_exit kernel/exit.c:991 [inline]
       __se_sys_exit kernel/exit.c:989 [inline]
       __x64_sys_exit+0x42/0x50 kernel/exit.c:989
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Mask off low bit flags before checking for NULL pointer.  In addition, the
      reserve map only 'belongs' to the OWNER (parent in parent/child
      relationships) so also check for the OWNER flag.
      
      Link: https://lkml.kernel.org/r/20231114012033.259600-1-mike.kravetz@oracle.com
      
      
      Reported-by: default avatar <syzbot+6ada951e7c0f7bc8a71e@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/linux-mm/00000000000078d1e00608d7878b@google.com/
      
      
      Fixes: bf491692 ("hugetlbfs: extend hugetlb_vma_lock to private VMAs")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Edward Adam Davis <eadavis@qq.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fe697327
  2. Nov 21, 2023
  3. Oct 25, 2023
    • Hugh Dickins's avatar
      mempolicy: mmap_lock is not needed while migrating folios · 72e315f7
      Hugh Dickins authored
      mbind(2) holds down_write of current task's mmap_lock throughout
      (exclusive because it needs to set the new mempolicy on the vmas);
      migrate_pages(2) holds down_read of pid's mmap_lock throughout.
      
      They both hold mmap_lock across the internal migrate_pages(), under which
      all new page allocations (huge or small) are made.  I'm nervous about it;
      and migrate_pages() certainly does not need mmap_lock itself.  It's done
      this way for mbind(2), because its page allocator is vma_alloc_folio() or
      alloc_hugetlb_folio_vma(), both of which depend on vma and address.
      
      Now that we have alloc_pages_mpol(), depending on (refcounted) memory
      policy and interleave index, mbind(2) can be modified to use that or
      alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the
      internal migrate_pages() at all: add alloc_migration_target_by_mpol() to
      replace mbind's new_page().
      
      (After that change, alloc_hugetlb_folio_vma() is used by nothing but a
      userfaultfd function: move it out of hugetlb.h and into the #ifdef.)
      
      migrate_pages(2) has chosen its target node before migrating, so can
      continue to use the standard alloc_migration_target(); but let it take and
      drop mmap_lock just around migrate_to_node()'s queue_pages_range():
      neither the node-to-node calculations nor the page migrations need it.
      
      It seems unlikely, but it is conceivable that some userspace depends on
      the kernel's mmap_lock exclusion here, instead of doing its own locking:
      more likely in a testsuite than in real life.  It is also possible, of
      course, that some pages on the list will be munmapped by another thread
      before they are migrated, or a newer memory policy applied to the range by
      that time: but such races could happen before, as soon as mmap_lock was
      dropped, so it does not appear to be a concern.
      
      Link: https://lkml.kernel.org/r/21e564e8-269f-6a89-7ee2-fd612831c289@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72e315f7
    • Usama Arif's avatar
      hugetlb_vmemmap: use folio argument for hugetlb_vmemmap_* functions · c5ad3233
      Usama Arif authored
      Most function calls in hugetlb.c are made with folio arguments.  This
      brings hugetlb_vmemmap calls inline with them by using folio instead of
      head struct page.  Head struct page is still needed within these
      functions.
      
      The set/clear/test functions for hugepages are also changed to folio
      versions.
      
      Link: https://lkml.kernel.org/r/20231011144557.1720481-2-usama.arif@bytedance.com
      
      
      Signed-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5ad3233
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap restoration on a list of pages · cfb8c750
      Mike Kravetz authored
      The routine update_and_free_pages_bulk already performs vmemmap
      restoration on the list of hugetlb pages in a separate step.  In
      preparation for more functionality to be added in this step, create a new
      routine hugetlb_vmemmap_restore_folios() that will restore vmemmap for a
      list of folios.
      
      This new routine must provide sufficient feedback about errors and actual
      restoration performed so that update_and_free_pages_bulk can perform
      optimally.
      
      Special care must be taken when encountering an error from
      hugetlb_vmemmap_restore_folios.  We want to continue making as much
      forward progress as possible.  A new routine bulk_vmemmap_restore_error
      handles this specific situation.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-5-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfb8c750
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap optimization on a list of pages · 79359d6d
      Mike Kravetz authored
      When adding hugetlb pages to the pool, we first create a list of the
      allocated pages before adding to the pool.  Pass this list of pages to a
      new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.
      
      Due to significant differences in vmemmmap initialization for bootmem
      allocated hugetlb pages, a new routine prep_and_add_bootmem_folios is
      created.
      
      We also modify the routine vmemmap_should_optimize() to check for pages
      that are already optimized.  There are code paths that might request
      vmemmap optimization twice and we want to make sure this is not attempted.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-4-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79359d6d
    • Mike Kravetz's avatar
      hugetlb: restructure pool allocations · d67e32f2
      Mike Kravetz authored
      Allocation of a hugetlb page for the hugetlb pool is done by the routine
      alloc_pool_huge_page.  This routine will allocate contiguous pages from a
      low level allocator, prep the pages for usage as a hugetlb page and then
      add the resulting hugetlb page to the pool.
      
      In the 'prep' stage, optional vmemmap optimization is done.  For
      performance reasons we want to perform vmemmap optimization on multiple
      hugetlb pages at once.  To do this, restructure the hugetlb pool
      allocation code such that vmemmap optimization can be isolated and later
      batched.
      
      The code to allocate hugetlb pages from bootmem was also modified to
      allow batching.
      
      No functional changes, only code restructure.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-3-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Tested-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d67e32f2
    • Mike Kravetz's avatar
      hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles · d2cf88c2
      Mike Kravetz authored
      Patch series "Batch hugetlb vmemmap modification operations", v8.
      
      When hugetlb vmemmap optimization was introduced, the overhead of enabling
      the option was measured as described in commit 426e5c42 [1].  The
      summary states that allocating a hugetlb page should be ~2x slower with
      optimization and freeing a hugetlb page should be ~2-3x slower.  Such
      overhead was deemed an acceptable trade off for the memory savings
      obtained by freeing vmemmap pages.
      
      It was recently reported that the overhead associated with enabling
      vmemmap optimization could be as high as 190x for hugetlb page
      allocations.  Yes, 190x!  Some actual numbers from other environments are:
      
      Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
      ------------------------------------------------
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m4.119s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m4.477s
      
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m28.973s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m36.748s
      
      VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
      -----------------------------------------------------------
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m2.463s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m2.931s
      
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    2m27.609s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    2m29.924s
      
      In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
      resulted in allocation times being 61x slower.
      
      A quick profile showed that the vast majority of this overhead was due to
      TLB flushing.  Each time we modify the kernel pagetable we need to flush
      the TLB.  For each hugetlb that is optimized, there could be potentially
      two TLB flushes performed.  One for the vmemmap pages associated with the
      hugetlb page, and potentially another one if the vmemmap pages are mapped
      at the PMD level and must be split.  The TLB flushes required for the
      kernel pagetable, result in a broadcast IPI with each CPU having to flush
      a range of pages, or do a global flush if a threshold is exceeded.  So,
      the flush time increases with the number of CPUs.  In addition, in virtual
      environments the broadcast IPI can’t be accelerated by hypervisor
      hardware and leads to traps that need to wakeup/IPI all vCPUs which is
      very expensive.  Because of this the slowdown in virtual environments is
      even worse than bare metal as the number of vCPUS/CPUs is increased.
      
      The following series attempts to reduce amount of time spent in TLB
      flushing.  The idea is to batch the vmemmap modification operations for
      multiple hugetlb pages.  Instead of doing one or two TLB flushes for each
      page, we do two TLB flushes for each batch of pages.  One flush after
      splitting pages mapped at the PMD level, and another after remapping
      vmemmap associated with all hugetlb pages.  Results of such batching are
      as follows:
      
      Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
      ------------------------------------------------
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m4.719s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m4.245s
      
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m7.267s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m13.199s
      
      VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
      -----------------------------------------------------------
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m2.715s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m3.186s
      
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m4.799s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m5.273s
      
      With batching, results are back in the 2-3x slowdown range.
      
      
      This patch (of 8):
      
      update_and_free_pages_bulk is designed to free a list of hugetlb pages
      back to their associated lower level allocators.  This may require
      allocating vmemmmap pages associated with each hugetlb page.  The hugetlb
      page destructor must be changed before pages are freed to lower level
      allocators.  However, the destructor must be changed under the hugetlb
      lock.  This means there is potentially one lock cycle per page.
      
      Minimize the number of lock cycles in update_and_free_pages_bulk by:
      1) allocating necessary vmemmap for all hugetlb pages on the list
      2) take hugetlb lock and clear destructor for all pages on the list
      3) free all pages on list back to low level allocators
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20231019023113.345257-2-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJames Houghton <jthoughton@google.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2cf88c2
  4. Oct 18, 2023
    • Nhat Pham's avatar
      hugetlb: memcg: account hugetlb-backed memory in memory controller · 8cba9576
      Nhat Pham authored
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etc.  fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch rectifies this issue by charging the memcg when the hugetlb
      folio is utilized, and uncharging when the folio is freed (analogous to
      the hugetlb controller).  Note that we do not charge when the folio is
      allocated to the hugetlb pool, because at this point it is not owned by
      any memcg.
      
      Some caveats to consider:
        * This feature is only available on cgroup v2.
        * There is no hugetlb pool management involved in the memory
          controller. As stated above, hugetlb folios are only charged towards
          the memory controller when it is used. Host overcommit management
          has to consider it when configuring hard limits.
        * Failure to charge towards the memcg results in SIGBUS. This could
          happen even if the hugetlb pool still has pages (but the cgroup
          limit is hit and reclaim attempt fails).
        * When this feature is enabled, hugetlb pages contribute to memory
          reclaim protection. low, min limits tuning must take into account
          hugetlb memory.
        * Hugetlb pages utilized while this option is not selected will not
          be tracked by the memory controller (even if cgroup v2 is remounted
          later on).
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cba9576
    • Frank van der Linden's avatar
      mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER · 59838b25
      Frank van der Linden authored
      Originally, hugetlb_cgroup was the only hugetlb user of tail page
      structure fields.  So, the code defined and checked against
      HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use.
      
      However, by now, tail page #2 is used to store hugetlb hwpoison and
      subpool information as well.  In other words, without that tail page
      hugetlb doesn't work.
      
      Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and
      checks against it.  Instead, just check for the minimum viable page order
      at hstate creation time.
      
      Link: https://lkml.kernel.org/r/20231004153248.3842997-1-fvdl@google.com
      
      
      Signed-off-by: default avatarFrank van der Linden <fvdl@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      59838b25
    • David Hildenbrand's avatar
      mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap() · 06968625
      David Hildenbrand authored
      Let's convert it to consume a folio.
      
      [akpm@linux-foundation.org: fix kerneldoc]
      Link: https://lkml.kernel.org/r/20231002142949.235104-3-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06968625
    • David Hildenbrand's avatar
      mm/rmap: move SetPageAnonExclusive() out of page_move_anon_rmap() · 5ca43289
      David Hildenbrand authored
      Patch series "mm/rmap: convert page_move_anon_rmap() to
      folio_move_anon_rmap()".
      
      Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the
      callers handle PageAnonExclusive.  I'm including cleanup patch #3 because
      it fits into the picture and can be done cleaner by the conversion.
      
      
      This patch (of 3):
      
      Let's move it into the caller: there is a difference between whether an
      anon folio can only be mapped by one process (e.g., into one VMA), and
      whether it is truly exclusive (e.g., no references -- including GUP --
      from other processes).
      
      Further, for large folios the page might not actually be pointing at the
      head page of the folio, so it better be handled in the caller.  This is a
      preparation for converting page_move_anon_rmap() to consume a folio.
      
      Link: https://lkml.kernel.org/r/20231002142949.235104-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20231002142949.235104-2-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5ca43289
    • Muhammad Usama Anjum's avatar
      fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs · 52526ca7
      Muhammad Usama Anjum authored
      The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally
      clear the info about page table entries. The following operations are
      supported in this IOCTL:
      - Scan the address range and get the memory ranges matching the provided
        criteria. This is performed when the output buffer is specified.
      - Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect
        the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if
        non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING``
        can be used with or without PM_SCAN_CHECK_WPASYNC.
      - Both of those operations can be combined into one atomic operation where
        we can get and write protect the pages as well.
      
      Following flags about pages are currently supported:
      - PAGE_IS_WPALLOWED - Page has async-write-protection enabled
      - PAGE_IS_WRITTEN - Page has been written to from the time it was write protected
      - PAGE_IS_FILE - Page is file backed
      - PAGE_IS_PRESENT - Page is present in the memory
      - PAGE_IS_SWAPPED - Page is in swapped
      - PAGE_IS_PFNZERO - Page has zero PFN
      - PAGE_IS_HUGE - Page is THP or Hugetlb backed
      
      This IOCTL can be extended to get information about more PTE bits. The
      entire address range passed by user [start, end) is scanned until either
      the user provided buffer is full or max_pages have been found.
      
      [akpm@linux-foundation.org: update it for "mm: hugetlb: add huge page size param to set_huge_pte_at()"]
      [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n warning]
      [arnd@arndb.de: hide unused pagemap_scan_backout_range() function]
        Link: https://lkml.kernel.org/r/20230927060257.2975412-1-arnd@kernel.org
      [sfr@canb.auug.org.au: fix "fs/proc/task_mmu: hide unused pagemap_scan_backout_range() function"]
        Link: https://lkml.kernel.org/r/20230928092223.0625c6bf@canb.auug.org.au
      Link: https://lkml.kernel.org/r/20230821141518.870589-3-usama.anjum@collabora.com
      
      
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: default avatarAndrei Vagin <avagin@gmail.com>
      Reviewed-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52526ca7
    • Peter Xu's avatar
      userfaultfd: UFFD_FEATURE_WP_ASYNC · d61ea1cb
      Peter Xu authored
      Patch series "Implement IOCTL to get and optionally clear info about
      PTEs", v33.
      
      *Motivation*
      The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
      GetWriteWatch() and ResetWriteWatch() syscalls [1].  The GetWriteWatch()
      retrieves the addresses of the pages that are written to in a region of
      virtual memory.
      
      This syscall is used in Windows applications and games etc.  This syscall
      is being emulated in pretty slow manner in userspace.  Our purpose is to
      enhance the kernel such that we translate it efficiently in a better way. 
      Currently some out of tree hack patches are being used to efficiently
      emulate it in some kernels.  We intend to replace those with these
      patches.  So the whole gaming on Linux can effectively get benefit from
      this.  It means there would be tons of users of this code.
      
      CRIU use case [2] was mentioned by Andrei and Danylo:
      > Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
      > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
      > shadow memory [4]. Being able to migrate such binaries allows to highly
      > reduce the amount of work needed to identify and fix post-migration
      > crashes, which happen constantly.
      
      Andrei defines the following uses of this code:
      * it is more granular and allows us to track changed pages more
        effectively. The current interface can clear dirty bits for the entire
        process only. In addition, reading info about pages is a separate
        operation. It means we must freeze the process to read information
        about all its pages, reset dirty bits, only then we can start dumping
        pages. The information about pages becomes more and more outdated,
        while we are processing pages. The new interface solves both these
        downsides. First, it allows us to read pte bits and clear the
        soft-dirty bit atomically. It means that CRIU will not need to freeze
        processes to pre-dump their memory. Second, it clears soft-dirty bits
        for a specified region of memory. It means CRIU will have actual info
        about pages to the moment of dumping them.
      * The new interface has to be much faster because basic page filtering
        is happening in the kernel. With the old interface, we have to read
        pagemap for each page.
      
      *Implementation Evolution (Short Summary)*
      From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
      feature can be used under the hood with some additions like:
      * reset soft-dirty flag for only a specific region of memory instead of
      clearing the flag for the entire process
      * get and clear soft-dirty flag for a specific region atomically
      
      So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
      flag. But using soft-dirty flag, sometimes we get extra pages which weren't
      even written. They had become soft-dirty because of VMA merging and
      VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
      able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
      reported that mprotect etc messes up the soft-dirty flag while ignoring
      VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
      discussed if we can revert these patches. But we could not reach to any
      conclusion. So at this point, I made couple of tries to solve this whole
      VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
      * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
      regression. We left it behind.
      * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
      got the reply don't increase the size of the VMA by 8 bytes.
      
      At this point, we left soft-dirty considering it is too much delicate and
      userfaultfd [9] seemed like the only way forward. From there onward, we
      have been basing soft-dirty emulation on userfaultfd wp feature where
      kernel resolves the faults itself when WP_ASYNC feature is used. It was
      straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
      those pages dirty or written-to which are really written in reality. (PS
      There is another WP_UNPOPULATED userfautfd feature is required which is
      needed to avoid pre-faulting memory before write-protecting [9].)
      
      All the different masks were added on the request of CRIU devs to create
      interface more generic and better.
      
      [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
      [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
      [3] https://github.com/google/sanitizers
      [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
      [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
      [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
      [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
      [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
      [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
      [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
      
      
      This patch (of 6):
      
      Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
      userfaultfd wr-protect faults to be resolved by the kernel directly.
      
      It can be used like a high accuracy version of soft-dirty, without vma
      modifications during tracking, and also with ranged support by default
      rather than for a whole mm when reset the protections due to existence of
      ioctl(UFFDIO_WRITEPROTECT).
      
      Several goals of such a dirty tracking interface:
      
      1. All types of memory should be supported and tracable. This is nature
         for soft-dirty but should mention when the context is userfaultfd,
         because it used to only support anon/shmem/hugetlb. The problem is for
         a dirty tracking purpose these three types may not be enough, and it's
         legal to track anything e.g. any page cache writes from mmap.
      
      2. Protections can be applied to partial of a memory range, without vma
         split/merge fuss.  The hope is that the tracking itself should not
         affect any vma layout change.  It also helps when reset happens because
         the reset will not need mmap write lock which can block the tracee.
      
      3. Accuracy needs to be maintained.  This means we need pte markers to work
         on any type of VMA.
      
      One could question that, the whole concept of async dirty tracking is not
      really close to fundamentally what userfaultfd used to be: it's not "a
      fault to be serviced by userspace" anymore. However, using userfaultfd-wp
      here as a framework is convenient for us in at least:
      
      1. VM_UFFD_WP vma flag, which has a very good name to suite something like
         this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
         feature bit to identify from a sync version of uffd-wp registration.
      
      2. PTE markers logic can be leveraged across the whole kernel to maintain
         the uffd-wp bit as long as an arch supports, this also applies to this
         case where uffd-wp bit will be a hint to dirty information and it will
         not go lost easily (e.g. when some page cache ptes got zapped).
      
      3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
         resetting a range of memory, while there's no counterpart in the old
         soft-dirty world, hence if this is wanted in a new design we'll need a
         new interface otherwise.
      
      We can somehow understand that commonality because uffd-wp was
      fundamentally a similar idea of write-protecting pages just like
      soft-dirty.
      
      This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
      far WP_ASYNC seems to not usable if without WP_UNPOPULATE.  This also
      gives us chance to modify impl of WP_ASYNC just in case it could be not
      depending on WP_UNPOPULATED anymore in the future kernels.  It's also fine
      to imply that because both features will rely on PTE_MARKER_UFFD_WP config
      option, so they'll show up together (or both missing) in an UFFDIO_API
      probe.
      
      vma_can_userfault() now allows any VMA if the userfaultfd registration is
      only about async uffd-wp.  So we can track dirty for all kinds of memory
      including generic file systems (like XFS, EXT4 or BTRFS).
      
      One trick worth mention in do_wp_page() is that we need to manually update
      vmf->orig_pte here because it can be used later with a pte_same() check -
      this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
      
      The major defect of this approach of dirty tracking is we need to populate
      the pgtables when tracking starts.  Soft-dirty doesn't do it like that. 
      It's unwanted in the case where the range of memory to track is huge and
      unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
      without having any page cache installed yet).  One way to improve this is
      to allow pte markers exist for larger than PTE level for PMD+.  That will
      not change the interface if to implemented, so we can leave that for
      later.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d61ea1cb
    • Mike Kravetz's avatar
      hugetlb: check for hugetlb folio before vmemmap_restore · 30a89adf
      Mike Kravetz authored
      In commit d8f5f7e4 ("hugetlb: set hugetlb page flag before
      optimizing vmemmap") checks were added to print a warning if
      hugetlb_vmemmap_restore was called on a non-hugetlb page.
      
      This was mostly due to ordering issues in the hugetlb page set up and tear
      down sequencees.  One place missed was the routine
      dissolve_free_huge_page.
      
      Naoya Horiguchi noted: "I saw that VM_WARN_ON_ONCE() in
      hugetlb_vmemmap_restore is triggered when memory_failure() is called on a
      free hugetlb page with vmemmap optimization disabled (the warning is not
      triggered if vmemmap optimization is enabled).  I think that we need check
      folio_test_hugetlb() before dissolve_free_huge_page() calls
      hugetlb_vmemmap_restore_folio()."
      
      Perform the check as suggested by Naoya.
      
      Link: https://lkml.kernel.org/r/20231017032140.GA3680@monkey
      
      
      Fixes: d8f5f7e4 ("hugetlb: set hugetlb page flag before optimizing vmemmap")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@linux.dev>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30a89adf
    • Rik van Riel's avatar
      hugetlbfs: close race between MADV_DONTNEED and page fault · 2820b0f0
      Rik van Riel authored
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by pulling the locking from __unmap_hugepage_final_range
      into helper functions called from zap_page_range_single.  This ensures
      page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
      have actually been freed.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
      
      
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2820b0f0
    • Rik van Riel's avatar
      hugetlbfs: extend hugetlb_vma_lock to private VMAs · bf491692
      Rik van Riel authored
      Extend the locking scheme used to protect shared hugetlb mappings from
      truncate vs page fault races, in order to protect private hugetlb mappings
      (with resv_map) against MADV_DONTNEED.
      
      Add a read-write semaphore to the resv_map data structure, and use that
      from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
      race between MADV_DONTNEED and page faults.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com
      
      
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf491692
    • Rik van Riel's avatar
      hugetlbfs: clear resv_map pointer if mmap fails · 92fe9dcb
      Rik van Riel authored
      Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
      
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by extending the hugetlb_vma_lock locking scheme to also
      cover private hugetlb mappings (with resv_map), and pulling the locking
      from __unmap_hugepage_final_range into helper functions called from
      zap_page_range_single.  This ensures page faults stay locked out of the
      MADV_DONTNEED VMA until the huge pages have actually been freed.
      
      
      This patch (of 3):
      
      Hugetlbfs leaves a dangling pointer in the VMA if mmap fails.  This has
      not been a problem so far, but other code in this patch series tries to
      follow that pointer.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-1-riel@surriel.com
      Link: https://lkml.kernel.org/r/20231006040020.3677377-2-riel@surriel.com
      
      
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92fe9dcb
  5. Oct 16, 2023
    • Sidhartha Kumar's avatar
      mm/hugetlb: replace page_ref_freeze() with folio_ref_freeze() in hugetlb_folio_init_vmemmap() · a48bf7b4
      Sidhartha Kumar authored
      No functional difference, folio_ref_freeze() is currently a wrapper for
      page_ref_freeze().
      
      Link: https://lkml.kernel.org/r/20230926174433.81241-1-sidhartha.kumar@oracle.com
      
      
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a48bf7b4
    • Sidhartha Kumar's avatar
      mm/filemap: remove hugetlb special casing in filemap.c · a08c7193
      Sidhartha Kumar authored
      Remove special cased hugetlb handling code within the page cache by
      changing the granularity of ->index to the base page size rather than the
      huge page size.  The motivation of this patch is to reduce complexity
      within the filemap code while also increasing performance by removing
      branches that are evaluated on every page cache lookup.
      
      To support the change in index, new wrappers for hugetlb page cache
      interactions are added.  These wrappers perform the conversion to a linear
      index which is now expected by the page cache for huge pages.
      
      ========================= PERFORMANCE ======================================
      
      Perf was used to check the performance differences after the patch. 
      Overall the performance is similar to mainline with a very small larger
      overhead that occurs in __filemap_add_folio() and
      hugetlb_add_to_page_cache().  This is because of the larger overhead that
      occurs in xa_load() and xa_store() as the xarray is now using more entries
      to store hugetlb folios in the page cache.
      
      Timing
      
      aarch64
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                  real    1m49.568s
                  user    0m0.000s
                  sys     1m49.461s
      
              6.5-rc3:
                  [root]# time fallocate -l 700GB test.txt
                  real    1m47.495s
                  user    0m0.000s
                  sys     1m47.370s
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m47.024s
                  user    0m0.000s
                  sys     1m46.921s
      
              6.5-rc3:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m44.551s
                  user    0m0.000s
                  sys     1m44.438s
      
      x86
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                  real    0m22.383s
                  user    0m0.000s
                  sys     0m22.255s
      
              6.5-rc3:
                  [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                  real    0m22.735s
                  user    0m0.038s
                  sys     0m22.567s
      
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                  real    0m25.786s
                  user    0m0.001s
                  sys     0m25.589s
      
              6.5-rc3:
                  [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                  real    0m33.454s
                  user    0m0.001s
                  sys     0m33.193s
      
      aarch64:
          workload - fallocate a 700GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--95.04%--__pi_clear_page
                                |
                                |--3.57%--clear_huge_page
                                |          |
                                |          |--2.63%--rcu_all_qs
                                |          |
                                |           --0.91%--__cond_resched
                                |
                                 --0.67%--__cond_resched
                  0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                  0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio
      
          6.5-rc3
              2MB Page Size:
                      --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--94.91%--__pi_clear_page
                                |
                                |--4.11%--clear_huge_page
                                |          |
                                |          |--3.00%--rcu_all_qs
                                |          |
                                |           --1.10%--__cond_resched
                                |
                                 --0.59%--__cond_resched
                  0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                  0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      x86
          workload - fallocate a 100GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  hugetlbfs_fallocate
                  |
                  --99.57%--clear_huge_page
                      |
                      --98.47%--clear_page_erms
                          |
                          --0.53%--asm_sysvec_apic_timer_interrupt
      
                  0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store
      
          6.5-rc3
              2MB Page Size:
                      --99.93%--__x64_sys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                 --99.38%--clear_huge_page
                                           |
                                           |--98.40%--clear_page_erms
                                           |
                                            --0.59%--__cond_resched
                  0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      ========================= TESTING ======================================
      
      This patch passes libhugetlbfs tests and LTP hugetlb tests
      
      ********** TEST SUMMARY
      *                      2M
      *                      32-bit 64-bit
      *     Total testcases:   110    113
      *             Skipped:     0      0
      *                PASS:   107    113
      *                FAIL:     0      0
      *    Killed by signal:     3      0
      *   Bad configuration:     0      0
      *       Expected FAIL:     0      0
      *     Unexpected PASS:     0      0
      *    Test not present:     0      0
      * Strange test result:     0      0
      **********
      
          Done executing testcases.
          LTP Version:  20220527-178-g2761a81c4
      
      page migration was also tested using Mike Kravetz's test program.[8]
      
      [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
        Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
      Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
      
      
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reported-and-tested-by: default avatar <syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
      
      
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a08c7193
  6. Oct 04, 2023
  7. Sep 30, 2023
    • Ryan Roberts's avatar
      mm: hugetlb: add huge page size param to set_huge_pte_at() · 935d4f0c
      Ryan Roberts authored
      Patch series "Fix set_huge_pte_at() panic on arm64", v2.
      
      This series fixes a bug in arm64's implementation of set_huge_pte_at(),
      which can result in an unprivileged user causing a kernel panic.  The
      problem was triggered when running the new uffd poison mm selftest for
      HUGETLB memory.  This test (and the uffd poison feature) was merged for
      v6.5-rc7.
      
      Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
      (correctly this time) to get it backported to v6.5, where the issue first
      showed up.
      
      
      Description of Bug
      ==================
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
      it will dereference a bad pointer in page_folio():
      
          static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
          {
              VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
      
              return page_folio(pfn_to_page(swp_offset_pfn(entry)));
          }
      
      
      Fix
      ===
      
      The simplest fix would have been to revert the dodgy cleanup commit
      18f39629 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
      things have moved on, this would have required an audit of all the new
      set_huge_pte_at() call sites to see if they should be converted to
      set_huge_swap_pte_at().  As per the original intent of the change, it
      would also leave us open to future bugs when people invariably get it
      wrong and call the wrong helper.
      
      So instead, I've added a huge page size parameter to set_huge_pte_at(). 
      This means that the arm64 code has the size in all cases.  It's a bigger
      change, due to needing to touch the arches that implement the function,
      but it is entirely mechanical, so in my view, low risk.
      
      I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
      s390, sparc (and additionally x86_64).  I've additionally booted and run
      mm selftests against arm64, where I observe the uffd poison test is fixed,
      and there are no other regressions.
      
      
      This patch (of 2):
      
      In order to fix a bug, arm64 needs to be told the size of the huge page
      for which the pte is being set in set_huge_pte_at().  Provide for this by
      adding an `unsigned long sz` parameter to the function.  This follows the
      same pattern as huge_pte_clear().
      
      This commit makes the required interface modifications to the core mm as
      well as all arches that implement this function (arm64, parisc, powerpc,
      riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
      commit.
      
      No behavioral changes intended.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
      
      
      Fixes: 8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>	[powerpc 8xx]
      Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>	[vmalloc change]
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      935d4f0c
  8. Aug 24, 2023
  9. Aug 21, 2023
  10. Aug 18, 2023
Loading