Skip to content
Snippets Groups Projects
  1. Nov 21, 2023
  2. Nov 13, 2023
  3. Oct 25, 2023
  4. Oct 18, 2023
    • Nhat Pham's avatar
      hugetlb: memcg: account hugetlb-backed memory in memory controller · 8cba9576
      Nhat Pham authored
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etc.  fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch rectifies this issue by charging the memcg when the hugetlb
      folio is utilized, and uncharging when the folio is freed (analogous to
      the hugetlb controller).  Note that we do not charge when the folio is
      allocated to the hugetlb pool, because at this point it is not owned by
      any memcg.
      
      Some caveats to consider:
        * This feature is only available on cgroup v2.
        * There is no hugetlb pool management involved in the memory
          controller. As stated above, hugetlb folios are only charged towards
          the memory controller when it is used. Host overcommit management
          has to consider it when configuring hard limits.
        * Failure to charge towards the memcg results in SIGBUS. This could
          happen even if the hugetlb pool still has pages (but the cgroup
          limit is hit and reclaim attempt fails).
        * When this feature is enabled, hugetlb pages contribute to memory
          reclaim protection. low, min limits tuning must take into account
          hugetlb memory.
        * Hugetlb pages utilized while this option is not selected will not
          be tracked by the memory controller (even if cgroup v2 is remounted
          later on).
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cba9576
    • Gregory Price's avatar
      mm/migrate: remove unused mm argument from do_move_pages_to_node · ec47e250
      Gregory Price authored
      This function does not actively use the mm_struct, it can be removed.
      
      Link: https://lkml.kernel.org/r/20231003144857.752952-2-gregory.price@memverge.com
      
      
      Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
      Reviewed-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Gregory Price <gregory.price@memverge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec47e250
  5. Oct 16, 2023
    • Sidhartha Kumar's avatar
      mm/filemap: remove hugetlb special casing in filemap.c · a08c7193
      Sidhartha Kumar authored
      Remove special cased hugetlb handling code within the page cache by
      changing the granularity of ->index to the base page size rather than the
      huge page size.  The motivation of this patch is to reduce complexity
      within the filemap code while also increasing performance by removing
      branches that are evaluated on every page cache lookup.
      
      To support the change in index, new wrappers for hugetlb page cache
      interactions are added.  These wrappers perform the conversion to a linear
      index which is now expected by the page cache for huge pages.
      
      ========================= PERFORMANCE ======================================
      
      Perf was used to check the performance differences after the patch. 
      Overall the performance is similar to mainline with a very small larger
      overhead that occurs in __filemap_add_folio() and
      hugetlb_add_to_page_cache().  This is because of the larger overhead that
      occurs in xa_load() and xa_store() as the xarray is now using more entries
      to store hugetlb folios in the page cache.
      
      Timing
      
      aarch64
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                  real    1m49.568s
                  user    0m0.000s
                  sys     1m49.461s
      
              6.5-rc3:
                  [root]# time fallocate -l 700GB test.txt
                  real    1m47.495s
                  user    0m0.000s
                  sys     1m47.370s
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m47.024s
                  user    0m0.000s
                  sys     1m46.921s
      
              6.5-rc3:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m44.551s
                  user    0m0.000s
                  sys     1m44.438s
      
      x86
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                  real    0m22.383s
                  user    0m0.000s
                  sys     0m22.255s
      
              6.5-rc3:
                  [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                  real    0m22.735s
                  user    0m0.038s
                  sys     0m22.567s
      
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                  real    0m25.786s
                  user    0m0.001s
                  sys     0m25.589s
      
              6.5-rc3:
                  [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                  real    0m33.454s
                  user    0m0.001s
                  sys     0m33.193s
      
      aarch64:
          workload - fallocate a 700GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--95.04%--__pi_clear_page
                                |
                                |--3.57%--clear_huge_page
                                |          |
                                |          |--2.63%--rcu_all_qs
                                |          |
                                |           --0.91%--__cond_resched
                                |
                                 --0.67%--__cond_resched
                  0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                  0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio
      
          6.5-rc3
              2MB Page Size:
                      --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--94.91%--__pi_clear_page
                                |
                                |--4.11%--clear_huge_page
                                |          |
                                |          |--3.00%--rcu_all_qs
                                |          |
                                |           --1.10%--__cond_resched
                                |
                                 --0.59%--__cond_resched
                  0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                  0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      x86
          workload - fallocate a 100GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  hugetlbfs_fallocate
                  |
                  --99.57%--clear_huge_page
                      |
                      --98.47%--clear_page_erms
                          |
                          --0.53%--asm_sysvec_apic_timer_interrupt
      
                  0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store
      
          6.5-rc3
              2MB Page Size:
                      --99.93%--__x64_sys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                 --99.38%--clear_huge_page
                                           |
                                           |--98.40%--clear_page_erms
                                           |
                                            --0.59%--__cond_resched
                  0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      ========================= TESTING ======================================
      
      This patch passes libhugetlbfs tests and LTP hugetlb tests
      
      ********** TEST SUMMARY
      *                      2M
      *                      32-bit 64-bit
      *     Total testcases:   110    113
      *             Skipped:     0      0
      *                PASS:   107    113
      *                FAIL:     0      0
      *    Killed by signal:     3      0
      *   Bad configuration:     0      0
      *       Expected FAIL:     0      0
      *     Unexpected PASS:     0      0
      *    Test not present:     0      0
      * Strange test result:     0      0
      **********
      
          Done executing testcases.
          LTP Version:  20220527-178-g2761a81c4
      
      page migration was also tested using Mike Kravetz's test program.[8]
      
      [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
        Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
      Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
      
      
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reported-and-tested-by: default avatar <syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
      
      
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a08c7193
  6. Oct 06, 2023
    • Gregory Price's avatar
      mm/migrate: fix do_pages_move for compat pointers · 229e2253
      Gregory Price authored
      do_pages_move does not handle compat pointers for the page list. 
      correctly.  Add in_compat_syscall check and appropriate get_user fetch
      when iterating the page list.
      
      It makes the syscall in compat mode (32-bit userspace, 64-bit kernel)
      work the same way as the native 32-bit syscall again, restoring the
      behavior before my broken commit 5b1b561b ("mm: simplify
      compat_sys_move_pages").
      
      More specifically, my patch moved the parsing of the 'pages' array from
      the main entry point into do_pages_stat(), which left the syscall
      working correctly for the 'stat' operation (nodes = NULL), while the
      'move' operation (nodes != NULL) is now missing the conversion and
      interprets 'pages' as an array of 64-bit pointers instead of the
      intended 32-bit userspace pointers.
      
      It is possible that nobody noticed this bug because the few
      applications that actually call move_pages are unlikely to run in
      compat mode because of their large memory requirements, but this
      clearly fixes a user-visible regression and should have been caught by
      ltp.
      
      Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com
      
      
      Fixes: 5b1b561b ("mm: simplify compat_sys_move_pages")
      Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Co-developed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      229e2253
  7. Oct 04, 2023
  8. Sep 30, 2023
    • Ryan Roberts's avatar
      mm: hugetlb: add huge page size param to set_huge_pte_at() · 935d4f0c
      Ryan Roberts authored
      Patch series "Fix set_huge_pte_at() panic on arm64", v2.
      
      This series fixes a bug in arm64's implementation of set_huge_pte_at(),
      which can result in an unprivileged user causing a kernel panic.  The
      problem was triggered when running the new uffd poison mm selftest for
      HUGETLB memory.  This test (and the uffd poison feature) was merged for
      v6.5-rc7.
      
      Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
      (correctly this time) to get it backported to v6.5, where the issue first
      showed up.
      
      
      Description of Bug
      ==================
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
      it will dereference a bad pointer in page_folio():
      
          static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
          {
              VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
      
              return page_folio(pfn_to_page(swp_offset_pfn(entry)));
          }
      
      
      Fix
      ===
      
      The simplest fix would have been to revert the dodgy cleanup commit
      18f39629 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
      things have moved on, this would have required an audit of all the new
      set_huge_pte_at() call sites to see if they should be converted to
      set_huge_swap_pte_at().  As per the original intent of the change, it
      would also leave us open to future bugs when people invariably get it
      wrong and call the wrong helper.
      
      So instead, I've added a huge page size parameter to set_huge_pte_at(). 
      This means that the arm64 code has the size in all cases.  It's a bigger
      change, due to needing to touch the arches that implement the function,
      but it is entirely mechanical, so in my view, low risk.
      
      I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
      s390, sparc (and additionally x86_64).  I've additionally booted and run
      mm selftests against arm64, where I observe the uffd poison test is fixed,
      and there are no other regressions.
      
      
      This patch (of 2):
      
      In order to fix a bug, arm64 needs to be told the size of the huge page
      for which the pte is being set in set_huge_pte_at().  Provide for this by
      adding an `unsigned long sz` parameter to the function.  This follows the
      same pattern as huge_pte_clear().
      
      This commit makes the required interface modifications to the core mm as
      well as all arches that implement this function (arm64, parisc, powerpc,
      riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
      commit.
      
      No behavioral changes intended.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
      
      
      Fixes: 8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>	[powerpc 8xx]
      Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>	[vmalloc change]
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      935d4f0c
  9. Aug 18, 2023
    • Matthew Wilcox (Oracle)'s avatar
      migrate: use folio_set_bh() instead of set_bh_page() · d5db4f9d
      Matthew Wilcox (Oracle) authored
      This function was converted before folio_set_bh() existed.  Catch up to
      the new API.
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tom Rix <trix@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5db4f9d
    • David Howells's avatar
      mm: merge folio_has_private()/filemap_release_folio() call pairs · 0201ebf2
      David Howells authored
      Patch series "mm, netfs, fscache: Stop read optimisation when folio
      removed from pagecache", v7.
      
      This fixes an optimisation in fscache whereby we don't read from the cache
      for a particular file until we know that there's data there that we don't
      have in the pagecache.  The problem is that I'm no longer using PG_fscache
      (aka PG_private_2) to indicate that the page is cached and so I don't get
      a notification when a cached page is dropped from the pagecache.
      
      The first patch merges some folio_has_private() and
      filemap_release_folio() pairs and introduces a helper,
      folio_needs_release(), to indicate if a release is required.
      
      The second patch is the actual fix.  Following Willy's suggestions[1], it
      adds an AS_RELEASE_ALWAYS flag to an address_space that will make
      filemap_release_folio() always call ->release_folio(), even if
      PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
      add a check for this.
      
      
      This patch (of 2):
      
      Make filemap_release_folio() check folio_has_private().  Then, in most
      cases, where a call to folio_has_private() is immediately followed by a
      call to filemap_release_folio(), we can get rid of the test in the pair.
      
      There are a couple of sites in mm/vscan.c that this can't so easily be
      done.  In shrink_folio_list(), there are actually three cases (something
      different is done for incompletely invalidated buffers), but
      filemap_release_folio() elides two of them.
      
      In shrink_active_list(), we don't have have the folio lock yet, so the
      check allows us to avoid locking the page unnecessarily.
      
      A wrapper function to check if a folio needs release is provided for those
      places that still need to do it in the mm/ directory.  This will acquire
      additional parts to the condition in a future patch.
      
      After this, the only remaining caller of folio_has_private() outside of
      mm/ is a check in fuse.
      
      Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
      Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
      
      
      Reported-by: default avatarRohith Surabattula <rohiths.msft@gmail.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steve French <sfrench@samba.org>
      Cc: Shyam Prasad N <nspmangalore@gmail.com>
      Cc: Rohith Surabattula <rohiths.msft@gmail.com>
      Cc: Dave Wysochanski <dwysocha@redhat.com>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0201ebf2
  10. Aug 02, 2023
  11. Jul 11, 2023
    • Rick Edgecombe's avatar
      mm: Make pte_mkwrite() take a VMA · 161e393c
      Rick Edgecombe authored
      
      The x86 Shadow stack feature includes a new type of memory called shadow
      stack. This shadow stack memory has some unusual properties, which requires
      some core mm changes to function properly.
      
      One of these unusual properties is that shadow stack memory is writable,
      but only in limited ways. These limits are applied via a specific PTE
      bit combination. Nevertheless, the memory is writable, and core mm code
      will need to apply the writable permissions in the typical paths that
      call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
      that the x86 implementation of it can know whether to create regular
      writable or shadow stack mappings.
      
      But there are a couple of challenges to this. Modifying the signatures of
      each arch pte_mkwrite() implementation would be error prone because some
      are generated with macros and would need to be re-implemented. Also, some
      pte_mkwrite() callers operate on kernel memory without a VMA.
      
      So this can be done in a three step process. First pte_mkwrite() can be
      renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
      added that just calls pte_mkwrite_novma(). Next callers without a VMA can
      be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
      can be changed to take/pass a VMA.
      
      Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted
      callers that don't have a VMA were to use pte_mkwrite_novma(). So now
      change pte_mkwrite() to take a VMA and change the remaining callers to
      pass a VMA. Apply the same changes for pmd_mkwrite().
      
      No functional change.
      
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com
      161e393c
  12. Jun 23, 2023
  13. Jun 19, 2023
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
      
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
      
      
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
    • Hugh Dickins's avatar
      mm/various: give up if pte_offset_map[_lock]() fails · 04dee9e8
      Hugh Dickins authored
      Following the examples of nearby code, various functions can just give up
      if pte_offset_map() or pte_offset_map_lock() fails.  And there's no need
      for a preliminary pmd_trans_unstable() or other such check, since such
      cases are now safely handled inside.
      
      Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04dee9e8
    • Hugh Dickins's avatar
      mm/migrate: remove cruft from migration_entry_wait()s · 0cb8fd4d
      Hugh Dickins authored
      migration_entry_wait_on_locked() does not need to take a mapped pte
      pointer, its callers can do the unmap first.  Annotate it with
      __releases(ptl) to reduce sparse warnings.
      
      Fold __migration_entry_wait_huge() into migration_entry_wait_huge().  Fold
      __migration_entry_wait() into migration_entry_wait(), preferring the
      tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr().
      
      Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0cb8fd4d
  14. Jun 09, 2023
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert migrate_pages() to work on folios · 4e096ae1
      Matthew Wilcox (Oracle) authored
      Almost all of the callers & implementors of migrate_pages() were already
      converted to use folios.  compaction_alloc() & compaction_free() are
      trivial to convert a part of this patch and not worth splitting out.
      
      Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e096ae1
    • Huang Ying's avatar
      migrate_pages_batch: simplify retrying and failure counting of large folios · 124abced
      Huang Ying authored
      After recent changes to the retrying and failure counting in
      migrate_pages_batch(), it was found that it's unnecessary to count
      retrying and failure for normal, large, and THP folios separately. 
      Because we don't use retrying and failure number of large folios directly.
      So, in this patch, we simplified retrying and failure counting of large
      folios via counting retrying and failure of normal and large folios
      together.  This results in the reduced line number.
      
      Previously, in migrate_pages_batch we need to track whether the source
      folio is large/THP before splitting.  So is_large is used to cache
      folio_test_large() result.  Now, we don't need that variable any more
      because we don't count retrying and failure of large folios (only counting
      that of THP folios).  So, in this patch, is_large is removed to simplify
      the code.
      
      This is just code cleanup, no functionality changes are expected.
      
      Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      124abced
    • Douglas Anderson's avatar
      migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT · 4bb6dc79
      Douglas Anderson authored
      The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
      finish quickly but not for things that will take a long time.  Exactly how
      long is too long is not well defined, but waits of tens of milliseconds is
      likely non-ideal.
      
      When putting a Chromebook under memory pressure (opening over 90 tabs on a
      4GB machine) it was fairly easy to see delays waiting for some locks in
      the kcompactd code path of > 100 ms.  While the laptop wasn't amazingly
      usable in this state, it was still limping along and this state isn't
      something artificial.  Sometimes we simply end up with a lot of memory
      pressure.
      
      Putting the same Chromebook under memory pressure while it was running
      Android apps (though not stressing them) showed a much worse result (NOTE:
      this was on a older kernel but the codepaths here are similar).  Android
      apps on ChromeOS currently run from a 128K-block, zlib-compressed,
      loopback-mounted squashfs disk.  If we get a page fault from something
      backed by the squashfs filesystem we could end up holding a folio lock
      while reading enough from disk to decompress 128K (and then decompressing
      it using the somewhat slow zlib algorithms).  That reading goes through
      the ext4 subsystem (because it's a loopback mount) before eventually
      ending up in the block subsystem.  This extra jaunt adds extra overhead. 
      Without much work I could see cases where we ended up blocked on a folio
      lock for over a second.  With more extreme memory pressure I could see up
      to 25 seconds.
      
      We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
      two locks that were seen to be slow [1] and that generated much
      discussion.  After discussion, it was decided that we should avoid waiting
      for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
      IO.  We'll continue with the unbounded wait for the more full SYNC modes.
      
      With this change, I couldn't see any slow waits on these locks with my
      previous testcases.
      
      NOTE: The reason I stated digging into this originally isn't because some
      benchmark had gone awry, but because we've received in-the-field crash
      reports where we have a hung task waiting on the page lock (which is the
      equivalent code path on old kernels).  While the root cause of those
      crashes is likely unrelated and won't be fixed by this patch, analyzing
      those crash reports did point out these very long waits seemed like
      something good to fix.  With this patch we should no longer hang waiting
      on these locks, but presumably the system will still be in a bad shape and
      hang somewhere else.
      
      [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid
      
      Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid
      
      
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bb6dc79
  15. Apr 21, 2023
    • David Hildenbrand's avatar
      mm: don't check VMA write permissions if the PTE/PMD indicates write permissions · f3ebdf04
      David Hildenbrand authored
      Staring at the comment "Recheck VMA as permissions can change since
      migration started" in remove_migration_pte() can result in confusion,
      because if the source PTE/PMD indicates write permissions, then there
      should be no need to check VMA write permissions when restoring migration
      entries or PTE-mapping a PMD.
      
      Commit d3cb8bf6 ("mm: migrate: Close race between migration completion
      and mprotect") introduced the maybe_mkwrite() handling in
      remove_migration_pte() in 2014, stating that a race between mprotect() and
      migration finishing would be possible, and that we could end up with a
      writable PTE that should be readable.
      
      However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
      and then walks the page tables to (a) set all present writable PTEs to
      read-only and (b) convert all writable migration entries to readable
      migration entries.  While walking the page tables and modifying the
      entries, migration code has to grab the PT locks to synchronize against
      concurrent page table modifications.
      
      Assuming migration would find a writable migration entry (while holding
      the PT lock) and replace it with a writable present PTE, surely mprotect()
      code didn't stumble over the writable migration entry yet (converting it
      into a readable migration entry) and would instead wait for the PT lock to
      convert the now present writable PTE into a read-only PTE.  As mprotect()
      didn't finish yet, the behavior is just like migration didn't happen: a
      writable PTE will be converted to a read-only PTE.
      
      So it's fine to rely on the writability information in the source PTE/PMD
      and not recheck against the VMA as long as we're holding the PT lock to
      synchronize with anyone who concurrently wants to downgrade write
      permissions (like mprotect()) by first adjusting vma->vm_flags /
      vma->vm_page_prot to then walk over the page tables to adjust the page
      table entries.
      
      Running test cases that should reveal such races -- mprotect(PROT_READ)
      racing with page migration or THP splitting -- for multiple hours did not
      reveal an issue with this cleanup.
      
      Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3ebdf04
    • Huang Ying's avatar
      migrate_pages_batch: fix statistics for longterm pin retry · 851ae642
      Huang Ying authored
      In commit fd4a7ac3 ("mm: migrate: try again if THP split is failed due
      to page refcnt"), if the THP splitting fails due to page reference count,
      we will retry to improve migration successful rate.  But the failed
      splitting is counted as migration failure and migration retry, which will
      cause duplicated failure counting.  So, in this patch, this is fixed via
      undoing the failure counting if we decide to retry.  The patch is tested
      via failure injection.
      
      Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com
      
      
      Fixes: fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      851ae642
  16. Apr 18, 2023
  17. Mar 28, 2023
  18. Mar 16, 2023
  19. Mar 08, 2023
    • Huang Ying's avatar
      migrate_pages: try migrate in batch asynchronously firstly · 2ef7dbb2
      Huang Ying authored
      When we have locked more than one folios, we cannot wait the lock or bit
      (e.g., page lock, buffer head lock, writeback bit) synchronously. 
      Otherwise deadlock may be triggered.  This make it hard to batch the
      synchronous migration directly.
      
      This patch re-enables batching synchronous migration via trying to migrate
      in batch asynchronously firstly.  And any folios that are failed to be
      migrated asynchronously will be migrated synchronously one by one.
      
      Test shows that this can restore the TLB flushing batching performance for
      synchronous migration effectively.
      
      Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com
      
      
      Fixes: 5dfab109 ("migrate_pages: batch _unmap and _move")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ef7dbb2
    • Huang Ying's avatar
      migrate_pages: move split folios processing out of migrate_pages_batch() · a21d2133
      Huang Ying authored
      To simplify the code logic and reduce the line number.
      
      Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com
      
      
      Fixes: 5dfab109 ("migrate_pages: batch _unmap and _move")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a21d2133
    • Huang Ying's avatar
      migrate_pages: fix deadlock in batched migration · fb3592c4
      Huang Ying authored
      Patch series "migrate_pages: fix deadlock in batched synchronous
      migration", v2.
      
      Two deadlock bugs were reported for the migrate_pages() batching series. 
      Thanks Hugh and Pengfei.  Analysis shows that if we have locked some other
      folios except the one we are migrating, it's not safe in general to wait
      synchronously, for example, to wait the writeback to complete or wait to
      lock the buffer head.
      
      So 1/3 fixes the deadlock in a simple way, where the batching support for
      the synchronous migration is disabled.  The change is straightforward and
      easy to be understood.  While 3/3 re-introduce the batching for
      synchronous migration via trying to migrate asynchronously in batch
      optimistically, then fall back to migrate synchronously one by one for
      fail-to-migrate folios.  Test shows that this can restore the TLB flushing
      batching performance for synchronous migration effectively.
      
      
      This patch (of 3):
      
      Two deadlock bugs were reported for the migrate_pages() batching series. 
      Thanks Hugh and Pengfei!  For example, in the following deadlock trace
      snippet,
      
       INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
             Not tainted 6.2.0-rc4-kvm+ #1314
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
       Workqueue: loop4 loop_rootcg_workfn
       Call Trace:
        <TASK>
        __schedule+0x43b/0xd00
        schedule+0x6a/0xf0
        io_schedule+0x4a/0x80
        folio_wait_bit_common+0x1b5/0x4e0
        ? __pfx_wake_page_function+0x10/0x10
        __filemap_get_folio+0x73d/0x770
        shmem_get_folio_gfp+0x1fd/0xc80
        shmem_write_begin+0x91/0x220
        generic_perform_write+0x10e/0x2e0
        __generic_file_write_iter+0x17e/0x290
        ? generic_write_checks+0x12b/0x1a0
        generic_file_write_iter+0x97/0x180
        ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
        do_iter_readv_writev+0x13c/0x210
        ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
        do_iter_write+0xf6/0x330
        vfs_iter_write+0x46/0x70
        loop_process_work+0x723/0xfe0
        loop_rootcg_workfn+0x28/0x40
        process_one_work+0x3cc/0x8d0
        worker_thread+0x66/0x630
        ? __pfx_worker_thread+0x10/0x10
        kthread+0x153/0x190
        ? __pfx_kthread+0x10/0x10
        ret_from_fork+0x29/0x50
        </TASK>
      
       INFO: task repro:1023 blocked for more than 147 seconds.
             Not tainted 6.2.0-rc4-kvm+ #1314
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
       Call Trace:
        <TASK>
        __schedule+0x43b/0xd00
        schedule+0x6a/0xf0
        io_schedule+0x4a/0x80
        folio_wait_bit_common+0x1b5/0x4e0
        ? compaction_alloc+0x77/0x1150
        ? __pfx_wake_page_function+0x10/0x10
        folio_wait_bit+0x30/0x40
        folio_wait_writeback+0x2e/0x1e0
        migrate_pages_batch+0x555/0x1ac0
        ? __pfx_compaction_alloc+0x10/0x10
        ? __pfx_compaction_free+0x10/0x10
        ? __this_cpu_preempt_check+0x17/0x20
        ? lock_is_held_type+0xe6/0x140
        migrate_pages+0x100e/0x1180
        ? __pfx_compaction_free+0x10/0x10
        ? __pfx_compaction_alloc+0x10/0x10
        compact_zone+0xe10/0x1b50
        ? lock_is_held_type+0xe6/0x140
        ? check_preemption_disabled+0x80/0xf0
        compact_node+0xa3/0x100
        ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
        ? _find_first_bit+0x7b/0x90
        sysctl_compaction_handler+0x5d/0xb0
        proc_sys_call_handler+0x29d/0x420
        proc_sys_write+0x2b/0x40
        vfs_write+0x3a3/0x780
        ksys_write+0xb7/0x180
        __x64_sys_write+0x26/0x30
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x72/0xdc
       RIP: 0033:0x7f3a2471f59d
       RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
       RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
       RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
       R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
       R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
        </TASK>
      
      The page migration task has held the lock of the shmem folio A, and is
      waiting the writeback of the folio B of the file system on the loop block
      device to complete.  While the loop worker task which writes back the
      folio B is waiting to lock the shmem folio A, because the folio A backs
      the folio B in the loop device.  Thus deadlock is triggered.
      
      In general, if we have locked some other folios except the one we are
      migrating, it's not safe to wait synchronously, for example, to wait the
      writeback to complete or wait to lock the buffer head.
      
      To fix the deadlock, in this patch, we avoid to batch the page migration
      except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode, synchronous waiting
      is avoided.
      
      The fix can be improved further.  We will do that as soon as possible.
      
      Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
      Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
      Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
      Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
      Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
      
      
      Fixes: 5dfab109 ("migrate_pages: batch _unmap and _move")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatar"Xu, Pengfei" <pengfei.xu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb3592c4
Loading