Skip to content
Snippets Groups Projects
  1. Nov 27, 2023
  2. Nov 15, 2023
  3. Oct 18, 2023
    • Lorenzo Stoakes's avatar
      mm: drop the assumption that VM_SHARED always implies writable · e8e17ee9
      Lorenzo Stoakes authored
      Patch series "permit write-sealed memfd read-only shared mappings", v4.
      
      The man page for fcntl() describing memfd file seals states the following
      about F_SEAL_WRITE:-
      
          Furthermore, trying to create new shared, writable memory-mappings via
          mmap(2) will also fail with EPERM.
      
      With emphasis on 'writable'.  In turns out in fact that currently the
      kernel simply disallows all new shared memory mappings for a memfd with
      F_SEAL_WRITE applied, rendering this documentation inaccurate.
      
      This matters because users are therefore unable to obtain a shared mapping
      to a memfd after write sealing altogether, which limits their usefulness. 
      This was reported in the discussion thread [1] originating from a bug
      report [2].
      
      This is a product of both using the struct address_space->i_mmap_writable
      atomic counter to determine whether writing may be permitted, and the
      kernel adjusting this counter when any VM_SHARED mapping is performed and
      more generally implicitly assuming VM_SHARED implies writable.
      
      It seems sensible that we should only update this mapping if VM_MAYWRITE
      is specified, i.e.  whether it is possible that this mapping could at any
      point be written to.
      
      If we do so then all we need to do to permit write seals to function as
      documented is to clear VM_MAYWRITE when mapping read-only.  It turns out
      this functionality already exists for F_SEAL_FUTURE_WRITE - we can
      therefore simply adapt this logic to do the same for F_SEAL_WRITE.
      
      We then hit a chicken and egg situation in mmap_region() where the check
      for VM_MAYWRITE occurs before we are able to clear this flag.  To work
      around this, perform this check after we invoke call_mmap(), with careful
      consideration of error paths.
      
      Thanks to Andy Lutomirski for the suggestion!
      
      [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
      [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
      
      
      This patch (of 3):
      
      There is a general assumption that VMAs with the VM_SHARED flag set are
      writable.  If the VM_MAYWRITE flag is not set, then this is simply not the
      case.
      
      Update those checks which affect the struct address_space->i_mmap_writable
      field to explicitly test for this by introducing
      [vma_]is_shared_maywrite() helper functions.
      
      This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
      that the VMA cannot be written to.
      
      Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com
      
      
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8e17ee9
    • Matthew Wilcox (Oracle)'s avatar
      filemap: remove use of wait bookmarks · b0b598ee
      Matthew Wilcox (Oracle) authored
      The original problem of the overly long list of waiters on a locked page
      was solved properly by commit 9a1ea439 ("mm:
      put_and_wait_on_page_locked() while page is migrated").  In the meantime,
      using bookmarks for the writeback bit can cause livelocks, so we need to
      stop using them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-1-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0b598ee
    • Nhat Pham's avatar
      memcontrol: only transfer the memcg data for migration · 85ce2c51
      Nhat Pham authored
      For most migration use cases, only transfer the memcg data from the old
      folio to the new folio, and clear the old folio's memcg data.  No charging
      and uncharging will be done.
      
      This shaves off some work on the migration path, and avoids the temporary
      double charging of a folio during its migration.
      
      The only exception is replace_page_cache_folio(), which will use the old
      mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio).  In that
      context, the isolation of the old page isn't quite as thorough as with
      migration, so we cannot use our new implementation directly.
      
      This patch is the result of the following discussion on the new hugetlb
      memcg accounting behavior:
      
      https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85ce2c51
    • Matthew Wilcox (Oracle)'s avatar
      mm: use folio_xor_flags_has_waiters() in folio_end_writeback() · 2580d554
      Matthew Wilcox (Oracle) authored
      Match how folio_unlock() works by combining the test for PG_waiters with
      the clearing of PG_writeback.  This should have a small performance win,
      and removes the last user of folio_wake().
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-18-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2580d554
    • Matthew Wilcox (Oracle)'s avatar
      mm: make __end_folio_writeback() return void · 7d0795d0
      Matthew Wilcox (Oracle) authored
      Rather than check the result of test-and-clear, just check that we have
      the writeback bit set at the start.  This wouldn't catch every case, but
      it's good enough (and enables the next patch).
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-17-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d0795d0
    • Matthew Wilcox (Oracle)'s avatar
      mm: add folio_xor_flags_has_waiters() · 0410cd84
      Matthew Wilcox (Oracle) authored
      Optimise folio_end_read() by setting the uptodate bit at the same time we
      clear the unlock bit.  This saves at least one memory barrier and one
      write-after-write hazard.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-16-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0410cd84
    • Matthew Wilcox (Oracle)'s avatar
      mm: delete checks for xor_unlock_is_negative_byte() · f12fb73b
      Matthew Wilcox (Oracle) authored
      Architectures which don't define their own use the one in
      asm-generic/bitops/lock.h.  Get rid of all the ifdefs around "maybe we
      don't have it".
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-15-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f12fb73b
    • Matthew Wilcox (Oracle)'s avatar
      bitops: add xor_unlock_is_negative_byte() · 247dbcdb
      Matthew Wilcox (Oracle) authored
      Replace clear_bit_and_unlock_is_negative_byte() with
      xor_unlock_is_negative_byte().  We have a few places that like to lock a
      folio, set a flag and unlock it again.  Allow for the possibility of
      combining the latter two operations for efficiency.  We are guaranteed
      that the caller holds the lock, so it is safe to unlock it with the xor. 
      The caller must guarantee that nobody else will set the flag without
      holding the lock; it is not safe to do this with the PG_dirty flag, for
      example.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-8-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      247dbcdb
    • Matthew Wilcox (Oracle)'s avatar
      mm: add folio_end_read() · 0b237047
      Matthew Wilcox (Oracle) authored
      Provide a function for filesystems to call when they have finished reading
      an entire folio.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-4-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b237047
    • Pankaj Raghav's avatar
      filemap: call filemap_get_folios_tag() from filemap_get_folios() · bafd7e9d
      Pankaj Raghav authored
      filemap_get_folios() is filemap_get_folios_tag() with XA_PRESENT as the
      tag that is being matched.  Return filemap_get_folios_tag() with
      XA_PRESENT as the tag instead of duplicating the code in
      filemap_get_folios().
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20231006110120.136809-1-kernel@pankajraghav.com
      
      
      Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bafd7e9d
    • Matthew Wilcox (Oracle)'s avatar
      mm: make lock_folio_maybe_drop_mmap() VMA lock aware · 5d74b2ab
      Matthew Wilcox (Oracle) authored
      Patch series "Handle more faults under the VMA lock", v2.
      
      At this point, we're handling the majority of file-backed page faults
      under the VMA lock, using the ->map_pages entry point.  This patch set
      attempts to expand that for the following siutations:
      
       - We have to do a read.  This could be because we've hit the point in
         the readahead window where we need to kick off the next readahead,
         or because the page is simply not present in cache.
       - We're handling a write fault.  Most applications don't do I/O by writes
         to shared mmaps for very good reasons, but some do, and it'd be nice
         to not make that slow unnecessarily.
       - We're doing a COW of a private mapping (both PTE already present
         and PTE not-present).  These are two different codepaths and I handle
         both of them in this patch set.
      
      There is no support in this patch set for drivers to mark themselves as
      being VMA lock friendly; they could implement the ->map_pages
      vm_operation, but if they do, they would be the first.  This is probably
      something we want to change at some point in the future, and I've marked
      where to make that change in the code.
      
      There is very little performance change in the benchmarks we've run;
      mostly because the vast majority of page faults are handled through the
      other paths.  I still think this patch series is useful for workloads that
      may take these paths more often, and just for cleaning up the fault path
      in general (it's now clearer why we have to retry in these cases).
      
      
      This patch (of 6):
      
      Drop the VMA lock instead of the mmap_lock if that's the one which
      is held.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5d74b2ab
    • Lorenzo Stoakes's avatar
      mm/filemap: clarify filemap_fault() comments for not uptodate case · 6facf36e
      Lorenzo Stoakes authored
      The existing comments in filemap_fault() suggest that, after either a
      minor fault has occurred and filemap_get_folio() found a folio in the page
      cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
      the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
      to read in the folio), the only possible reason it could not be uptodate
      is because of an error.
      
      This is not so, as if, for instance, the fault occurred within a VMA which
      had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
      specified), this would cause even synchronous readahead to fail to read in
      the folio.
      
      I confirmed this by dropping page caches and faulting in memory
      madvise()'d this way, observing that this code path was reached on each
      occasion.
      
      Clarify the comments to include this case, and additionally update the
      comment recently added around the invalidate lock logic to make it clear
      the comment explicitly refers to the minor fault case.
      
      In addition, while we're here, refer to folios rather than pages.
      
      [lstoakes@gmail.com: correct identation as per Christopher's feedback]
        Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local
      Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.com
      
      
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6facf36e
  4. Oct 16, 2023
    • Sidhartha Kumar's avatar
      mm/filemap: remove hugetlb special casing in filemap.c · a08c7193
      Sidhartha Kumar authored
      Remove special cased hugetlb handling code within the page cache by
      changing the granularity of ->index to the base page size rather than the
      huge page size.  The motivation of this patch is to reduce complexity
      within the filemap code while also increasing performance by removing
      branches that are evaluated on every page cache lookup.
      
      To support the change in index, new wrappers for hugetlb page cache
      interactions are added.  These wrappers perform the conversion to a linear
      index which is now expected by the page cache for huge pages.
      
      ========================= PERFORMANCE ======================================
      
      Perf was used to check the performance differences after the patch. 
      Overall the performance is similar to mainline with a very small larger
      overhead that occurs in __filemap_add_folio() and
      hugetlb_add_to_page_cache().  This is because of the larger overhead that
      occurs in xa_load() and xa_store() as the xarray is now using more entries
      to store hugetlb folios in the page cache.
      
      Timing
      
      aarch64
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                  real    1m49.568s
                  user    0m0.000s
                  sys     1m49.461s
      
              6.5-rc3:
                  [root]# time fallocate -l 700GB test.txt
                  real    1m47.495s
                  user    0m0.000s
                  sys     1m47.370s
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m47.024s
                  user    0m0.000s
                  sys     1m46.921s
      
              6.5-rc3:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m44.551s
                  user    0m0.000s
                  sys     1m44.438s
      
      x86
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                  real    0m22.383s
                  user    0m0.000s
                  sys     0m22.255s
      
              6.5-rc3:
                  [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                  real    0m22.735s
                  user    0m0.038s
                  sys     0m22.567s
      
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                  real    0m25.786s
                  user    0m0.001s
                  sys     0m25.589s
      
              6.5-rc3:
                  [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                  real    0m33.454s
                  user    0m0.001s
                  sys     0m33.193s
      
      aarch64:
          workload - fallocate a 700GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--95.04%--__pi_clear_page
                                |
                                |--3.57%--clear_huge_page
                                |          |
                                |          |--2.63%--rcu_all_qs
                                |          |
                                |           --0.91%--__cond_resched
                                |
                                 --0.67%--__cond_resched
                  0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                  0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio
      
          6.5-rc3
              2MB Page Size:
                      --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--94.91%--__pi_clear_page
                                |
                                |--4.11%--clear_huge_page
                                |          |
                                |          |--3.00%--rcu_all_qs
                                |          |
                                |           --1.10%--__cond_resched
                                |
                                 --0.59%--__cond_resched
                  0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                  0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      x86
          workload - fallocate a 100GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  hugetlbfs_fallocate
                  |
                  --99.57%--clear_huge_page
                      |
                      --98.47%--clear_page_erms
                          |
                          --0.53%--asm_sysvec_apic_timer_interrupt
      
                  0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store
      
          6.5-rc3
              2MB Page Size:
                      --99.93%--__x64_sys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                 --99.38%--clear_huge_page
                                           |
                                           |--98.40%--clear_page_erms
                                           |
                                            --0.59%--__cond_resched
                  0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      ========================= TESTING ======================================
      
      This patch passes libhugetlbfs tests and LTP hugetlb tests
      
      ********** TEST SUMMARY
      *                      2M
      *                      32-bit 64-bit
      *     Total testcases:   110    113
      *             Skipped:     0      0
      *                PASS:   107    113
      *                FAIL:     0      0
      *    Killed by signal:     3      0
      *   Bad configuration:     0      0
      *       Expected FAIL:     0      0
      *     Unexpected PASS:     0      0
      *    Test not present:     0      0
      * Strange test result:     0      0
      **********
      
          Done executing testcases.
          LTP Version:  20220527-178-g2761a81c4
      
      page migration was also tested using Mike Kravetz's test program.[8]
      
      [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
        Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
      Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
      
      
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reported-and-tested-by: default avatar <syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
      
      
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a08c7193
  5. Oct 06, 2023
  6. Sep 30, 2023
  7. Sep 19, 2023
  8. Sep 05, 2023
    • Tong Tiangen's avatar
      mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs() · d256d1cd
      Tong Tiangen authored
      We found a softlock issue in our test, analyzed the logs, and found that
      the relevant CPU call trace as follows:
      
      CPU0:
        _do_fork
          -> copy_process()
            -> write_lock_irq(&tasklist_lock)  //Disable irq,waiting for
            					 //tasklist_lock
      
      CPU1:
        wp_page_copy()
          ->pte_offset_map_lock()
            -> spin_lock(&page->ptl);        //Hold page->ptl
          -> ptep_clear_flush()
            -> flush_tlb_others() ...
              -> smp_call_function_many()
                -> arch_send_call_function_ipi_mask()
                  -> csd_lock_wait()         //Waiting for other CPUs respond
      	                               //IPI
      
      CPU2:
        collect_procs_anon()
          -> read_lock(&tasklist_lock)       //Hold tasklist_lock
            ->for_each_process(tsk)
              -> page_mapped_in_vma()
                -> page_vma_mapped_walk()
      	    -> map_pte()
                    ->spin_lock(&page->ptl)  //Waiting for page->ptl
      
      We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
      unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
      softlockup is triggered.
      
      For collect_procs_anon(), what we're doing is task list iteration, during
      the iteration, with the help of call_rcu(), the task_struct object is freed
      only after one or more grace periods elapse. the logic as follows:
      
      release_task()
        -> __exit_signal()
          -> __unhash_process()
            -> list_del_rcu()
      
        -> put_task_struct_rcu_user()
          -> call_rcu(&task->rcu, delayed_put_task_struct)
      
      delayed_put_task_struct()
        -> put_task_struct()
        -> if (refcount_sub_and_test())
           	__put_task_struct()
                -> free_task()
      
      Therefore, under the protection of the rcu lock, we can safely use
      get_task_struct() to ensure a safe reference to task_struct during the
      iteration.
      
      By removing the use of tasklist_lock in task list iteration, we can break
      the softlock chain above.
      
      The same logic can also be applied to:
       - collect_procs_file()
       - collect_procs_fsdax()
       - collect_procs_ksm()
      
      Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
      
      
      Signed-off-by: default avatarTong Tiangen <tongtiangen@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d256d1cd
  9. Aug 24, 2023
  10. Aug 18, 2023
    • David Howells's avatar
      mm: merge folio_has_private()/filemap_release_folio() call pairs · 0201ebf2
      David Howells authored
      Patch series "mm, netfs, fscache: Stop read optimisation when folio
      removed from pagecache", v7.
      
      This fixes an optimisation in fscache whereby we don't read from the cache
      for a particular file until we know that there's data there that we don't
      have in the pagecache.  The problem is that I'm no longer using PG_fscache
      (aka PG_private_2) to indicate that the page is cached and so I don't get
      a notification when a cached page is dropped from the pagecache.
      
      The first patch merges some folio_has_private() and
      filemap_release_folio() pairs and introduces a helper,
      folio_needs_release(), to indicate if a release is required.
      
      The second patch is the actual fix.  Following Willy's suggestions[1], it
      adds an AS_RELEASE_ALWAYS flag to an address_space that will make
      filemap_release_folio() always call ->release_folio(), even if
      PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
      add a check for this.
      
      
      This patch (of 2):
      
      Make filemap_release_folio() check folio_has_private().  Then, in most
      cases, where a call to folio_has_private() is immediately followed by a
      call to filemap_release_folio(), we can get rid of the test in the pair.
      
      There are a couple of sites in mm/vscan.c that this can't so easily be
      done.  In shrink_folio_list(), there are actually three cases (something
      different is done for incompletely invalidated buffers), but
      filemap_release_folio() elides two of them.
      
      In shrink_active_list(), we don't have have the folio lock yet, so the
      check allows us to avoid locking the page unnecessarily.
      
      A wrapper function to check if a folio needs release is provided for those
      places that still need to do it in the mm/ directory.  This will acquire
      additional parts to the condition in a future patch.
      
      After this, the only remaining caller of folio_has_private() outside of
      mm/ is a check in fuse.
      
      Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
      Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
      
      
      Reported-by: default avatarRohith Surabattula <rohiths.msft@gmail.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steve French <sfrench@samba.org>
      Cc: Shyam Prasad N <nspmangalore@gmail.com>
      Cc: Rohith Surabattula <rohiths.msft@gmail.com>
      Cc: Dave Wysochanski <dwysocha@redhat.com>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0201ebf2
    • Haibo Li's avatar
      mm/filemap.c: fix update prev_pos after one read request done · f04d16ee
      Haibo Li authored
      ra->prev_pos tracks the last visited byte in the previous read request. 
      It is used to check whether it is sequential read in ondemand_readahead
      and thus affects the readahead window.
      
      After commit 06c04442 ("mm/filemap.c: generic_file_buffered_read() now
      uses find_get_pages_contig"), update logic of prev_pos is changed.  It
      updates prev_pos after each return from filemap_get_pages().  But the read
      request from user may be not fully completed at this point.  The updated
      prev_pos impacts the subsequent readahead window.
      
      The real problem is performance drop of fsck_msdos between linux-5.4 and
      linux-5.15(also linux-6.4).  Comparing to linux-5.4,It spends about 110%
      time and read 140% pages.  The read pattern of fsck_msdos is not fully
      sequential.
      
      Simplified read pattern of fsck_msdos likes below:
      1.read at page offset 0xa,size 0x1000
      2.read at other page offset like 0x20,size 0x1000
      3.read at page offset 0xa,size 0x4000
      4.read at page offset 0xe,size 0x1000
      
      Here is the read status on linux-6.4:
      1.after read at page offset 0xa,size 0x1000
          ->page ofs 0xa go into pagecache
      2.after read at page offset 0x20,size 0x1000
          ->page ofs 0x20 go into pagecache
      3.read at page offset 0xa,size 0x4000
          ->filemap_get_pages read ofs 0xa from pagecache and returns
          ->prev_pos is updated to 0xb and goto next loop
          ->filemap_get_pages tends to read ofs 0xb,size 0x3000
          ->initial_readahead case in ondemand_readahead since prev_pos is
            the same as request ofs.
          ->read 8 pages while async size is 5 pages
            (PageReadahead flag at page 0xe)
      4.read at page offset 0xe,size 0x1000
          ->hit page 0xe with PageReadahead flag set,double the ra_size.
            read 16 pages while async size is 16 pages
      Now it reads 24 pages while actually uses 5 pages
      
      on linux-5.4:
      1.the same as 6.4
      2.the same as 6.4
      3.read at page offset 0xa,size 0x4000
          ->read ofs 0xa from pagecache
          ->read ofs 0xb,size 0x3000 using page_cache_sync_readahead
            read 3 pages
          ->prev_pos is updated to 0xd before generic_file_buffered_read
            returns
      4.read at page offset 0xe,size 0x1000
          ->initial_readahead case in ondemand_readahead since
            request ofs-prev_pos==1
          ->read 4 pages while async size is 3 pages
      
      Now it reads 7 pages while actually uses 5 pages.
      
      In above demo, the initial_readahead case is triggered by offset of user
      request on linux-5.4.  While it may be triggered by update logic of
      prev_pos on linux-6.4.
      
      To fix the performance drop, update prev_pos after finishing one read
      request.
      
      Link: https://lkml.kernel.org/r/20230628110220.120134-1-haibo.li@mediatek.com
      
      
      Signed-off-by: default avatarHaibo Li <haibo.li@mediatek.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f04d16ee
    • Sidhartha Kumar's avatar
      mm: increase usage of folio_next_index() helper · 87b11f86
      Sidhartha Kumar authored
      Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
      the existing helper folio_next_index().
      
      Link: https://lkml.kernel.org/r/20230627174349.491803-1-sidhartha.kumar@oracle.com
      
      
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      87b11f86
  11. Jul 24, 2023
  12. Jun 23, 2023
  13. Jun 19, 2023
    • Kefeng Wang's avatar
      mm: kill lock|unlock_page_memcg() · 6c77b607
      Kefeng Wang authored
      Since commit c7c3dec1 ("mm: rmap: remove lock_page_memcg()"),
      no more user, kill lock_page_memcg() and unlock_page_memcg().
      
      Link: https://lkml.kernel.org/r/20230614143612.62575-1-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c77b607
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
      
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
      
      
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
    • Hugh Dickins's avatar
      mm/filemap: allow pte_offset_map_lock() to fail · 65747aaf
      Hugh Dickins authored
      filemap_map_pages() allow pte_offset_map_lock() to fail; and remove the
      pmd_devmap_trans_unstable() check from filemap_map_pmd(), which can safely
      return to filemap_map_pages() and let pte_offset_map_lock() discover that.
      
      Link: https://lkml.kernel.org/r/54607cf4-ddb6-7ef3-043-1d2de1a9a71@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65747aaf
    • Hugh Dickins's avatar
      mm/migrate: remove cruft from migration_entry_wait()s · 0cb8fd4d
      Hugh Dickins authored
      migration_entry_wait_on_locked() does not need to take a mapped pte
      pointer, its callers can do the unmap first.  Annotate it with
      __releases(ptl) to reduce sparse warnings.
      
      Fold __migration_entry_wait_huge() into migration_entry_wait_huge().  Fold
      __migration_entry_wait() into migration_entry_wait(), preferring the
      tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr().
      
      Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0cb8fd4d
  14. Jun 12, 2023
    • Mike Kravetz's avatar
      page cache: fix page_cache_next/prev_miss off by one · 9425c591
      Mike Kravetz authored
      Ackerley Tng reported an issue with hugetlbfs fallocate here[1].  The
      issue showed up after the conversion of hugetlb page cache lookup code to
      use page_cache_next_miss.  Code in hugetlb fallocate, userfaultfd and GUP
      is now using page_cache_next_miss to determine if a page is present the
      page cache.  The following statement is used.
      
      	present = page_cache_next_miss(mapping, index, 1) != index;
      
      There are two issues with page_cache_next_miss when used in this way.
      1) If the passed value for index is equal to the 'wrap-around' value,
         the same index will always be returned.  This wrap-around value is 0,
         so 0 will be returned even if page is present at index 0.
      2) If there is no gap in the range passed, the last index in the range
         will be returned.  When passed a range of 1 as above, the passed
         index value will be returned even if the page is present.
      The end result is the statement above will NEVER indicate a page is
      present in the cache, even if it is.
      
      As noted by Ackerley in [1], users can see this by hugetlb fallocate
      incorrectly returning EEXIST if pages are already present in the file.  In
      addition, hugetlb pages will not be included in core dumps if they need to
      be brought in via GUP.  userfaultfd UFFDIO_COPY also uses this code and
      will not notice pages already present in the cache.  It may try to
      allocate a new page and potentially return ENOMEM as opposed to EEXIST.
      
      Both page_cache_next_miss and page_cache_prev_miss have similar issues.
      Fix by:
      - Check for index equal to 'wrap-around' value and do not exit early.
      - If no gap is found in range, return index outside range.
      - Update function description to say 'wrap-around' value could be
        returned if passed as index.
      
      [1] https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com/
      
      Link: https://lkml.kernel.org/r/20230602225747.103865-2-mike.kravetz@oracle.com
      
      
      Fixes: d0ce0e47 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarAckerley Tng <ackerleytng@google.com>
      Reviewed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Tested-by: default avatarAckerley Tng <ackerleytng@google.com>
      Cc: Erdem Aktas <erdemaktas@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9425c591
  15. Jun 09, 2023
Loading