- Jun 17, 2024
-
-
Elliot Berman authored
GUP'ing pages should get the same pages, test it. In case of FOLL_EXCLUSIVE, the second pin should fail to get any pages. Note: this change ought to be refactored to pull out the GUP'ing bits that's duplicated between the original and the second GUP. Signed-off-by:
Elliot Berman <quic_eberman@quicinc.com>
-
Elliot Berman authored
Add test that pages have the exclusive pin bias when providing FOLL_EXCLUSIVE. Signed-off-by:
Elliot Berman <quic_eberman@quicinc.com>
-
Fuad Tabba authored
When a page is shared, the exclusive pin is dropped, but one normal pin is maintained. In order to be able to unshare a page, add the ability to reaquire the exclusive pin, but only if there is only one normal pin on the page, and only if the page is marked as AnonExclusive. Co-Developed-by:
Elliot Berman <quic_eberman@quicinc.com> Signed-off-by:
Elliot Berman <quic_eberman@quicinc.com> Signed-off-by:
Fuad Tabba <tabba@google.com>
-
Fuad Tabba authored
Introduce the ability to obtain an exclusive long-term pin on a page. This exclusive pin can only be held if there are no other pins on the page, regular, or exclusive. Moreover, once this pin is held, no other pins can be grabbed until the exclusive pin is released. This pin is grabbed using the (new) FOLL_EXCLUSIVE flag, and is gated by the EXCLUSIVE_PIN configuration option. Similar to how the normal GUP pin is obtain, the exclusive PIN overloads the _refcount field for normal pages, or the _pincount field for large pages. It appropriates bit 30 of these two fields, which still allows the detection of overflows into bit 31. It does however, half the number of potential normals pins for a page. In order to avoid the possibility of COWing such a page, once an exclusive pin has been obtained, it's marked as AnonExclusive. Co-Developed-by:
Elliot Berman <quic_eberman@quicinc.com> Signed-off-by:
Elliot Berman <quic_eberman@quicinc.com> Signed-off-by:
Fuad Tabba <tabba@google.com>
-
Fuad Tabba authored
No functional change intended. Signed-off-by:
Fuad Tabba <tabba@google.com>
-
- May 16, 2024
-
-
Elliot Berman authored
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. When these pages are lent to a guest and made inaccessible to Linux, we need to ensure that the kernel doesn't try to access these pages: a fault occurs otherwise. To do this, we need to ensure that no page pins exist prior to lending the page to the guest. Previously discussed in a PUCK session and in [1], we introduce the concept of "exclusive GUP pinning", which enforces that only one GUP pin is allowed when the flag is set. When FOLL_EXCLUSIVE is set, that corresponding pin ensures that no other pins have been made and sets a bias in the refcount that ensures no future pins (FOLL_EXCLUSIVE or otherwise) are allowed. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path. [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/ # Describe the purpose of this series. The information you put here # will be used by the project maintainer to make a decision whether # your patches should be reviewed, and in what priority order. Please be # very detailed and link to any relevant discussions or sites that the # maintainer can review to better understand your proposed changes. If you # only have a single patch in your series, the contents of the cover # letter will be appended to the "under-the-cut" portion of the patch. # Lines starting with # will be removed from the cover letter. You can # use them to add notes or reminders to yourself. If you want to use # markdown headers in your cover letter, start the line with ">#". # You can add trailers to the cover letter. Any email addresses found in # these trailers will be added to the addresses specified/generated # during the b4 send stage. You can also run "b4 prep --auto-to-cc" to # auto-populate the To: and Cc: trailers based on the code being # modified. To: Andrew Morton <akpm@linux-foundation.org> To: Shuah Khan <shuah@kernel.org> To: David Hildenbrand <david@redhat.com> To: Matthew Wilcox <willy@infradead.org> To: maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, seanjc@google.com, Vishal Annapurve <vannapurve@google.com>, Fuad Tabba <tabba@google.com> Cc: kvm@vger.kernel.org, kvmarm@lists.linux.dev Cc: linux-arm-msm@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by:
Elliot Berman <quic_eberman@quicinc.com> --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 1, "change-id": "20240509-exclusive-gup-66259138bbff", "prefixes": [ "RFC" ] } }
-
- May 11, 2024
-
-
Xiu Jianfeng authored
Since commit 857f2139 ("memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize()"), memcg_oom_gfp_mask and memcg_oom_order are no longer used any more. Link: https://lkml.kernel.org/r/20240509032628.1217652-1-xiujianfeng@huawei.com Signed-off-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by:
Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Benjamin Segall <bsegall@google.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Dev Jain authored
Currently, the size used in mmap() is statically defined, leading to skipping of the test on a hugepage size other than 2 MB, since munmap() won't free the hugepage for a size greater than 2 MB. Hence, query the size at runtime. Also, there is no reason why a hugepage allocation should fail, since we are using a simple mmap() using MAP_HUGETLB; hence, instead of skipping the test, make it fail. Link: https://lkml.kernel.org/r/20240509095447.3791573-1-dev.jain@arm.com Signed-off-by:
Dev Jain <dev.jain@arm.com> Reviewed-by:
Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Oscar Salvador authored
commit 1cb9dc4b ("mm: hwpoison: support recovery from HugePage copy-on-write faults") added support to use the mc variants when coping hugetlb pages on CoW faults. Add the missing VM_FAULT_SET_HINDEX, so the right si_addr_lsb will be passed to userspace to report the extension of the faulty area. Link: https://lkml.kernel.org/r/20240509100148.22384-3-osalvador@suse.de Signed-off-by:
Oscar Salvador <osalvador@suse.de> Acked-by:
Peter Xu <peterx@redhat.com> Acked-by:
Axel Rasmussen <axelrasmussen@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Oscar Salvador authored
Patch series "Minor fixups for hugetlb fault path". This series contains a couple of fixups for hugetlb_fault and hugetlb_wp respectively, where a VM_FAULT_SET_HINDEX call was missing. I did not bother with a Fixes tag because the missing piece here is that we will not report to userspace the right extension of the faulty area by adjusting struct kernel_siginfo.si_addr_lsb, but I do not consider that to be a big issue because I assume that userspace already knows the size of the mapping anyway. This patch (of 2): commit af19487f ("mm: make PTE_MARKER_SWAPIN_ERROR more general") added the code to handle pte_markers in hugetlb faulting path. In case of an UFFD_POISON event, a PTE_MARKER_POISONED will be created and we will return VM_FAULT_HWPOISON_LARGE upon detecting that in the fault path. Add the missing VM_FAULT_SET_HINDEX, so the right si_addr_lsb will be passed to userspace to report the extension of the faulty area. Link: https://lkml.kernel.org/r/20240509100148.22384-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240509100148.22384-2-osalvador@suse.de Signed-off-by:
Oscar Salvador <osalvador@suse.de> Acked-by:
Peter Xu <peterx@redhat.com> Acked-by:
Axel Rasmussen <axelrasmussen@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Usama Arif authored
Attempt writeback with the below steps and check using memory.stat.zswpwb if zswap writeback occurred: 1. Allocate memory. 2. Reclaim memory equal to the amount that was allocated in step 1. This will move it into zswap. 3. Save current zswap usage. 4. Move the memory allocated in step 1 back in from zswap. 5. Set zswap.max to half the amount that was recorded in step 3. 6. Attempt to reclaim memory equal to the amount that was allocated, this will either trigger writeback if it's enabled, or reclamation will fail if writeback is disabled as there isn't enough zswap space. Link: https://lkml.kernel.org/r/20240508171359.1545744-1-usamaarif642@gmail.com Signed-off-by:
Usama Arif <usamaarif642@gmail.com> Suggested-by:
Nhat Pham <nphamcs@gmail.com> Acked-by:
Yosry Ahmed <yosryahmed@google.com> Acked-by:
Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Xiu Jianfeng authored
alloc_mem_cgroup_per_node_info() returns int that doesn't map to any errno error code. The only existing caller doesn't really need an error code so change the function to return bool (true on success) because this is slightly less confusing and more consistent with the other code. Link: https://lkml.kernel.org/r/20240507132324.1158510-1-xiujianfeng@huawei.com Signed-off-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Shakeel Butt <shakeel.butt@linux.dev> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alex Rusuf authored
damos_wmark_metric_value's return value is 'unsigned long', so returning -EINVAL as 'unsigned long' may turn out to be very different from the expected one (using 2's complement) and treat as usual matric's value. So, fix that, checking if returned value is not 0. Link: https://lkml.kernel.org/r/20240506180238.53842-1-sj@kernel.org Fixes: ee801b7d ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by:
Alex Rusuf <yorha.op@gmail.com> Reviewed-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Yosry Ahmed authored
Previously, all NR_VM_EVENT_ITEMS stats were maintained per-memcg, although some of those fields are not exposed anywhere. Commit 14e0f6c957e39 ("memcg: reduce memory for the lruvec and memcg stats") changed this such that we only maintain the stats we actually expose per-memcg via a translation table. Additionally, commit 514462bbe927b ("memcg: warn for unexpected events and stats") added a warning if a per-memcg stat update is attempted for a stat that is not in the translation table. The warning started firing for the NR_{FILE/SHMEM}_PMDMAPPED stat updates in the rmap code. These stats are not maintained per-memcg, and hence are not in the translation table. Do not use __lruvec_stat_mod_folio() when updating NR_FILE_PMDMAPPED and NR_SHMEM_PMDMAPPED. Use __mod_node_page_state() instead, which updates the global per-node stats only. Link: https://lkml.kernel.org/r/20240506192924.271999-1-yosryahmed@google.com Fixes: 514462bbe927 ("memcg: warn for unexpected events and stats") Signed-off-by:
Yosry Ahmed <yosryahmed@google.com> Reported-by:
<syzbot+9319a4268a640e26b72b@syzkaller.appspotmail.com> Closes: https://lore.kernel.org/lkml/0000000000001b9d500617c8b23c@google.com Acked-by:
Shakeel Butt <shakeel.butt@linux.dev> Acked-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Usama Arif authored
Memory controller is already enabled in main which invokes the test, hence this does not need to be done in test_no_kmem_bypass. Link: https://lkml.kernel.org/r/20240502200529.4193651-2-usamaarif642@gmail.com Signed-off-by:
Usama Arif <usamaarif642@gmail.com> Acked-by:
Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
The document mentions any patches for review should based on mm-unstable instead of damon/next. It should be the recommended process, but sometimes patches based on damon/next could be posted for some reasons. Actually, the DAMON-based tiered memory management patchset[1] was written on top of 'young page' DAMOS filter patchset, which was in damon/next tree as of the writing. Allow such case and just ask such things to be clearly specified. [1] https://lore.kernel.org/20240405060858.2818-1-honggyu.kim@sk.com Link: https://lkml.kernel.org/r/20240503180318.72798-11-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
The document says the maintainer is working on only PST. The maintainer respects daylight saving system, though. Update the time zone to PT. Link: https://lkml.kernel.org/r/20240503180318.72798-10-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Filters section is listing currently supported filter types in a normal paragraph. Since the number of types are higher than four, it is not easy to read for only specific types. Use a list for easier finding of specific types. [sj@kernel.org: fix build warning] Link: https://lkml.kernel.org/r/20240507161747.52430-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240503180318.72798-9-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
To update effective size quota of DAMOS schemes on DAMON sysfs file interface, user should write 'update_schemes_effective_quotas' to the kdamond 'state' file. But the document is mistakenly saying the input string as 'update_schemes_effective_bytes'. Fix it (s/bytes/quotas/). Link: https://lkml.kernel.org/r/20240503180318.72798-8-sj@kernel.org Fixes: a6068d6d ("Docs/admin-guide/mm/damon/usage: document effective_bytes file") Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.9.x] Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
The example usage of DAMOS filter sysfs files, specifically the part of 'matching' file writing for memcg type filter, is wrong. The intention is to exclude pages of a memcg that already getting enough care from a given scheme, but the example is setting the filter to apply the scheme to only the pages of the memcg. Fix it. Link: https://lkml.kernel.org/r/20240503180318.72798-7-sj@kernel.org Fixes: 9b7f9322 ("Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs") Closes: https://lore.kernel.org/r/20240317191358.97578-1-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.3.x] Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
DAMON selftests can be classified into two categories: functionalities and regressions. Functionality tests are for checking if the function is working as specified, while the regression tests are basically reproducers of previously reported and fixed bugs. The tests of the categories are mixed in the selftests Makefile. Separate those for easier understanding of the types of tests. Link: https://lkml.kernel.org/r/20240503180318.72798-6-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
_damon_sysfs.py is using '==' or '!=' for 'None'. Since 'None' is a singleton, using 'is' or 'is not' is more efficient. Use the more efficient one. Link: https://lkml.kernel.org/r/20240503180318.72798-5-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
_damon_sysfs.py assumes sysfs is mounted at /sys. In some systems, that might not be true. Find the mount point from /proc/mounts file content. Link: https://lkml.kernel.org/r/20240503180318.72798-4-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
DAMON context staging method in _damon_sysfs.py is not checking the returned error from nr_schemes file read. Check it. Link: https://lkml.kernel.org/r/20240503180318.72798-3-sj@kernel.org Fixes: f5f0e5a2 ("selftests/damon/_damon_sysfs: implement kdamonds start function") Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "mm/damon: misc fixes and improvements". Add miscelleneous and non-urgent fixes and improvements for DAMON code, selftests, and documents. This patch (of 10): damos_quota_init_priv() function should initialize all private fields of struct damos_quota. However, it is not initializing ->esz_bp field. This could result in use of uninitialized variable from damon_feed_loop_next_input() function. There is no such issue at the moment because every caller of the function is passing damos_quota object that already having the field zero value. But we cannot guarantee the future, and the function is not doing what it is promising. A bug is a bug. This fix is for preventing possible future issues. Link: https://lkml.kernel.org/r/20240503180318.72798-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240503180318.72798-2-sj@kernel.org Fixes: 9294a037 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Add a selftest for DAMOS quota goal. It tests the feature by setting a user_input metric based goal, change the current feedback, and check if the effective quota size is increased and decreased as expected. Link: https://lkml.kernel.org/r/20240502172718.74166-3-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "selftests/damon: add DAMOS quota goal test". Extend DAMON selftest-purpose sysfs wrapper to support DAMOS quota goal, and implement a simple selftest for the feature using it. This patch (of 2): The DAMON sysfs test purpose wrapper, _damon_sysfs.py, is not supporting quota goals. Implement the support for testing the feature. The test will be implemented and added by the following commit. Link: https://lkml.kernel.org/r/20240502172718.74166-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240502172718.74166-2-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- May 07, 2024
-
-
Matthew Wilcox (Oracle) authored
We now handle order-1 folios correctly, so we don't need this assertion any more. Link: https://lkml.kernel.org/r/20240429190114.3126789-1-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
All reclaim_folio_list() callers are passing 'true' for 'ignore_references' parameter. In other words, the parameter is not really being used. Simplify the code by removing the parameter. Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
All reclaim_pages() callers are setting 'ignore_references' parameter 'true'. In other words, the parameter is not really being used. Remove the argument to make it simple. Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
'pageout' DAMOS action implementation of 'paddr' DAMON operations set asks reclaim_pages() to do page level access check if the user is not asking DAMOS to do that on its own. Simplify the logic by making the check always be done by 'paddr'. Link: https://lkml.kernel.org/r/20240429224451.67081-3-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "mm/damon/paddr: simplify page level access re-check for pageout. The 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do page level access check again. But the user can ask 'paddr' to do the page level access check on its own, using DAMOS filter of 'young page' type. Meanwhile, 'paddr' is the only user of reclaim_pages() that asks the page level access check. Make 'paddr' does the page level access check on its own always, and simplify reclaim_pages() by removing the page level access check request handling logic. As a result of the change for reclaim_pages(), reclaim_folio_list(), which is called by reclaim_pages(), also no more need to do the page level access check. Simplify the function, too. This patch (of 4): 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do the page level access check. User could ask DAMOS to do the page level access check on its own using 'young page' type DAMOS filter. In the case, pageout DAMOS action unnecessarily asks reclaim_pages() to do the check again. Ask the page level access check only if the scheme is not having the filter. Link: https://lkml.kernel.org/r/20240429224451.67081-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240429224451.67081-2-sj@kernel.org Signed-off-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Commit a12083d7 added hugepd handling for gup-slow, reusing gup-fast functions. follow_hugepd() correctly took the vma pointer in, however didn't pass it over into the lower functions, which was overlooked. The issue is gup_fast_hugepte() uses the vma pointer to make the correct decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM. Now without vma ponter it will constantly return "true" (needs an unshare) for a page cache, even though in the SHARED case it will be wrong to unshare. The other problem is, even if an unshare is needed, it now returns 0 rather than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault. That will need to be fixed too when the unshare is wanted. gup_longterm test didn't expose this issue in the past because it didn't yet test R/O unshare in this case, another separate patch will enable that in future tests. Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte() back to gup_hugepte() as it is shared between the fast/slow paths, and also allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast will take it the same as zero. Link: https://lkml.kernel.org/r/20240430131303.264331-1-peterx@redhat.com Fixes: a12083d7 ("mm/gup: handle hugepd for follow_page()") Signed-off-by:
Peter Xu <peterx@redhat.com> Reported-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
In our FOLL_LONGTERM tests, we prefault the page tables for the GUP-fast test cases to be able to find a PTE and exercise the "longterm pinning allowed" logic on the GUP-fast path where possible. For now, we always prefault the page tables writable, resulting in PTEs that are writable. Let's cover more cases to also test if our unsharing logic works as expected (and is able to make progress when there is nothing to unshare) by mprotect'ing the range R/O when R/O-pinning, so we don't get PTEs that are writable. This change would have found an issue introduced by commit a12083d7 ("mm/gup: handle hugepd for follow_page()"), whereby R/O pinning was not able to make progress in all cases, because unsharing logic was not provided with the VMA to decide at some point that long-term R/O pinning a !anon page is fine. Link: https://lkml.kernel.org/r/20240430131508.86924-1-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Acked-by:
Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Frank van der Linden authored
Align the CMA area for hugetlb gigantic pages to their size, not the size that they can be demoted to. Otherwise there might be misaligned sections at the start and end of the CMA area that will never be used for hugetlb page allocations. Link: https://lkml.kernel.org/r/20240430161437.2100295-1-fvdl@google.com Fixes: a01f4390 ("hugetlb: be sure to free demoted CMA pages to CMA") Signed-off-by:
Frank van der Linden <fvdl@google.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Vishal Verma authored
In size_show(), the dax_dev_rwsem only needs a read lock, but was acquiring a write lock. Change it to down_read_interruptible() so it doesn't unnecessarily hold a write lock. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-4-e3dcd755774c@intel.com Fixes: c05ae9d8 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by:
Vishal Verma <vishal.l.verma@intel.com> Reviewed-by:
Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Vishal Verma authored
Change an instance of down_write_killable() to a simple down_write() where there is no user process that might want to interrupt the operation. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-3-e3dcd755774c@intel.com Fixes: c05ae9d8 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by:
Vishal Verma <vishal.l.verma@intel.com> Reported-by:
Dan Williams <dan.j.williams@intel.com> Reviewed-by:
Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Vishal Verma authored
Commit c05ae9d8 ("dax/bus.c: replace driver-core lock usage by a local rwsem") aimed to undo device_lock() abuses for protecting changes to dax-driver internal data-structures like the dax_region resource tree to device-dax-instance range structures. However, the device_lock() was legitimately enforcing that devices to be deleted were not current actively attached to any driver nor assigned any capacity from the region. As a result of the device_lock restoration in delete_store(), the conditional locking in unregister_dev_dax() and unregister_dax_mapping() can be removed. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-2-e3dcd755774c@intel.com Fixes: c05ae9d8 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by:
Vishal Verma <vishal.l.verma@intel.com> Reported-by:
Dan Williams <dan.j.williams@intel.com> Reviewed-by:
Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Vishal Verma authored
Patch series "dax/bus.c: Fixups for dax-bus locking", v3. Commit Fixes: c05ae9d8 ("dax/bus.c: replace driver-core lock usage by a local rwsem") introduced a few problems that this series aims to fix. Add back device_lock() where it was correctly used (during device manipulation operations), remove conditional locking in unregister_dax_dev() and unregister_dax_mapping(), use non-interruptible versions of rwsem locks when not called from a user process, and fix up a write vs. read usage of an rwsem. This patch (of 4): In [1], Dan points out that all of the WARN_ON_ONCE() usage in the referenced patch should be replaced with lockdep_assert_held, or lockdep_held_assert_write(). Replace these as appropriate. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-0-e3dcd755774c@intel.com Link: https://lore.kernel.org/r/65f0b5ef41817_aa222941a@dwillia2-mobl3.amr.corp.intel.com.notmuch [1] Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-1-e3dcd755774c@intel.com Fixes: c05ae9d8 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by:
Vishal Verma <vishal.l.verma@intel.com> Reported-by:
Dan Williams <dan.j.williams@intel.com> Reviewed-by:
Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Breno Leitao authored
A memcg pointer in the per-cpu stock can be accessed by drain_all_stock() and consume_stock() in parallel, causing a potential race, which is believed to e harmless. KCSAN shows this data-race clearly in the splat below: BUG: KCSAN: data-race in drain_all_stock.part.0 / try_charge_memcg write to 0xffff88903f8b0788 of 4 bytes by task 35901 on cpu 2: try_charge_memcg (mm/memcontrol.c:2323 mm/memcontrol.c:2746) __mem_cgroup_charge (mm/memcontrol.c:7287 mm/memcontrol.c:7301) do_anonymous_page (mm/memory.c:1054 mm/memory.c:4375 mm/memory.c:4433) __handle_mm_fault (mm/memory.c:3878 mm/memory.c:5300 mm/memory.c:5441) handle_mm_fault (mm/memory.c:5606) do_user_addr_fault (arch/x86/mm/fault.c:1363) exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) read to 0xffff88903f8b0788 of 4 bytes by task 287 on cpu 27: drain_all_stock.part.0 (mm/memcontrol.c:2433) mem_cgroup_css_offline (mm/memcontrol.c:5398 mm/memcontrol.c:5687) css_killed_work_fn (kernel/cgroup/cgroup.c:5521 kernel/cgroup/cgroup.c:5794) process_one_work (kernel/workqueue.c:3254) worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) value changed: 0x00000014 -> 0x00000013 This happens because drain_all_stock() is reading stock->nr_pages, while consume_stock() might be updating the same address, causing a potential data-race. Make the shared addresses bulletproof regarding to reads and writes, similarly to what stock->cached_objcg and stock->cached. Annotate all accesses to stock->nr_pages with READ_ONCE()/WRITE_ONCE(). Link: https://lkml.kernel.org/r/20240501095420.679208-1-leitao@debian.org Signed-off-by:
Breno Leitao <leitao@debian.org> Acked-by:
Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by:
Roman Gushchin <roman.gushchin@linux.dev> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-