Commits · android14-6.1-2024-12_r10 · CodeLinaro / la / kernel / msm

Mar 21, 2025

ANDROID: userfaultfd: add MOVE ioctl mode to confirm bug-fixes · 32fd2083

Lokesh Gidra authored 1 month ago

Following issues were reported in the MOVE ioctl:
1. Panic when trying to move a source page which is in swap-cache [1]
2. Livelock when multiple threads try to move the same source page [2]

Three patches have been upstreamed to fix these issues [3, 4, 5]

MOVE ioctl was backported to ACK 6.1 and 6.6 for ART GC to use it [6].
Therefore, on these kernels in order to be able to identify in the
userspace if the fixes are included, this mode is added.

NOTE: UFFDIO_MOVE_MODE_CONFIRM_FIXED mode is only for 6.1 and 6.6
kernels, and will go away afterwards.

[1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gmail.com/
[2] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
[3] https://web.git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-hotfixes-stable&id=c50f8e6053b0503375c2975bf47f182445aebb4c
[4] https://web.git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-hotfixes-stable&id=37b338eed10581784e854d4262da05c8d960c748
[5] https://web.git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-hotfixes-stable&id=927e926d72d9155fde3264459fe9bfd7b5e40d28


[6] b/274911254

Bug: 401790618
Bug: 405066974
Change-Id: Ibd854ec7ac9ae6a2ca416767d032b6c71f1bc688
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
(cherry picked from commit 9bcabbda)
Signed-off-by: Yinchu Chen <chenyc5@motorola.com>

32fd2083

FROMGIT: userfaultfd: fix PTE unmapping stack-allocated PTE copies · 74c9f3f8

Suren Baghdasaryan authored 1 month ago

Current implementation of move_pages_pte() copies source and destination
PTEs in order to detect concurrent changes to PTEs involved in the move.
However these copies are also used to unmap the PTEs, which will fail if
CONFIG_HIGHPTE is enabled because the copies are allocated on the stack.
Fix this by using the actual PTEs which were kmap()ed.

Link: https://lkml.kernel.org/r/20250226185510.2732648-3-surenb@google.com


Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
(cherry-picked from commit 927e926d https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git

 mm-hotfixes-stable)
Change-Id: I0ee6c1b509ea7c4fa68056d6e512d4ac167c9234
Bug: 401790618
Bug: 405066974
(cherry picked from commit 8d8d44ff)
Signed-off-by: Yinchu Chen <chenyc5@motorola.com>

74c9f3f8

FROMGIT: userfaultfd: do not block on locking a large folio with raised refcount · a87db275

Suren Baghdasaryan authored 1 month ago

Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
state when it goes into split_folio() with raised folio refcount.
split_folio() expects the reference count to be exactly mapcount +
num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN
otherwise.

If multiple processes are trying to move the same large folio, they raise
the refcount (all tasks succeed in that) then one of them succeeds in
locking the folio, while others will block in folio_lock() while keeping
the refcount raised.  The winner of this race will proceed with calling
split_folio() and will fail returning EAGAIN to the caller and unlocking
the folio.  The next competing process will get the folio locked and will
go through the same flow.  In the meantime the original winner will be
retried and will block in folio_lock(), getting into the queue of waiting
processes only to repeat the same path.  All this results in a livelock.

An easy fix would be to avoid waiting for the folio lock while holding
folio refcount, similar to madvise_free_huge_pmd() where folio lock is
acquired before raising the folio refcount.  Since we lock and take a
refcount of the folio while holding the PTE lock, changing the order of
these operations should not break anything.

Modify move_pages_pte() to try locking the folio first and if that fails
and the folio is large then return EAGAIN without touching the folio
refcount.  If the folio is single-page then split_folio() is not called,
so we don't have this issue.  Lokesh has a reproducer [1] and I verified
that this change fixes the issue.

[1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock

[akpm@linux-foundation.org: reflow comment to 80 cols, s/end/end up/]
Link: https://lkml.kernel.org/r/20250226185510.2732648-2-surenb@google.com


Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
(cherry-picked from commit 37b338ee https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git

 mm-hotfixes-stable)
Change-Id: I71b307add9707ad3518a44623aea2e2ca417b95a
Bug: 401790618
Bug: 405066974
(cherry picked from commit af439acc)
Signed-off-by: Yinchu Chen <chenyc5@motorola.com>

a87db275

FROMGIT: BACKPORT: mm: fix kernel BUG when userfaultfd_move encounters swapcache · 7c180424

Barry Song authored 1 month ago

userfaultfd_move() checks whether the PTE entry is present or a
swap entry.

- If the PTE entry is present, move_present_pte() handles folio
  migration by setting:

  src_folio->index = linear_page_index(dst_vma, dst_addr);

- If the PTE entry is a swap entry, move_swap_pte() simply copies
  the PTE to the new dst_addr.

This approach is incorrect because, even if the PTE is a swap entry,
it can still reference a folio that remains in the swap cache.

This creates a race window between steps 2 and 4.
 1. add_to_swap: The folio is added to the swapcache.
 2. try_to_unmap: PTEs are converted to swap entries.
 3. pageout: The folio is written back.
 4. Swapcache is cleared.
If userfaultfd_move() occurs in the window between steps 2 and 4,
after the swap PTE has been moved to the destination, accessing the
destination triggers do_swap_page(), which may locate the folio in
the swapcache. However, since the folio's index has not been updated
to match the destination VMA, do_swap_page() will detect a mismatch.

This can result in two critical issues depending on the system
configuration.

If KSM is disabled, both small and large folios can trigger a BUG
during the add_rmap operation due to:

 page_pgoff(folio, page) != linear_page_index(vma, address)

[   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
[   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
[   13.337716] memcg:ffff00000405f000
[   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
[   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
[   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
[   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
[   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
[   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
[   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
[   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
[   13.340190] ------------[ cut here ]------------
[   13.340316] kernel BUG at mm/rmap.c:1380!
[   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   13.340969] Modules linked in:
[   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
[   13.341470] Hardware name: linux,dummy-virt (DT)
[   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
[   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
[   13.342018] sp : ffff80008752bb20
[   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
[   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
[   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
[   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
[   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
[   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
[   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
[   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
[   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
[   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
[   13.343876] Call trace:
[   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
[   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
[   13.344333]  do_swap_page+0x1060/0x1400
[   13.344417]  __handle_mm_fault+0x61c/0xbc8
[   13.344504]  handle_mm_fault+0xd8/0x2e8
[   13.344586]  do_page_fault+0x20c/0x770
[   13.344673]  do_translation_fault+0xb4/0xf0
[   13.344759]  do_mem_abort+0x48/0xa0
[   13.344842]  el0_da+0x58/0x130
[   13.344914]  el0t_64_sync_handler+0xc4/0x138
[   13.345002]  el0t_64_sync+0x1ac/0x1b0
[   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
[   13.345504] ---[ end trace 0000000000000000 ]---
[   13.345715] note: a.out[107] exited with irqs disabled
[   13.345954] note: a.out[107] exited with preempt_count 2

If KSM is enabled, Peter Xu also discovered that do_swap_page() may
trigger an unexpected CoW operation for small folios because
ksm_might_need_to_copy() allocates a new folio when the folio index
does not match linear_page_index(vma, addr).

This patch also checks the swapcache when handling swap entries. If a
match is found in the swapcache, it processes it similarly to a present
PTE.
However, there are some differences. For example, the folio is no longer
exclusive because folio_try_share_anon_rmap_pte() is performed during
unmapping.
Furthermore, in the case of swapcache, the folio has already been
unmapped, eliminating the risk of concurrent rmap walks and removing the
need to acquire src_folio's anon_vma or lock.

Note that for large folios, in the swapcache handling path, we directly
return -EBUSY since split_folio() will return -EBUSY regardless if
the folio is under writeback or unmapped. This is not an urgent issue,
so a follow-up patch may address it separately.

[v-songbaohua@oppo.com: minor cleanup according to Peter Xu]
  Link: https://lkml.kernel.org/r/20250226024411.47092-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20250226001400.9129-1-21cnbao@gmail.com


Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Conflicts:
1. mm/userfaultfd.c
[Removed pmd arguments being passed to move_swap_pte() to resolve conflicts - Lokesh Gidra]
[Replaced swap_cache_index() with swp_offset() as the former doesn't exist - Lokesh Gidra]
[Replaced folio_move_anon_rmap() with page_move_anon_rmap() as the
 former doesn't exist - Lokesh Gidra]

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
(cherry-picked from commit c50f8e60 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git

 mm-hotfixes-stable)
Change-Id: I94caeac5bf78add4d78650929303a25d54d8a638
Bug: 401790618
Bug: 405066974
(cherry picked from commit 7d6124b6)
Signed-off-by: Yinchu Chen <chenyc5@motorola.com>

7c180424

Mar 20, 2025

Revert "usb: gadget: composite: fix OS descriptors w_value logic" · 5c90f903

Michal Vrastil authored 5 months ago


commit 51cdd69d upstream.

This reverts commit ec6ce707.

Fix installation of WinUSB driver using OS descriptors. Without the
fix the drivers are not installed correctly and the property
'DeviceInterfaceGUID' is missing on host side.

The original change was based on the assumption that the interface
number is in the high byte of wValue but it is in the low byte,
instead. Unfortunately, the fix is based on MS documentation which is
also wrong.

The actual USB request for OS descriptors (using USB analyzer) looks
like:

Offset  0   1   2   3   4   5   6   7
0x000   C1  A1  02  00  05  00  0A  00

C1: bmRequestType (device to host, vendor, interface)
A1: nas magic number
0002: wValue (2: nas interface)
0005: wIndex (5: get extended property i.e. nas interface GUID)
008E: wLength (142)

The fix was tested on Windows 10 and Windows 11.

Cc: stable@vger.kernel.org
Fixes: ec6ce707 ("usb: gadget: composite: fix OS descriptors w_value logic")
Signed-off-by: Michal Vrastil <michal.vrastil@hidglobal.com>
Signed-off-by: Elson Roy Serrao <quic_eserrao@quicinc.com>
Acked-by: Peter korsgaard <peter@korsgaard.com>
Link: https://lore.kernel.org/r/20241113235433.20244-1-quic_eserrao@quicinc.com


Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Bug: 405046934
(cherry picked from commit c17418f4)
Change-Id: Ibed2f523154a016106e48d345d3d368adaeddd48
Signed-off-by: Yinchu Chen <chenyc5@motorola.com>

5c90f903

Mar 10, 2025

ANDROID: GKI: Update symbol list for xiaomi · 1b5ca2c3

zhanghui authored 1 month ago


1 function symbol(s) added
  'int __traceiter_android_vh_filemap_map_pages_range(void*, struct file*, unsigned long, unsigned long, vm_fault_t)'

1 variable symbol(s) added
  'struct tracepoint __tracepoint_android_vh_filemap_map_pages_range'

Bug: 398130226
Bug: 400290767
Bug: 401380632

Change-Id: I789a16f5d0bc3d11b9518c548276b2ce19514ead
Signed-off-by: zhanghui <zhanghui31@xiaomi.com>
(cherry picked from commit 6b227a1f)
(cherry picked from commit 08e9e947)

1b5ca2c3

ANDROID: mm: add a new vendor hook in filemap_map_pages · 0c4ecf5c

zhanghui authored 2 months ago


In the current vendor hook, if next_uptodate_folio returns NULL, the
first_pgoff is set to zero, and the last_pgoff is set to start_pgoff.
Therefore, the collection range is from 0 to the start_pgoff.

|-----------|------------|-------------|------------------|
0      start_pgoff  first_pgoff    last_pgoff         end_pgoff

We want to collect the first_pgoff to last_pgoff, so we have to add a
new vendor hook.

Bug: 398130226
Bug: 400290767
Bug: 401380632

Change-Id: I19d54c601e2ffc5de5ec2dafcd43fbdcdc84b0d2
Signed-off-by: zhanghui <zhanghui31@xiaomi.com>
(cherry picked from commit eaffa3e3)
(cherry picked from commit 5ef6cde4)

0c4ecf5c

ANDROID: Update the ABI symbol list · 75b061ff

Marcus Ma authored 1 month ago


Adding the following symbols:
  - page_swap_info

Bug: 397308736
Bug: 400290767
Bug: 401380632

Change-Id: Ica1c945fd0401c0276d0409ff284fe9debc352a3
Signed-off-by: Marcus Ma <maminghui5@xiaomi.corp-partner.google.com>
(cherry picked from commit 4deb2cd7)
(cherry picked from commit 24104369)

75b061ff

ANDROID: swapfile: Add EXPORT_SYMBOL_GPL for page_swap_info · 1810da4c

Marcus Ma authored 2 months ago


We present a specific requirement regarding the memory management
and I/O operations.In our project,we're focused on handling scenarios
where I/O delays are triggered by anoymous pages.During this period,we
need to obtain swap_info_struct according to page to obtain the
corresponding block device id.

Bug: 397308736
Bug: 400290767
Bug: 401380632

Change-Id: Ibc11f412964245658cec60af42cf9486adc96e1a
Signed-off-by: Marcus Ma <maminghui5@xiaomi.corp-partner.google.com>
(cherry picked from commit 75c1d11b)
(cherry picked from commit be18f50e)

1810da4c

Feb 26, 2025

BACKPORT: f2fs: introduce get_available_block_count() for cleanup · f8effc48

Chao Yu authored 1 year ago


There are very similar codes in inc_valid_block_count() and
inc_valid_node_count() which is used for available user block
count calculation.

This patch introduces a new helper get_available_block_count()
to include those common codes, and used it to clean up codes.

Bug: 399286969

Change-Id: Ie2ce55bdac091bc4880478eeba2a76e1608726e3
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 0f1c6ede)
[Added line for F2FS_IO_ALIGNED, which was removed in later kernels]
Signed-off-by: Daniel Rosenberg <drosen@google.com>
(cherry picked from commit 38149d53)

f8effc48

UPSTREAM: f2fs: modify f2fs_is_checkpoint_ready logic to allow more data to be... · 54182a07

Qi Han authored 5 months ago

UPSTREAM: f2fs: modify f2fs_is_checkpoint_ready logic to allow more data to be written with the CP disable

When the free segment is used up during CP disable, many write or
ioctl operations will get ENOSPC error codes, even if there are
still many blocks available. We can reproduce it in the following
steps:

dd if=/dev/zero of=f2fs.img bs=1M count=65
mkfs.f2fs -f f2fs.img
mount f2fs.img f2fs_dir -o checkpoint=disable:10%
cd f2fs_dir
i=1 ; while [[ $i -lt 50 ]] ; do (file_name=./2M_file$i ; dd \
if=/dev/random of=$file_name bs=1M count=2); i=$((i+1)); done
sync
i=1 ; while [[ $i -lt 50 ]] ; do (file_name=./2M_file$i ; truncate \
-s 1K $file_name); i=$((i+1)); done
sync
dd if=/dev/zero of=./file bs=1M count=20

In f2fs_need_SSR() function, it is allowed to use SSR to allocate
blocks when CP is disabled, so in f2fs_is_checkpoint_ready function,
can we judge the number of invalid blocks when free segment is not
enough, and return ENOSPC only if the number of invalid blocks is
also not enough.

Bug: 399286969

Signed-off-by: Qi Han <hanqi@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 84b5bb8b)
Signed-off-by: Daniel Rosenberg <drosen@google.com>
(cherry picked from https://android-review.googlesource.com/q/commit:225caf3bdf7a4977ae50ba9b5c5470a16d480100)
Merged-In: I41ad315f603cd764d0e9b8ef984447e7116b1514
Change-Id: I41ad315f603cd764d0e9b8ef984447e7116b1514
(cherry picked from commit 182b090f)

54182a07

Feb 17, 2025

ANDROID: cma: Add restrict_cma_redirect boot parameter · c5dc859b

Sebastian Achim authored 5 months ago and

Oleksiy Avramchenko committed 2 months ago


Commit "mm,page_alloc,cma: conditionally prefer cma pageblocks for
movable allocations" (16867664) introduced balancing of movable
allocations between CMA and normal areas.

Commit "ANDROID: cma: redirect page allocation to CMA" (f60c5572)
removes it, making allocations go in CMA area first.

1. Reintroduce the condition, so that CMA and normal area are used
in a balanced way(as it used to be), so it prevents depleting of CMA
region;

2. Back-port a command line option(from 6.6), "restrict_cma_redirect",
that can be used if only MOVABLE allocations marked as __GFP_CMA are
eligible to be redirected to CMA region. By default it is true.

The purpose of this change is to keep using CMA for movable allocations,
but at the same time, to have enough free CMA pages for critical system
areas such as modem initialization, GPU initialization and so on.

Bug: 397184449
Bug: 381168812
Signed-off-by: Sebastian Achim <sebastian.1.achim@sony.com>
Signed-off-by: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Change-Id: I5fd6d022340715e27754c687189c5ea0e56d9ee6
(cherry picked from commit ccc91578)

c5dc859b

Jan 22, 2025

UPSTREAM: selinux: ignore unknown extended permissions · a0b6635b

Thiébaud Weksteen authored 4 months ago


commit 900f83cf upstream.

When evaluating extended permissions, ignore unknown permissions instead
of calling BUG(). This commit ensures that future permissions can be
added without interfering with older kernels.

Bug: 381754752
Cc: stable@vger.kernel.org
Fixes: fa1aa143 ("selinux: extended permissions for ioctls")
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Change-Id: I1689d8c5084a24c1a34ef3d15d71f8cfa4122447

a0b6635b

Jan 21, 2025

UPSTREAM: vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans · 43f55b45

Hyunwoo Kim authored 5 months ago


commit 6ca57537 upstream.

During loopback communication, a dangling pointer can be created in
vsk->trans, potentially leading to a Use-After-Free condition.  This
issue is resolved by initializing vsk->trans to NULL.

Bug: 378870958
Cc: stable <stable@kernel.org>
Fixes: 06a8fc78 ("VSOCK: Introduce virtio_vsock_common.ko")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: Wongi Lee <qwerty@theori.io>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <2024102245-strive-crib-c8d3@gregkh>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b110196f)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I5eb7b5ccf7f0d96644cc4313548c0114e8836149

43f55b45

Jan 13, 2025

UPSTREAM: cgroup: remove cgroup_rstat_flush_atomic() · 0f30f805

Yosry Ahmed authored 2 years ago

Previous patches removed the only caller of cgroup_rstat_flush_atomic().
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.com


Change-Id: I28939a92d4dcc6b8ad03dfaf7243e6d2b6af9100
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0a2dc6ac)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

0f30f805

UPSTREAM: memcg: remove mem_cgroup_flush_stats_atomic() · f6cd59cb

Yosry Ahmed authored 2 years ago

Previous patches removed all callers of mem_cgroup_flush_stats_atomic().
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.com


Change-Id: Ieef476cf2c0aacb48af4e82e4bb96e5cc69292c5
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 35822fda)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

f6cd59cb

UPSTREAM: memcg: calculate root usage from global state · c8968c3f

Yosry Ahmed authored 2 years ago

Currently, we approximate the root usage by adding the memcg stats for
anon, file, and conditionally swap (for memsw).  To read the memcg stats
we need to invoke an rstat flush.  rstat flushes can be expensive, they
scale with the number of cpus and cgroups on the system.

mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
with irqs disabled, so such an expensive operation with irqs disabled can
cause problems.

Instead, approximate the root usage from global state.  This is not 100%
accurate, but the root usage has always been ill-defined anyway.

Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.com


Change-Id: Ifbc895f45fc119f3079c17fd8f5e3d87267428c4
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit f82a7a86)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

c8968c3f

UPSTREAM: memcg: flush stats non-atomically in mem_cgroup_wb_stats() · 0bcff0d5

Yosry Ahmed authored 2 years ago

The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
code path in wb_writeback() outside the lock section.  We no longer need
to flush the stats atomically.  Flush the stats non-atomically.

Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.com


Change-Id: Ibbb2dd41fac7c26f1e18d0a984d7197b45294f6f
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 190409ca)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

0bcff0d5

UPSTREAM: writeback: move wb_over_bg_thresh() call outside lock section · 0dedcebe

Yosry Ahmed authored 2 years ago

Patch series "cgroup: eliminate atomic rstat flushing", v5.

A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic.  This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted.  There were two remaining atomic flushing
contexts after that series.  This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.

The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()

For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.

For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled.  We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter.  This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.

After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
remove them and simplify the code.

[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/

This patch (of 5):

wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.

Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.

Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com


Change-Id: Id9dbfad96cfd32f1381d7640e1e2cf62c0a56189
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 2816ea2a)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

0dedcebe

UPSTREAM: memcg: do not modify rstat tree for zero updates · f46f633b

Yosry Ahmed authored 2 years ago

In some situations, we may end up calling memcg_rstat_updated() with a
value of 0, which means the stat was not actually updated.  An example is
if we fail to reclaim any pages in shrink_folio_list().

Do not add the cgroup to the rstat updated tree in this case, to avoid
unnecessarily flushing it.

Link: https://lkml.kernel.org/r/20230330191801.1967435-9-yosryahmed@google.com


Change-Id: I8142aa6c2962e5af590a5e93f89fbf2313d4b741
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit f9d911ca)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

f46f633b

UPSTREAM: vmscan: memcg: sleep when flushing stats during reclaim · c1401e9c

Yosry Ahmed authored 2 years ago

Memory reclaim is a sleepable context.  Flushing is an expensive operaiton
that scales with the number of cpus and the number of cgroups in the
system, so avoid doing it atomically unnecessarily.  This can slow down
reclaim code if flushing stats is taking too long, but there is already
multiple cond_resched()'s in reclaim code.

Link: https://lkml.kernel.org/r/20230330191801.1967435-8-yosryahmed@google.com


Change-Id: Ia0f0d42131e67a060dc7c9b868ef5247d78e05c8
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0d856cfe)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

c1401e9c

UPSTREAM: workingset: memcg: sleep when flushing stats in workingset_refault() · eed1fb4b

Yosry Ahmed authored 2 years ago

In workingset_refault(), we call
mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats within
an RCU read section and with sleeping disallowed.  Move the call above the
RCU read section to make it non-atomic.

Flushing is an expensive operation that scales with the number of cpus and
the number of cgroups in the system, so avoid doing it atomically where
possible.

Since workingset_refault() is the only caller of
mem_cgroup_flush_stats_atomic_ratelimited(), just make it non-atomic, and
rename it to mem_cgroup_flush_stats_ratelimited().

Link: https://lkml.kernel.org/r/20230330191801.1967435-7-yosryahmed@google.com


Change-Id: Ia073b9bcef12dfced053ebea7a5da96ce942596c
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4009b2f1)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

eed1fb4b

UPSTREAM: memcg: sleep during flushing stats in safe contexts · 4a68abd7

Yosry Ahmed authored 2 years ago

Currently, all contexts that flush memcg stats do so with sleeping not
allowed.  Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.
Flushing is an expensive operation that scales with the number of cpus and
the number of cgroups in the system, so avoid doing it atomically where
possible.

Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
sleepable), and provide a separate atomic version.  The atomic version is
used in reclaim, refault, writeback, and in mem_cgroup_usage().  All other
code paths are left to use the non-atomic version.  This includes
callbacks for userspace reads and the periodic flusher.

Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
change it to mem_cgroup_flush_stats_atomic_ratelimited().  Reclaim and
refault code paths are modified to do non-atomic flushing in separate
later patches -- so it will eventually be changed back to
mem_cgroup_flush_stats_ratelimited().

Link: https://lkml.kernel.org/r/20230330191801.1967435-6-yosryahmed@google.com


Change-Id: I9c28e852e1a37202fbd3ee419c72acf667d63404
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 9fad9aee)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

4a68abd7

UPSTREAM: memcg: replace stats_flush_lock with an atomic · 722c151c

Yosry Ahmed authored 2 years ago

As Johannes notes in [1], stats_flush_lock is currently used to:
(a) Protect updated to stats_flush_threshold.
(b) Protect updates to flush_next_time.
(c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.

However:

1. stats_flush_threshold is already an atomic

2. flush_next_time is not atomic. The writer is locked, but the reader
   is lockless. If the reader races with a flush, you could see this:

                                        if (time_after(jiffies, flush_next_time))
        spin_trylock()
        flush_next_time = now + delay
        flush()
        spin_unlock()
                                        spin_trylock()
                                        flush_next_time = now + delay
                                        flush()
                                        spin_unlock()

   which means we already can get flushes at a higher frequency than
   FLUSH_TIME during races. But it isn't really a problem.

   The reader could also see garbled partial updates if the compiler
   decides to split the write, so it needs at least READ_ONCE and
   WRITE_ONCE protection.

3. Serializing cgroup_rstat_flush() calls against the ratelimit
   factors is currently broken because of the race in 2. But the race
   is actually harmless, all we might get is the occasional earlier
   flush. If there is no delta, the flush won't do much. And if there
   is, the flush is justified.

So the lock can be removed all together. However, the lock also served
the purpose of preventing a thundering herd problem for concurrent
flushers, see [2]. Use an atomic instead to serve the purpose of
unifying concurrent flushers.

[1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
[2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/

Link: https://lkml.kernel.org/r/20230330191801.1967435-5-yosryahmed@google.com


Change-Id: I98e8344b440486162426186c4abdf21e02eebd43
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 3cd9992b)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

722c151c

UPSTREAM: memcg: do not flush stats in irq context · 29a39623

Yosry Ahmed authored 2 years ago

Currently, the only context in which we can invoke an rstat flush from irq
context is through mem_cgroup_usage() on the root memcg when called from
memcg_check_events().  An rstat flush is an expensive operation that
should not be done in irq context, so do not flush stats and use the stale
stats in this case.

Arguably, usage threshold events are not reliable on the root memcg anyway
since its usage is ill-defined.

Link: https://lkml.kernel.org/r/20230330191801.1967435-4-yosryahmed@google.com


Change-Id: If230311168f126e3741afaeab1f20cb1949190f0
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit a2174e95)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

29a39623

UPSTREAM: memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" · ba1e0fa6

Yosry Ahmed authored 2 years ago

mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work, but
this is actually sometimes flushing directly from the callsite.

What it's doing is ratelimited calls.  A better name would be
mem_cgroup_flush_stats_ratelimited().

Link: https://lkml.kernel.org/r/20230330191801.1967435-3-yosryahmed@google.com


Change-Id: Ib418ae17599d10c520b766f0c0e4396a43906256
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 92fbbc72)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

ba1e0fa6

UPSTREAM: cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" · e5a2d8be

Yosry Ahmed authored 2 years ago

Patch series "memcg: avoid flushing stats atomically where possible", v3.

rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system.  The purpose of this series
is to minimize the contexts where we flush stats atomically.

Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.

Patch 3 makes sure we never try to flush from within an irq context.

Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.

Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.

This patch (of 8):

cgroup_rstat_flush_irqsafe() can be a confusing name.  It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention.  The
intention is that this function is safe to call from atomic contexts.
Name it as such.

Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com


Change-Id: I7a030bc657b330ce700a29ded19f995e26f3aec1
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8bff9a04)
Bug: 322544714
Bug: 389118629
Signed-off-by: T.J. Mercier <tjmercier@google.com>

e5a2d8be

Jan 10, 2025

UPSTREAM: mm: krealloc: Fix MTE false alarm in __do_krealloc · 6b18f0b5

Qun-Wei Lin authored 5 months ago

commit 70457385 upstream.

This patch addresses an issue introduced by commit 1a83a716 ("mm:
krealloc: consider spare memory for __GFP_ZERO") which causes MTE
(Memory Tagging Extension) to falsely report a slab-out-of-bounds error.

The problem occurs when zeroing out spare memory in __do_krealloc. The
original code only considered software-based KASAN and did not account
for MTE. It does not reset the KASAN tag before calling memset, leading
to a mismatch between the pointer tag and the memory tag, resulting
in a false positive.

Example of the error:
==================================================================
swapper/0: BUG: KASAN: slab-out-of-bounds in __memset+0x84/0x188
swapper/0: Write at addr f4ffff8005f0fdf0 by task swapper/0/1
swapper/0: Pointer tag: [f4], memory tag: [fe]
swapper/0:
swapper/0: CPU: 4 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.
swapper/0: Hardware name: MT6991(ENG) (DT)
swapper/0: Call trace:
swapper/0:  dump_backtrace+0xfc/0x17c
swapper/0:  show_stack+0x18/0x28
swapper/0:  dump_stack_lvl+0x40/0xa0
swapper/0:  print_report+0x1b8/0x71c
swapper/0:  kasan_report+0xec/0x14c
swapper/0:  __do_kernel_fault+0x60/0x29c
swapper/0:  do_bad_area+0x30/0xdc
swapper/0:  do_tag_check_fault+0x20/0x34
swapper/0:  do_mem_abort+0x58/0x104
swapper/0:  el1_abort+0x3c/0x5c
swapper/0:  el1h_64_sync_handler+0x80/0xcc
swapper/0:  el1h_64_sync+0x68/0x6c
swapper/0:  __memset+0x84/0x188
swapper/0:  btf_populate_kfunc_set+0x280/0x3d8
swapper/0:  __register_btf_kfunc_id_set+0x43c/0x468
swapper/0:  register_btf_kfunc_id_set+0x48/0x60
swapper/0:  register_nf_nat_bpf+0x1c/0x40
swapper/0:  nf_nat_init+0xc0/0x128
swapper/0:  do_one_initcall+0x184/0x464
swapper/0:  do_initcall_level+0xdc/0x1b0
swapper/0:  do_initcalls+0x70/0xc0
swapper/0:  do_basic_setup+0x1c/0x28
swapper/0:  kernel_init_freeable+0x144/0x1b8
swapper/0:  kernel_init+0x20/0x1a8
swapper/0:  ret_from_fork+0x10/0x20
==================================================================

Bug: 388132060
(cherry picked from commit 486aeb5f
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

 linux-6.1.y)
Fixes: 1a83a716 ("mm: krealloc: consider spare memory for __GFP_ZERO")
Signed-off-by: Qun-Wei Lin <qun-wei.lin@mediatek.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>
Change-Id: Iea0ba629183042d594665ab51b410965963d167e

6b18f0b5

ANDROID: ABI: update protected symbol list · 01033a8e

Seiya Wang authored 3 months ago


Add bt_sock_linked to protected symbol list

Bug: 387804010
Bug: 388980392
Change-Id: I96abbc18d9cb122708a07d80ae9f8fa2da276ef2
Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>
(cherry picked from commit 770852bf)

01033a8e

Jan 08, 2025

UPSTREAM: ALSA: usb-audio: Fix a DMA to stack memory bug · 86e45e52

Dan Carpenter authored 4 months ago


commit f7d306b4 upstream.

The usb_get_descriptor() function does DMA so we're not allowed
to use a stack buffer for that.  Doing DMA to the stack is not portable
all architectures.  Move the "new_device_descriptor" from being stored
on the stack and allocate it with kmalloc() instead.

Bug: 382243530
Fixes: b909df18 ("ALSA: usb-audio: Fix potential out-of-bound accesses for Extigy and Mbox devices")
Cc: stable@kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/60e3aa09-039d-46d2-934c-6f123026c2eb@stanley.mountain


Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Benoît Sevens <bsevens@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 44a7b041)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I469212aa538584e3d8cc5b0087b68c99acf43f64

86e45e52

UPSTREAM: ALSA: usb-audio: Fix potential out-of-bound accesses for Extigy and Mbox devices · f57fff3f

Benoît Sevens authored 4 months ago


commit b909df18 upstream.

A bogus device can provide a bNumConfigurations value that exceeds the
initial value used in usb_get_configuration for allocating dev->config.

This can lead to out-of-bounds accesses later, e.g. in
usb_destroy_configuration.

Bug: 382243530
Signed-off-by: Benoît Sevens <bsevens@google.com>
Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Link: https://patch.msgid.link/20241120124144.3814457-1-bsevens@google.com


Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b8f8b81d)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I1aa1a442a5c87116200dcab02f84e1bd48f86bb5

f57fff3f

Dec 19, 2024

UPSTREAM: ALSA: usb-audio: Fix out of bounds reads when finding clock sources · dcfec89a

Takashi Iwai authored 4 months ago


commit a3dd4d63 upstream.

The current USB-audio driver code doesn't check bLength of each
descriptor at traversing for clock descriptors.  That is, when a
device provides a bogus descriptor with a shorter bLength, the driver
might hit out-of-bounds reads.

For addressing it, this patch adds sanity checks to the validator
functions for the clock descriptor traversal.  When the descriptor
length is shorter than expected, it's skipped in the loop.

For the clock source and clock multiplier descriptors, we can just
check bLength against the sizeof() of each descriptor type.
OTOH, the clock selector descriptor of UAC2 and UAC3 has an array
of bNrInPins elements and two more fields at its tail, hence those
have to be checked in addition to the sizeof() check.

Bug: 382239029
Reported-by: Benoît Sevens <bsevens@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/20241121140613.3651-1-bsevens@google.com
Link: https://patch.msgid.link/20241125144629.20757-1-tiwai@suse.de


Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 74cb86e1)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I13e916ffd46fce6fd08f7b9f96cea82bb4bc475d

dcfec89a

Dec 16, 2024

ANDROID: GKI: Update symbol list for mtk · dada37f1

Seiya Wang authored 4 months ago


1 function symbol(s) added
  'unsigned int tty_buffer_space_avail(struct tty_port*)'

Bug: 383039133
Bug: 384396059
Change-Id: Ic92096ecc482c120b8cda325a7f2ae44f99e8527
Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>
(cherry picked from commit d8ccecb7)

dada37f1

ANDROID: GKI: Update symbol list for mtk · 2d1fd936

Seiya Wang authored 4 months ago


1 function symbol(s) added
  'int snd_pcm_format_set_silence(snd_pcm_format_t, void*, unsigned int)'

Bug: 382303912
Bug: 384396059
Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>
Change-Id: I7de7c367b52114b9770697cf3eaeb47008a0e52b
(cherry picked from commit c276c539)

2d1fd936

ANDROID: GKI: Update symbol list for mtk · 72ec54c5

Seiya Wang authored 4 months ago


5 function symbol(s) added
  'void* devm_memremap_pages(struct device*, struct dev_pagemap*)'
  'int rproc_detach(struct rproc*)'
  'struct socket* tun_get_socket(struct file*)'
  'struct ptr_ring* tun_get_tx_ring(struct file*)'
  'void tun_ptr_free(void*)'

Bug: 381284575
Bug: 384396059
Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>
Change-Id: I593a5dd82f5b7fde7bb3ebbd74fead839fcfdf9f
(cherry picked from commit cdea241b)

72ec54c5

Dec 12, 2024

UPSTREAM: HID: core: zero-initialize the report buffer · 7316d67b

Jiri Kosina authored 5 months ago


[ Upstream commit 177f25d1 ]

Since the report buffer is used by all kinds of drivers in various ways, let's
zero-initialize it during allocation to make sure that it can't be ever used
to leak kernel memory via specially-crafted report.

Bug: 380395346
Fixes: 27ce4050 ("HID: fix data access in implement()")
Reported-by: Benoît Sevens <bsevens@google.com>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 9d9f5c75)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I31f64f2745347137bbc415eb35b7fab5761867f3

7316d67b

Revert "UPSTREAM: unicode: Don't special case ignorable code points" · c53aad9c

Todd Kjos authored 4 months ago


This reverts commit 62bbb08a.

Reason for revert: b/382800956

Bug: 382800956
Change-Id: Ic7a0cdbb060c12c1628a5859d795e78cd6b9341d
Signed-off-by: Todd Kjos <tkjos@google.com>
(cherry picked from commit c3766284)
Signed-off-by: Lee Jones <joneslee@google.com>

c53aad9c

Dec 02, 2024

ANDROID: Initialize android14-6.1-2024-12 · b5630b0a

Todd Kjos authored 4 months ago


Bug: 380360698
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: I71aeaa612f23804e376a3d9ebe33ac991e62e3b6

b5630b0a

Nov 29, 2024

ANDROID: KVM: arm64: Always check state from host_ack_unshare() · a887a44a

Quentin Perret authored 4 months ago


Similar to how we failed to cross-check the state from the completer's
PoV on the hyp_ack_unshare() path, we fail to do so from
host_ack_unshare().

This shouldn't cause problems in practice as this can only be called on
the guest_unshare_host() path, and guest currently don't have the
ability to share their pages with anybody other than the host. But this
again is rather fragile, so let's simply do the proper check -- it isn't
very costly thanks to the hyp_vmemmap optimisation.

Bug: 381409114
Change-Id: I3770b7db55c579758863e41f50ab30f6a8bb4a0c
Signed-off-by: Quentin Perret <qperret@google.com>

a887a44a

FROMLIST: KVM: arm64: Always check the state from hyp_ack_unshare() · fb69bae8

Quentin Perret authored 4 months ago

There are multiple pKVM memory transitions where the state of a page is
not cross-checked from the completer's PoV for performance reasons.
For example, if a page is PKVM_PAGE_OWNED from the initiator's PoV,
we should be guaranteed by construction that it is PKVM_NOPAGE for
everybody else, hence allowing us to save a page-table lookup.

When it was introduced, hyp_ack_unshare() followed that logic and bailed
out without checking the PKVM_PAGE_SHARED_BORROWED state in the
hypervisor's stage-1. This was correct as we could safely assume that
all host-initiated shares were directed at the hypervisor at the time.
But with the introduction of other types of shares (e.g. for FF-A or
non-protected guests), it is now very much required to cross check this
state to prevent the host from running __pkvm_host_unshare_hyp() on a
page shared with TZ or a non-protected guest.

Thankfully, if an attacker were to try this, the hyp_unmap() call from
hyp_complete_unshare() would fail, hence causing to WARN() from
__do_unshare() with the host lock held, which is fatal. But this is
fragile at best, and can hardly be considered a security measure.

Let's just do the right thing and always check the state from
hyp_ack_unshare().

Bug: 381409114
Link: https://lore.kernel.org/kvmarm/20241128154406.602875-1-qperret@google.com/


Change-Id: Id3bbd1fc3c75df506b0919f4d6f7be74b6f013f3
Signed-off-by: Quentin Perret <qperret@google.com>

fb69bae8