Skip to content
Snippets Groups Projects
  1. Dec 23, 2024
  2. Dec 06, 2024
    • Filipe Manana's avatar
      btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount · f10bef73
      Filipe Manana authored
      
      During the unmount path, at close_ctree(), we first stop the cleaner
      kthread, using kthread_stop() which frees the associated task_struct, and
      then stop and destroy all the work queues. However after we stopped the
      cleaner we may still have a worker from the delalloc_workers queue running
      inode.c:submit_compressed_extents(), which calls btrfs_add_delayed_iput(),
      which in turn tries to wake up the cleaner kthread - which was already
      destroyed before, resulting in a use-after-free on the task_struct.
      
      Syzbot reported this with the following stack traces:
      
        BUG: KASAN: slab-use-after-free in __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
        Read of size 8 at addr ffff8880259d2818 by task kworker/u8:3/52
      
        CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.13.0-rc1-syzkaller-00002-gcdd30ebb1b9f #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
        Workqueue: btrfs-delalloc btrfs_work_helper
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:94 [inline]
         dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
         print_address_description mm/kasan/report.c:378 [inline]
         print_report+0x169/0x550 mm/kasan/report.c:489
         kasan_report+0x143/0x180 mm/kasan/report.c:602
         __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
         lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
         __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
         _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
         class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
         try_to_wake_up+0xc2/0x1470 kernel/sched/core.c:4205
         submit_compressed_extents+0xdf/0x16e0 fs/btrfs/inode.c:1615
         run_ordered_work fs/btrfs/async-thread.c:288 [inline]
         btrfs_work_helper+0x96f/0xc40 fs/btrfs/async-thread.c:324
         process_one_work kernel/workqueue.c:3229 [inline]
         process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
         worker_thread+0x870/0xd30 kernel/workqueue.c:3391
         kthread+0x2f0/0x390 kernel/kthread.c:389
         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
         </TASK>
      
        Allocated by task 2:
         kasan_save_stack mm/kasan/common.c:47 [inline]
         kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
         unpoison_slab_object mm/kasan/common.c:319 [inline]
         __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
         kasan_slab_alloc include/linux/kasan.h:250 [inline]
         slab_post_alloc_hook mm/slub.c:4104 [inline]
         slab_alloc_node mm/slub.c:4153 [inline]
         kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205
         alloc_task_struct_node kernel/fork.c:180 [inline]
         dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
         copy_process+0x5d1/0x3d50 kernel/fork.c:2225
         kernel_clone+0x223/0x870 kernel/fork.c:2807
         kernel_thread+0x1bc/0x240 kernel/fork.c:2869
         create_kthread kernel/kthread.c:412 [inline]
         kthreadd+0x60d/0x810 kernel/kthread.c:767
         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
        Freed by task 24:
         kasan_save_stack mm/kasan/common.c:47 [inline]
         kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
         kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
         poison_slab_object mm/kasan/common.c:247 [inline]
         __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
         kasan_slab_free include/linux/kasan.h:233 [inline]
         slab_free_hook mm/slub.c:2338 [inline]
         slab_free mm/slub.c:4598 [inline]
         kmem_cache_free+0x195/0x410 mm/slub.c:4700
         put_task_struct include/linux/sched/task.h:144 [inline]
         delayed_put_task_struct+0x125/0x300 kernel/exit.c:227
         rcu_do_batch kernel/rcu/tree.c:2567 [inline]
         rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
         handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:554
         run_ksoftirqd+0xca/0x130 kernel/softirq.c:943
         smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
         kthread+0x2f0/0x390 kernel/kthread.c:389
         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
        Last potentially related work creation:
         kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
         __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:544
         __call_rcu_common kernel/rcu/tree.c:3086 [inline]
         call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
         context_switch kernel/sched/core.c:5372 [inline]
         __schedule+0x1803/0x4be0 kernel/sched/core.c:6756
         __schedule_loop kernel/sched/core.c:6833 [inline]
         schedule+0x14b/0x320 kernel/sched/core.c:6848
         schedule_timeout+0xb0/0x290 kernel/time/sleep_timeout.c:75
         do_wait_for_common kernel/sched/completion.c:95 [inline]
         __wait_for_common kernel/sched/completion.c:116 [inline]
         wait_for_common kernel/sched/completion.c:127 [inline]
         wait_for_completion+0x355/0x620 kernel/sched/completion.c:148
         kthread_stop+0x19e/0x640 kernel/kthread.c:712
         close_ctree+0x524/0xd60 fs/btrfs/disk-io.c:4328
         generic_shutdown_super+0x139/0x2d0 fs/super.c:642
         kill_anon_super+0x3b/0x70 fs/super.c:1237
         btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2112
         deactivate_locked_super+0xc4/0x130 fs/super.c:473
         cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
         task_work_run+0x24f/0x310 kernel/task_work.c:239
         ptrace_notify+0x2d2/0x380 kernel/signal.c:2503
         ptrace_report_syscall include/linux/ptrace.h:415 [inline]
         ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
         syscall_exit_work+0xc7/0x1d0 kernel/entry/common.c:173
         syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
         __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
         syscall_exit_to_user_mode+0x24a/0x340 kernel/entry/common.c:218
         do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
        The buggy address belongs to the object at ffff8880259d1e00
         which belongs to the cache task_struct of size 7424
        The buggy address is located 2584 bytes inside of
         freed 7424-byte region [ffff8880259d1e00, ffff8880259d3b00)
      
        The buggy address belongs to the physical page:
        page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x259d0
        head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
        memcg:ffff88802f4b56c1
        flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
        page_type: f5(slab)
        raw: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
        raw: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
        head: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
        head: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
        head: 00fff00000000003 ffffea0000967401 ffffffffffffffff 0000000000000000
        head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
        page_owner tracks the page as allocated
        page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 12, tgid 12 (kworker/u8:1), ts 7328037942, free_ts 0
         set_page_owner include/linux/page_owner.h:32 [inline]
         post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556
         prep_new_page mm/page_alloc.c:1564 [inline]
         get_page_from_freelist+0x3651/0x37a0 mm/page_alloc.c:3474
         __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751
         alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
         alloc_slab_page+0x6a/0x140 mm/slub.c:2408
         allocate_slab+0x5a/0x2f0 mm/slub.c:2574
         new_slab mm/slub.c:2627 [inline]
         ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815
         __slab_alloc+0x58/0xa0 mm/slub.c:3905
         __slab_alloc_node mm/slub.c:3980 [inline]
         slab_alloc_node mm/slub.c:4141 [inline]
         kmem_cache_alloc_node_noprof+0x269/0x380 mm/slub.c:4205
         alloc_task_struct_node kernel/fork.c:180 [inline]
         dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
         copy_process+0x5d1/0x3d50 kernel/fork.c:2225
         kernel_clone+0x223/0x870 kernel/fork.c:2807
         user_mode_thread+0x132/0x1a0 kernel/fork.c:2885
         call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
         process_one_work kernel/workqueue.c:3229 [inline]
         process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
         worker_thread+0x870/0xd30 kernel/workqueue.c:3391
        page_owner free stack trace missing
      
        Memory state around the buggy address:
         ffff8880259d2700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880259d2780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880259d2800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                    ^
         ffff8880259d2880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880259d2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      Fix this by flushing the delalloc workers queue before stopping the
      cleaner kthread.
      
      Reported-by: default avatar <syzbot+b7cf50a0c173770dcb14@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-btrfs/674ed7e8.050a0220.48a03.0031.GAE@google.com/
      
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f10bef73
    • Johannes Thumshirn's avatar
      btrfs: handle bio_split() errors · c7c97cef
      Johannes Thumshirn authored
      
      Commit e546fe1d ("block: Rework bio_split() return value") changed
      bio_split() so that it can return errors.
      
      Add error handling for it in btrfs_split_bio() and ultimately
      btrfs_submit_chunk(). As the bio is not submitted, the bio counter must
      be decremented to pair btrfs_bio_counter_inc_blocked().
      
      Reviewed-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7c97cef
    • Qu Wenruo's avatar
      btrfs: properly wait for writeback before buffered write · c83d77eb
      Qu Wenruo authored
      
      [BUG]
      Before commit e820dbeb ("btrfs: convert btrfs_buffered_write() to
      use folios"), function prepare_one_folio() will always wait for folio
      writeback to finish before returning the folio.
      
      However commit e820dbeb ("btrfs: convert btrfs_buffered_write() to
      use folios") changed to use FGP_STABLE to do the writeback wait, but
      FGP_STABLE is calling folio_wait_stable(), which only calls
      folio_wait_writeback() if the address space has AS_STABLE_WRITES, which
      is not set for btrfs inodes.
      
      This means we will not wait for the folio writeback at all.
      
      [CAUSE]
      The cause is FGP_STABLE is not waiting for writeback unconditionally, but
      only for address spaces with AS_STABLE_WRITES, normally such flag is set
      when the super block has SB_I_STABLE_WRITES flag.
      
      Such super block flag is set when the block device has hardware digest
      support or has internal checksum requirement.
      
      I'd argue btrfs should set such super block due to its default data
      checksum behavior, but it is not set yet, so this means FGP_STABLE flag
      will have no effect at all.
      
      (For NODATASUM inodes, we can skip the waiting in theory but that should
      be an optimization in the future.)
      
      This can lead to data checksum mismatch, as we can modify the folio
      while it's still under writeback, this will make the contents differ
      from the contents at submission and checksum calculation.
      
      [FIX]
      Instead of fully relying on FGP_STABLE, manually do the folio writeback
      waiting, until we set the address space or super flag.
      
      Fixes: e820dbeb ("btrfs: convert btrfs_buffered_write() to use folios")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c83d77eb
  3. Dec 03, 2024
    • Filipe Manana's avatar
      btrfs: fix missing snapshot drew unlock when root is dead during swap activation · 9c803c47
      Filipe Manana authored
      
      When activating a swap file we acquire the root's snapshot drew lock and
      then check if the root is dead, failing and returning with -EPERM if it's
      dead but without unlocking the root's snapshot lock. Fix this by adding
      the missing unlock.
      
      Fixes: 60021bd7 ("btrfs: prevent subvol with swapfile from being deleted")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c803c47
    • Qu Wenruo's avatar
      btrfs: fix mount failure due to remount races · 951a3f59
      Qu Wenruo authored
      
      [BUG]
      The following reproducer can cause btrfs mount to fail:
      
        dev="/dev/test/scratch1"
        mnt1="/mnt/test"
        mnt2="/mnt/scratch"
      
        mkfs.btrfs -f $dev
        mount $dev $mnt1
        btrfs subvolume create $mnt1/subvol1
        btrfs subvolume create $mnt1/subvol2
        umount $mnt1
      
        mount $dev $mnt1 -o subvol=subvol1
        while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done &
        bg=$!
      
        while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done
      
        kill $bg
        wait
        umount -R $mnt1
        umount -R $mnt2
      
      The script will fail with the following error:
      
        mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test.
              dmesg(1) may have more information after failed mount system call.
        umount: /mnt/test: target is busy.
        umount: /mnt/scratch/: not mounted
      
      And there is no kernel error message.
      
      [CAUSE]
      During the btrfs mount, to support mounting different subvolumes with
      different RO/RW flags, we need to detect that and retry if needed:
      
        Retry with matching RO flags if the initial mount fail with -EBUSY.
      
      The problem is, during that retry we do not hold any super block lock
      (s_umount), this means there can be a remount process changing the RO
      flags of the original fs super block.
      
      If so, we can have an EBUSY error during retry.  And this time we treat
      any failure as an error, without any retry and cause the above EBUSY
      mount failure.
      
      [FIX]
      The current retry behavior is racy because we do not have a super block
      thus no way to hold s_umount to prevent the race with remount.
      
      Solve the root problem by allowing fc->sb_flags to mismatch from the
      sb->s_flags at btrfs_get_tree_super().
      
      Then at the re-entry point btrfs_get_tree_subvol(), manually check the
      fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then
      reconfigure with s_umount lock hold.
      
      Reported-by: default avatarEnno Gotthold <egotthold@suse.com>
      Reported-by: default avatarFabian Vogt <fvogt@suse.com>
      [ Special thanks for the reproducer and early analysis pointing to btrfs. ]
      Fixes: f044b318 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836
      
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      951a3f59
  4. Nov 29, 2024
    • Mark Harmstone's avatar
      btrfs: fix lockdep warnings on io_uring encoded reads · 22d2e48e
      Mark Harmstone authored
      
      Lockdep doesn't like the fact that btrfs_uring_read_extent() returns to
      userspace still holding the inode lock, even though we release it once
      the I/O finishes. Add calls to rwsem_release() and rwsem_acquire_read() to
      work round this.
      
      Reported-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      34310c44 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
      Signed-off-by: default avatarMark Harmstone <maharmstone@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22d2e48e
    • Filipe Manana's avatar
      btrfs: ref-verify: fix use-after-free after invalid ref action · 7c4e39f9
      Filipe Manana authored
      
      At btrfs_ref_tree_mod() after we successfully inserted the new ref entry
      (local variable 'ref') into the respective block entry's rbtree (local
      variable 'be'), if we find an unexpected action of BTRFS_DROP_DELAYED_REF,
      we error out and free the ref entry without removing it from the block
      entry's rbtree. Then in the error path of btrfs_ref_tree_mod() we call
      btrfs_free_ref_cache(), which iterates over all block entries and then
      calls free_block_entry() for each one, and there we will trigger a
      use-after-free when we are called against the block entry to which we
      added the freed ref entry to its rbtree, since the rbtree still points
      to the block entry, as we didn't remove it from the rbtree before freeing
      it in the error path at btrfs_ref_tree_mod(). Fix this by removing the
      new ref entry from the rbtree before freeing it.
      
      Syzbot report this with the following stack traces:
      
         BTRFS error (device loop0 state EA):   Ref action 2, root 5, ref_root 0, parent 8564736, owner 0, offset 0, num_refs 18446744073709551615
            __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
            update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
            btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
            btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
            btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
            btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4314
            btrfs_insert_empty_item fs/btrfs/ctree.h:669 [inline]
            btrfs_insert_orphan_item+0x1f1/0x320 fs/btrfs/orphan.c:23
            btrfs_orphan_add+0x6d/0x1a0 fs/btrfs/inode.c:3482
            btrfs_unlink+0x267/0x350 fs/btrfs/inode.c:4293
            vfs_unlink+0x365/0x650 fs/namei.c:4469
            do_unlinkat+0x4ae/0x830 fs/namei.c:4533
            __do_sys_unlinkat fs/namei.c:4576 [inline]
            __se_sys_unlinkat fs/namei.c:4569 [inline]
            __x64_sys_unlinkat+0xcc/0xf0 fs/namei.c:4569
            do_syscall_x64 arch/x86/entry/common.c:52 [inline]
            do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
            entry_SYSCALL_64_after_hwframe+0x77/0x7f
         BTRFS error (device loop0 state EA):   Ref action 1, root 5, ref_root 5, parent 0, owner 260, offset 0, num_refs 1
            __btrfs_mod_ref+0x76b/0xac0 fs/btrfs/extent-tree.c:2521
            update_ref_for_cow+0x96a/0x11f0
            btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
            btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
            btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
            btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
            __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
            btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
            __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
            __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
            btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
            prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
            relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
            btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
            btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
            __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
            btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
         BTRFS error (device loop0 state EA):   Ref action 2, root 5, ref_root 0, parent 8564736, owner 0, offset 0, num_refs 18446744073709551615
            __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
            update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
            btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
            btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
            btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
            btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
            __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
            btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
            __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
            __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
            btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
            prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
            relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
            btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
            btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
            __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
            btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
         ==================================================================
         BUG: KASAN: slab-use-after-free in rb_first+0x69/0x70 lib/rbtree.c:473
         Read of size 8 at addr ffff888042d1af38 by task syz.0.0/5329
      
         CPU: 0 UID: 0 PID: 5329 Comm: syz.0.0 Not tainted 6.12.0-rc7-syzkaller #0
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
         Call Trace:
          <TASK>
          __dump_stack lib/dump_stack.c:94 [inline]
          dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
          print_address_description mm/kasan/report.c:377 [inline]
          print_report+0x169/0x550 mm/kasan/report.c:488
          kasan_report+0x143/0x180 mm/kasan/report.c:601
          rb_first+0x69/0x70 lib/rbtree.c:473
          free_block_entry+0x78/0x230 fs/btrfs/ref-verify.c:248
          btrfs_free_ref_cache+0xa3/0x100 fs/btrfs/ref-verify.c:917
          btrfs_ref_tree_mod+0x139f/0x15e0 fs/btrfs/ref-verify.c:898
          btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
          __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
          update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
          btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
          btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
          btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
          btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
          __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
          btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
          __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
          __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
          btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
          prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
          relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:907 [inline]
          __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f
         RIP: 0033:0x7f996df7e719
         RSP: 002b:00007f996ede7038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
         RAX: ffffffffffffffda RBX: 00007f996e135f80 RCX: 00007f996df7e719
         RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000004
         RBP: 00007f996dff139e R08: 0000000000000000 R09: 0000000000000000
         R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
         R13: 0000000000000000 R14: 00007f996e135f80 R15: 00007fff79f32e68
          </TASK>
      
         Allocated by task 5329:
          kasan_save_stack mm/kasan/common.c:47 [inline]
          kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
          poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
          __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
          kasan_kmalloc include/linux/kasan.h:257 [inline]
          __kmalloc_cache_noprof+0x19c/0x2c0 mm/slub.c:4295
          kmalloc_noprof include/linux/slab.h:878 [inline]
          kzalloc_noprof include/linux/slab.h:1014 [inline]
          btrfs_ref_tree_mod+0x264/0x15e0 fs/btrfs/ref-verify.c:701
          btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
          __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
          update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
          btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
          btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
          btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
          btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
          __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
          btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
          __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
          __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
          btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
          prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
          relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:907 [inline]
          __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
         Freed by task 5329:
          kasan_save_stack mm/kasan/common.c:47 [inline]
          kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
          kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
          poison_slab_object mm/kasan/common.c:247 [inline]
          __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
          kasan_slab_free include/linux/kasan.h:230 [inline]
          slab_free_hook mm/slub.c:2342 [inline]
          slab_free mm/slub.c:4579 [inline]
          kfree+0x1a0/0x440 mm/slub.c:4727
          btrfs_ref_tree_mod+0x136c/0x15e0
          btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
          __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
          update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
          btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
          btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
          btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
          btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
          __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
          btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
          __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
          __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
          btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
          prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
          relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:907 [inline]
          __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
         The buggy address belongs to the object at ffff888042d1af00
          which belongs to the cache kmalloc-64 of size 64
         The buggy address is located 56 bytes inside of
          freed 64-byte region [ffff888042d1af00, ffff888042d1af40)
      
         The buggy address belongs to the physical page:
         page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x42d1a
         anon flags: 0x4fff00000000000(node=1|zone=1|lastcpupid=0x7ff)
         page_type: f5(slab)
         raw: 04fff00000000000 ffff88801ac418c0 0000000000000000 dead000000000001
         raw: 0000000000000000 0000000000200020 00000001f5000000 0000000000000000
         page dumped because: kasan: bad access detected
         page_owner tracks the page as allocated
         page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 5055, tgid 5055 (dhcpcd-run-hook), ts 40377240074, free_ts 40376848335
          set_page_owner include/linux/page_owner.h:32 [inline]
          post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1541
          prep_new_page mm/page_alloc.c:1549 [inline]
          get_page_from_freelist+0x3649/0x3790 mm/page_alloc.c:3459
          __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4735
          alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
          alloc_slab_page+0x6a/0x140 mm/slub.c:2412
          allocate_slab+0x5a/0x2f0 mm/slub.c:2578
          new_slab mm/slub.c:2631 [inline]
          ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3818
          __slab_alloc+0x58/0xa0 mm/slub.c:3908
          __slab_alloc_node mm/slub.c:3961 [inline]
          slab_alloc_node mm/slub.c:4122 [inline]
          __do_kmalloc_node mm/slub.c:4263 [inline]
          __kmalloc_noprof+0x25a/0x400 mm/slub.c:4276
          kmalloc_noprof include/linux/slab.h:882 [inline]
          kzalloc_noprof include/linux/slab.h:1014 [inline]
          tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
          tomoyo_encode+0x26f/0x540 security/tomoyo/realpath.c:80
          tomoyo_realpath_from_path+0x59e/0x5e0 security/tomoyo/realpath.c:283
          tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
          tomoyo_check_open_permission+0x255/0x500 security/tomoyo/file.c:771
          security_file_open+0x777/0x990 security/security.c:3109
          do_dentry_open+0x369/0x1460 fs/open.c:945
          vfs_open+0x3e/0x330 fs/open.c:1088
          do_open fs/namei.c:3774 [inline]
          path_openat+0x2c84/0x3590 fs/namei.c:3933
         page last free pid 5055 tgid 5055 stack trace:
          reset_page_owner include/linux/page_owner.h:25 [inline]
          free_pages_prepare mm/page_alloc.c:1112 [inline]
          free_unref_page+0xcfb/0xf20 mm/page_alloc.c:2642
          free_pipe_info+0x300/0x390 fs/pipe.c:860
          put_pipe_info fs/pipe.c:719 [inline]
          pipe_release+0x245/0x320 fs/pipe.c:742
          __fput+0x23f/0x880 fs/file_table.c:431
          __do_sys_close fs/open.c:1567 [inline]
          __se_sys_close fs/open.c:1552 [inline]
          __x64_sys_close+0x7f/0x110 fs/open.c:1552
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
         Memory state around the buggy address:
          ffff888042d1ae00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
          ffff888042d1ae80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
         >ffff888042d1af00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                                                 ^
          ffff888042d1af80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
          ffff888042d1b000: 00 00 00 00 00 fc fc 00 00 00 00 00 fc fc 00 00
      
      Reported-by: default avatar <syzbot+7325f164162e200000c1@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-btrfs/673723eb.050a0220.1324f8.00a8.GAE@google.com/T/#u
      
      
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c4e39f9
    • Lizhi Xu's avatar
      btrfs: add a sanity check for btrfs root in btrfs_search_slot() · 3ed51857
      Lizhi Xu authored
      
      Syzbot reports a null-ptr-deref in btrfs_search_slot().
      
      The reproducer is using rescue=ibadroots, and the extent tree root is
      corrupted thus the extent tree is NULL.
      
      When scrub tries to search the extent tree to gather the needed extent
      info, btrfs_search_slot() doesn't check if the target root is NULL or
      not, resulting the null-ptr-deref.
      
      Add sanity check for btrfs root before using it in btrfs_search_slot().
      
      Reported-by: default avatar <syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com>
      Fixes: 42437a63 ("btrfs: introduce mount option rescue=ignorebadroots")
      Link: https://syzkaller.appspot.com/bug?extid=3030e17bd57a73d39bd7
      
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Tested-by: default avatar <syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com>
      Signed-off-by: default avatarLizhi Xu <lizhi.xu@windriver.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ed51857
    • Filipe Manana's avatar
      btrfs: don't loop for nowait writes when checking for cross references · ed67f2a9
      Filipe Manana authored
      
      When checking for delayed refs when verifying if there are cross
      references for a data extent, we stop if the path has nowait set and we
      can't try lock the delayed ref head's mutex, returning -EAGAIN with the
      goal of making a write fallback to a blocking context. However we ignore
      the -EAGAIN at btrfs_cross_ref_exist() when check_delayed_ref() returns
      it, and keep looping instead of immediately returning the -EAGAIN to the
      caller.
      
      Fix this by not looping if we get -EAGAIN and we have a nowait path.
      
      Fixes: 26ce9114 ("btrfs: make can_nocow_extent nowait compatible")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed67f2a9
  5. Nov 28, 2024
    • Filipe Manana's avatar
      btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y · b188ad77
      Filipe Manana authored
      
      We are advertising experimental features through sysfs if
      CONFIG_BTRFS_DEBUG is set, without looking if CONFIG_BTRFS_EXPERIMENTAL
      is set. This is wrong as it will result in reporting experimental
      features as supported when CONFIG_BTRFS_EXPERIMENTAL is not set but
      CONFIG_BTRFS_DEBUG is set.
      
      Fix this by checking for CONFIG_BTRFS_EXPERIMENTAL instead of
      CONFIG_BTRFS_DEBUG.
      
      Fixes: 67cd3f22 ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG")
      Reviewed-by: default avatarNeal Gompa <neal@gompa.dev>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b188ad77
    • Filipe Manana's avatar
      btrfs: fix deadlock between transaction commits and extent locks · 7d6872cc
      Filipe Manana authored
      
      When running a workload with fsstress and duperemove (generic/561) we can
      hit a deadlock related to transaction commits and locking extent ranges,
      as described below.
      
      Task A hanging during a transaction commit, waiting for all other writers
      to complete:
      
        [178317.334817] INFO: task fsstress:555623 blocked for more than 120 seconds.
        [178317.335693]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
        [178317.336528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [178317.337673] task:fsstress        state:D stack:0     pid:555623 tgid:555623 ppid:555620 flags:0x00004002
        [178317.337679] Call Trace:
        [178317.337681]  <TASK>
        [178317.337685]  __schedule+0x364/0xbe0
        [178317.337691]  schedule+0x26/0xa0
        [178317.337695]  btrfs_commit_transaction+0x5c5/0x1050 [btrfs]
        [178317.337769]  ? start_transaction+0xc4/0x800 [btrfs]
        [178317.337815]  ? __pfx_autoremove_wake_function+0x10/0x10
        [178317.337819]  btrfs_mksubvol+0x381/0x640 [btrfs]
        [178317.337878]  btrfs_mksnapshot+0x7a/0xb0 [btrfs]
        [178317.337935]  __btrfs_ioctl_snap_create+0x1bb/0x1d0 [btrfs]
        [178317.337995]  btrfs_ioctl_snap_create_v2+0x103/0x130 [btrfs]
        [178317.338053]  btrfs_ioctl+0x29b/0x2a90 [btrfs]
        [178317.338118]  ? kmem_cache_alloc_noprof+0x5f/0x2c0
        [178317.338126]  ? getname_flags+0x45/0x1f0
        [178317.338133]  ? _raw_spin_unlock+0x15/0x30
        [178317.338145]  ? __x64_sys_ioctl+0x88/0xc0
        [178317.338149]  __x64_sys_ioctl+0x88/0xc0
        [178317.338152]  do_syscall_64+0x4a/0x110
        [178317.338160]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
        [178317.338190] RIP: 0033:0x7f13c28e271b
      
      Which corresponds to line 2361 of transaction.c:
      
        $ cat -n fs/btrfs/transaction.c
        (...)
        2162  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
        2163  {
        (...)
        2349          spin_lock(&fs_info->trans_lock);
        2350          add_pending_snapshot(trans);
        2351          cur_trans->state = TRANS_STATE_COMMIT_DOING;
        2352          spin_unlock(&fs_info->trans_lock);
        2353
        2354          /*
        2355           * The thread has started/joined the transaction thus it holds the
        2356           * lockdep map as a reader. It has to release it before acquiring the
        2357           * lockdep map as a writer.
        2358           */
        2359          btrfs_lockdep_release(fs_info, btrfs_trans_num_writers);
        2360          btrfs_might_wait_for_event(fs_info, btrfs_trans_num_writers);
        2361          wait_event(cur_trans->writer_wait,
        2362                     atomic_read(&cur_trans->num_writers) == 1);
        (...)
      
      The transaction is in the TRANS_STATE_COMMIT_DOING state and so it's
      waiting for all other existing writers to complete and release their
      transaction handle.
      
      Task B is running ordered extent completion and blocked waiting to lock an
      extent range in an inode's io tree:
      
        [178317.327411] INFO: task kworker/u48:8:554545 blocked for more than 120 seconds.
        [178317.328630]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
        [178317.329635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [178317.330872] task:kworker/u48:8   state:D stack:0     pid:554545 tgid:554545 ppid:2      flags:0x00004000
        [178317.330878] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        [178317.330944] Call Trace:
        [178317.330945]  <TASK>
        [178317.330947]  __schedule+0x364/0xbe0
        [178317.330952]  schedule+0x26/0xa0
        [178317.330955]  __lock_extent+0x337/0x3a0 [btrfs]
        [178317.331014]  ? __pfx_autoremove_wake_function+0x10/0x10
        [178317.331017]  btrfs_finish_one_ordered+0x47a/0xaa0 [btrfs]
        [178317.331074]  ? psi_group_change+0x132/0x2d0
        [178317.331078]  btrfs_work_helper+0xbd/0x370 [btrfs]
        [178317.331140]  process_scheduled_works+0xd3/0x460
        [178317.331144]  ? __pfx_worker_thread+0x10/0x10
        [178317.331146]  worker_thread+0x121/0x250
        [178317.331149]  ? __pfx_worker_thread+0x10/0x10
        [178317.331151]  kthread+0xe9/0x120
        [178317.331154]  ? __pfx_kthread+0x10/0x10
        [178317.331157]  ret_from_fork+0x2d/0x50
        [178317.331159]  ? __pfx_kthread+0x10/0x10
        [178317.331162]  ret_from_fork_asm+0x1a/0x30
      
      This extent range locking happens after joining the current transaction,
      so task A is waiting for task B to release its transaction handle
      (decrementing the transaction's num_writers counter).
      
      Task C while doing a fiemap it tries to join the current transaction:
      
        [242682.812815] task:pool            state:D stack:0     pid:560767 tgid:560724 ppid:555622 flags:0x00004006
        [242682.812827] Call Trace:
        [242682.812856]  <TASK>
        [242682.812864]  __schedule+0x364/0xbe0
        [242682.812879]  ? _raw_spin_unlock_irqrestore+0x23/0x40
        [242682.812897]  schedule+0x26/0xa0
        [242682.812909]  wait_current_trans+0xd6/0x130 [btrfs]
        [242682.813148]  ? __pfx_autoremove_wake_function+0x10/0x10
        [242682.813162]  start_transaction+0x3d4/0x800 [btrfs]
        [242682.813399]  btrfs_is_data_extent_shared+0xd2/0x440 [btrfs]
        [242682.813723]  fiemap_process_hole+0x2a2/0x300 [btrfs]
        [242682.813995]  extent_fiemap+0x9b8/0xb80 [btrfs]
        [242682.814249]  btrfs_fiemap+0x78/0xc0 [btrfs]
        [242682.814501]  do_vfs_ioctl+0x2db/0xa50
        [242682.814519]  __x64_sys_ioctl+0x6a/0xc0
        [242682.814531]  do_syscall_64+0x4a/0x110
        [242682.814544]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
        [242682.814556] RIP: 0033:0x7efff595e71b
      
      It tries to join the current transaction, but it can't because the
      transaction is in the TRANS_STATE_COMMIT_DOING state, so
      join_transaction() returns -EBUSY to start_transaction() and makes it
      wait for the current transaction to complete. And while it's waiting
      for the transaction to complete, it's holding an extent range locked
      in the same inode that task B is operating, which causes a deadlock
      between these 3 tasks. The extent range for the inode was locked at
      the start of the fiemap operation, early at extent_fiemap().
      
      In short these tasks deadlock because:
      
      1) Task A is waiting for task B to release its transaction handle;
      
      2) Task B is waiting to lock an extent range for an inode while holding a
         transaction handle open;
      
      3) Task C is waiting for the current transaction to complete (for task A
         to finish the transaction commit) while holding the extent range for
         the inode locked, so task B can't progress and release its transaction
         handle.
      
      This results in an ABBA deadlock involving transaction commits and extent
      locks. Extent locks are higher level locks, like inode VFS locks, and
      should always be acquired before joining or starting a transaction, but
      recently commit 2206265f ("btrfs: remove code duplication in ordered
      extent finishing") accidentally changed btrfs_finish_one_ordered() to do
      the transaction join before locking the extent range.
      
      Fix this by making sure that btrfs_finish_one_ordered() always locks the
      extent before joining a transaction and add an explicit comment about the
      need for this order.
      
      Fixes: 2206265f ("btrfs: remove code duplication in ordered extent finishing")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7d6872cc
    • Johannes Thumshirn's avatar
      btrfs: fix use-after-free in btrfs_encoded_read_endio() · 05b36b04
      Johannes Thumshirn authored
      
      Shinichiro reported the following use-after free that sometimes is
      happening in our CI system when running fstests' btrfs/284 on a TCMU
      runner device:
      
        BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
        Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
      
        CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
        Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
        Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
        Call Trace:
         <TASK>
         dump_stack_lvl+0x6e/0xa0
         ? lock_release+0x708/0x780
         print_report+0x174/0x505
         ? lock_release+0x708/0x780
         ? __virt_addr_valid+0x224/0x410
         ? lock_release+0x708/0x780
         kasan_report+0xda/0x1b0
         ? lock_release+0x708/0x780
         ? __wake_up+0x44/0x60
         lock_release+0x708/0x780
         ? __pfx_lock_release+0x10/0x10
         ? __pfx_do_raw_spin_lock+0x10/0x10
         ? lock_is_held_type+0x9a/0x110
         _raw_spin_unlock_irqrestore+0x1f/0x60
         __wake_up+0x44/0x60
         btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
         btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
         ? lock_release+0x1b0/0x780
         ? trace_lock_acquire+0x12f/0x1a0
         ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
         ? process_one_work+0x7e3/0x1460
         ? lock_acquire+0x31/0xc0
         ? process_one_work+0x7e3/0x1460
         process_one_work+0x85c/0x1460
         ? __pfx_process_one_work+0x10/0x10
         ? assign_work+0x16c/0x240
         worker_thread+0x5e6/0xfc0
         ? __pfx_worker_thread+0x10/0x10
         kthread+0x2c3/0x3a0
         ? __pfx_kthread+0x10/0x10
         ret_from_fork+0x31/0x70
         ? __pfx_kthread+0x10/0x10
         ret_from_fork_asm+0x1a/0x30
         </TASK>
      
        Allocated by task 3661:
         kasan_save_stack+0x30/0x50
         kasan_save_track+0x14/0x30
         __kasan_kmalloc+0xaa/0xb0
         btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
         send_extent_data+0xf0f/0x24a0 [btrfs]
         process_extent+0x48a/0x1830 [btrfs]
         changed_cb+0x178b/0x2ea0 [btrfs]
         btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
         _btrfs_ioctl_send+0x117/0x330 [btrfs]
         btrfs_ioctl+0x184a/0x60a0 [btrfs]
         __x64_sys_ioctl+0x12e/0x1a0
         do_syscall_64+0x95/0x180
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        Freed by task 3661:
         kasan_save_stack+0x30/0x50
         kasan_save_track+0x14/0x30
         kasan_save_free_info+0x3b/0x70
         __kasan_slab_free+0x4f/0x70
         kfree+0x143/0x490
         btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
         send_extent_data+0xf0f/0x24a0 [btrfs]
         process_extent+0x48a/0x1830 [btrfs]
         changed_cb+0x178b/0x2ea0 [btrfs]
         btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
         _btrfs_ioctl_send+0x117/0x330 [btrfs]
         btrfs_ioctl+0x184a/0x60a0 [btrfs]
         __x64_sys_ioctl+0x12e/0x1a0
         do_syscall_64+0x95/0x180
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        The buggy address belongs to the object at ffff888106a83f00
         which belongs to the cache kmalloc-rnd-07-96 of size 96
        The buggy address is located 24 bytes inside of
         freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
      
        The buggy address belongs to the physical page:
        page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
        flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
        page_type: f5(slab)
        raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
        raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
         ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
        >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                                    ^
         ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
         ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Further analyzing the trace and the crash dump's vmcore file shows that
      the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
      the wait_queue that is in the private data passed to the end_io handler.
      
      Commit 4ff47df4 ("btrfs: move priv off stack in
      btrfs_encoded_read_regular_fill_pages()") moved 'struct
      btrfs_encoded_read_private' off the stack.
      
      Before that commit one can see a corruption of the private data when
      analyzing the vmcore after a crash:
      
      *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
      	.wait = (wait_queue_head_t){
      		.lock = (spinlock_t){
      			.rlock = (struct raw_spinlock){
      				.raw_lock = (arch_spinlock_t){
      					.val = (atomic_t){
      						.counter = (int)-2005885696,
      					},
      					.locked = (u8)0,
      					.pending = (u8)157,
      					.locked_pending = (u16)40192,
      					.tail = (u16)34928,
      				},
      				.magic = (unsigned int)536325682,
      				.owner_cpu = (unsigned int)29,
      				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
      				.dep_map = (struct lockdep_map){
      					.key = (struct lock_class_key *)0xffff8881575a3b6c,
      					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
      					.name = (const char *)0xffff88815626f100 = "",
      					.wait_type_outer = (u8)37,
      					.wait_type_inner = (u8)178,
      					.lock_type = (u8)154,
      				},
      			},
      			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
      			.dep_map = (struct lockdep_map){
      				.key = (struct lock_class_key *)0xffff8881575a3b6c,
      				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
      				.name = (const char *)0xffff88815626f100 = "",
      				.wait_type_outer = (u8)37,
      				.wait_type_inner = (u8)178,
      				.lock_type = (u8)154,
      			},
      		},
      		.head = (struct list_head){
      			.next = (struct list_head *)0x112cca,
      			.prev = (struct list_head *)0x47,
      		},
      	},
      	.pending = (atomic_t){
      		.counter = (int)-1491499288,
      	},
      	.status = (blk_status_t)130,
      }
      
      Here we can see several indicators of in-memory data corruption, e.g. the
      large negative atomic values of ->pending or
      ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
      0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
      pointer values for ->wait->head.
      
      To fix this, change atomic_dec_return() to atomic_dec_and_test() to fix the
      corruption, as atomic_dec_return() is defined as two instructions on
      x86_64, whereas atomic_dec_and_test() is defined as a single atomic
      operation. This can lead to a situation where counter value is already
      decremented but the if statement in btrfs_encoded_read_endio() is not
      completely processed, i.e. the 0 test has not completed. If another thread
      continues executing btrfs_encoded_read_regular_fill_pages() the
      atomic_dec_return() there can see an already updated ->pending counter and
      continues by freeing the private data. Continuing in the endio handler the
      test for 0 succeeds and the wait_queue is woken up, resulting in a
      use-after-free.
      
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Suggested-by: default avatarDamien Le Moal <Damien.LeMoal@wdc.com>
      Fixes: 1881fba8 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05b36b04
  6. Nov 11, 2024
Loading