Skip to content
Snippets Groups Projects
  1. Apr 13, 2023
  2. Feb 06, 2023
  3. Nov 01, 2022
  4. Oct 24, 2022
    • Jinlong Chen's avatar
      block: fix up elevator_type refcounting · 8ed40ee3
      Jinlong Chen authored
      The current reference management logic of io scheduler modules contains
      refcnt problems. For example, blk_mq_init_sched may fail before or after
      the calling of e->ops.init_sched. If it fails before the calling, it does
      nothing to the reference to the io scheduler module. But if it fails after
      the calling, it releases the reference by calling kobject_put(&eq->kobj).
      
      As the callers of blk_mq_init_sched can't know exactly where the failure
      happens, they can't handle the reference to the io scheduler module
      properly: releasing the reference on failure results in double-release if
      blk_mq_init_sched has released it, and not releasing the reference results
      in ghost reference if blk_mq_init_sched did not release it either.
      
      The same problem also exists in io schedulers' init_sched implementations.
      
      We can address the problem by adding releasing statements to the error
      handling procedures of blk_mq_init_sched and init_sched implementations.
      But that is counterintuitive and requires modifications to existing io
      schedulers.
      
      Instead, We make elevator_alloc get the io scheduler module references
      that will be released by elevator_release. And then, we match each
      elevator_get with an elevator_put. Therefore, each reference to an io
      scheduler module explicitly has its own getter and releaser, and we no
      longer need to worry about the refcnt problems.
      
      The bugs and the patch can be validated with tools here:
      https://github.com/nickyc975/linux_elv_refcnt_bug.git
      
      
      
      [hch: split out a few bits into separate patches, use a non-try
            module_get in elevator_alloc]
      
      Signed-off-by: default avatarJinlong Chen <nickyc975@zju.edu.cn>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8ed40ee3
  5. Jun 17, 2022
  6. Jun 16, 2022
  7. Mar 18, 2022
    • Shin'ichiro Kawasaki's avatar
      block: limit request dispatch loop duration · 572299f0
      Shin'ichiro Kawasaki authored
      
      When IO requests are made continuously and the target block device
      handles requests faster than request arrival, the request dispatch loop
      keeps on repeating to dispatch the arriving requests very long time,
      more than a minute. Since the loop runs as a workqueue worker task, the
      very long loop duration triggers workqueue watchdog timeout and BUG [1].
      
      To avoid the very long loop duration, break the loop periodically. When
      opportunity to dispatch requests still exists, check need_resched(). If
      need_resched() returns true, the dispatch loop already consumed its time
      slice, then reschedule the dispatch work and break the loop. With heavy
      IO load, need_resched() does not return true for 20~30 seconds. To cover
      such case, check time spent in the dispatch loop with jiffies. If more
      than 1 second is spent, reschedule the dispatch work and break the loop.
      
      [1]
      
      [  609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
      [  609.701820] Showing busy workqueues and worker pools:
      [  609.707915] workqueue events: flags=0x0
      [  609.712615]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  609.712626]     pending: drm_fb_helper_damage_work [drm_kms_helper]
      [  609.712687] workqueue events_freezable: flags=0x4
      [  609.732943]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  609.732952]     pending: pci_pme_list_scan
      [  609.732968] workqueue events_power_efficient: flags=0x80
      [  609.751947]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  609.751955]     pending: neigh_managed_work
      [  609.752018] workqueue kblockd: flags=0x18
      [  609.769480]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
      [  609.769488]     in-flight: 1020:blk_mq_run_work_fn
      [  609.769498]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
      [  609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
      [  639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
      [  639.909513] Showing busy workqueues and worker pools:
      [  639.915404] workqueue events: flags=0x0
      [  639.920197]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  639.920215]     pending: drm_fb_helper_damage_work [drm_kms_helper]
      [  639.920365] workqueue kblockd: flags=0x18
      [  639.939932]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
      [  639.939942]     in-flight: 1020:blk_mq_run_work_fn
      [  639.939955]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
      [  639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67
      
      Fixes: 6e6fcbc2 ("blk-mq: support batching dispatch in case of io")
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Cc: stable@vger.kernel.org # v5.10+
      Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/
      Link: https://lore.kernel.org/r/20220318022641.133484-1-shinichiro.kawasaki@wdc.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      572299f0
  8. Mar 09, 2022
  9. Dec 03, 2021
  10. Nov 29, 2021
  11. Nov 11, 2021
  12. Nov 05, 2021
    • Jens Axboe's avatar
      block: move queue enter logic into blk_mq_submit_bio() · 900e0807
      Jens Axboe authored
      
      Retain the old logic for the fops based submit, but for our internal
      blk_mq_submit_bio(), move the queue entering logic into the core
      function itself.
      
      We need to be a bit careful if going into the scheduler, as a scheduler
      or queue mappings can arbitrarily change before we have entered the queue.
      Have the bio scheduler mapping do that separately, it's a very cheap
      operation compared to actually doing merging locking and lookups.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [axboe: update to check merge post submit_bio_checks() doing remap...]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      900e0807
  13. Oct 30, 2021
  14. Oct 22, 2021
  15. Oct 21, 2021
  16. Oct 18, 2021
  17. Jul 27, 2021
  18. Jun 25, 2021
    • Jan Kara's avatar
      blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c
      Jan Kara authored
      
      Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
      
      bfqd -> ioc:
       put_io_context+0x33/0x90 -> ioc->lock grabbed
       blk_mq_free_request+0x51/0x140
       blk_put_request+0xe/0x10
       blk_attempt_req_merge+0x1d/0x30
       elv_attempt_insert_merge+0x56/0xa0
       blk_mq_sched_try_insert_merge+0x4b/0x60
       bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
       blk_mq_sched_insert_requests+0xd6/0x2b0
       blk_mq_flush_plug_list+0x154/0x280
       blk_finish_plug+0x40/0x60
       ext4_writepages+0x696/0x1320
       do_writepages+0x1c/0x80
       __filemap_fdatawrite_range+0xd7/0x120
       sync_file_range+0xac/0xf0
      
      ioc->bfqd:
       bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
       put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
       exit_io_context+0x48/0x50
       do_exit+0x7e9/0xdd0
       do_group_exit+0x54/0xc0
      
      To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
      free the merged request but rather leave that upto the caller similarly
      to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
      to free all the merged requests after dropping bfqd->lock.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd2ef39c
  19. Jun 18, 2021
  20. Jun 03, 2021
    • Jan Kara's avatar
      block: Do not pull requests from the scheduler when we cannot dispatch them · 61347154
      Jan Kara authored
      
      Provided the device driver does not implement dispatch budget accounting
      (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
      requests from the IO scheduler as long as it is willing to give out any.
      That defeats scheduling heuristics inside the scheduler by creating
      false impression that the device can take more IO when it in fact
      cannot.
      
      For example with BFQ IO scheduler on top of virtio-blk device setting
      blkio cgroup weight has barely any impact on observed throughput of
      async IO because __blk_mq_do_dispatch_sched() always sucks out all the
      IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
      when that is all dispatched, it will give out IO of lower weight cgroups
      as well. And then we have to wait for all this IO to be dispatched to
      the disk (which means lot of it actually has to complete) before the
      IO scheduler is queried again for dispatching more requests. This
      completely destroys any service differentiation.
      
      So grab request tag for a request pulled out of the IO scheduler already
      in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
      cannot get it because we are unlikely to be able to dispatch it. That
      way only single request is going to wait in the dispatch list for some
      tag to free.
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61347154
  21. May 24, 2021
  22. May 11, 2021
    • Omar Sandoval's avatar
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval authored
      
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      efed9a33
  23. Apr 08, 2021
Loading