Skip to content
Snippets Groups Projects
  1. Oct 18, 2021
  2. Jul 27, 2021
  3. Jun 25, 2021
    • Jan Kara's avatar
      blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c
      Jan Kara authored
      
      Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
      
      bfqd -> ioc:
       put_io_context+0x33/0x90 -> ioc->lock grabbed
       blk_mq_free_request+0x51/0x140
       blk_put_request+0xe/0x10
       blk_attempt_req_merge+0x1d/0x30
       elv_attempt_insert_merge+0x56/0xa0
       blk_mq_sched_try_insert_merge+0x4b/0x60
       bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
       blk_mq_sched_insert_requests+0xd6/0x2b0
       blk_mq_flush_plug_list+0x154/0x280
       blk_finish_plug+0x40/0x60
       ext4_writepages+0x696/0x1320
       do_writepages+0x1c/0x80
       __filemap_fdatawrite_range+0xd7/0x120
       sync_file_range+0xac/0xf0
      
      ioc->bfqd:
       bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
       put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
       exit_io_context+0x48/0x50
       do_exit+0x7e9/0xdd0
       do_group_exit+0x54/0xc0
      
      To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
      free the merged request but rather leave that upto the caller similarly
      to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
      to free all the merged requests after dropping bfqd->lock.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd2ef39c
  4. Jun 18, 2021
  5. Jun 03, 2021
    • Jan Kara's avatar
      block: Do not pull requests from the scheduler when we cannot dispatch them · 61347154
      Jan Kara authored
      
      Provided the device driver does not implement dispatch budget accounting
      (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
      requests from the IO scheduler as long as it is willing to give out any.
      That defeats scheduling heuristics inside the scheduler by creating
      false impression that the device can take more IO when it in fact
      cannot.
      
      For example with BFQ IO scheduler on top of virtio-blk device setting
      blkio cgroup weight has barely any impact on observed throughput of
      async IO because __blk_mq_do_dispatch_sched() always sucks out all the
      IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
      when that is all dispatched, it will give out IO of lower weight cgroups
      as well. And then we have to wait for all this IO to be dispatched to
      the disk (which means lot of it actually has to complete) before the
      IO scheduler is queried again for dispatching more requests. This
      completely destroys any service differentiation.
      
      So grab request tag for a request pulled out of the IO scheduler already
      in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
      cannot get it because we are unlikely to be able to dispatch it. That
      way only single request is going to wait in the dispatch list for some
      tag to free.
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61347154
  6. May 24, 2021
  7. May 11, 2021
    • Omar Sandoval's avatar
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval authored
      
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      efed9a33
  8. Apr 08, 2021
  9. Mar 04, 2021
  10. Mar 01, 2021
  11. Feb 22, 2021
  12. Dec 04, 2020
  13. Oct 09, 2020
    • Yufen Yu's avatar
      blk-mq: get rid of the dead flush handle code path · c7281524
      Yufen Yu authored
      
      After commit 923218f6 ("blk-mq: don't allocate driver tag upfront
      for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
      directly to handle flush request rather than blk_mq_sched_insert_request()
      in the case of elevator.
      
      Then, all flush request either have set RQF_FLUSH_SEQ flag when call
      blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
      So, remove the dead code path.
      
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c7281524
  14. Oct 06, 2020
  15. Sep 08, 2020
  16. Sep 03, 2020
    • John Garry's avatar
      blk-mq: Facilitate a shared sbitmap per tagset · 32bc15af
      John Garry authored
      Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
      multiple reply queues with single hostwide tags.
      
      In addition, these drivers want to use interrupt assignment in
      pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
      CPU hotplug may cause in-flight IO completion to not be serviced when an
      interrupt is shutdown. That problem is solved in commit bf0beec0
      ("blk-mq: drain I/O when all CPUs in a hctx are offline").
      
      However, to take advantage of that blk-mq feature, the HBA HW queuess are
      required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
      queues need to be exposed to the upper layer.
      
      In making that transition, the per-SCSI command request tags are no
      longer unique per Scsi host - they are just unique per hctx. As such, the
      HBA LLDD would have to generate this tag internally, which has a certain
      performance overhead.
      
      However another problem is that blk-mq assumes the host may accept
      (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e0 ("scsi:
       core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
      counter was removed, which would stop the LLDD being sent more than
      .can_queue commands; however, it should still be ensured that the block
      layer does not issue more than .can_queue commands to the Scsi host.
      
      To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
      which may be requested at init time.
      
      New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
      tagset to indicate whether the shared sbitmap should be used.
      
      Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
      are still allocated per hctx; the reason for this is that if tags and
      requests were only allocated for a single hctx - like hctx0 - it may break
      block drivers which expect a request be associated with a specific hctx,
      i.e. not always hctx0. This will introduce extra memory usage.
      
      This change is based on work originally from Ming Lei in [1] and from
      Bart's suggestion in [2].
      
      [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
      [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
      [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
      
      
      
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
      Tested-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32bc15af
    • John Garry's avatar
      blk-mq: Pass flags for tag init/free · 1c0706a7
      John Garry authored
      
      Pass hctx/tagset flags argument down to blk_mq_init_tags() and
      blk_mq_free_tags() for selective init/free.
      
      For now, make it include the alloc policy flag, which can be evaluated
      when needed (in blk_mq_init_tags()).
      
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1c0706a7
  17. Sep 01, 2020
  18. Aug 17, 2020
    • Ming Lei's avatar
      blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · d7d8535f
      Ming Lei authored
      SCHED_RESTART code path is relied to re-run queue for dispatch requests
      in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
      requests to hctx->dispatch.
      
      memory barriers have to be used for ordering the following two pair of OPs:
      
      1) adding requests to hctx->dispatch and checking SCHED_RESTART in
      blk_mq_dispatch_rq_list()
      
      2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
      in blk_mq_sched_restart().
      
      Without the added memory barrier, either:
      
      1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
      blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
      dispatch side
      
      or
      
      2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
      in dispatch side, meantime checking if there is request in
      hctx->dispatch from blk_mq_sched_restart() is missed.
      
      IO hang in ltp/fs_fill test is reported by kernel test robot:
      
      	https://lkml.org/lkml/2020/7/26/77
      
      
      
      Turns out it is caused by the above out-of-order OPs. And the IO hang
      can't be observed any more after applying this patch.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d7d8535f
  19. Jul 31, 2020
  20. Jul 10, 2020
  21. Jun 30, 2020
  22. Apr 29, 2020
  23. Apr 24, 2020
    • Salman Qazi's avatar
      block: Limit number of items taken from the I/O scheduler in one go · 28d65729
      Salman Qazi authored
      
      Flushes bypass the I/O scheduler and get added to hctx->dispatch
      in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
      hctx->run_work work item and is past the point in
      blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
      
      The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
      because the I/O scheduler can feed an arbitrary number of commands.
      
      Since we have only one hctx->run_work, the commands waiting in
      hctx->dispatch will wait an arbitrary length of time for run_work to be
      rerun.
      
      A similar phenomenon exists with dispatches from the software queue.
      
      The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
      blk_mq_do_dispatch_ctx and return from the run_work handler and let it
      rerun.
      
      Signed-off-by: default avatarSalman Qazi <sqazi@google.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      28d65729
  24. Apr 20, 2020
    • Douglas Anderson's avatar
      blk-mq: Rerun dispatching in the case of budget contention · a0823421
      Douglas Anderson authored
      
      If ever a thread running blk-mq code tries to get budget and fails it
      immediately stops doing work and assumes that whenever budget is freed
      up that queues will be kicked and whatever work the thread was trying
      to do will be tried again.
      
      One path where budget is freed and queues are kicked in the normal
      case can be seen in scsi_finish_command().  Specifically:
      - scsi_finish_command()
        - scsi_device_unbusy()
          - # Decrement "device_busy", AKA release budget
        - scsi_io_completion()
          - scsi_end_request()
            - blk_mq_run_hw_queues()
      
      The above is all well and good.  The problem comes up when a thread
      claims the budget but then releases it without actually dispatching
      any work.  Since we didn't schedule any work we'll never run the path
      of finishing work / kicking the queues.
      
      This isn't often actually a problem which is why this issue has
      existed for a while and nobody noticed.  Specifically we only get into
      this situation when we unexpectedly found that we weren't going to do
      any work.  Code that later receives new work kicks the queues.  All
      good, right?
      
      The problem shows up, however, if timing is just wrong and we hit a
      race.  To see this race let's think about the case where we only have
      a budget of 1 (only one thread can hold budget).  Now imagine that a
      thread got budget and then decided not to dispatch work.  It's about
      to call put_budget() but then the thread gets context switched out for
      a long, long time.  While in this state, any and all kicks of the
      queue (like the when we received new work) will be no-ops because
      nobody can get budget.  Finally the thread holding budget gets to run
      again and returns.  All the normal kicks will have been no-ops and we
      have an I/O stall.
      
      As you can see from the above, you need just the right timing to see
      the race.  To start with, the only case it happens if we thought we
      had work, actually managed to get the budget, but then actually didn't
      have work.  That's pretty rare to start with.  Even then, there's
      usually a very small amount of time between realizing that there's no
      work and putting the budget.  During this small amount of time new
      work has to come in and the queue kick has to make it all the way to
      trying to get the budget and fail.  It's pretty unlikely.
      
      One case where this could have failed is illustrated by an example of
      threads running blk_mq_do_dispatch_sched():
      
      * Threads A and B both run has_work() at the same time with the same
        "hctx".  Imagine has_work() is exact.  There's no lock, so it's OK
        if Thread A and B both get back true.
      * Thread B gets interrupted for a long time right after it decides
        that there is work.  Maybe its CPU gets an interrupt and the
        interrupt handler is slow.
      * Thread A runs, get budget, dispatches work.
      * Thread A's work finishes and budget is released.
      * Thread B finally runs again and gets budget.
      * Since Thread A already took care of the work and no new work has
        come in, Thread B will get NULL from dispatch_request().  I believe
        this is specifically why dispatch_request() is allowed to return
        NULL in the first place if has_work() must be exact.
      * Thread B will now be holding the budget and is about to call
        put_budget(), but hasn't called it yet.
      * Thread B gets interrupted for a long time (again).  Dang interrupts.
      * Now Thread C (maybe with a different "hctx" but the same queue)
        comes along and runs blk_mq_do_dispatch_sched().
      * Thread C won't do anything because it can't get budget.
      * Finally Thread B will run again and put the budget without kicking
        any queues.
      
      Even though the example above is with blk_mq_do_dispatch_sched() I
      believe the race is possible any time someone is holding budget but
      doesn't do work.
      
      Unfortunately, the unlikely has become more likely if you happen to be
      using the BFQ I/O scheduler.  BFQ, by design, sometimes returns "true"
      for has_work() but then NULL for dispatch_request() and stays in this
      state for a while (currently up to 9 ms).  Suddenly you only need one
      race to hit, not two races in a row.  With my current setup this is
      easy to reproduce in reboot tests and traces have actually shown that
      we hit a race similar to the one described above.
      
      Note that we only need to fix blk_mq_do_dispatch_sched() and
      blk_mq_do_dispatch_ctx() and not the other places that put budget.  In
      other cases we know that we have work to do on at least one "hctx" and
      code already exists to kick that "hctx"'s queue.  When that work
      finally finishes all the queues will be kicked using the normal flow.
      
      One last note is that (at least in the SCSI case) budget is shared by
      all "hctx"s that have the same queue.  Thus we need to make sure to
      kick the whole queue, not just re-run dispatching on a single "hctx".
      
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a0823421
  25. Mar 12, 2020
    • Ming Lei's avatar
      blk-mq: insert flush request to the front of dispatch queue · cc3200ea
      Ming Lei authored
      
      commit 01e99aec ("blk-mq: insert passthrough request into
      hctx->dispatch directly") may change to add flush request to the tail
      of dispatch by applying the 'add_head' parameter of
      blk_mq_sched_insert_request.
      
      Turns out this way causes performance regression on NCQ controller because
      flush is non-NCQ command, which can't be queued when there is any in-flight
      NCQ command. When adding flush rq to the front of hctx->dispatch, it is
      easier to introduce extra time to flush rq's latency compared with adding
      to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
      flush merge is increased, and less flush requests may be issued to
      controller.
      
      So always insert flush request to the front of dispatch queue just like
      before applying commit 01e99aec ("blk-mq: insert passthrough request
      into hctx->dispatch directly").
      
      Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
      Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: 01e99aec ("blk-mq: insert passthrough request into hctx->dispatch directly")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc3200ea
  26. Feb 25, 2020
    • Ming Lei's avatar
      blk-mq: insert passthrough request into hctx->dispatch directly · 01e99aec
      Ming Lei authored
      
      For some reason, device may be in one situation which can't handle
      FS request, so STS_RESOURCE is always returned and the FS request
      will be added to hctx->dispatch. However passthrough request may
      be required at that time for fixing the problem. If passthrough
      request is added to scheduler queue, there isn't any chance for
      blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
      Then the FS IO request may never be completed, and IO hang is caused.
      
      So passthrough request has to be added to hctx->dispatch directly
      for fixing the IO hang.
      
      Fix this issue by inserting passthrough request into hctx->dispatch
      directly together withing adding FS request to the tail of
      hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
      to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().
      
      Then it becomes consistent with original legacy IO request
      path, in which passthrough request is always added to q->queue_head.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      01e99aec
Loading