Skip to content
Snippets Groups Projects
  1. Dec 03, 2021
  2. Nov 29, 2021
  3. Nov 11, 2021
  4. Nov 05, 2021
    • Jens Axboe's avatar
      block: move queue enter logic into blk_mq_submit_bio() · 900e0807
      Jens Axboe authored
      
      Retain the old logic for the fops based submit, but for our internal
      blk_mq_submit_bio(), move the queue entering logic into the core
      function itself.
      
      We need to be a bit careful if going into the scheduler, as a scheduler
      or queue mappings can arbitrarily change before we have entered the queue.
      Have the bio scheduler mapping do that separately, it's a very cheap
      operation compared to actually doing merging locking and lookups.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [axboe: update to check merge post submit_bio_checks() doing remap...]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      900e0807
  5. Oct 30, 2021
  6. Oct 22, 2021
  7. Oct 21, 2021
  8. Oct 18, 2021
  9. Jul 27, 2021
  10. Jun 25, 2021
    • Jan Kara's avatar
      blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c
      Jan Kara authored
      
      Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
      
      bfqd -> ioc:
       put_io_context+0x33/0x90 -> ioc->lock grabbed
       blk_mq_free_request+0x51/0x140
       blk_put_request+0xe/0x10
       blk_attempt_req_merge+0x1d/0x30
       elv_attempt_insert_merge+0x56/0xa0
       blk_mq_sched_try_insert_merge+0x4b/0x60
       bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
       blk_mq_sched_insert_requests+0xd6/0x2b0
       blk_mq_flush_plug_list+0x154/0x280
       blk_finish_plug+0x40/0x60
       ext4_writepages+0x696/0x1320
       do_writepages+0x1c/0x80
       __filemap_fdatawrite_range+0xd7/0x120
       sync_file_range+0xac/0xf0
      
      ioc->bfqd:
       bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
       put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
       exit_io_context+0x48/0x50
       do_exit+0x7e9/0xdd0
       do_group_exit+0x54/0xc0
      
      To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
      free the merged request but rather leave that upto the caller similarly
      to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
      to free all the merged requests after dropping bfqd->lock.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd2ef39c
  11. Jun 18, 2021
  12. Jun 03, 2021
    • Jan Kara's avatar
      block: Do not pull requests from the scheduler when we cannot dispatch them · 61347154
      Jan Kara authored
      
      Provided the device driver does not implement dispatch budget accounting
      (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
      requests from the IO scheduler as long as it is willing to give out any.
      That defeats scheduling heuristics inside the scheduler by creating
      false impression that the device can take more IO when it in fact
      cannot.
      
      For example with BFQ IO scheduler on top of virtio-blk device setting
      blkio cgroup weight has barely any impact on observed throughput of
      async IO because __blk_mq_do_dispatch_sched() always sucks out all the
      IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
      when that is all dispatched, it will give out IO of lower weight cgroups
      as well. And then we have to wait for all this IO to be dispatched to
      the disk (which means lot of it actually has to complete) before the
      IO scheduler is queried again for dispatching more requests. This
      completely destroys any service differentiation.
      
      So grab request tag for a request pulled out of the IO scheduler already
      in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
      cannot get it because we are unlikely to be able to dispatch it. That
      way only single request is going to wait in the dispatch list for some
      tag to free.
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61347154
  13. May 24, 2021
  14. May 11, 2021
    • Omar Sandoval's avatar
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval authored
      
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      efed9a33
  15. Apr 08, 2021
  16. Mar 04, 2021
  17. Mar 01, 2021
  18. Feb 22, 2021
  19. Dec 04, 2020
  20. Oct 09, 2020
    • Yufen Yu's avatar
      blk-mq: get rid of the dead flush handle code path · c7281524
      Yufen Yu authored
      
      After commit 923218f6 ("blk-mq: don't allocate driver tag upfront
      for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
      directly to handle flush request rather than blk_mq_sched_insert_request()
      in the case of elevator.
      
      Then, all flush request either have set RQF_FLUSH_SEQ flag when call
      blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
      So, remove the dead code path.
      
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c7281524
  21. Oct 06, 2020
  22. Sep 08, 2020
  23. Sep 03, 2020
    • John Garry's avatar
      blk-mq: Facilitate a shared sbitmap per tagset · 32bc15af
      John Garry authored
      Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
      multiple reply queues with single hostwide tags.
      
      In addition, these drivers want to use interrupt assignment in
      pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
      CPU hotplug may cause in-flight IO completion to not be serviced when an
      interrupt is shutdown. That problem is solved in commit bf0beec0
      ("blk-mq: drain I/O when all CPUs in a hctx are offline").
      
      However, to take advantage of that blk-mq feature, the HBA HW queuess are
      required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
      queues need to be exposed to the upper layer.
      
      In making that transition, the per-SCSI command request tags are no
      longer unique per Scsi host - they are just unique per hctx. As such, the
      HBA LLDD would have to generate this tag internally, which has a certain
      performance overhead.
      
      However another problem is that blk-mq assumes the host may accept
      (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e0 ("scsi:
       core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
      counter was removed, which would stop the LLDD being sent more than
      .can_queue commands; however, it should still be ensured that the block
      layer does not issue more than .can_queue commands to the Scsi host.
      
      To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
      which may be requested at init time.
      
      New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
      tagset to indicate whether the shared sbitmap should be used.
      
      Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
      are still allocated per hctx; the reason for this is that if tags and
      requests were only allocated for a single hctx - like hctx0 - it may break
      block drivers which expect a request be associated with a specific hctx,
      i.e. not always hctx0. This will introduce extra memory usage.
      
      This change is based on work originally from Ming Lei in [1] and from
      Bart's suggestion in [2].
      
      [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
      [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
      [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
      
      
      
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
      Tested-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32bc15af
    • John Garry's avatar
      blk-mq: Pass flags for tag init/free · 1c0706a7
      John Garry authored
      
      Pass hctx/tagset flags argument down to blk_mq_init_tags() and
      blk_mq_free_tags() for selective init/free.
      
      For now, make it include the alloc policy flag, which can be evaluated
      when needed (in blk_mq_init_tags()).
      
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1c0706a7
  24. Sep 01, 2020
Loading