Skip to content
Snippets Groups Projects
  1. Jan 05, 2018
  2. Nov 11, 2017
    • Jens Axboe's avatar
      blk-mq: only run the hardware queue if IO is pending · 79f720a7
      Jens Axboe authored
      
      Currently we are inconsistent in when we decide to run the queue. Using
      blk_mq_run_hw_queues() we check if the hctx has pending IO before
      running it, but we don't do that from the individual queue run function,
      blk_mq_run_hw_queue(). This results in a lot of extra and pointless
      queue runs, potentially, on flush requests and (much worse) on tag
      starvation situations. This is observable just looking at top output,
      with lots of kworkers active. For the !async runs, it just adds to the
      CPU overhead of blk-mq.
      
      Move the has-pending check into the run function instead of having
      callers do it.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79f720a7
    • Jens Axboe's avatar
      Revert "blk-mq: don't handle TAG_SHARED in restart" · 05b79413
      Jens Axboe authored
      
      This reverts commit 358a3a6b.
      
      We have cases that aren't covered 100% in the drivers, so for now
      we have to retain the shared tag restart loops.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05b79413
  3. Nov 04, 2017
    • Ming Lei's avatar
      blk-mq: don't allocate driver tag upfront for flush rq · 923218f6
      Ming Lei authored
      
      The idea behind it is simple:
      
      1) for none scheduler, driver tag has to be borrowed for flush rq,
         otherwise we may run out of tag, and that causes an IO hang. And
         get/put driver tag is actually noop for none, so reordering tags
         isn't necessary at all.
      
      2) for a real I/O scheduler, we need not allocate a driver tag upfront
         for flush rq. It works just fine to follow the same approach as
         normal requests: allocate driver tag for each rq just before calling
         ->queue_rq().
      
      One driver visible change is that the driver tag isn't shared in the
      flush request sequence. That won't be a problem, since we always do that
      in legacy path.
      
      Then flush rq need not be treated specially wrt. get/put driver tag.
      This cleans up the code - for instance, reorder_tags_to_front() can be
      removed, and we needn't worry about request ordering in dispatch list
      for avoiding I/O deadlock.
      
      Also we have to put the driver tag before requeueing.
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      923218f6
    • Ming Lei's avatar
      blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ · a6a252e6
      Ming Lei authored
      
      In case of IO scheduler we always pre-allocate one driver tag before
      calling blk_insert_flush(), and flush request will be marked as
      RQF_FLUSH_SEQ once it is in flush machinery.
      
      So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle
      the request, otherwise the flush request is dispatched to ->dispatch
      list directly.
      
      This is a preparation patch for not preallocating a driver tag for flush
      requests, and for not treating flush requests as a special case. This is
      similar to what the legacy path does.
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a6a252e6
    • Ming Lei's avatar
      blk-mq: don't handle failure in .get_budget · 88022d72
      Ming Lei authored
      
      It is enough to just check if we can get the budget via .get_budget().
      And we don't need to deal with device state change in .get_budget().
      
      For SCSI, one issue to be fixed is that we have to call
      scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
      to handle the request. And it isn't enough to simply call
      blk_mq_end_request() to do that if this request is marked as
      RQF_DONTPREP.
      
      Fixes: 0df21c86(scsi: implement .get_budget and .put_budget for blk-mq)
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      88022d72
  4. Nov 01, 2017
    • Ming Lei's avatar
      blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE · 1f460b63
      Ming Lei authored
      
      SCSI restarts its queue in scsi_end_request() automatically, so we don't
      need to handle this case in blk-mq.
      
      Especailly any request won't be dequeued in this case, we needn't to
      worry about IO hang caused by restart vs. dispatch.
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1f460b63
    • Ming Lei's avatar
      blk-mq: don't handle TAG_SHARED in restart · 358a3a6b
      Ming Lei authored
      Now restart is used in the following cases, and TAG_SHARED is for
      SCSI only.
      
      1) .get_budget() returns BLK_STS_RESOURCE
      - if resource in target/host level isn't satisfied, this SCSI device
      will be added in shost->starved_list, and the whole queue will be rerun
      (via SCSI's built-in RESTART) in scsi_end_request() after any request
      initiated from this host/targe is completed. Forget to mention, host level
      resource can't be an issue for blk-mq at all.
      
      - the same is true if resource in the queue level isn't satisfied.
      
      - if there isn't outstanding request on this queue, then SCSI's RESTART
      can't work(blk-mq's can't work too), and the queue will be run after
      SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
      RESTART when this request is finished
      
      2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
      - if there isn't onprogressing request on this queue, the queue
      will be run after SCSI_QUEUE_DELAY
      
      - otherwise, SCSI's RESTART covers the rerun.
      
      3) blk_mq_get_driver_tag() failed
      - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
      allocation.
      
      In one word, SCSI's built-in RESTART is enough to cover the queue
      rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.
      
      In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
      running I/O on these 8 luns concurrently.
      
      Aslo Roman Pen reported the current RESTART is very expensive especialy
      when there are lots of LUNs attached in one host, such as in his
      test, RESTART causes half of IOPS be cut.
      
      Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
      
      
      Fixes: 6d8c6c0f ("blk-mq: Restart a single queue if tag sets are shared")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      358a3a6b
    • Ming Lei's avatar
      blk-mq-sched: improve dispatching from sw queue · b347689f
      Ming Lei authored
      
      SCSI devices use host-wide tagset, and the shared driver tag space is
      often quite big. However, there is also a queue depth for each lun(
      .cmd_per_lun), which is often small, for example, on both lpfc and
      qla2xxx, .cmd_per_lun is just 3.
      
      So lots of requests may stay in sw queue, and we always flush all
      belonging to same hw queue and dispatch them all to driver.
      Unfortunately it is easy to cause queue busy because of the small
      .cmd_per_lun.  Once these requests are flushed out, they have to stay in
      hctx->dispatch, and no bio merge can happen on these requests, and
      sequential IO performance is harmed.
      
      This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
      from a sw queue, so that we can dispatch them in scheduler's way. We can
      then avoid dequeueing too many requests from sw queue, since we don't
      flush ->dispatch completely.
      
      This patch improves dispatching from sw queue by using the .get_budget
      and .put_budget callbacks.
      
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b347689f
    • Ming Lei's avatar
      blk-mq: introduce .get_budget and .put_budget in blk_mq_ops · de148297
      Ming Lei authored
      
      For SCSI devices, there is often a per-request-queue depth, which needs
      to be respected before queuing one request.
      
      Currently blk-mq always dequeues the request first, then calls
      .queue_rq() to dispatch the request to lld. One obvious issue with this
      approach is that I/O merging may not be successful, because when the
      per-request-queue depth can't be respected, .queue_rq() has to return
      BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
      list. This means it never gets a chance to be merged with other IO.
      
      This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
      then we can try to get reserved budget first before dequeuing request.
      If the budget for queueing I/O can't be satisfied, we don't need to
      dequeue request at all. Hence the request can be left in the IO
      scheduler queue, for more merging opportunities.
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de148297
    • Ming Lei's avatar
      blk-mq-sched: move actual dispatching into one helper · caf8eb0d
      Ming Lei authored
      
      So that it becomes easy to support to dispatch from sw queue in the
      following patch.
      
      No functional change.
      
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      caf8eb0d
    • Ming Lei's avatar
      blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch · 5e3d02bb
      Ming Lei authored
      
      When the hw queue is busy, we shouldn't take requests from the scheduler
      queue any more, otherwise it is difficult to do IO merge.
      
      This patch fixes the awful IO performance on some SCSI devices(lpfc,
      qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
      hw queue is busy.
      
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5e3d02bb
  5. Jul 03, 2017
    • Ming Lei's avatar
      blk-mq-sched: fix performance regression of mq-deadline · 32825c45
      Ming Lei authored
      
      When mq-deadline is taken, IOPS of sequential read and
      seqential write is observed more than 20% drop on sata(scsi-mq)
      devices, compared with using 'none' scheduler.
      
      The reason is that the default nr_requests for scheduler is
      too big for small queuedepth devices, and latency is increased
      much.
      
      Since the principle of taking 256 requests for mq scheduler
      is based on 128 queue depth, this patch changes into
      double size of min(hw queue_depth, 128).
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32825c45
  6. Jun 21, 2017
  7. Jun 18, 2017
  8. May 26, 2017
  9. May 04, 2017
  10. May 02, 2017
  11. Apr 27, 2017
  12. Apr 20, 2017
  13. Apr 07, 2017
    • Omar Sandoval's avatar
      blk-mq-sched: provide hooks for initializing hardware queue data · ee056f98
      Omar Sandoval authored
      
      Schedulers need to be informed when a hardware queue is added or removed
      at runtime so they can allocate/free per-hardware queue data. So,
      replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
      at init time, with .init_hctx() and .exit_hctx() hooks.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ee056f98
    • Bart Van Assche's avatar
      blk-mq: Restart a single queue if tag sets are shared · 6d8c6c0f
      Bart Van Assche authored
      
      To improve scalability, if hardware queues are shared, restart
      a single hardware queue in round-robin fashion. Rename
      blk_mq_sched_restart_queues() to reflect the new semantics.
      Remove blk_mq_sched_mark_restart_queue() because this function
      has no callers. Remove flag QUEUE_FLAG_RESTART because this
      patch removes the code that uses this flag.
      
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6d8c6c0f
    • Omar Sandoval's avatar
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval authored
      
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      54d5329d
    • Omar Sandoval's avatar
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval authored
      
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      93252632
    • Omar Sandoval's avatar
      blk-mq-sched: refactor scheduler initialization · 6917ff0b
      Omar Sandoval authored
      
      Preparation cleanup for the next couple of fixes, push
      blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6917ff0b
    • Omar Sandoval's avatar
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval authored
      
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      81380ca1
  14. Mar 02, 2017
  15. Feb 23, 2017
    • Omar Sandoval's avatar
      blk-mq-sched: separate mark hctx and queue restart operations · d38d3515
      Omar Sandoval authored
      
      In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
      after we dispatch requests left over on our hardware queue dispatch
      list. This is so we'll go back and dispatch requests from the scheduler.
      In this case, it's only necessary to restart the hardware queue that we
      are running; there's no reason to run other hardware queues just because
      we are using shared tags.
      
      So, split out blk_mq_sched_mark_restart() into two operations, one for
      just the hardware queue and one for the whole request queue. The core
      code only needs the hctx variant, but I/O schedulers will want to use
      both.
      
      This also requires adjusting blk_mq_sched_restart_queues() to always
      check the queue restart flag, not just when using shared tags.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d38d3515
  16. Feb 22, 2017
  17. Feb 17, 2017
  18. Feb 10, 2017
  19. Feb 08, 2017
  20. Feb 03, 2017
    • Jens Axboe's avatar
      block: free merged request in the caller · e4d750c9
      Jens Axboe authored
      
      If we end up doing a request-to-request merge when we have completed
      a bio-to-request merge, we free the request from deep down in that
      path. For blk-mq-sched, the merge path has to hold the appropriate
      lock, but we don't need it for freeing the request. And in fact
      holding the lock is problematic, since we are now calling the
      mq sched put_rq_private() hook with the lock held. Other call paths
      do not hold this lock.
      
      Fix this inconsistency by ensuring that the caller frees a merged
      request. Then we can do it outside of the lock, making it both more
      efficient and fixing the blk-mq-sched problem of invoking parts of
      the scheduler with an unknown lock state.
      
      Reported-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      e4d750c9
Loading