Skip to content
Snippets Groups Projects
  1. Oct 07, 2019
    • Bart Van Assche's avatar
      block: Remove request_queue.nr_queues · 95662565
      Bart Van Assche authored
      
      Commit 897bb0c7 ("blk-mq: Use proper cpumask iterator"; v4.6)
      removed the last use of request_queue.nr_queues from outside
      blk_mq_init_allocate_queue(). Remove this member variable to make
      struct request_queue smaller. This patch does not change any
      functionality.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      95662565
    • Bart Van Assche's avatar
      block: Fix three kernel-doc warnings · 98e54402
      Bart Van Assche authored
      
      Fix the following kernel-doc warnings:
      
      block/t10-pi.c:242: warning: Function parameter or member 'rq' not described in 't10_pi_type3_prepare'
      block/t10-pi.c:249: warning: Function parameter or member 'rq' not described in 't10_pi_type3_complete'
      block/t10-pi.c:249: warning: Function parameter or member 'nr_bytes' not described in 't10_pi_type3_complete'
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Fixes: 54d4e6ab ("block: centralize PI remapping logic to the block layer")
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      98e54402
  2. Oct 03, 2019
  3. Sep 27, 2019
    • Ming Lei's avatar
      blk-mq: apply normal plugging for HDD · 3154df26
      Ming Lei authored
      
      Some HDD drive may expose multiple hardware queues, such as MegraRaid.
      Let's apply the normal plugging for such devices because sequential IO
      may benefit a lot from plug merging.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3154df26
    • Ming Lei's avatar
      blk-mq: honor IO scheduler for multiqueue devices · a12de1d4
      Ming Lei authored
      
      If a device is using multiple queues, the IO scheduler may be bypassed.
      This may hurt performance for some slow MQ devices, and it also breaks
      zoned devices which depend on mq-deadline for respecting the write order
      in one zone.
      
      Don't bypass io scheduler if we have one setup.
      
      This patch can double sequential write performance basically on MQ
      scsi_debug when mq-deadline is applied.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJavier González <javier@javigon.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a12de1d4
    • Yufen Yu's avatar
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu authored
      
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d699663
    • Yufen Yu's avatar
      rq-qos: get rid of redundant wbt_update_limits() · 2af2783f
      Yufen Yu authored
      
      We have updated limits after calling wbt_set_min_lat(). No need to
      update again.
      
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2af2783f
  4. Sep 26, 2019
    • Tejun Heo's avatar
      iocost: bump up default latency targets for hard disks · 7afcccaf
      Tejun Heo authored
      
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7afcccaf
    • Tejun Heo's avatar
      iocost: improve nr_lagging handling · 7cd806a9
      Tejun Heo authored
      
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cd806a9
    • Tejun Heo's avatar
      iocost: better trace vrate changes · 25d41e4a
      Tejun Heo authored
      
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25d41e4a
    • Ming Lei's avatar
      block: don't release queue's sysfs lock during switching elevator · b89f625e
      Ming Lei authored
      
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & acquire sysfs_lock before registering/un-registering elevator
      queue during switching elevator for avoiding potential deadlock from
      showing & storing 'queue/iosched' attributes and removing elevator's
      kobject.
      
      Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
      required in .show & .store of queue/iosched's attributes, and just
      elevator's sysfs lock is acquired in elv_iosched_store() and
      elv_iosched_show(). So it is safe to hold queue's sysfs lock when
      registering/un-registering elevator queue.
      
      The biggest issue is that commit cecf5d87 assumes that concurrent
      write on 'queue/scheduler' can't happen. However, this assumption isn't
      true, because kernfs_fop_write() only guarantees that concurrent write
      aren't called on the same open file, but the write could be from
      different open on the file. So we can't release & re-acquire queue's
      sysfs lock during switching elevator, otherwise use-after-free on
      elevator could be triggered.
      
      Fixes the issue by not releasing queue's sysfs lock during switching
      elevator.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b89f625e
    • Ming Lei's avatar
      blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be
      Ming Lei authored
      
      Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
      lockdep_assert_held() called in blk_mq_sched_free_requests() which is
      run in failure path of elevator_init_mq().
      
      blk_mq_sched_free_requests() is called in the following 3 functions:
      
      	elevator_init_mq()
      	elevator_exit()
      	blk_cleanup_queue()
      
      In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
      by 'mutex_lock(&q->sysfs_lock)'.
      
      So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
      into elevator_exit() for fixing the report by syzbot.
      
      Reported-by: default avatar <syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com>
      Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      284b94be
  5. Sep 23, 2019
  6. Sep 18, 2019
  7. Sep 15, 2019
  8. Sep 14, 2019
  9. Sep 12, 2019
    • Ming Lei's avatar
      block: fix race between switching elevator and removing queues · 0a67b5a9
      Ming Lei authored
      
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & actuire sysfs_lock again during switching elevator. So it
      isn't enough to prevent switching elevator from happening by simply
      clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
      in-progress switch still can move on after re-acquiring the lock,
      meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.
      
      Fixes this issue by checking 'q->elevator' directly & locklessly after
      q->kobj is removed in blk_unregister_queue(), this way is safe because
      q->elevator can't be changed at that time.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a67b5a9
    • Stanley Chu's avatar
      block: bypass blk_set_runtime_active for uninitialized q->dev · 8a15b4d7
      Stanley Chu authored
      
      Some devices may skip blk_pm_runtime_init() and have null pointer
      in its request_queue->dev. For example, SCSI devices of UFS Well-Known
      LUNs.
      
      Currently the null pointer is checked by the user of
      blk_set_runtime_active(), i.e., scsi_dev_type_resume(). It is better to
      check it by blk_set_runtime_active() itself instead of by its users.
      
      Signed-off-by: default avatarStanley Chu <stanley.chu@mediatek.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a15b4d7
  10. Sep 10, 2019
    • Tejun Heo's avatar
      iocost_monitor: Report debt · 7c1ee704
      Tejun Heo authored
      
      Report debt and rename del_ms row to delay for consistency.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7c1ee704
    • Tejun Heo's avatar
      blk-iocost: Don't let merges push vtime into the future · e1518f63
      Tejun Heo authored
      
      Merges have the same problem that forced-bios had which is fixed by
      the previous patch.  The cost of a merge is calculated at the time of
      issue and force-advances vtime into the future.  Until global vtime
      catches up, how the cgroup's hweight changes in the meantime doesn't
      matter and it often leads to situations where the cost is calculated
      at one hweight and paid at a very different one.  See the previous
      patch for more details.
      
      Fix it by never advancing vtime into the future for merges.  If budget
      is available, vtime is advanced.  Otherwise, the cost is charged as
      debt.
      
      This brings merge cost handling in line with issue cost handling in
      ioc_rqos_throttle().
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e1518f63
    • Tejun Heo's avatar
      blk-iocost: Account force-charged overage in absolute vtime · 36a52481
      Tejun Heo authored
      
      Currently, when a bio needs to be force-charged and there isn't enough
      budget, vtime is simply pushed into the future.  This means that the
      cost of the whole bio is scaled using the current hweight and then
      charged immediately.  Until the global vtime advances beyond this
      future vtime, the cgroup won't be allowed to issue normal IOs.
      
      This is incorrect and can lead to, for example, exploding vrate or
      extended stalls if vrate range is constrained.  Consider the following
      scenario.
      
      1. A cgroup with a very low hweight runs out of budget.
      
      2. A storm of swap-out happens on it.  All of them are scaled
         according to the current low hweight and charged to vtime pushing
         it to a far future.
      
      3. All other cgroups go idle and now the above cgroup has access to
         the whole device.  However, because vtime is already wound using
         the past low hweight, what its current hweight is doesn't matter
         until global vtime catches up to the local vtime.
      
      4. As a result, either vrate gets ramped up extremely or the IOs stall
         while the underlying device is idle.
      
      This is because the hweight the overage is calculated at is different
      from the hweight that it's being paid at.
      
      Fix it by remembering the overage in absoulte vtime and continuously
      paying with the actual budget according to the current hweight at each
      period.
      
      Note that non-forced bios which wait already remembers the cost in
      absolute vtime.  This brings forced-bio accounting in line.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      36a52481
    • Tejun Heo's avatar
      blk-iocost: Fix incorrect operation order during iocg free · e036c4ca
      Tejun Heo authored
      
      ioc_pd_free() first cancels the hrtimers and then deactivates the
      iocg.  However, the iocg timer can run inbetween and reschedule the
      hrtimers which will end up running after the iocg is freed leading to
      crashes like the following.
      
        general protection fault: 0000 [#1] SMP
        ...
        RIP: 0010:iocg_kick_delay+0xbe/0x1b0
        RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
        RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
        RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
        R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
        FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         iocg_delay_timer_fn+0x3d/0x60
         __hrtimer_run_queues+0xfe/0x270
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x5e/0x120
         apic_timer_interrupt+0xf/0x20
         </IRQ>
      
      Fix it by canceling hrtimers after deactivating the iocg.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e036c4ca
  11. Sep 06, 2019
    • Fam Zheng's avatar
      bfq: Add per-device weight · 795fe54c
      Fam Zheng authored
      
      This adds to BFQ the missing per-device weight interfaces:
      blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
      implementation pretty closely resembles what we had in CFQ and the parsing code
      is basically reused.
      
      Tests
      =====
      
      Using two cgroups and three block devices, having weights setup as:
      
      Cgroup          test1           test2
      ============================================
      default         100             500
      sda             500             100
      sdb             default         default
      sdc             200             200
      
      cgroup v1 runs
      --------------
      
          sda.test1.out:   READ: bw=913MiB/s
          sda.test2.out:   READ: bw=183MiB/s
      
          sdb.test1.out:   READ: bw=213MiB/s
          sdb.test2.out:   READ: bw=1054MiB/s
      
          sdc.test1.out:   READ: bw=650MiB/s
          sdc.test2.out:   READ: bw=650MiB/s
      
      cgroup v2 runs
      --------------
      
          sda.test1.out:   READ: bw=915MiB/s
          sda.test2.out:   READ: bw=184MiB/s
      
          sdb.test1.out:   READ: bw=216MiB/s
          sdb.test2.out:   READ: bw=1069MiB/s
      
          sdc.test1.out:   READ: bw=621MiB/s
          sdc.test2.out:   READ: bw=622MiB/s
      
      Signed-off-by: default avatarFam Zheng <zhengfeiran@bytedance.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      795fe54c
    • Fam Zheng's avatar
      bfq: Extract bfq_group_set_weight from bfq_io_set_weight_legacy · 5ff047e3
      Fam Zheng authored
      
      This function will be useful when we update weight from the soon-coming
      per-device interface.
      
      Signed-off-by: default avatarFam Zheng <zhengfeiran@bytedance.com>
      Reviewed-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5ff047e3
    • Fam Zheng's avatar
      bfq: Fix the missing barrier in __bfq_entity_update_weight_prio · e9d3c866
      Fam Zheng authored
      
      The comment of bfq_group_set_weight says the reading of prio_changed
      should happen before the reading of weight, but a memory barrier is
      missing here. Add it now, to match the smp_wmb() there.
      
      Signed-off-by: default avatarFam Zheng <zhengfeiran@bytedance.com>
      Reviewed-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9d3c866
    • Jens Axboe's avatar
      block: fix elevator_get_by_features() · a2614255
      Jens Axboe authored
      
      The lookup logic is broken - 'e' will never be NULL, even if the
      list is empty. Maintain lookup hit in a separate variable instead.
      
      Fixes: a0958ba7 ("block: Improve default elevator selection")
      Reported-by: default avatarJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2614255
    • Damien Le Moal's avatar
      block: Delay default elevator initialization · 737eb78e
      Damien Le Moal authored
      
      When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
      the only information known about the device is the number of hardware
      queues as the block device scan by the device driver is not completed
      yet for most drivers. The device type and elevator required features
      are not set yet, preventing to correctly select the default elevator
      most suitable for the device.
      
      This currently affects all multi-queue zoned block devices which default
      to the "none" elevator instead of the required "mq-deadline" elevator.
      These drives currently include host-managed SMR disks connected to a
      smartpqi HBA and null_blk block devices with zoned mode enabled.
      Upcoming NVMe Zoned Namespace devices will also be affected.
      
      Fix this by adding the boolean elevator_init argument to
      blk_mq_init_allocated_queue() to control the execution of
      elevator_init_mq(). Two cases exist:
      1) elevator_init = false is used for calls to
         blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
         case, a call to elevator_init_mq() is added to __device_add_disk(),
         resulting in the delayed initialization of the queue elevator
         after the device driver finished probing the device information. This
         effectively allows elevator_init_mq() access to more information
         about the device.
      2) elevator_init = true preserves the current behavior of initializing
         the elevator directly from blk_mq_init_allocated_queue(). This case
         is used for the special request based DM devices where the device
         gendisk is created before the queue initialization and device
         information (e.g. queue limits) is already known when the queue
         initialization is executed.
      
      Additionally, to make sure that the elevator initialization is never
      done while requests are in-flight (there should be none when the device
      driver calls device_add_disk()), freeze and quiesce the device request
      queue before calling blk_mq_init_sched() in elevator_init_mq().
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      737eb78e
    • Damien Le Moal's avatar
      block: Improve default elevator selection · a0958ba7
      Damien Le Moal authored
      
      For block devices that do not specify required features, preserve the
      current default elevator selection (mq-deadline for single queue
      devices, none for multi-queue devices). However, for devices specifying
      required features (e.g. zoned block devices ELEVATOR_F_ZBD_SEQ_WRITE
      feature), select the first available elevator providing the required
      features.
      
      In all cases, default to "none" if no elevator is available or if the
      initialization of the default elevator fails.
      
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a0958ba7
    • Damien Le Moal's avatar
      block: Introduce elevator features · 68c43f13
      Damien Le Moal authored
      
      Introduce the definition of elevator features through the
      elevator_features flags in the elevator_type structure. Each flag can
      represent a feature supported by an elevator. The first feature defined
      by this patch is support for zoned block device sequential write
      constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
      by the mq-deadline elevator using zone write locking.
      
      Other possible features are IO priorities, write hints, latency targets
      or single-LUN dual-actuator disks (for which the elevator could maintain
      one LBA ordered list per actuator).
      
      The required_elevator_features field is also added to the request_queue
      structure to allow a device driver to specify elevator feature flags
      that an elevator must support for the correct operation of the device
      (e.g. device drivers for zoned block devices can have the
      ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
      The helper function blk_queue_required_elevator_features() is
      defined for setting this new field.
      
      With these two new fields in place, the elevator functions
      elevator_match() and elevator_find() are modified to allow a user to set
      only an elevator with a set of features that satisfies the device
      required features. Elevators not matching the device requirements are
      not shown in the device sysfs queue/scheduler file to prevent their use.
      
      The "none" elevator can always be selected as before.
      
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68c43f13
    • Damien Le Moal's avatar
      block: Change elevator_init_mq() to always succeed · 954b4a5c
      Damien Le Moal authored
      
      If the default elevator chosen is mq-deadline, elevator_init_mq() may
      return an error if mq-deadline initialization fails, leading to
      blk_mq_init_allocated_queue() returning an error, which in turn will
      cause the block device initialization to fail and the device not being
      exposed.
      
      Instead of taking such extreme measure, handle mq-deadline
      initialization failures in the same manner as when mq-deadline is not
      available (no module to load), that is, default to the "none" scheduler.
      With this change, elevator_init_mq() return type can be changed to void.
      
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      954b4a5c
    • Damien Le Moal's avatar
      block: Cleanup elevator_init_mq() use · 61db437d
      Damien Le Moal authored
      
      Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
      calling elevator_init_mq() to make sure that the queue supports IO
      scheduling, use the elevator.c function elv_support_iosched() in
      elevator_init_mq(). This does not introduce any functional change but
      ensure that elevator_init_mq() does the right thing based on the queue
      settings.
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61db437d
  12. Sep 03, 2019
Loading