Skip to content
Snippets Groups Projects
  1. Dec 16, 2018
  2. Dec 14, 2018
  3. Dec 13, 2018
  4. Dec 12, 2018
    • Ming Lei's avatar
      block: deactivate blk_stat timer in wbt_disable_default() · 544fbd16
      Ming Lei authored
      
      rwb_enabled() can't be changed when there is any inflight IO.
      
      wbt_disable_default() may set rwb->wb_normal as zero, however the
      blk_stat timer may still be pending, and the timer function will update
      wrb->wb_normal again.
      
      This patch introduces blk_stat_deactivate() and applies it in
      wbt_disable_default(), then the following IO hang triggered when running
      parted & switching io scheduler can be fixed:
      
      [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
      [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
      [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  369.940768] parted          D    0  3645   3239 0x00000000
      [  369.941500] Call Trace:
      [  369.941874]  ? __schedule+0x6d9/0x74c
      [  369.942392]  ? wbt_done+0x5e/0x5e
      [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
      [  369.943404]  ? wbt_done+0x5e/0x5e
      [  369.943874]  schedule+0x67/0x78
      [  369.944298]  io_schedule+0x12/0x33
      [  369.944771]  rq_qos_wait+0xb5/0x119
      [  369.945193]  ? karma_partition+0x1c2/0x1c2
      [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
      [  369.946151]  wbt_wait+0x85/0xb6
      [  369.946540]  __rq_qos_throttle+0x23/0x2f
      [  369.947014]  blk_mq_make_request+0xe6/0x40a
      [  369.947518]  generic_make_request+0x192/0x2fe
      [  369.948042]  ? submit_bio+0x103/0x11f
      [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
      [  369.949011]  submit_bio+0x103/0x11f
      [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
      [  369.949962]  submit_bio_wait+0x53/0x7f
      [  369.950469]  blkdev_issue_flush+0x8a/0xae
      [  369.951032]  blkdev_fsync+0x2f/0x3a
      [  369.951502]  do_fsync+0x2e/0x47
      [  369.951887]  __x64_sys_fsync+0x10/0x13
      [  369.952374]  do_syscall_64+0x89/0x149
      [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  369.953492] RIP: 0033:0x7f95a1e729d4
      [  369.953996] Code: Bad RIP value.
      [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
      [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
      [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
      [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
      [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
      
      Cc: stable@vger.kernel.org
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      544fbd16
  5. Dec 10, 2018
  6. Dec 08, 2018
    • Ming Lei's avatar
      blk-mq: re-build queue map in case of kdump kernel · 59388702
      Ming Lei authored
      
      Now almost all .map_queues() implementation based on managed irq
      affinity doesn't update queue mapping and it just retrieves the
      old built mapping, so if nr_hw_queues is changed, the mapping talbe
      includes stale mapping. And only blk_mq_map_queues() may rebuild
      the mapping talbe.
      
      One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
      However, drivers often builds queue mapping before allocating tagset
      via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
      as 1 in case of kdump kernel, so wrong queue mapping is used, and
      kernel panic[1] is observed during booting.
      
      This patch fixes the kernel panic triggerd on nvme by rebulding the
      mapping table via blk_mq_map_queues().
      
      [1] kernel panic log
      [    4.438371] nvme nvme0: 16/0/0 default/read/poll queues
      [    4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      [    4.444681] PGD 0 P4D 0
      [    4.445367] Oops: 0000 [#1] SMP NOPTI
      [    4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
      [    4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
      [    4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [    4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
      [    4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
      [    4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
      [    4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
      [    4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
      [    4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
      [    4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
      [    4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
      [    4.469220] FS:  0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
      [    4.471554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
      [    4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    4.477061] PKRU: 55555554
      [    4.477464] Call Trace:
      [    4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
      [    4.479595]  blk_mq_init_queue+0x32/0x4e
      [    4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
      [    4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
      [    4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
      [    4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
      [    4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
      [    4.483930]  ? try_to_wake_up+0x38d/0x3b3
      [    4.484478]  ? process_one_work+0x179/0x2fc
      [    4.485118]  process_one_work+0x1d3/0x2fc
      [    4.485655]  ? rescuer_thread+0x2ae/0x2ae
      [    4.486196]  worker_thread+0x1e9/0x2be
      [    4.486841]  kthread+0x115/0x11d
      [    4.487294]  ? kthread_park+0x76/0x76
      [    4.487784]  ret_from_fork+0x3a/0x50
      [    4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
      [    4.489428] Dumping ftrace buffer:
      [    4.489939]    (ftrace buffer empty)
      [    4.490492] CR2: 0000000000000098
      [    4.491052] ---[ end trace 03cd268ad5a86ff7 ]---
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: linux-nvme@lists.infradead.org
      Cc: David Milburn <dmilburn@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59388702
    • Josef Bacik's avatar
      block: convert io-latency to use rq_qos_wait · d3fcdff1
      Josef Bacik authored
      
      Now that we have this common helper, convert io-latency over to use it
      as well.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d3fcdff1
    • Josef Bacik's avatar
      block: convert wbt_wait() to use rq_qos_wait() · b6c7b58f
      Josef Bacik authored
      
      Now that we have rq_qos_wait() in place, convert wbt_wait() over to
      using it with it's specific callbacks.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b6c7b58f
    • Josef Bacik's avatar
      block: add rq_qos_wait to rq_qos · 84f60324
      Josef Bacik authored
      
      Originally when I split out the common code from blk-wbt into rq_qos I
      left the wbt_wait() where it was and simply copied and modified it
      slightly to work for io-latency.  However they are both basically the
      same thing, and as time has gone on wbt_wait() has ended up much smarter
      and kinder than it was when I copied it into io-latency, which means
      io-latency has lost out on these improvements.
      
      Since they are the same thing essentially except for a few minor things,
      create rq_qos_wait() that replicates what wbt_wait() currently does with
      callbacks that can be passed in for the snowflakes to do their own thing
      as appropriate.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84f60324
    • Dennis Zhou's avatar
      blkcg: rename blkg_try_get() to blkg_tryget() · 7754f669
      Dennis Zhou authored
      
      blkg reference counting now uses percpu_ref rather than atomic_t. Let's
      make this consistent with css_tryget. This renames blkg_try_get to
      blkg_tryget and now returns a bool rather than the blkg or %NULL.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7754f669
    • Dennis Zhou's avatar
      blkcg: change blkg reference counting to use percpu_ref · 7fcf2b03
      Dennis Zhou authored
      
      Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
      and blkg_put on the hot path. Switch over the refcnt in blkg to use
      percpu_ref.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7fcf2b03
    • Dennis Zhou's avatar
      blkcg: remove bio_disassociate_task() · 6f70fb66
      Dennis Zhou authored
      
      Now that a bio only holds a blkg reference, so clean up is simply
      putting back that reference. Remove bio_disassociate_task() as it just
      calls bio_disassociate_blkg() and call the latter directly.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6f70fb66
    • Dennis Zhou's avatar
      blkcg: remove additional reference to the css · fc5a828b
      Dennis Zhou authored
      
      The previous patch in this series removed carrying around a pointer to
      the css in blkg. However, the blkg association logic still relied on
      taking a reference on the css to ensure we wouldn't fail in getting a
      reference for the blkg.
      
      Here the implicit dependency on the css is removed. The association
      continues to rely on the tryget logic walking up the blkg tree. This
      streamlines the three ways that association can happen: normal, swap,
      and writeback.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fc5a828b
    • Dennis Zhou's avatar
      blkcg: remove bio->bi_css and instead use bio->bi_blkg · db6638d7
      Dennis Zhou authored
      
      Prior patches ensured that any bio that interacts with a request_queue
      is properly associated with a blkg. This makes bio->bi_css unnecessary
      as blkg maintains a reference to blkcg already.
      
      This removes the bio field bi_css and transfers corresponding uses to
      access via bi_blkg.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db6638d7
    • Dennis Zhou's avatar
      blkcg: associate writeback bios with a blkg · fd42df30
      Dennis Zhou authored
      
      One of the goals of this series is to remove a separate reference to
      the css of the bio. This can and should be accessed via bio_blkcg(). In
      this patch, wbc_init_bio() now requires a bio to have a device
      associated with it.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd42df30
    • Dennis Zhou's avatar
      blkcg: associate a blkg for pages being evicted by swap · 6a7f6d86
      Dennis Zhou authored
      
      A prior patch in this series added blkg association to bios issued by
      cgroups. There are two other paths that we want to attribute work back
      to the appropriate cgroup: swap and writeback. Here we modify the way
      swap tags bios to include the blkg. Writeback will be tackle in the next
      patch.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6a7f6d86
    • Dennis Zhou's avatar
      blkcg: consolidate bio_issue_init() to be a part of core · e439bedf
      Dennis Zhou authored
      
      bio_issue_init among other things initializes the timestamp for an IO.
      Rather than have this logic handled by policies, this consolidates it to
      be on the init paths (normal, clone, bounce clone).
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e439bedf
    • Dennis Zhou's avatar
      blkcg: associate blkg when associating a device · 5cdf2e3f
      Dennis Zhou authored
      
      Previously, blkg association was handled by controller specific code in
      blk-throttle and blk-iolatency. However, because a blkg represents a
      relationship between a blkcg and a request_queue, it makes sense to keep
      the blkg->q and bio->bi_disk->queue consistent.
      
      This patch moves association into the bio_set_dev macro(). This should
      cover the majority of cases where the device is set/changed keeping the
      two pointers consistent. Fallback code is added to
      blkcg_bio_issue_check() to catch any missing paths.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5cdf2e3f
    • Dennis Zhou's avatar
      dm: set the static flush bio device on demand · 892ad71f
      Dennis Zhou authored
      
      The next patch changes the macro bio_set_dev() to associate a bio with a
      blkg based on the device set. However, dm creates a static bio to be
      used as the basis for cloning empty flush bios on creation. The
      bio_set_dev() call in alloc_dev() will cause problems with the next
      patch adding association to bio_set_dev() because the call is before the
      bdev is associated with a gendisk (bd_disk is %NULL). To get around
      this, set the device on the static bio every time and use that to clone
      to the other bios.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      892ad71f
    • Dennis Zhou's avatar
      blkcg: introduce common blkg association logic · 2268c0fe
      Dennis Zhou authored
      
      There are 3 ways blkg association can happen: association with the
      current css, with the page css (swap), or from the wbc css (writeback).
      
      This patch handles how association is done for the first case where we
      are associating bsaed on the current css. If there is already a blkg
      associated, the css will be reused and association will be redone as the
      request_queue may have changed.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2268c0fe
    • Dennis Zhou's avatar
      blkcg: convert blkg_lookup_create() to find closest blkg · beea9da0
      Dennis Zhou authored
      
      There are several scenarios where blkg_lookup_create() can fail such as
      the blkcg dying, request_queue is dying, or simply being OOM. Most
      handle this by simply falling back to the q->root_blkg and calling it a
      day.
      
      This patch implements the notion of closest blkg. During
      blkg_lookup_create(), if it fails to create, return the closest blkg
      found or the q->root_blkg. blkg_try_get_closest() is introduced and used
      during association so a bio is always attached to a blkg.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      beea9da0
    • Dennis Zhou's avatar
      blkcg: update blkg_lookup_create() to do locking · b978962a
      Dennis Zhou authored
      
      To know when to create a blkg, the general pattern is to do a
      blkg_lookup() and if that fails, lock and do the lookup again, and if
      that fails finally create. It doesn't make much sense for everyone who
      wants to do creation to write this themselves.
      
      This changes blkg_lookup_create() to do locking and implement this
      pattern. The old blkg_lookup_create() is renamed to
      __blkg_lookup_create().  If a call site wants to do its own error
      handling or already owns the queue lock, they can use
      __blkg_lookup_create(). This will be used in upcoming patches.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b978962a
    • Dennis Zhou's avatar
      blkcg: fix ref count issue with bio_blkcg() using task_css · 0fe061b9
      Dennis Zhou authored
      
      The bio_blkcg() function turns out to be inconsistent and consequently
      dangerous to use. The first part returns a blkcg where a reference is
      owned by the bio meaning it does not need to be rcu protected. However,
      the third case, the last line, is problematic:
      
      	return css_to_blkcg(task_css(current, io_cgrp_id));
      
      This can race against task migration and the cgroup dying. It is also
      semantically different as it must be called rcu protected and is
      susceptible to failure when trying to get a reference to it.
      
      This patch adds association ahead of calling bio_blkcg() rather than
      after. This makes association a required and explicit step along the
      code paths for calling bio_blkcg(). In blk-iolatency, association is
      moved above the bio_blkcg() call to ensure it will not return %NULL.
      
      BFQ uses the old bio_blkcg() function, but I do not want to address it
      in this series due to the complexity. I have created a private version
      documenting the inconsistency and noting not to use it.
      
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0fe061b9
  7. Dec 07, 2018
    • Jens Axboe's avatar
      blk-mq: punt failed direct issue to dispatch list · c616cbee
      Jens Axboe authored
      
      After the direct dispatch corruption fix, we permanently disallow direct
      dispatch of non read/write requests. This works fine off the normal IO
      path, as they will be retried like any other failed direct dispatch
      request. But for the blk_insert_cloned_request() that only DM uses to
      bypass the bottom level scheduler, we always first attempt direct
      dispatch. For some types of requests, that's now a permanent failure,
      and no amount of retrying will make that succeed. This results in a
      livelock.
      
      Instead of making special cases for what we can direct issue, and now
      having to deal with DM solving the livelock while still retaining a BUSY
      condition feedback loop, always just add a request that has been through
      ->queue_rq() to the hardware queue dispatch list. These are safe to use
      as no merging can take place there. Additionally, if requests do have
      prepped data from drivers, we aren't dependent on them not sharing space
      in the request structure to safely add them to the IO scheduler lists.
      
      This basically reverts ffe81d45 and is based on a patch from Ming,
      but with the list insert case covered as well.
      
      Fixes: ffe81d45 ("blk-mq: fix corruption with direct issue")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarMing Lei <ming.lei@redhat.com>
      Reported-by: default avatarBart Van Assche <bvanassche@acm.org>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c616cbee
    • Paolo Valente's avatar
      block, bfq: fix decrement of num_active_groups · ba7aeae5
      Paolo Valente authored
      
      Since commit '2d29c9f8 ("block, bfq: improve asymmetric scenarios
      detection")', if there are process groups with I/O requests waiting for
      completion, then BFQ tags the scenario as 'asymmetric'. This detection
      is needed for preserving service guarantees (for details, see comments
      on the computation * of the variable asymmetric_scenario in the
      function bfq_better_to_idle).
      
      Unfortunately, commit '2d29c9f8 ("block, bfq: improve asymmetric
      scenarios detection")' contains an error exactly in the updating of
      the number of groups with I/O requests waiting for completion: if a
      group has more than one descendant process, then the above number of
      groups, which is renamed from num_active_groups to a more appropriate
      num_groups_with_pending_reqs by this commit, may happen to be wrongly
      decremented multiple times, namely every time one of the descendant
      processes gets all its pending I/O requests completed.
      
      A correct, complete solution should work as follows. Consider a group
      that is inactive, i.e., that has no descendant process with pending
      I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
      is still accounting for this group, because the group still has some
      descendant process with some I/O request still in
      flight. num_groups_with_pending_reqs should be decremented when the
      in-flight request of the last descendant process is finally completed
      (assuming that nothing else has changed for the group in the meantime,
      in terms of composition of the group and active/inactive state of
      child groups and processes). To accomplish this, an additional
      pending-request counter must be added to entities, and must be
      updated correctly.
      
      To avoid this additional field and operations, this commit resorts to
      the following tradeoff between simplicity and accuracy: for an
      inactive group that is still counted in num_groups_with_pending_reqs,
      this commit decrements num_groups_with_pending_reqs when the first
      descendant process of the group remains with no request waiting for
      completion.
      
      This simplified scheme provides a fix to the unbalanced decrements
      introduced by 2d29c9f8. Since this error was also caused by lack
      of comments on this non-trivial issue, this commit also adds related
      comments.
      
      Fixes: 2d29c9f8 ("block, bfq: improve asymmetric scenarios detection")
      Reported-by: default avatarSteven Barrett <steven@liquorix.net>
      Tested-by: default avatarSteven Barrett <steven@liquorix.net>
      Tested-by: default avatarLucjan Lucjanov <lucjan.lucjanov@gmail.com>
      Reviewed-by: default avatarFederico Motta <federico@willer.it>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba7aeae5
  8. Dec 05, 2018
    • Jens Axboe's avatar
      blk-mq: fix corruption with direct issue · ffe81d45
      Jens Axboe authored
      If we attempt a direct issue to a SCSI device, and it returns BUSY, then
      we queue the request up normally. However, the SCSI layer may have
      already setup SG tables etc for this particular command. If we later
      merge with this request, then the old tables are no longer valid. Once
      we issue the IO, we only read/write the original part of the request,
      not the new state of it.
      
      This causes data corruption, and is most often noticed with the file
      system complaining about the just read data being invalid:
      
      [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
      
      because most of it is garbage...
      
      This doesn't happen from the normal issue path, as we will simply defer
      the request to the hardware queue dispatch list if we fail. Once it's on
      the dispatch list, we never merge with it.
      
      Fix this from the direct issue path by flagging the request as
      REQ_NOMERGE so we don't change the size of it before issue.
      
      See also:
        https://bugzilla.kernel.org/show_bug.cgi?id=201685
      
      
      
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Fixes: 6ce3dd6e ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ffe81d45
  9. Dec 04, 2018
  10. Dec 03, 2018
Loading