Skip to content
Snippets Groups Projects
  1. Apr 11, 2025
  2. Apr 09, 2025
  3. Apr 08, 2025
    • Jeffrey Hugo's avatar
      bus: mhi: host: Address conflict between power_up and syserr · b60a0538
      Jeffrey Hugo authored and Mani Sadhasivam's avatar Mani Sadhasivam committed
      
      mhi_async_power_up() enables IRQs, at which point we can receive a syserr
      notification from the device.  The syserr notification queues a work item
      that cannot execute until the pm_mutex is released.
      
      If we receive a syserr notification at the right time during
      mhi_async_power_up(), we will fail to initialize the device.
      
      The syserr work item will be pending.  If mhi_async_power_up() detects the
      syserr, it will handle it.  If the device is in PBL, then the PBL state
      transition event will be queued, resulting in a work item after the
      pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
      the syserr work item can run.  It will blindly attempt to reset the MHI
      state machine, which is the recovery action for syserr.  PBL/SBL are not
      interrupt driven and will ignore the MHI Reset unless syserr is actively
      advertised.  This will cause the syserr work item to timeout waiting for
      Reset to be cleared, and will leave the host state in syserr processing.
      The PBL transition work item will then run, and immediately fail because
      syserr processing is not a valid state for PBL transition.
      
      This leaves the device uninitialized.
      
      This issue has a fairly unique signature in the kernel log:
      
      [  909.803598] mhi mhi3: Requested to power ON
      [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
      [  909.803945] mhi mhi3: Power on setup success
      [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
      [  911.808448] mhi mhi3: Device MHI is not in valid state
      
      We cannot remove the syserr handling from mhi_async_power_up() because the
      device may be in the syserr state, but we missed the notification as the
      irq was fired before irqs were enabled.  We also can't queue the syserr
      work item from mhi_async_power_up() if syserr is detected because that may
      result in a duplicate work item, and cause the same issue since the
      duplicate item will blindly issue MHI Reset even if syserr is no longer
      active.
      
      Instead, add a check in the syserr work item to make sure that the device
      is in the syserr state if the device is in the PBL or SBL EEs.
      
      It is unknown when this issue was introduced. It was first observed with
      commit bce3f770 ("bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state") but
      that commit does not appear to introduce the issue per code inspection.
      This issue is suspected to trace back to the introduction of MHI, but the
      relevant code paths have drastically changed since then. Therefore, do
      not identify a specific commit in a Fixes tag as confidence is low that
      such a commit would be correctly identified.
      
      Signed-off-by: default avatarJeffrey Hugo <quic_jhugo@quicinc.com>
      Signed-off-by: default avatarJeff Hugo <jeff.hugo@oss.qualcomm.com>
      Reviewed-by: default avatarTroy Hanson <quic_thanson@quicinc.com>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Message-ID: <20250328163526.3365497-1-jeff.hugo@oss.qualcomm.com>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      b60a0538
  4. Apr 07, 2025
Loading