Commits · ccc41a66203ad1baff4a8b53f05a5f970fd6d85f · CodeLinaro / la / kernel / msm

May 30, 2022

KVM: x86: Avoid theoretical NULL pointer dereference in kvm_irq_delivery_to_apic_fast() · ccc41a66

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 00b5f371
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date:   Fri Mar 25 14:21:39 2022 +0100

    KVM: x86: Avoid theoretical NULL pointer dereference in kvm_irq_delivery_to_apic_fast()

    When kvm_irq_delivery_to_apic_fast() is called with APIC_DEST_SELF
    shorthand, 'src' must not be NULL. Crash the VM with KVM_BUG_ON()
    instead of crashing the host.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Message-Id: <20220325132140.25650-3-vkuznets@redhat.com>
    Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

ccc41a66

KVM: x86: Check lapic_in_kernel() before attempting to set a SynIC irq · a90abe67

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 7ec37d1c
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date:   Fri Mar 25 14:21:38 2022 +0100

    KVM: x86: Check lapic_in_kernel() before attempting to set a SynIC irq

    When KVM_CAP_HYPERV_SYNIC{,2} is activated, KVM already checks for
    irqchip_in_kernel() so normally SynIC irqs should never be set. It is,
    however,  possible for a misbehaving VMM to write to SYNIC/STIMER MSRs
    causing erroneous behavior.

    The immediate issue being fixed is that kvm_irq_delivery_to_apic()
    (kvm_irq_delivery_to_apic_fast()) crashes when called with
    'irq.shorthand = APIC_DEST_SELF' and 'src == NULL'.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Message-Id: <20220325132140.25650-2-vkuznets@redhat.com>
    Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

a90abe67

KVM: x86: Fix clang -Wimplicit-fallthrough in do_host_cpuid() · 4d06814b

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 07ea4ab1
Author: Nathan Chancellor <nathan@kernel.org>
Date:   Tue Mar 22 08:29:06 2022 -0700

    KVM: x86: Fix clang -Wimplicit-fallthrough in do_host_cpuid()

    Clang warns:

      arch/x86/kvm/cpuid.c:739:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
              default:
              ^
      arch/x86/kvm/cpuid.c:739:2: note: insert 'break;' to avoid fall-through
              default:
              ^
              break;
      1 error generated.

    Clang is a little more pedantic than GCC, which does not warn when
    falling through to a case that is just break or return. Clang's version
    is more in line with the kernel's own stance in deprecated.rst, which
    states that all switch/case blocks must end in either break,
    fallthrough, continue, goto, or return. Add the missing break to silence
    the warning.

    Fixes: f144c49e ("KVM: x86: synthesize CPUID leaf 0x80000021h if useful")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
    Message-Id: <20220322152906.112164-1-nathan@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

4d06814b

Revert "KVM: set owner of cpu and vm file operations" · 2120884e

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 70375c2d
Author: David Matlack <dmatlack@google.com>
Date:   Thu Mar 3 18:33:28 2022 +0000

    Revert "KVM: set owner of cpu and vm file operations"

    This reverts commit 3d3aab1b.

    Now that the KVM module's lifetime is tied to kvm.users_count, there is
    no need to also tie it's lifetime to the lifetime of the VM and vCPU
    file descriptors.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20220303183328.1499189-3-dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

2120884e

KVM: Prevent module exit until all VMs are freed · c88b229d

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 5f6de5cb
Author: David Matlack <dmatlack@google.com>
Date:   Thu Mar 3 18:33:27 2022 +0000

    KVM: Prevent module exit until all VMs are freed

    Tie the lifetime the KVM module to the lifetime of each VM via
    kvm.users_count. This way anything that grabs a reference to the VM via
    kvm_get_kvm() cannot accidentally outlive the KVM module.

    Prior to this commit, the lifetime of the KVM module was tied to the
    lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU
    file descriptors by their respective file_operations "owner" field.
    This approach is insufficient because references grabbed via
    kvm_get_kvm() do not prevent closing any of the aforementioned file
    descriptors.

    This fixes a long standing theoretical bug in KVM that at least affects
    async page faults. kvm_setup_async_pf() grabs a reference via
    kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing
    prevents the VM file descriptor from being closed and the KVM module
    from being unloaded before this callback runs.

    Fixes: af585b92 ("KVM: Halt vcpu if page it tries to access is swapped out")
    Fixes: 3d3aab1b ("KVM: set owner of cpu and vm file operations")
    Cc: stable@vger.kernel.org
Suggested-by: Ben Gardon <bgardon@google.com>
    [ Based on a patch from Ben implemented for Google's kernel. ]
Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20220303183328.1499189-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

c88b229d

KVM: use kvcalloc for array allocations · db0bc553

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit c9b8fecd
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Mar 8 04:57:39 2022 -0500

    KVM: use kvcalloc for array allocations

    Instead of using array_size, use a function that takes care of the
    multiplication.  While at it, switch to kvcalloc since this allocation
    should not be very large.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

db0bc553

KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · a8f70d84

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 6d849191
Author: Oliver Upton <oupton@google.com>
Date:   Tue Mar 1 06:03:47 2022 +0000

    KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2

    KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
    advertise the set of quirks which may be disabled to userspace, so it is
    impossible to predict the behavior of KVM. Worse yet,
    KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
    it fails to reject attempts to set invalid quirk bits.

    The only valid workaround for the quirky quirks API is to add a new CAP.
    Actually advertise the set of quirks that can be disabled to userspace
    so it can predict KVM's behavior. Reject values for cap->args[0] that
    contain invalid bits.

    Finally, add documentation for the new capability and describe the
    existing quirks.

Signed-off-by: Oliver Upton <oupton@google.com>
    Message-Id: <20220301060351.442881-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Conflicts:
	Documentation/virt/kvm/api.rst (skipping 93b71801)

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

a8f70d84

kvm: x86: Require const tsc for RT · 4ac5496f

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 5e17b2ee
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Sun Nov 6 12:26:18 2011 +0100

    kvm: x86: Require const tsc for RT

    Non constant TSC is a nightmare on bare metal already, but with
    virtualization it becomes a complete disaster because the workarounds
    are horrible latency wise. That's also a preliminary for running RT in
    a guest on top of a RT host.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

4ac5496f

KVM: x86: synthesize CPUID leaf 0x80000021h if useful · 99456a76

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit f144c49e
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Oct 21 17:19:27 2021 -0400

    KVM: x86: synthesize CPUID leaf 0x80000021h if useful

    Guests X86_BUG_NULL_SEG if and only if the host has them.  Use the info
    from static_cpu_has_bug to form the 0x80000021 CPUID leaf that was
    defined for Zen3.  Userspace can then set the bit even on older CPUs
    that do not have the bug, such as Zen2.

    Do the same for X86_FEATURE_LFENCE_RDTSC as well, since various processors
    have had very different ways of detecting it and not all of them are
    available to userspace.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

99456a76

KVM: x86: add support for CPUID leaf 0x80000021 · 2adf7307

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 58b3d12c
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Oct 28 13:26:38 2021 -0400

    KVM: x86: add support for CPUID leaf 0x80000021

    CPUID leaf 0x80000021 defines some features (or lack of bugs) of AMD
    processors.  Expose the ones that make sense via KVM_GET_SUPPORTED_CPUID.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

2adf7307

KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask · b30ba626

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit bf07be36
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Fri Mar 18 12:27:41 2022 -0400

    KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask

    KVM_X86_OP_OPTIONAL_RET0 can only be used with 32-bit return values on 32-bit
    systems, because unsigned long is only 32-bits wide there and 64-bit values
    are returned in edx:eax.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

b30ba626

Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" · 2565938e

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 873dd122
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Fri Mar 18 12:30:32 2022 -0400

    Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"

    This reverts commit cf3e2642.

    Multi-vCPU Hyper-V guests started crashing randomly on boot with the
    latest kvm/queue and the problem can be bisected the problem to this
    particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest
    successfully anymore. Both Intel and AMD seem to be affected. Reverting
    the commit saves the day.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

2565938e

kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU · 33fc20f8

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit fcb93eb6
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Mon Mar 21 05:05:08 2022 -0400

    kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU

    Since "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
    is going to be reverted, it's not going to be true anymore that
    the zap-page flow does not free any 'struct kvm_mmu_page'.  Introduce
    an early flush before tdp_mmu_zap_leafs() returns, to preserve
    bisectability.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

33fc20f8

kvm/emulate: Fix SETcc emulation function offsets with SLS · 57c11290

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit fe83f5ea
Author: Borislav Petkov <bp@suse.de>
Date:   Wed Mar 16 22:05:52 2022 +0100

    kvm/emulate: Fix SETcc emulation function offsets with SLS

    The commit in Fixes started adding INT3 after RETs as a mitigation
    against straight-line speculation.

    The fastop SETcc implementation in kvm's insn emulator uses macro magic
    to generate all possible SETcc functions and to jump to them when
    emulating the respective instruction.

    However, it hardcodes the size and alignment of those functions to 4: a
    three-byte SETcc insn and a single-byte RET. BUT, with SLS, there's an
    INT3 that gets slapped after the RET, which brings the whole scheme out
    of alignment:

      15:   0f 90 c0                seto   %al
      18:   c3                      ret
      19:   cc                      int3
      1a:   0f 1f 00                nopl   (%rax)
      1d:   0f 91 c0                setno  %al
      20:   c3                      ret
      21:   cc                      int3
      22:   0f 1f 00                nopl   (%rax)
      25:   0f 92 c0                setb   %al
      28:   c3                      ret
      29:   cc                      int3

    and this explodes like this:

      int3: 0000 [#1] PREEMPT SMP PTI
      CPU: 0 PID: 2435 Comm: qemu-system-x86 Not tainted 5.17.0-rc8-sls #1
      Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A14 04/30/2012
      RIP: 0010:setc+0x5/0x8 [kvm]
      Code: 00 00 0f 1f 00 0f b6 05 43 24 06 00 c3 cc 0f 1f 80 00 00 00 00 0f 90 c0 c3 cc 0f \
              1f 00 0f 91 c0 c3 cc 0f 1f 00 0f 92 c0 c3 cc <0f> 1f 00 0f 93 c0 c3 cc 0f 1f 00 \
              0f 94 c0 c3 cc 0f 1f 00 0f 95 c0
      Call Trace:
       <TASK>
       ? x86_emulate_insn [kvm]
       ? x86_emulate_instruction [kvm]
       ? vmx_handle_exit [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run [kvm]
       ? kvm_vcpu_ioctl [kvm]
       ? __x64_sys_ioctl
       ? do_syscall_64
       ? entry_SYSCALL_64_after_hwframe
       </TASK>

    Raise the alignment value when SLS is enabled and use a macro for that
    instead of hard-coding naked numbers.

    Fixes: e463a09a ("x86: Add straight-line-speculation mitigation")
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jamie Heilman <jamie@audible.transient.net>
    Link: https://lore.kernel.org/r/YjGzJwjrvxg5YZ0Z@audible.transient.net


    [Add a comment and a bit of safety checking, since this is going to be changed
     again for IBT support. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Omitted-fix: 3986f65d ("kvm/emulate: Fix SETcc emulation for ENDBR")
 Will be picked up with the rest of IBT support.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

57c11290

KVM: compat: riscv: Prevent KVM_COMPAT from being selected · dc991e90

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit afec0c65
Author: Guo Ren <guoren@linux.alibaba.com>
Date:   Tue Feb 1 23:05:45 2022 +0800

    KVM: compat: riscv: Prevent KVM_COMPAT from being selected

    Current riscv doesn't support the 32bit KVM API. Let's make it
    clear by not selecting KVM_COMPAT.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Anup Patel <anup@brainfault.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

dc991e90

KVM: selftests: Add test to populate a VM with the max possible guest mem · 792198cd

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832

commit b58c55d5
Author: Sean Christopherson <seanjc@google.com>
Date: Sat Feb 26 00:15:46 2022 +0000

KVM: selftests: Add test to populate a VM with the max possible guest mem

Add a selftest that enables populating a VM with the maximum amount of
guest memory allowed by the underlying architecture. Abuse KVM's
memslots by mapping a single host memory region into multiple memslots so
that the selftest doesn't require a system with terabytes of RAM.

Default to 512gb of guest memory, which isn't all that interesting, but
should work on all MMUs and doesn't take an exorbitant amount of memory
or time. E.g. testing with ~64tb of guest memory takes the better part
of an hour, and requires 200gb of memory for KVM's page tables when using
4kb pages.

To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
the not-hugepage size is) in the backing store (memfd). Use memfd for
the host backing store to ensure that hugepages are guaranteed when
requested, and to give the user explicit control of the size of hugepage
being tested.

By default, spin up as many vCPUs as there are available to the selftest,
and distribute the work of dirtying each 4kb chunk of memory across all
vCPUs. Dirtying guest memory forces KVM to populate its page tables, and
also forces KVM to write back accessed/dirty information to struct page
when the guest memory is freed.

On x86, perform two passes with a MMU context reset between each pass to
coerce KVM into dropping all references to the MMU root, e.g. to emulate
a vCPU dropping the last reference. Perform both passes and all
rendezvous on all architectures in the hope that arm64 and s390x can gain
similar shenanigans in the future.

Measure and report the duration of each operation, which is helpful not
only to verify the test is working as intended, but also to easily
evaluate the performance differences different page sizes.

Provide command line options to limit the amount of guest memory, set the
size of each slot (i.e. of the host memory region), set the number of
vCPUs, and to enable usage of hugepages.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-29-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

792198cd

KVM: selftests: Define cpu_relax() helpers for s390 and x86 · 7f9facea

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 17ae5ebc
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:45 2022 +0000

    KVM: selftests: Define cpu_relax() helpers for s390 and x86

    Add cpu_relax() for s390 and x86 for use in arch-agnostic tests.  arm64
    already defines its own version.

Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20220226001546.360188-28-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

7f9facea

KVM: selftests: Split out helper to allocate guest mem via memfd · ba5808f5

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit a4187c9b
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:44 2022 +0000

    KVM: selftests: Split out helper to allocate guest mem via memfd

    Extract the code for allocating guest memory via memfd out of
    vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
    A future selftest to populate a guest with the maximum amount of guest
    memory will abuse KVM's memslots to alias guest memory regions to a
    single memfd-backed host region, i.e. needs to back a guest with memfd
    memory without a 1:1 association between a memslot and a memfd instance.

    No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20220226001546.360188-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

ba5808f5

KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils · 52798888

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 3d7d6043
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:43 2022 +0000

    KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils

    Move set_memory_region_test's KVM_SET_USER_MEMORY_REGION helper to KVM's
    utils so that it can be used by other tests.  Provide a raw version as
    well as an assert-success version to reduce the amount of boilerplate
    code need for basic usage.

    No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20220226001546.360188-26-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

52798888

KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE · 102817b3

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832

commit 396fd74d
Author: Sean Christopherson <seanjc@google.com>
Date: Sat Feb 26 00:15:42 2022 +0000

KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE

Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE.
This solves a conundrum introduced by commit 3255530a ("KVM: x86/mmu:
Automatically update iter->old_spte if cmpxchg fails"); if the helper
doesn't update old_spte in the REMOVED case, then theoretically the
caller could get stuck in an infinite loop as it will fail indefinitely
on the REMOVED SPTE. E.g. until recently, clear_dirty_gfn_range() didn't
check for a present SPTE and would have spun until getting rescheduled.

In practice, only the page fault path should "create" a new SPTE, all
other paths should only operate on existing, a.k.a. shadow present,
SPTEs. Now that the page fault path pre-checks for a REMOVED SPTE in all
cases, require all other paths to indirectly pre-check by verifying the
target SPTE is a shadow-present SPTE.

Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that
scenario disallowed. The invariant is only that the caller mustn't
invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last
observed by the caller.

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

102817b3

KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE · e0572ecf

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 58298b06
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:41 2022 +0000

    KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE

    Explicitly check for a REMOVED leaf SPTE prior to attempting to map
    the final SPTE when handling a TDP MMU fault.  Functionally, this is a
    nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
    Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
    is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
    SPTE is never a REMOVED SPTE.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

e0572ecf

KVM: x86/mmu: Zap defunct roots via asynchronous worker · 971487be

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832

commit efd995da
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Fri Mar 4 11:43:13 2022 -0500

KVM: x86/mmu: Zap defunct roots via asynchronous worker

Zap defunct roots, a.k.a. roots that have been invalidated after their
last reference was initially dropped, asynchronously via the existing work
queue instead of forcing the work upon the unfortunate task that happened
to drop the last reference.

If a vCPU task drops the last reference, the vCPU is effectively blocked
by the host for the entire duration of the zap. If the root being zapped
happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
being active, the zap can take several hundred seconds. Unsurprisingly,
most guests are unhappy if a vCPU disappears for hundreds of seconds.

E.g. running a synthetic selftest that triggers a vCPU root zap with
~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
Offloading the zap to a worker drops the block time to <100ms.

There is an important nuance to this change. If the same work item
was queued twice before the work function has run, it would only
execute once and one reference would be leaked. Therefore, now that
queueing and flushing items is not anymore protected by kvm->slots_lock,
kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
skip already invalid roots. On the other hand, kvm_mmu_zap_all_fast()
must return only after those skipped roots have been zapped as well.
These two requirements can be satisfied only if _all_ places that
change invalid to true now schedule the worker before releasing the
mmu_lock. There are just two, kvm_tdp_mmu_put_root() and
kvm_tdp_mmu_invalidate_all_roots().

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

971487be

KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls · 450f7e19

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 1b6043e8
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:39 2022 +0000

    KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls

    When zapping a TDP MMU root, perform the zap in two passes to avoid
    zapping an entire top-level SPTE while holding RCU, which can induce RCU
    stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
    zap top-level entries in the second pass.

    With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
    SPTEs take up to ~7 or so seconds (time varies based on kernel config,
    number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
    well into hundreds of seconds.

    Before remote TLB flushes were omitted, the problem was even worse as
    waiting for all active vCPUs to respond to the IPI introduced significant
    overhead for VMs with large numbers of vCPUs.

    By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
    the amount of work that is done without dropping RCU protection is
    strictly bounded, with the worst case latency for a single operation
    being less than 100ms.

    Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
    KVM relies on being able to zap 1gb shadow pages in a single shot when
    when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
    that is fully populated with 4kb dirty SPTEs also triggers the worst case
    latency due writing back the struct page accessed/dirty bits for each 4kb
    page, i.e. the two-pass approach is guaranteed to work so long as KVM can
    cleany zap a 1gb shadow page.

      rcu: INFO: rcu_sched self-detected stall on CPU
      rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                              softirq=15759/15759 fqs=5058
       (t=21016 jiffies g=66453 q=238577)
      NMI backtrace for cpu 52
      Call Trace:
       ...
       mark_page_accessed+0x266/0x2f0
       kvm_set_pfn_accessed+0x31/0x40
       handle_removed_tdp_mmu_page+0x259/0x2e0
       __handle_changed_spte+0x223/0x2c0
       handle_removed_tdp_mmu_page+0x1c1/0x2e0
       __handle_changed_spte+0x223/0x2c0
       handle_removed_tdp_mmu_page+0x1c1/0x2e0
       __handle_changed_spte+0x223/0x2c0
       zap_gfn_range+0x141/0x3b0
       kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
       kvm_mmu_zap_all_fast+0x121/0x190
       kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
       kvm_page_track_flush_slot+0x5c/0x80
       kvm_arch_flush_shadow_memslot+0xe/0x10
       kvm_set_memslot+0x172/0x4e0
       __kvm_set_memory_region+0x337/0x590
       kvm_vm_ioctl+0x49c/0xf80

Reported-by: David Matlack <dmatlack@google.com>
    Cc: Ben Gardon <bgardon@google.com>
    Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

450f7e19

KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root · 67ce648a

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 8351779c
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Mar 3 01:50:21 2022 -0500

    KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

    Allow yielding when zapping SPTEs after the last reference to a valid
    root is put.  Because KVM must drop all SPTEs in response to relevant
    mmu_notifier events, mark defunct roots invalid and reset their refcount
    prior to zapping the root.  Keeping the refcount elevated while the zap
    is in-progress ensures the root is reachable via mmu_notifier until the
    zap completes and the last reference to the invalid, defunct root is put.

    Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
    root in being put has a massive paging structure, e.g. zapping a root
    that is backed entirely by 4kb pages for a guest with 32tb of memory can
    take hundreds of seconds to complete.

      watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
      RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
       __handle_changed_spte+0x1b2/0x2f0 [kvm]
       handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
       __handle_changed_spte+0x1f4/0x2f0 [kvm]
       handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
       __handle_changed_spte+0x1f4/0x2f0 [kvm]
       tdp_mmu_zap_root+0x307/0x4d0 [kvm]
       kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
       kvm_mmu_free_roots+0x22d/0x350 [kvm]
       kvm_mmu_reset_context+0x20/0x60 [kvm]
       kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
       kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
       __se_sys_ioctl+0x77/0xc0
       __x64_sys_ioctl+0x1d/0x20
       do_syscall_64+0x44/0xa0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

    KVM currently doesn't put a root from a non-preemptible context, so other
    than the mmu_notifier wrinkle, yielding when putting a root is safe.

    Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
    take a reference to each root (it requires mmu_lock be held for the
    entire duration of the walk).

    tdp_mmu_next_root() is used only by the yield-friendly iterator.

    tdp_mmu_zap_root_work() is explicitly yield friendly.

    kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
    but is still yield-friendly in all call sites, as all callers can be
    traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
    kvm_create_vm().

Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20220226001546.360188-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

67ce648a

KVM: x86/mmu: Zap invalidated roots via asynchronous worker · fd0bac14

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 22b94c4b
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Mar 2 12:02:07 2022 -0500

    KVM: x86/mmu: Zap invalidated roots via asynchronous worker

    Use the system worker threads to zap the roots invalidated
    by the TDP MMU's "fast zap" mechanism, implemented by
    kvm_tdp_mmu_invalidate_all_roots().

    At this point, apart from allowing some parallelism in the zapping of
    roots, the workqueue is a glorified linked list: work items are added and
    flushed entirely within a single kvm->slots_lock critical section.  However,
    the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
    assumes that it owns a reference to all invalid roots; therefore, no
    one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
    invalidated roots on a linked list... erm, on a workqueue ensures that
    tdp_mmu_zap_root_work() only puts back those extra references that
    kvm_mmu_zap_all_invalidated_roots() had gifted to it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

fd0bac14

KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · 630157ce

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832

commit bb95dfb9
Author: Sean Christopherson <seanjc@google.com>
Date: Sat Feb 26 00:15:37 2022 +0000

KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages

Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
of immediately flushing. Because the shadow pages are freed in an RCU
callback, so long as at least one CPU holds RCU, all CPUs are protected.
For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
needs to ensure the caller services the pending TLB flush before dropping
its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs
running in the guest.

Deferring the flushes allows batching flushes, e.g. when installing a
1gb hugepage and zapping a pile of SPs. And when zapping an entire root,
deferring flushes allows skipping the flush entirely (because flushes are
not needed in that case).

Avoiding flushes when zapping an entire root is especially important as
synchronizing with other CPUs via IPI after zapping every shadow page can
cause significant performance issues for large VMs. The issue is
exacerbated by KVM zapping entire top-level entries without dropping
RCU protection, which can lead to RCU stalls even when zapping roots
backing relatively "small" amounts of guest memory, e.g. 2tb. Removing
the IPI bottleneck largely mitigates the RCU issues, though it's likely
still a problem for 5-level paging. A future patch will further address
the problem by zapping roots in multiple passes to avoid holding RCU for
an extended duration.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

630157ce

KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched · 828485ac

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit bd296779
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:36 2022 +0000

    KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

    When yielding in the TDP MMU iterator, service any pending TLB flush
    before dropping RCU protections in anticipation of using the caller's RCU
    "lock" as a proxy for vCPUs in the guest.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

828485ac

KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · e751cb53

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit cf3e2642
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:35 2022 +0000

    KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()

    Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
    functions accordingly.  When removing mappings for functional correctness
    (except for the stupid VFIO GPU passthrough memslots bug), zapping the
    leaf SPTEs is sufficient as the paging structures themselves do not point
    at guest memory and do not directly impact the final translation (in the
    TDP MMU).

    Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
    the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
    kvm_unmap_gfn_range().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

e751cb53

KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range · 93002665

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit acbda82a
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:34 2022 +0000

    KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range

    Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
    support for zapping with mmu_lock held for read.  That all callers hold
    mmu_lock for write isn't a random coincidence; now that the paths that
    need to zap _everything_ have their own path, the only callers left are
    those that need to zap for functional correctness.  And when zapping is
    required for functional correctness, mmu_lock must be held for write,
    otherwise the caller has no guarantees about the state of the TDP MMU
    page tables after it has run, e.g. the SPTE(s) it zapped can be
    immediately replaced by a vCPU faulting in a page.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

93002665

KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page · fd1d08ad

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832

commit e2b5b21d
Author: Sean Christopherson <seanjc@google.com>
Date: Sat Feb 26 00:15:33 2022 +0000

KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page

Add a dedicated helper for zapping a TDP MMU root, and use it in the three
flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
are zapped (zapping an entire root is safe if and only if it cannot be in
use by any vCPU). Because a TLB flush is never required, unconditionally
pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.

Opportunistically document why KVM must not yield when zapping roots that
are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
reached zero, and further harden the flow to detect improper KVM behavior
with respect to roots that are supposed to be unreachable.

In addition to hardening zapping of roots, isolating zapping of roots
will allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs, and by removing its tricky "zap all" heuristic. By having
all paths that truly need to free _all_ SPs flow through the dedicated
root zapper, the generic zapper can be freed of those concerns.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

fd1d08ad

KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU · 1f6f46bf

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 77c8cd6b
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:32 2022 +0000

    KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU

    Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
    uses the slow version of "zap everything" is when the VM is being
    destroyed or the owning mm has exited.  In either case, KVM_RUN is
    unreachable for the VM, i.e. the guest TLB entries cannot be consumed.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-15-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

1f6f46bf

KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery · 22507a21

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit c10743a1
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:31 2022 +0000

    KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

    When recovering a potential hugepage that was shattered for the iTLB
    multihit workaround, precisely zap only the target page instead of
    iterating over the TDP MMU to find the SP that was passed in.  This will
    allow future simplification of zap_gfn_range() by having it zap only
    leaf SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20220226001546.360188-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

22507a21

KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values · 97ec766b

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 626808d1
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:30 2022 +0000

    KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values

    Refactor __tdp_mmu_set_spte() to work with raw values instead of a
    tdp_iter objects so that a future patch can modify SPTEs without doing a
    walk, and without having to synthesize a tdp_iter.

    No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-13-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

97ec766b

KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path · e1f255d2

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 966da62a
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:29 2022 +0000

    KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path

    WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
    which is called out by the comment as being disallowed but not actually
    checked.  Keep the WARN on the old_spte as well, because overwriting a
    REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
    lack of splats with the existing WARN).

    Fixes: 08f07c80 ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
    Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-12-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

e1f255d2

KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU · 9bf4ac8e

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 0e587aa7
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:28 2022 +0000

    KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU

    Add helpers to read and write TDP MMU SPTEs instead of open coding
    rcu_dereference() all over the place, and to provide a convenient
    location to document why KVM doesn't exempt holding mmu_lock for write
    from having to hold RCU (and any future changes to the rules).

    No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-11-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

9bf4ac8e

KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks · de5b5eb7

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832

commit a151acec
Author: Sean Christopherson <seanjc@google.com>
Date: Sat Feb 26 00:15:27 2022 +0000

KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks

Drop RCU protection after processing each root when handling MMU notifier
hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
drop RCU to let RCU do its thing between roots, and to make it clear that
there's no special behavior that relies on holding RCU across all roots.

Currently, the RCU protection is completely superficial, it's necessary
only to make rcu_dereference() of SPTE pointers happy. A future patch
will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
rather, acknowledge the need for a flush), but in that case RCU needs to
be held until the flush is complete if and only if the flush is needed
because a shadow page may have been removed. And except for the "unmap"
path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
and can't change the PFN for a SP).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-10-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

de5b5eb7

KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte · 8085f8c7

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 93fa50f6
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:26 2022 +0000

    KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte

    Batch TLB flushes (with other MMUs) when handling ->change_spte()
    notifications in the TDP MMU.  The MMU notifier path in question doesn't
    allow yielding and correcty flushes before dropping mmu_lock.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-9-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

8085f8c7

KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal · 4bd4a8f5

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit c8e5a0d0
Author: Sean Christopherson <seanjc@google.com>
Date:   Sat Feb 26 00:15:25 2022 +0000

    KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal

    Look for a !leaf=>leaf conversion instead of a PFN change when checking
    if a SPTE change removed a TDP MMU shadow page.  Convert the PFN check
    into a WARN, as KVM should never change the PFN of a shadow page (except
    when its being zapped or replaced).

    From a purely theoretical perspective, it's not illegal to replace a SP
    with a hugepage pointing at the same PFN.  In practice, it's impossible
    as that would require mapping guest memory overtop a kernel-allocated SP.
    Either way, the check is odd.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-8-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

4bd4a8f5

KVM: x86/mmu: do not allow readers to acquire references to invalid roots · 701288b7

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 614f6970
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Mar 2 08:51:05 2022 -0500

    KVM: x86/mmu: do not allow readers to acquire references to invalid roots

    Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
    that readers do not ever acquire a reference to an invalid root.  After this
    patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
    refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
    same way.  kvm_tdp_mmu_zap_invalidated_roots() is different but it also
    does not acquire a reference to the invalid root, and it cannot see
    refcount=0/invalid because it is guaranteed to run after
    kvm_tdp_mmu_invalidate_all_roots().

    Opportunistically add a lockdep assertion to the yield-safe iterator.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

701288b7

KVM: x86/mmu: only perform eager page splitting on valid roots · 7c2adcb1

Vitaly Kuznetsov authored 2 years ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2074832



commit 7c554d8e
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Mar 2 08:44:22 2022 -0500

    KVM: x86/mmu: only perform eager page splitting on valid roots

    Eager page splitting is an optimization; it does not have to be performed on
    invalid roots.  It is also the only case in which a reader might acquire
    a reference to an invalid root, so after this change we know that readers
    will skip both dying and invalid roots.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

7c2adcb1