Commit Graph

326 Commits

Author SHA1 Message Date
Avi Kivity
5008fdf5b6 KVM: Use kernel-standard types
Noted by Joerg Roedel.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Joerg Roedel
80b7706e4c KVM: SVM: enable LBRV virtualization if available
This patch enables the virtualization of the last branch record MSRs on
SVM if this feature is available in hardware. It also introduces a small
and simple check feature for specific SVM extensions.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity
b8836737d9 KVM: Add fpu get/set operations
These are really helpful when migrating an floating point app to another
machine.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity
e8207547d2 KVM: Add physical memory aliasing feature
With this, we can specify that accesses to one physical memory range will
be remapped to another.  This is useful for the vga window at 0xa0000 which
is used as a movable window into the (much larger) framebuffer.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity
954bbbc236 KVM: Simply gfn_to_page()
Mapping a guest page to a host page is a common operation.  Currently,
one has first to find the memory slot where the page belongs (gfn_to_memslot),
then locate the page itself (gfn_to_page()).

This is clumsy, and also won't work well with memory aliases.  So simplify
gfn_to_page() not to require memory slot translation first, and instead do it
internally.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Dor Laor
e0fa826f96 KVM: Add mmu cache clear function
Functions that play around with the physical memory map
need a way to clear mappings to possibly nonexistent or
invalid memory.  Both the mmu cache and the processor tlb
are cleared.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity
df513e2cdd KVM: x86 emulator: fix bit string operations operand size
On x86, bit operations operate on a string of bits that can reside in
multiple words.  For example, 'btsl %eax, (blah)' will touch the word
at blah+4 if %eax is between 32 and 63.

The x86 emulator compensates for that by advancing the operand address
by (bit offset / BITS_PER_LONG) and truncating the bit offset to the
range (0..BITS_PER_LONG-1).  This has a side effect of forcing the operand
size to 8 bytes on 64-bit hosts.

Now, a 32-bit guest goes and fork()s a process.  It write protects a stack
page at 0xbffff000 using the 'btr' instruction, at offset 0xffc in the page
table, with bit offset 1 (for the write permission bit).

The emulator now forces the operand size to 8 bytes as previously described,
and an innocent page table update turns into a cross-page-boundary write,
which is assumed by the mmu code not to be a page table, so it doesn't
actually clear the corresponding shadow page table entry.  The guest and
host permissions are out of sync and guest memory is corrupted soon
afterwards, leading to guest failure.

Fix by not using BITS_PER_LONG as the word size; instead use the actual
operand size, so we get a 32-bit write in that case.

Note we still have to teach the mmu to handle cross-page-boundary writes
to guest page table; but for now this allows Damn Small Linux 0.4 (2.4.20)
to boot.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity
afeb1f14c5 KVM: Remove debug message
No longer interesting.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Avi Kivity
36868f7b0e KVM: Use list_move()
Use list_move() where possible.  Noticed by Dor Laor.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Michal Piotrowski
55bf402834 KVM: Remove unused function
Remove unused function

CC      drivers/kvm/svm.o
drivers/kvm/svm.c:207: warning: ‘inject_db’ defined but not used

Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Avi Kivity
0cc5064d33 KVM: SVM: Ensure timestamp counter monotonicity
When a vcpu is migrated from one cpu to another, its timestamp counter
may lose its monotonic property if the host has unsynced timestamp counters.
This can confuse the guest, sometimes to the point of refusing to boot.

As the rdtsc instruction is rather fast on AMD processors (7-10 cycles),
we can simply record the last host tsc when we drop the cpu, and adjust
the vcpu tsc offset when we detect that we've migrated to a different cpu.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Avi Kivity
d28c6cfbbc KVM: MMU: Fix hugepage pdes mapping same physical address with different access
The kvm mmu keeps a shadow page for hugepage pdes; if several such pdes map
the same physical address, they share the same shadow page.  This is a fairly
common case (kernel mappings on i386 nonpae Linux, for example).

However, if the two pdes map the same memory but with different permissions, kvm
will happily use the cached shadow page.  If the access through the more
permissive pde will occur after the access to the strict pde, an endless pagefault
loop will be generated and the guest will make no progress.

Fix by making the access permissions part of the cache lookup key.

The fix allows Xen pae to boot on kvm and run guest domains.

Thanks to Jeremy Fitzhardinge for reporting the bug and testing the fix.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Joerg Roedel
916ce2360f KVM: SVM: forbid guest to execute monitor/mwait
This patch forbids the guest to execute monitor/mwait instructions on
SVM. This is necessary because the guest can execute these instructions
if they are available even if the kvm cpuid doesn't report its
existence.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:26 +03:00
Sergey Kiselev
0e5bf0d0e4 KVM: Handle writes to MCG_STATUS msr
Some older (~2.6.7) kernels write MCG_STATUS register during kernel
boot (mce_clear_all() function, called from mce_init()). It's not
currently handled by kvm and will cause it to inject a GPF.
Following patch adds a "nop" handler for this.

Signed-off-by: Sergey Kiselev <sergey.kiselev@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:26 +03:00
Avi Kivity
fcd3410870 KVM: Remove unused and write-only variables
Trivial cleanup.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:26 +03:00
Avi Kivity
6da63cf95f KVM: Don't allow the guest to turn off the cpu cache
The cpu cache is a host resource; the guest should not be able to turn
it off (even for itself).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:26 +03:00
Avi Kivity
038881c8be KVM: Hack real-mode segments on vmx from KVM_SET_SREGS
As usual, we need to mangle segment registers when emulating real mode
as vm86 has specific constraints.  We special case the reset segment base,
and set the "access rights" (or descriptor flags) to vm86 comaptible values.

This fixes reboot on vmx.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:26 +03:00
Avi Kivity
024aa1c02f KVM: Modify guest segments after potentially switching modes
The SET_SREGS ioctl modifies both cr0.pe (real mode/protected mode) and
guest segment registers.  Since segment handling is modified by the mode on
Intel procesors, update the segment registers after the mode switch has taken
place.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:26 +03:00
Avi Kivity
f6528b03f1 KVM: Remove set_cr0_no_modeswitch() arch op
set_cr0_no_modeswitch() was a hack to avoid corrupting segment registers.
As we now cache the protected mode values on entry to real mode, this
isn't an issue anymore, and it interferes with reboot (which usually _is_
a modeswitch).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
8cb5b03332 KVM: Workaround vmx inability to virtualize the reset state
The reset state has cs.selector == 0xf000 and cs.base == 0xffff0000,
which aren't compatible with vm86 mode, which is used for real mode
virtualization.

When we create a vcpu, we set cs.base to 0xf0000, but if we get there by
way of a reset, the values are inconsistent and vmx refuses to enter
guest mode.

Workaround by detecting the state and munging it appropriately.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
aac012245a KVM: MMU: Remove global pte tracking
The initial, noncaching, version of the kvm mmu flushed the all nonglobal
shadow page table translations (much like a native tlb flush).  The new
implementation flushes translations only when they change, rendering global
pte tracking superfluous.

This removes the unused tracking mechanism and storage space.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
ca5aac1f96 KVM: MMU: Remove unnecessary check for pdptr access
We already special case the pdptr access, so no need to check it again.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
039576c03c KVM: Avoid guest virtual addresses in string pio userspace interface
The current string pio interface communicates using guest virtual addresses,
relying on userspace to translate addresses and to check permissions.  This
interface cannot fully support guest smp, as the check needs to take into
account two pages at one in case an unaligned string transfer straddles a
page boundary.

Change the interface not to communicate guest addresses at all; instead use
a buffer page (mmaped by userspace) and do transfers there.  The kernel
manages the virtual to physical translation and can perform the checks
atomically by taking the appropriate locks.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
f0fe510864 KVM: Future-proof argument-less ioctls
Some ioctls ignore their arguments.  By requiring them to be zero now,
we allow a nonzero value to have some special meaning in the future.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
07c45a366d KVM: Allow kernel to select size of mmap() buffer
This allows us to store offsets in the kernel/user kvm_run area, and be
sure that userspace has them mapped.  As offsets can be outside the
kvm_run struct, userspace has no way of knowing how much to mmap.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
1961d276c8 KVM: Add guest mode signal mask
Allow a special signal mask to be used while executing in guest mode.  This
allows signals to be used to interrupt a vcpu without requiring signal
delivery to a userspace handler, which is quite expensive.  Userspace still
receives -EINTR and can get the signal via sigwait().

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
6722c51c51 KVM: Initialize the apic_base msr on svm too
Older userspace didn't care, but newer userspace (with the cpuid changes)
does.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
1b19f3e61d KVM: Add a special exit reason when exiting due to an interrupt
This is redundant, as we also return -EINTR from the ioctl, but it
allows us to examine the exit_reason field on resume without seeing
old data.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
8eb7d334bd KVM: Fold kvm_run::exit_type into kvm_run::exit_reason
Currently, userspace is told about the nature of the last exit from the
guest using two fields, exit_type and exit_reason, where exit_type has
just two enumerations (and no need for more).  So fold exit_type into
exit_reason, reducing the complexity of determining what really happened.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
b4e63f560b KVM: Allow userspace to process hypercalls which have no kernel handler
This is useful for paravirtualized graphics devices, for example.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
5d308f4550 KVM: Add method to check for backwards-compatible API extensions
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity
106b552b43 KVM: Remove the 'emulated' field from the userspace interface
We no longer emulate single instructions in userspace.  Instead, we service
mmio or pio requests.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity
06465c5a3a KVM: Handle cpuid in the kernel instead of punting to userspace
KVM used to handle cpuid by letting userspace decide what values to
return to the guest.  We now handle cpuid completely in the kernel.  We
still let userspace decide which values the guest will see by having
userspace set up the value table beforehand (this is necessary to allow
management software to set the cpu features to the least common denominator,
so that live migration can work).

The motivation for the change is that kvm kernel code can be impacted by
cpuid features, for example the x86 emulator.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity
46fc147788 KVM: Do not communicate to userspace through cpu registers during PIO
Currently when passing the a PIO emulation request to userspace, we
rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx
(on string instructions).  This (a) requires two extra ioctls for getting
and setting the registers and (b) is unfriendly to non-x86 archs, when
they get kvm ports.

So fix by doing the register fixups in the kernel and passing to userspace
only an abstract description of the PIO to be done.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity
9a2bb7f486 KVM: Use a shared page for kernel/user communication when runing a vcpu
Instead of passing a 'struct kvm_run' back and forth between the kernel and
userspace, allocate a page and allow the user to mmap() it.  This reduces
needless copying and makes the interface expandable by providing lots of
free space.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity
1ea252afcd KVM: Fix bogus sign extension in mmu mapping audit
When auditing a 32-bit guest on a 64-bit host, sign extension of the page
table directory pointer table index caused bogus addresses to be shown on
audit errors.

Fix by declaring the index unsigned.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity
bbe4432e66 KVM: Use own minor number
Use the minor number (232) allocated to kvm by lanana.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:22 +03:00
Dor Laor
510043da85 KVM: Use the generic skip_emulated_instruction() in hypercall code
Instead of twiddling the rip registers directly, use the
skip_emulated_instruction() function to do that for us.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:22 +03:00
Dor Laor
9b22bf5783 KVM: Fix guest register corruption on paravirt hypercall
The hypercall code mixes up the ->cache_regs() and ->decache_regs()
callbacks, resulting in guest register corruption.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:22 +03:00
Avi Kivity
6b8d0f9b18 KVM: Fix off-by-one when writing to a nonpae guest pde
Nonpae guest pdes are shadowed by two pae ptes, so we double the offset
twice: once to account for the pte size difference, and once because we
need to shadow pdes for a single guest pde.

But when writing to the upper guest pde we also need to truncate the
lower bits, otherwise the multiply shifts these bits into the pde index
and causes an access to the wrong shadow pde.  If we're at the end of the
page (accessing the very last guest pde) we can even overflow into the
next host page and oops.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-04-19 18:39:26 +03:00
Ingo Molnar
6d9658df07 KVM: always reload segment selectors
failed VM entry on VMX might still change %fs or %gs, thus make sure
that KVM always reloads the segment selectors. This is crutial on both
x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like
'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-03-27 17:55:48 +02:00
Avi Kivity
6af11b9e82 KVM: Prevent system selectors leaking into guest on real->protected mode transition on vmx
Intel virtualization extensions do not support virtualizing real mode.  So
kvm uses virtualized vm86 mode to run real mode code.  Unfortunately, this
virtualized vm86 mode does not support the so called "big real" mode, where
the segment selector and base do not agree with each other according to the
real mode rules (base == selector << 4).

To work around this, kvm checks whether a selector/base pair violates the
virtualized vm86 rules, and if so, forces it into conformance.  On a
transition back to protected mode, if we see that the guest did not touch
a forced segment, we restore it back to the original protected mode value.

This pile of hacks breaks down if the gdt has changed in real mode, as it
can cause a segment selector to point to a system descriptor instead of a
normal data segment.  In fact, this happens with the Windows bootloader
and the qemu acpi bios, where a protected mode memcpy routine issues an
innocent 'pop %es' and traps on an attempt to load a system descriptor.

"Fix" by checking if the to-be-restored selector points at a system segment,
and if so, coercing it into a normal data segment.  The long term solution,
of course, is to abandon vm86 mode and use emulation for big real mode.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-27 17:54:38 +02:00
Avi Kivity
27aba76615 KVM: MMU: Fix host memory corruption on i386 with >= 4GB ram
PAGE_MASK is an unsigned long, so using it to mask physical addresses on
i386 (which are 64-bit wide) leads to truncation.  This can result in
page->private of unrelated memory pages being modified, with disasterous
results.

Fix by not using PAGE_MASK for physical addresses; instead calculate
the correct value directly from PAGE_SIZE.  Also fix a similar BUG_ON().

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-18 10:49:09 +02:00
Avi Kivity
ac1b714e78 KVM: MMU: Fix guest writes to nonpae pde
KVM shadow page tables are always in pae mode, regardless of the guest
setting.  This means that a guest pde (mapping 4MB of memory) is mapped
to two shadow pdes (mapping 2MB each).

When the guest writes to a pte or pde, we intercept the write and emulate it.
We also remove any shadowed mappings corresponding to the write.  Since the
mmu did not account for the doubling in the number of pdes, it removed the
wrong entry, resulting in a mismatch between shadow page tables and guest
page tables, followed shortly by guest memory corruption.

This patch fixes the problem by detecting the special case of writing to
a non-pae pde and adjusting the address and number of shadow pdes zapped
accordingly.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-18 10:49:09 +02:00
Avi Kivity
f5b42c3324 KVM: Fix guest sysenter on vmx
The vmx code currently treats the guest's sysenter support msrs as 32-bit
values, which breaks 32-bit compat mode userspace on 64-bit guests.  Fix by
using the native word width of the machine.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-18 10:49:06 +02:00
Avi Kivity
ca45aaae1e KVM: Unset kvm_arch_ops if arch module loading failed
Otherwise, the core module thinks the arch module is loaded, and won't
let you reload it after you've fixed the bug.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-18 10:49:06 +02:00
Andrew Morton
e9cdb1e330 KVM: Move kvmfs magic number to <linux/magic.h>
Use the standard magic.h for kvmfs.

Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:43 +02:00
Avi Kivity
58e690e6fd KVM: Fix bogus failure in kvm.ko module initialization
A bogus 'return r' can cause an otherwise successful module load to fail.
This both denies users the use of kvm, and it also denies them the use of
their machine, as it leaves a filesystem registered with its callbacks
pointing into now-freed module memory.

Fix by returning a zero like a good module.

Thanks to Richard Lucassen <mailinglists@lucassen.org> (?) for reporting
the problem and for providing access to a machine which exhibited it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:43 +02:00
Uri Lublin
ff990d5952 KVM: Remove write access permissions when dirty-page-logging is enabled
Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl.
If the memory region already exists, we need to remove write accesses,
so writes will be caught, and dirty pages will be logged.

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:43 +02:00
Uri Lublin
02b27c1f80 kvm: move do_remove_write_access() up
To be called from kvm_vm_ioctl_set_memory_region()

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:43 +02:00
Uri Lublin
cd1a4a982a KVM: Fix dirty page log bitmap size/access calculation
Since dirty_bitmap is an unsigned long array, the alignment and size need
to take that into account.

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:42 +02:00
Uri Lublin
ab51a434c5 KVM: Add missing calls to mark_page_dirty()
A few places where we modify guest memory fail to call mark_page_dirty(),
causing live migration to fail.  This adds the missing calls.

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:42 +02:00
Avi Kivity
bccf2150fe KVM: Per-vcpu inodes
Allocate a distinct inode for every vcpu in a VM.  This has the following
benefits:

 - the filp cachelines are no longer bounced when f_count is incremented on
   every ioctl()
 - the API and internal code are distinctly clearer; for example, on the
   KVM_GET_REGS ioctl, there is no need to copy the vcpu number from
   userspace and then copy the registers back; the vcpu identity is derived
   from the fd used to make the call

Right now the performance benefits are completely theoretical since (a) we
don't support more than one vcpu per VM and (b) virtualization hardware
inefficiencies completely everwhelm any cacheline bouncing effects.  But
both of these will change, and we need to prepare the API today.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:42 +02:00
Avi Kivity
c5ea766006 KVM: Move kvm_vm_ioctl_create_vcpu() around
In preparation of some hacking.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:42 +02:00
Avi Kivity
2c6f5df979 KVM: Rename some kvm_dev_ioctl_*() functions to kvm_vm_ioctl_*()
This reflects the changed scope, from device-wide to single vm (previously
every device open created a virtual machine).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:42 +02:00
Avi Kivity
f17abe9a44 KVM: Create an inode per virtual machine
This avoids having filp->f_op and the corresponding inode->i_fop different,
which is a little unorthodox.

The ioctl list is split into two: global kvm ioctls and per-vm ioctls.  A new
ioctl, KVM_CREATE_VM, is used to create VMs and return the VM fd.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:42 +02:00
Avi Kivity
37e29d906c KVM: Add internal filesystem for generating inodes
The kvmfs inodes will represent virtual machines and vcpus, as necessary,
reducing cacheline bouncing due to inodes and filps being shared.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:41 +02:00
Avi Kivity
19d1408dfd KVM: More 0 -> NULL conversions
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:41 +02:00
Joerg Roedel
0152527b76 KVM: SVM: intercept SMI to handle it at host level
This patch changes the SVM code to intercept SMIs and handle it
outside the guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:41 +02:00
Avi Kivity
cd205625e9 KVM: svm: init cr0 with the wp bit set
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:41 +02:00
Avi Kivity
270fd9b96f KVM: Wire up hypercall handlers to a central arch-independent location
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:41 +02:00
Avi Kivity
02e235bc8e KVM: Add hypercall host support for svm
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:41 +02:00
Ingo Molnar
c21415e843 KVM: Add host hypercall support for vmx
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:40 +02:00
Ingo Molnar
102d8325a1 KVM: add MSR based hypercall API
This adds a special MSR based hypercall API to KVM. This is to be
used by paravirtual kernels and virtual drivers.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:40 +02:00
Markus Rechberger
5972e9535e KVM: Use page_private()/set_page_private() apis
Besides using an established api, this allows using kvm in older kernels.

Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:39 +02:00
Ahmed S. Darwish
9d8f549dc6 KVM: Use ARRAY_SIZE macro instead of manual calculation.
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:39 +02:00
Joerg Roedel
de979caacc KVM: vmx: hack set_cr0_no_modeswitch() to actually do modeswitch
The whole thing is rotten, but this allows vmx to boot with the guest reboot
fix.

Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:39 +02:00
Avi Kivity
d27d4aca18 KVM: Cosmetics
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:39 +02:00
Jeremy Katz
43934a38d7 KVM: Move virtualization deactivation from CPU_DEAD state to CPU_DOWN_PREPARE
This gives it more chances of surviving suspend.

Signed-off-by: Jeremy Katz <katzj@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:39 +02:00
Avi Kivity
bf3f8e86c2 KVM: mmu: add missing dirty page tracking cases
We fail to mark a page dirty in three cases:

- setting the accessed bit in a pte
- setting the dirty bit in a pte
- emulating a write into a pagetable

This fix adds the missing cases.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-03-04 11:12:39 +02:00
Jeremy Fitzhardinge
464d1a78fb [PATCH] i386: Convert i386 PDA code to use %fs
Convert the PDA code to use %fs rather than %gs as the segment for
per-processor data.  This is because some processors show a small but
measurable performance gain for reloading a NULL segment selector (as %fs
generally is in user-space) versus a non-NULL one (as %gs generally is).

On modern processors the difference is very small, perhaps undetectable.
Some old AMD "K6 3D+" processors are noticably slower when %fs is used
rather than %gs; I have no idea why this might be, but I think they're
sufficiently rare that it doesn't matter much.

This patch also fixes the math emulator, which had not been adjusted to
match the changed struct pt_regs.

[frederik.deweerdt@gmail.com: fixit with gdb]
[mingo@elte.hu: Fix KVM too]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Ian Campbell <Ian.Campbell@XenSource.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Zachary Amsden <zach@vmware.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2007-02-13 13:26:20 +01:00
Avi Kivity
59ae6c6b87 [PATCH] KVM: Host suspend/resume support
Add the necessary callbacks to suspend and resume a host running kvm.  This is
just a repeat of the cpu hotplug/unplug work.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:41 -08:00
Avi Kivity
774c47f1d7 [PATCH] KVM: cpu hotplug support
On hotplug, we execute the hardware extension enable sequence.  On unplug, we
decache any vcpus that last ran on the exiting cpu, and execute the hardware
extension disable sequence.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:41 -08:00
Avi Kivity
8d0be2b3bf [PATCH] KVM: VMX: add vcpu_clear()
Like the inline code it replaces, this function decaches the vmcs from the cpu
it last executed on.  in addition:

 - vcpu_clear() works if the last cpu is also the cpu we're running on
 - it is faster on larger smps by virtue of using smp_call_function_single()

Includes fix from Ingo Molnar.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:41 -08:00
Avi Kivity
133de9021d [PATCH] KVM: Add a global list of all virtual machines
This will allow us to iterate over all vcpus and see which cpus they are
running on.

[akpm@osdl.org: use standard (ugly) initialisers]
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Ingo Molnar
1e8ba6fba5 [PATCH] kvm: fix vcpu freeing bug
vcpu_load() can return NULL and it sometimes does in failure paths (for
example when the userspace ABI version is too old) - causing a preemption
count underflow in the ->vcpu_free() later on.  So check for NULL.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Avi Kivity
26bb83a755 [PATCH] kvm: VMX: Reload ds and es even in 64-bit mode
Or 32-bit userspace will get confused.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Dor Laor
54810342f1 [PATCH] kvm: Two-way apic tpr synchronization
We report the value of cr8 to userspace on an exit.  Also let userspace change
cr8 when we re-enter the guest.  The lets 64-bit guest code maintain the tpr
correctly.

Thanks for Yaniv Kamay for the idea.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Avi Kivity
d92899a001 [PATCH] kvm: SVM: Hack initial cpu csbase to be consistent with intel
This allows us to run the mmu testsuite on amd.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Avi Kivity
ac6c2bc592 [PATCH] kvm: Fix mmu going crazy of guest sets cr0.wp == 0
The kvm mmu relies on cr0.wp being set even if the guest does not set it.  The
vmx code correctly forces cr0.wp at all times, the svm code does not, so it
can't boot solaris without this patch.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Avi Kivity
988ad74ff6 [PATCH] kvm: vmx: handle triple faults by returning EXIT_REASON_SHUTDOWN to userspace
Just like svm.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Avi Kivity
e119d117a1 [PATCH] kvm: Fix gva_to_gpa()
gva_to_gpa() needs to be updated to the new walk_addr() calling convention,
otherwise it may oops under some circumstances.

Use the opportunity to remove all the code duplication in gva_to_gpa(), which
essentially repeats the calculations in walk_addr().

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
S.Caglar Onur
a0610ddf6b [PATCH] kvm: Fix asm constraint for lldt instruction
lldt does not accept immediate operands, which "g" allows.

Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Ingo Molnar
96958231ce [PATCH] kvm: optimize inline assembly
Forms like "0(%rsp)" generate an instruction with an unnecessary one byte
displacement under certain circumstances.  replace with the equivalent
"(%rsp)".

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:40 -08:00
Al Viro
11718b4d6b [PATCH] misc NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:14:07 -08:00
Al Viro
8b6d44c7bd [PATCH] kvm: NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:14:07 -08:00
Al Viro
2f36698799 [PATCH] kvm: __user annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:14:07 -08:00
Avi Kivity
432bd6cbf9 [PATCH] KVM: fix lockup on 32-bit intel hosts with nx disabled in the bios
Intel hosts, without long mode, and with nx support disabled in the bios
have an efer that is readable but not writable.  This causes a lockup on
switch to guest mode (even though it should exit with reason 34 according
to the documentation).

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-01 16:22:41 -08:00
Robert P. J. Day
49b14f24cc [PATCH] Fix "CONFIG_X86_64_" typo in drivers/kvm/svm.c
Fix what looks like an obvious typo in the file drivers/kvm/svm.c.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 08:26:45 -08:00
Joerg Roedel
46fe4ddd9d [PATCH] KVM: SVM: Propagate cpu shutdown events to userspace
This patch implements forwarding of SHUTDOWN intercepts from the guest on to
userspace on AMD SVM.  A SHUTDOWN event occurs when the guest produces a
triple fault (e.g.  on reboot).  This also fixes the bug that a guest reboot
actually causes a host reboot under some circumstances.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:57 -08:00
Avi Kivity
73b1087e61 [PATCH] KVM: MMU: Report nx faults to the guest
With the recent guest page fault change, we perform access checks on our
own instead of relying on the cpu.  This means we have to perform the nx
checks as well.

Software like the google toolbar on windows appears to rely on this
somehow.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:57 -08:00
Avi Kivity
7993ba43db [PATCH] KVM: MMU: Perform access checks in walk_addr()
Check pte permission bits in walk_addr(), instead of scattering the checks all
over the code.  This has the following benefits:

1. We no longer set the accessed bit for accessed which fail permission checks.
2. Setting the accessed bit is simplified.
3. Under some circumstances, we used to pretend a page fault was fixed when
   it would actually fail the access checks.  This caused an unnecessary
   vmexit.
4. The error code for guest page faults is now correct.

The fix helps netbsd further along booting, and allows kvm to pass the new mmu
testsuite.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:57 -08:00
Avi Kivity
6f00e68f21 [PATCH] KVM: Emulate IA32_MISC_ENABLE msr
This allows netbsd 3.1 i386 to get further along installing.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:57 -08:00
Leonard Norrgard
bce66ca4a2 [PATCH] KVM: SVM: Fix SVM idt confusion
There's an obvious typo in svm_{get,set}_idt, causing it to access the ldt
instead.

Because these functions are only called for save/load on AMD, the bug does not
impact normal operation.  With the fix, save/load works as expected on AMD
hosts.

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:57 -08:00
Avi Kivity
fc3dffe121 [PATCH] KVM: fix bogus pagefault on writable pages
If a page is marked as dirty in the guest pte, set_pte_common() can set the
writable bit on newly-instantiated shadow pte.  This optimization avoids
a write fault after the initial read fault.

However, if a write fault instantiates the pte, fix_write_pf() incorrectly
reports the fault as a guest page fault, and the guest oopses on what appears
to be a correctly-mapped page.

Fix is to detect the condition and only report a guest page fault on a user
access to a kernel page.

With the fix, a kvm guest can survive a whole night of running the kernel
hacker's screensaver (make -j9 in a loop).

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:06 -08:00
Avi Kivity
038e51de2e [PATCH] KVM: x86 emulator: fix bit string instructions
The various bit string instructions (bts, btc, etc.) fail to adjust the
address correctly if the bit address is beyond BITS_PER_LONG.

This bug creeped in as the emulator originally relied on cr2 to contain the
memory address; however we now decode it from the mod r/m bits, and must
adjust the offset to account for large bit indices.

The patch is rather large because it switches src and dst decoding around, so
that the bit index is available when decoding the memory address.

This fixes workloads like the FC5 installer.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:06 -08:00
Avi Kivity
cccf748b81 [PATCH] KVM: fix race between mmio reads and injected interrupts
The kvm mmio read path looks like:

 1. guest read faults
 2. kvm emulates read, calls emulator_read_emulated()
 3. fails as a read requires userspace help
 4. exit to userspace
 5. userspace emulates read, kvm sets vcpu->mmio_read_completed
 6. re-enter guest, fault again
 7. kvm emulates read, calls emulator_read_emulated()
 8. succeeds as vcpu->mmio_read_emulated is set
 9. instruction completes and guest is resumed

A problem surfaces if the userspace exit (step 5) also requests an interrupt
injection.  In that case, the guest does not re-execute the original
instruction, but the interrupt handler.  The next time an mmio read is
exectued (likely for a different address), step 3 will find
vcpu->mmio_read_completed set and return the value read for the original
instruction.

The problem manifested itself in a few annoying ways:
- little squares appear randomly on console when switching virtual terminals
- ne2000 fails under nfs read load
- rtl8139 complains about "pci errors" even though the device model is
  incapable of issuing them.

Fix by skipping interrupt injection if an mmio read is pending.

A better fix is to avoid re-entry into the guest, and re-emulating immediately
instead.  However that's a bit more complex.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:06 -08:00
Avi Kivity
084384754e [PATCH] KVM: make sure there is a vcpu context loaded when destroying the mmu
This makes the vmwrite errors on vm shutdown go away.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:06 -08:00
Herbert Xu
e001548911 [PATCH] vmx: Fix register constraint in launch code
Both "=r" and "=g" breaks my build on i386:

  $ make
    CC [M]  drivers/kvm/vmx.o
  {standard input}: Assembler messages:
  {standard input}:3318: Error: bad register name `%sil'
  make[1]: *** [drivers/kvm/vmx.o] Error 1
  make: *** [_module_drivers/kvm] Error 2

The reason is that setbe requires an 8-bit register but "=r" does not
constrain the target register to be one that has an 8-bit version on
i386.

According to

	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10153

the correct constraint is "=q".

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-22 19:27:02 -08:00
Ingo Molnar
07031e14c1 [PATCH] KVM: add VM-exit profiling
This adds the profile=kvm boot option, which enables KVM to profile VM
exits.

Use: "readprofile -m ./System.map | sort -n" to see the resulting
output:

   [...]
   18246 serial_out                               148.3415
   18945 native_flush_tlb                         378.9000
   23618 serial_in                                212.7748
   29279 __spin_unlock_irq                        622.9574
   43447 native_apic_write                        2068.9048
   52702 enable_8259A_irq                         742.2817
   54250 vgacon_scroll                             89.3740
   67394 ide_inb                                  6126.7273
   79514 copy_page_range                           98.1654
   84868 do_wp_page                                86.6000
  140266 pit_read                                 783.6089
  151436 ide_outb                                 25239.3333
  152668 native_io_delay                          21809.7143
  174783 mask_and_ack_8259A                       783.7803
  362404 native_set_pte_at                        36240.4000
 1688747 total                                      0.5009

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:21 -08:00
Dor Laor
022a93080c [PATCH] KVM: Simplify test for interrupt window
No need to test for rflags.if as both VT and SVM specs assure us that on exit
caused from interrupt window opening, 'if' is set.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Ingo Molnar
68a99f6d37 [PATCH] KVM: Simplify mmu_alloc_roots()
Small optimization/cleanup:

    page == page_header(page->page_hpa)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Ingo Molnar
d21225ee2b [PATCH] KVM: Make loading cr3 more robust
Prevent the guest's loading of a corrupt cr3 (pointing at no guest phsyical
page) from crashing the host.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Avi Kivity
760db773fb [PATCH] KVM: MMU: Add missing dirty bit
If we emulate a write, we fail to set the dirty bit on the guest pte, leading
the guest to believe the page is clean, and thus lose data.  Bad.

Fix by setting the guest pte dirty bit under such conditions.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Avi Kivity
4db9c47c05 [PATCH] KVM: Don't set guest cr3 from vmx_vcpu_setup()
It overwrites the right cr3 set from mmu setup.  Happens only with the test
harness.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Avi Kivity
cc1d8955cb [PATCH] KVM: Add missing 'break'
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Ingo Molnar
7f7417d67e [PATCH] KVM: Avoid oom on cr3 switch
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Avi Kivity
86a2b42e81 [PATCH] KVM: Initialize vcpu->kvm a little earlier
Fixes oops on early close of /dev/kvm.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:28 -08:00
Avi Kivity
e52de1b8cf [PATCH] KVM: Improve reporting of vmwrite errors
This will allow us to see the root cause when a vmwrite error happens.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
37a7d8b046 [PATCH] KVM: MMU: add audit code to check mappings, etc are correct
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
9ede74e0af [PATCH] KVM: MMU: Destroy mmu while we still have a vcpu left
mmu_destroy flushes the guest tlb (indirectly), which needs a valid vcpu.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
40907d5768 [PATCH] KVM: MMU: Flush guest tlb when reducing permissions on a pte
If we reduce permissions on a pte, we must flush the cached copy of the pte
from the guest's tlb.

This is implemented at the moment by flushing the entire guest tlb, and can be
improved by flushing just the relevant virtual address, if it is known.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
e2dec939db [PATCH] KVM: MMU: Detect oom conditions and propagate error to userspace
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
714b93da1a [PATCH] KVM: MMU: Replace atomic allocations by preallocated objects
The mmu sometimes needs memory for reverse mapping and parent pte chains.
however, we can't allocate from within the mmu because of the atomic context.

So, move the allocations to a central place that can be executed before the
main mmu machinery, where we can bail out on failure before any damage is
done.

(error handling is deffered for now, but the basic structure is there)

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
f51234c2cd [PATCH] KVM: MMU: Free pages on kvm destruction
Because mmu pages have attached rmap and parent pte chain structures, we need
to zap them before freeing so the attached structures are freed.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:27 -08:00
Avi Kivity
143646567f [PATCH] KVM: MMU: Treat user-mode faults as a hint that a page is no longer a page table
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
32b3562735 [PATCH] KVM: MMU: Fix cmpxchg8b emulation
cmpxchg8b uses edx:eax as the compare operand, not edi:eax.

cmpxchg8b is used by 32-bit pae guests to set page table entries atomically,
and this is emulated touching shadowed guest page tables.

Also, implement it for 32-bit hosts.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
3bb65a22a4 [PATCH] KVM: MMU: Never free a shadow page actively serving as a root
We always need cr3 to point to something valid, so if we detect that we're
freeing a root page, simply push it back to the top of the active list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
86a5ba025d [PATCH] KVM: MMU: Page table write flood protection
In fork() (or when we protect a page that is no longer a page table), we can
experience floods of writes to a page, which have to be emulated.  This is
expensive.

So, if we detect such a flood, zap the page so subsequent writes can proceed
natively.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
139bdb2d9e [PATCH] KVM: MMU: If an empty shadow page is not empty, report more info
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
5f1e0b6abc [PATCH] KVM: MMU: Ensure freed shadow pages are clean
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
260746c03d [PATCH] KVM: MMU: <ove is_empty_shadow_page() above kvm_mmu_free_page()
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
0e7bc4b961 [PATCH] KVM: MMU: Handle misaligned accesses to write protected guest page tables
A misaligned access affects two shadow ptes instead of just one.

Since a misaligned access is unlikely to occur on a real page table, just zap
the page out of existence, avoiding further trouble.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
73f7198e73 [PATCH] KVM: MMU: Remove release_pt_page_64()
Unused.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:26 -08:00
Avi Kivity
5f015a5b28 [PATCH] KVM: MMU: Remove invlpg interception
Since we write protect shadowed guest page tables, there is no need to trap
page invalidations (the guest will always change the mapping before issuing
the invlpg instruction).

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
ebeace8609 [PATCH] KVM: MMU: oom handling
When beginning to process a page fault, make sure we have enough shadow pages
available to service the fault.  If not, free some pages.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
cc4529efc7 [PATCH] KVM: MMU: kvm_mmu_put_page() only removes one link to the page
...  and so must not free it unconditionally.

Move the freeing to kvm_mmu_zap_page().

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
697fe2e24a [PATCH] KVM: MMU: Implement child shadow unlinking
When removing a page table, we must maintain the parent_pte field all child
shadow page tables.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
a436036baf [PATCH] KVM: MMU: If emulating an instruction fails, try unprotecting the page
A page table may have been recycled into a regular page, and so any
instruction can be executed on it.  Unprotect the page and let the cpu do its
thing.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
9b7a032567 [PATCH] KVM: MMU: Zap shadow page table entries on writes to guest page tables
Iterate over all shadow pages which correspond to a the given guest page table
and remove the mappings.

A subsequent page fault will reestablish the new mapping.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
da4a00f002 [PATCH] KVM: MMU: Support emulated writes into RAM
As the mmu write protects guest page table, we emulate those writes.  Since
they are not mmio, there is no need to go to userspace to perform them.

So, perform the writes in the kernel if possible, and notify the mmu about
them so it can take the approriate action.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
815af8d42e [PATCH] KVM: MMU: Let the walker extract the target page gfn from the pte
This fixes a problem where set_pte_common() looked for shadowed pages based on
the page directory gfn (a huge page) instead of the actual gfn being mapped.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
374cbac033 [PATCH] KVM: MMU: Write protect guest pages when a shadow is created for them
When we cache a guest page table into a shadow page table, we need to prevent
further access to that page by the guest, as that would render the cache
incoherent.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:25 -08:00
Avi Kivity
cea0f0e7ea [PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.

The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
   * we can cache real mode, 32-bit mode, pae, and long mode page
     tables simultaneously.  this is useful for smp bootup.
- the guest page table table
   * some kernels use a page as both a page table and a page directory.  this
     allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
   * 32-bit mode page tables span 4MB, whereas a shadow page table spans
     2MB.  similarly, a 32-bit page directory spans 4GB, while a shadow
     page directory spans 1GB.  the quadrant allows caching up to 4 shadow page
     tables for one guest page in one level.
- a "metaphysical" bit
   * for real mode, and for pse pages, there is no guest page table, so set
     the bit to avoid write protecting the page.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
25c0de2cc6 [PATCH] KVM: MMU: Make kvm_mmu_alloc_page() return a kvm_mmu_page pointer
This allows further manipulation on the shadow page table.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
aef3d3fe13 [PATCH] KVM: MMU: Make the shadow page tables also special-case pae
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
1b0973bd8f [PATCH] KVM: MMU: Use the guest pdptrs instead of mapping cr3 in pae mode
This lets us not write protect a partial page, and is anyway what a real
processor does.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
17ac10ad2b [PATCH] KVM: MU: Special treatment for shadow pae root pages
Since we're not going to cache the pae-mode shadow root pages, allocate a
single pae shadow that will hold the four lower-level pages, which will act as
roots.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
ac79c978f1 [PATCH] KVM: MMU: Fold fetch_guest() into init_walker()
It is never necessary to fetch a guest entry from an intermediate page table
level (except for large pages), so avoid some confusion by always descending
into the lowest possible level.

Rename init_walker() to walk_addr() as it is no longer restricted to
initialization.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
1342d3536d [PATCH] KVM: MMU: Load the pae pdptrs on cr3 change like the processor does
In pae mode, a load of cr3 loads the four third-level page table entries in
addition to cr3 itself.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
6bcbd6aba0 [PATCH] KVM: MMU: Teach the page table walker to track guest page table gfns
Saving the table gfns removes the need to walk the guest and host page tables
in lockstep.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
cd4a4e5374 [PATCH] KVM: MMU: Implement simple reverse mapping
Keep in each host page frame's page->private a pointer to the shadow pte which
maps it.  If there are multiple shadow ptes mapping the page, set bit 0 of
page->private, and use the rest as a pointer to a linked list of all such
mappings.

Reverse mappings are needed because we when we cache shadow page tables, we
must protect the guest page tables from being modified by the guest, as that
would invalidate the cached ptes.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:24 -08:00
Avi Kivity
399badf315 [PATCH] KVM: Prevent stale bits in cr0 and cr4
Hardware virtualization implementations allow the guests to freely change some
of the bits in cr0 and cr4, but trap when changing the other bits.  This is
useful to avoid excessive exits due to changing, for example, the ts flag.

It also means the kvm's copy of cr0 and cr4 may be stale with respect to these
bits.  most of the time this doesn't matter as these bits are not very
interesting.  Other times, however (for example when returning cr0 to
userspace), they are, so get the fresh contents of these bits from the guest
by means of a new arch operation.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:23 -08:00
Dor Laor
c1150d8cf9 [PATCH] KVM: Improve interrupt response
The current interrupt injection mechanism might delay an interrupt under
the following circumstances:

 - if injection fails because the guest is not interruptible (rflags.IF clear,
   or after a 'mov ss' or 'sti' instruction).  Userspace can check rflags,
   but the other cases or not testable under the current API.
 - if injection fails because of a fault during delivery.  This probably
   never happens under normal guests.
 - if injection fails due to a physical interrupt causing a vmexit so that
   it can be handled by the host.

In all cases the guest proceeds without processing the interrupt, reducing
the interactive feel and interrupt throughput of the guest.

This patch fixes the situation by allowing userspace to request an exit
when the 'interrupt window' opens, so that it can re-inject the interrupt
at the right time.  Guest interactivity is very visibly improved.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:22 -08:00
Yoshimi Ichiyanagi
e097f35ce5 [PATCH] KVM: Recover after an arch module load failure
If we load the wrong arch module, it leaves behind kvm_arch_ops set, which
prevents loading of the correct arch module later.

Fix be not setting kvm_arch_ops until we're sure it's good.

Signed-off-by: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:22 -08:00
Ingo Molnar
d3b2c33860 [PATCH] KVM: Use raw_smp_processor_id() instead of smp_processor_id() where applicable
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:22 -08:00
Ingo Molnar
965b58a550 [PATCH] KVM: Fix GFP_KERNEL alloc in atomic section bug
KVM does kmalloc() in an atomic section while having preemption disabled via
vcpu_load().  Fix this by moving the ->*_msr setup from the vcpu_setup method
to the vcpu_create method.

(This is also a small speedup for setting up a vcpu, which can in theory be
more frequent than the vcpu_create method).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:21 -08:00
Ingo Molnar
8018c27b26 [PATCH] kvm: fix GFP_KERNEL allocation in atomic section in kvm_dev_ioctl_create_vcpu()
fix an GFP_KERNEL allocation in atomic section: kvm_dev_ioctl_create_vcpu()
called kvm_mmu_init(), which calls alloc_pages(), while holding the vcpu.

The fix is to set up the MMU state in two phases: kvm_mmu_create() and
kvm_mmu_setup().

(NOTE: free_vcpus does an kvm_mmu_destroy() call so there's no need for any
extra teardown branch on allocation/init failure here.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Avi Kivity
55a54f79e0 [PATCH] KVM: Fix oops on oom
__free_page() doesn't like a NULL argument, so check before calling it.  A
NULL can only happen if memory is exhausted during allocation of a memory
slot.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Nguyen Anh Quynh
c68876fd28 [PATCH] KVM: Rename some msrs
No need to append _MSR to msr names, a prefix should suffice.

Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Avi Kivity
a8d13ea28b [PATCH] KVM: More msr misery
These msrs are referenced by benchmarking software when pretending to be an
Intel cpu.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Avi Kivity
3bab1f5dda [PATCH] KVM: Move common msr handling to arch independent code
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Avi Kivity
671d656479 [PATCH] KVM: Implement a few system configuration msrs
Resolves sourceforge bug 1622229 (guest crashes running benchmark software).

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Yoshimi Ichiyanagi
09db28b8a3 [PATCH] KVM: Initialize kvm_arch_ops on unload
The latest version of kvm doesn't initialize kvm_arch_ops in kvm_init(), which
causes an error with the following sequence.

1. Load the supported arch's module.
2. Load the unsupported arch's module.$B!!(B(loading error)
3. Unload the unsupported arch's module.

You'll get the following error message after step 3.  "BUG: unable to handle
to handle kernel paging request at virtual address xxxxxxxx"

The problem here is that the unsupported arch's module overwrites kvm_arch_ops
of the supported arch's module at step 2.

This patch initializes kvm_arch_ops upon loading architecture specific kvm
module, and prevents overwriting kvm_arch_ops when kvm_arch_ops is already set
correctly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Avi Kivity
a9058ecd3c [PATCH] KVM: Simplify is_long_mode()
Instead of doing tricky stuff with the arch dependent virtualization
registers, take a peek at the guest's efer.

This simlifies some code, and fixes some confusion in the mmu branch.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:44 -08:00
Avi Kivity
1e885461f0 [PATCH] KVM: Use boot_cpu_data instead of current_cpu_data
current_cpu_data invokes smp_processor_id(), which is inadvisable when
preemption is enabled.  Switch to boot_cpu_data instead.

Resolves sourceforge bug 1621401.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:43 -08:00
Avi Kivity
0b76e20b27 [PATCH] KVM: API versioning
Add compile-time and run-time API versioning.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Michael Riepe
0f8e3d365a [PATCH] KVM: Handle p5 mce msrs
This allows plan9 to get a little further booting.

Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Michael Riepe
abacf8dff9 [PATCH] KVM: Force real-mode cs limit to 64K
This allows opensolaris to boot on kvm/intel.

Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Michael Riepe
bf591b24d0 [PATCH] KVM: Do not export unsupported msrs to userspace
Some msrs, such as MSR_STAR, are not available on all processors.  Exporting
them causes qemu to try to fetch them, which will fail.

So, check all msrs for validity at module load time.

Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Avi Kivity
2c26495710 [PATCH] KVM: Use more traditional error handling in kvm_mmu_init()
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Avi Kivity
36241b8c7c [PATCH] KVM: AMD SVM: Save and restore the floating point unit state
Fixes sf bug 1614113 (segfaults in nbench).

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Avi Kivity
0e859cacb0 [PATCH] KVM: AMD SVM: handle MSR_STAR in 32-bit mode
This is necessary for linux guests.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
James Morris
5aacf0ca41 [PATCH] KVM: add valid_vcpu() helper
Consolidate the logic for checking whether a vcpu index is valid.  Also, use
likely(), as a valid value should be the overwhelmingly common case.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:45 -08:00
Avi Kivity
bfdc0c280a [PATCH] KVM: Fix vmx hardware_enable() on macbooks
It seems macbooks set bit 2 but not bit 0, which is an "enabled but vmxon will
fault" setting.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Tested-by: Alex Larsson (sometimes testing helps)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Michael Riepe
3b99ab2421 [PATCH] KVM: Don't touch the virtual apic vt registers on 32-bit
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Avi Kivity
873a7c423b [PATCH] KVM: Disallow the kvm-amd module on intel hardware, and vice versa
They're not on speaking terms.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Avi Kivity
8c7bb723b4 [PATCH] KVM: MMU: Ignore pcd, pwt, and pat bits on ptes
The pcd, pwt, and pat bits on page table entries affect the cpu cache.  Since
the cache is a host resource, the guest should not be able to control it.
Moreover, the meaning of these bits changes depending on whether pat is
enabled or not.

So, force these bits to zero on shadow page table entries at all times.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:47 -08:00
Avi Kivity
0770b19b94 [PATCH] KVM: Remove extranous put_cpu() from vcpu_put()
The arch splitting patchset left an extra put_cpu() in core code, where it can
cause trouble for CONFIG_PREEMPT kernels.

Reported-by: Huihong Luo <huisinro@yahoo.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:47 -08:00
Avi Kivity
7725f0badd [PATCH] KVM: Move find_vmx_entry() to vmx.c
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:47 -08:00
Uri Lublin
f7fbf1fdf0 [PATCH] KVM: Make the GET_SREGS and SET_SREGS ioctls symmetric
This makes the SET_SREGS ioctl behave symmetrically to the GET_SREGS ioctl wrt
the segment access rights flag.

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:47 -08:00
Avi Kivity
05b3e0c2c7 [PATCH] KVM: Replace __x86_64__ with CONFIG_X86_64
As per akpm's request.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:46 -08:00
Avi Kivity
5aff458e9c [PATCH] KVM: Clean up AMD SVM debug registers load and unload
By letting gcc choose the temporary register for us, we lose arch dependency
and some ugliness.  Conceivably gcc will also generate marginally better code.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:46 -08:00
Avi Kivity
fd24dc4af6 [PATCH] KVM: Put KVM in a new Virtualization menu
Instead of in the main drivers menu.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:46 -08:00
Anthony Liguori
3b3be0d1cc [PATCH] KVM: Add missing include
load_TR_desc() lives in asm/desc.h, so #include that file.

Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:46 -08:00
Avi Kivity
6aa8b732ca [PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net

mailing list: kvm-devel@lists.sourceforge.net
  (http://lists.sourceforge.net/lists/listinfo/kvm-devel)

The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture.  The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace.  Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.

Using this driver, one can start multiple virtual machines on a host.

Each virtual machine is a process on the host; a virtual cpu is a thread in
that process.  kill(1), nice(1), top(1) work as expected.  In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode.  Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm).  Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.

The driver supports i386 and x86_64 hosts and guests.  All combinations are
allowed except x86_64 guest on i386 host.  For i386 guests and hosts, both pae
and non-pae paging modes are supported.

SMP hosts and UP guests are supported.  At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.

Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch.  We plan to address this in two ways:

- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables

Currently a virtual desktop is responsive but consumes a lot of CPU.  Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization.  Linux/X is slower, probably due
to X being in a separate process.

In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.

Caveats (akpm: might no longer be true):

- The Windows install currently bluescreens due to a problem with the
  virtual APIC.  We are working on a fix.  A temporary workaround is to
  use an existing image or install through qemu
- Windows 64-bit does not work.  That's also true for qemu, so it's
  probably a problem with the device model.

[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00