Commit Graph

43 Commits

Author SHA1 Message Date
reinoud e6b71c1487 Implement support for trapping REP CMPS instructions in NVMM.
Qemu would abort hard when NVMM would get a memory trap on the instruction
since it didn't know it.
2020-12-27 20:56:14 +00:00
reinoud b5bbc77284 Revert (REPE) CMPS support per request of Maxime, it is incorrect. 2020-10-31 15:44:01 +00:00
reinoud cfe6ea549d Implement missing (REPE) CMPS instruction support in NVMMs x86_decode().
In apparently rare cases the (REPE) CMPS instruction can trigger an memory
assist. NVMM wouldn't recognize the instruction and thus couldn't assist and
Qemu would abort.
2020-10-30 21:06:13 +00:00
maxv 4a2e4dc388 nvmm: update copyright headers 2020-09-05 07:22:25 +00:00
joerg 7e53a6c032 Annotate a covering switch as such to avoid warnings about missing
returns.
2019-10-28 18:12:17 +00:00
maxv 06a4f19afb Use the new PTE naming, and define CR3_FRAME_* separately. No functional
change.
2019-10-27 08:30:05 +00:00
maxv e6f32a5866 Three changes in libnvmm:
- Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem
   structures.

 - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'.

 - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS,
   it now becomes per-VCPU.
2019-10-23 12:02:55 +00:00
maxv f9fb7866ce Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.

 - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
   libnvmm. Introduce NVMM_USER_VERSION, for future use.

 - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
   avoid sharing the VMs with the children if the process forks. In the
   NVMM driver, force O_CLOEXEC on open().

 - Rename the following things for consistency:
       nvmm_exit*              -> nvmm_vcpu_exit*
       nvmm_event*             -> nvmm_vcpu_event*
       NVMM_EXIT_*             -> NVMM_VCPU_EXIT_*
       NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
       NVMM_EVENT_EXCEPTION    -> NVMM_VCPU_EVENT_EXCP
   Delete NVMM_EVENT_INTERRUPT_SW, unused already.

 - Slightly reorganize the MI/MD definitions, for internal clarity.

 - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
   separate u.rdmsr and u.wrmsr fields. This is more consistent with the
   other exit reasons.

 - Change the types of several variables:
       event.type                  enum -> u_int
       event.vector                uint64_t -> uint8_t
       exit.u.*msr.msr:            uint64_t -> uint32_t
       exit.u.io.type:             enum -> bool
       exit.u.io.seg:              int -> int8_t
       cap.arch.mxcsr_mask:        uint64_t -> uint32_t
       cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t

 - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
   already intercept 'monitor' so it is never armed.

 - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
   The 'npc' field wasn't getting filled properly during certain VMEXITs.

 - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
   but as its name indicates, the configuration is per-VCPU and not per-VM.
   Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
   This becomes per-VCPU, which makes more sense than per-VM.

 - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
   specific leaves. Until now we could only mask the leaves. An uint32_t
   is added in the structure:
	uint32_t mask:1;
	uint32_t exit:1;
	uint32_t rsvd:30;
   The two first bits select the desired behavior on the leaf. Specifying
   zero on both resets the leaf to the default behavior. The new
   NVMM_VCPU_EXIT_CPUID exit reason is added.
2019-10-23 07:01:11 +00:00
maxv e1d43b1e6d Put back 'default', because llvm apparently doesn't realize that all cases
are covered in the switch.
2019-10-19 19:45:10 +00:00
maxv fabc9bf218 Improve nvmm_vcpu_dump(). 2019-10-14 10:43:40 +00:00
maxv c496a7b118 Implement XCHG, add associated tests, and add comments to explain. With
this in place the Windows 95 installer completes successfuly.

Part of PR/54611.
2019-10-14 10:39:24 +00:00
maxv 416eaf02dc Fix incorrect parsing: the R/M field uses a special GPR map when the
address size is 16 bits, regardless of the actual operating mode. With
this special map there can be two registers referenced at once, and
also disp16-only.

Implement this special behavior, and add associated tests. While here
simplify a few things.

With this in place, the Windows 95 installer initializes correctly.

Part of PR/54611.
2019-10-13 17:32:15 +00:00
maxv d1002cd7eb Change the NVMM API to reduce data movements. Sent to tech-kern@. 2019-06-08 07:27:44 +00:00
maxv bfb4017486 Rework the machine configuration interface.
Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and
<MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf
op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now
per-machine, and the emulators should now do:

-	nvmm_callbacks_register(&cbs);
+	nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs);

This provides more granularity, for example if the process runs two VMs
and wants different callbacks for each.
2019-05-11 07:31:56 +00:00
maxv f973734497 Modify the communication layer between the kernel NVMM driver and libnvmm:
introduce a bidirectionnal "comm page", a page of memory shared between
the kernel and userland, and used to transfer data in and out in a more
performant manner than ioctls.

The comm page contains the VCPU state, plus three flags:

 - "wanted": the states the kernel must get/set when requested via ioctls
 - "cached": the states that are in the comm page
 - "commit": the states the kernel must set in vcpu_run

The idea is to avoid performing expensive syscalls, by using the VCPU
state cached, either explicitly or speculatively, in the comm page. For
example, if the state is cached we do a direct 1->5 with no syscall:

          +---------------------------------------------+
          |                    Qemu                     |
          +---------------------------------------------+
               |                                   ^
               | (0) nvmm_vcpu_getstate            | (6) Done
               |                                   |
               V                                   |
             +---------------------------------------+
             |                libnvmm                |
             +---------------------------------------+
                  |   ^          |               ^
        (1) State |   | (2) No   | (3) Ioctl:    | (5) Ok, state
        cached?   |   |          | "please cache | fetched
                  |   |          |  the state"   |
                  V   |          |               |
              +-----------+      |               |
              | Comm Page |------+---------------+
              +-----------+      |
                       ^         |
          (4) "Alright |         V
               babe"   |     +--------+
                       +-----| Kernel |
                             +--------+

The main changes in behavior are:

 - nvmm_vcpu_getstate(): won't emit a syscall if the state is already
   cached in the comm page, will just fetch from the comm page directly
 - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache
   the wanted state in the comm page
 - nvmm_vcpu_run(): will commit the to-be-set state in the comm page,
   as previously requested by nvmm_vcpu_setstate()

In addition to this, the kernel NVMM driver is changed to speculatively
cache certain states known to be of interest, so that the future
nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use
the comm page rather than expensive syscalls. For example, if an I/O
VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS,
and now the kernel caches all of that in the comm page before returning
to userland.

Overall, in a normal run of Windows 10, this saves several millions of
syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO
goes from taking 1min35 to taking 1min16.

The libnvmm API is not changed, but the ABI is. If we changed the API it
would be possible to save expensive memcpys on libnvmm's side. This will
be avoided in a future version. The comm page can also be extended to
implement future services.
2019-04-28 14:22:13 +00:00
maxv e00a8e01f5 Check the GPA permissions too in the Assists, because it is possible that
the guest traps on a page the virtualizer marked as read-only (even if it
appears as read-write in the HVA).
2019-04-04 17:33:47 +00:00
maxv 4977f2eb4a Micro optimizations:
- Compress x86_rexpref, x86_regmodrm, x86_opcode and x86_instr.
 - Cache-align the register, opcode and group tables.
 - Modify the opcode tables to have 256 entries, and avoid a lookup.
2019-03-07 15:47:34 +00:00
maxv 7a4f551dcf Change the layout of the SEG state:
- Reorder it, to match the CPU encoding. This is the universal order,
   also used by Qemu. Drop the seg_to_nvmm[] tables.

 - Compress it. This divides its size by two.

 - Rename some of its fields, to better match the x86 spec. Also, take S
   out of Type, this was a NetBSD-ism that was likely confusing to other
   people.
2019-02-26 12:23:12 +00:00
maxv 8d8eb34b8b Set hardseg to -1 rather than 0, because 0 can be a valid segment. 2019-02-26 10:18:39 +00:00
maxv 4cdf419d72 Fix handling of SIB instructions. We were jumping to the SIB node _before_
fetching the displacement, so the node would always think there was no
displacement.

This didn't alter the final GPA we would be touching - because it is
fetched from the kernel directly and not from the computation -, but it
altered the instruction length, and on some guests (like Fedora 64bit),
the VCPU would resume execution at the wrong RIP and crash.

Now these guests work.
2019-02-17 20:25:46 +00:00
maxv 2fad18ce40 Remove the PSE check in the 32bit-PAE MMU. Setting CR4.PAE automatically
enables PSE regardless of whether CR4.PSE is set or not, so we should just
ignore it.

With this in place I can boot Windows 8.1 on NVMM.
2019-02-15 16:42:27 +00:00
maxv 27a60aeb62 Harmonize the handling of the CPL between AMD and Intel.
AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET
instructions do not force SS.DPL to predefined values. On Intel they do,
so the CPL on Intel is just the guest's SS.DPL value.

Even though technically possible on AMD, there is no sane reason for a
guest kernel to set a non-three SS.DPL, doing that would mess up several
common segmentation practices and wouldn't be compatible with Intel.

So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL.
Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually
increases performance on AMD: to detect interrupt windows the virtualizer
has to modify some fields of misc[], and because CPL was there, we had to
flush the SEG set of the VMCB cache. Now there is no flush necessary.

While here remove the CPL check for XSETBV on Intel, contrary to AMD
Intel checks the CPL before the intercept, so if we receive an XSETBV
VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the
way my check was wrong in the first place, it was reading SS.RPL instead
of SS.DPL.
2019-02-14 14:30:20 +00:00
maxv f911f1c1e1 Optimize: fetch only 5 bytes instead of 15, the instruction can have only
up to five prefixes.
2019-02-12 14:50:21 +00:00
christos 12f8b8a214 #### is not legal. 2019-02-10 19:30:28 +00:00
maxv 83ed0b5e52 Improvements:
- Emulate the instructions by executing them directly on the host CPU.
   This is easier and probably faster than doing it in software
   manually.

 - Decode SUB from Primary, CMP from Group1, TEST from Group3, and add
   associated tests.

 - Handle correctly the cases where an instruction that always implicitly
   reads the register operand is executed with the mem operand as source
   (eg: "orq (%rbx),%rax").

 - Fix the MMU handling of 32bit-PAE. Under PAE CR3 is not page-aligned,
   so there are extra bits that are valid.

With these changes in place I can boot Windows XP on Qemu+NVMM.
2019-02-07 10:58:45 +00:00
maxv 2089a3819a Fix two issues:
* Uh I put the wrong masks in some GPRs, fuck.

 * When the opsize of MOVZX is 4, we need to combine the zero-extend from
   the instruction with the natural zero-extend of long mode.

Add two associated tests.
2019-02-01 06:49:58 +00:00
pgoyette d91f98a871 Merge the [pgoyette-compat] branch 2019-01-27 02:08:33 +00:00
maxv c07836be52 Ah, fix bug: when the opcode has an immediate, we fill the src with a
register storage, but then we overwrite it without zeroing out the highest
bits of the resulting immediate (which may contain garbage from the union).
2019-01-26 14:44:54 +00:00
maxv 7ceb32d30a Handle more corner cases, clean up a little, and add a set of instructions
in Group1.
2019-01-13 10:43:22 +00:00
maxv 06484a82be Handle REPN. FreeBSD has a "repn movs", which is a bit unusual, but doesn't
seem illegal as far as I can tell from the AMD SDM.

With that, I can boot FreeBSD on Qemu+NVMM.
2019-01-08 07:34:22 +00:00
maxv 75c7df3cfe Optimize the legpref node: omit BRN (we don't care and it's the same as
OVR_CS), inline the loops, sort the checks from most to least likely
prefix, and use a compact structure.
2019-01-07 18:13:34 +00:00
maxv 04b8bfbf75 Optimize: on single memory operand instructions, take the GPA directly from
the exit structure provided by the kernel. This saves an MMU translation,
and sometimes complex address computation (eg SIB).

Drop the GVA field, it is not useful to virtualizers.
2019-01-07 16:30:25 +00:00
maxv 960d1f7675 Improvements and fixes:
* Decode AND/OR/XOR from Group1.

 * Sign-extend the immediates and displacements in 64bit mode.

 * Fix the storage of {read,write}_guest_memory, now that we batch certain
   IO operations we can copy more than 8 bytes, and shit hits the fan.

 * Remove the CR4_PSE check in the 64bit MMU. This bit is actually ignored
   in long mode, and some systems (like FreeBSD) don't set it.
2019-01-07 13:47:33 +00:00
maxv 809327425b Improvements and fixes in NVMM.
Kernel driver:

 * Don't take an extra (unneeded) reference to the UAO.

 * Provide npc for HLT. I'm not really happy with it right now, will
   likely be revisited.

 * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide
   them in the exitstate too.

 * Don't take the TPR into account when processing INTs. The virtualizer
   can do that itself (Qemu already does).

 * Provide a hypervisor signature in CPUID, and hide SVM.

 * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set
   NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC.

 * If the LWP has pending signals or softints, leave, rather than waiting
   for a rescheduling to happen later. This reduces interrupt processing
   time in the guest (Qemu sends a signal to the thread, and now we leave
   right away). This could be improved even more by sending an actual IPI
   to the CPU, but I'll see later.

Libnvmm:

 * Fix the MMU translation of large pages, we need to add the lower bits
   too.

 * Change the IO and Mem structures to take a pointer rather than a
   static array. This provides more flexibility.

 * Batch together the str+rep IO transactions. We do one big memory
   read/write, and then send the IO commands to the hypervisor all at
   once. This considerably increases performance.

 * Decode MOVZX.

With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0
in a VM with multiple VCPUs, connect to the network, etc.
2019-01-06 16:10:51 +00:00
maxv 2e9744b39f In !64bit mode RIP-relative is null+disp32, handle that correctly. 2019-01-04 10:25:39 +00:00
maxv 579fb4792d When there's no DecodeAssist in hardware, decode manually in software. This
is needed on certain AMD CPUs (like mine): the segment base of OUTS can be
overridden, and it is wrong to just assume DS.

We fetch the instruction and look at the prefixes if any to determine the
correct segment.
2019-01-02 12:18:08 +00:00
maxv 4aa536c2db Fix the segmentation check, the limit is relative, not absolute. 2018-12-29 17:54:54 +00:00
maxv 38b2a665bf Several improvements and fixes:
* Change the Assist API. Rather than passing callbacks in each call, the
   callbacks are now registered beforehand. Then change the I/O Assist to
   fetch MMIO data via the Mem callback. This allows a guest to perform an
   I/O string operation on a memory that is itself an MMIO.

 * Introduce two new functions internal to libnvmm, read_guest_memory and
   write_guest_memory. They can handle mapped memory, MMIO memory and
   cross-page transactions.

 * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned
   addresses. This simplifies a lot of things.

 * Support the MOVS instruction, and add a test for it. This instruction
   is special, in that it takes two implicit memory operands. In
   particular, it means that the two buffers can both be in MMIO memory,
   and we handle this case.

 * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and
   there.
2018-12-27 07:22:31 +00:00
maxv 3f62f34a84 Two changes:
- Fix the I/O Assist, for INS* it is RDI and not RSI, and the register
   gets updated regardless of the REP prefix.

 - Fill in the Mem Assist. We decode and emulate certain instructions,
   and pass a mem descriptor to the callback to handle the transaction.
   The disassembler could use some polishing, and there are still a
   few instructions missing; but basically it works.
2018-12-15 13:09:02 +00:00
maxv 07310f302a Don't forget to set 'prot' when the guest has paging disabled. 2018-11-17 16:11:33 +00:00
maya c587647461 Revert my own rev 1.2, the missing include was only when building the 32-bit
compat library, we no longer do this.
2018-11-13 06:57:14 +00:00
maya dd1f793151 Add missing include for struct nvmm_x64_state
(Pointed out by the clang build)
2018-11-11 00:06:48 +00:00
maxv 2760ca24b5 Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM
software to effortlessly create and manage virtual machines via NVMM.

It is mostly complete, only nvmm_assist_mem needs to be filled -- I have
a draft for that, but it needs some more care. This Mem Assist should
not be needed when emulating a system in x2apic mode, so theoretically
the current form of libnvmm is sufficient to emulate a whole class of
systems.

Generally speaking, there are so many modes in x86 that it is difficult
to handle each corner case without introducing a ton of checks that just
slow down the common-case execution. Currently we check a limited number
of things; we may add more checks in the future if they turn out to be
needed, but that's rather low priority.

Libnvmm is compiled and installed only on amd64. A man page (reviewed by
wiz@) is provided.
2018-11-10 09:28:56 +00:00