address size is 16 bits, regardless of the actual operating mode. With
this special map there can be two registers referenced at once, and
also disp16-only.
Implement this special behavior, and add associated tests. While here
simplify a few things.
With this in place, the Windows 95 installer initializes correctly.
Part of PR/54611.
Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and
<MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf
op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now
per-machine, and the emulators should now do:
- nvmm_callbacks_register(&cbs);
+ nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs);
This provides more granularity, for example if the process runs two VMs
and wants different callbacks for each.
vcpu_run. This saves a few syscalls and copyins.
For example on Windows 10, moving the mouse from the left to right sides of
the screen generates ~500 events, which now don't result in syscalls.
The error handling is done in vcpu_run and it is less precise, but this
doesn't matter a lot, and will be solved with future NVMM error codes.
introduce a bidirectionnal "comm page", a page of memory shared between
the kernel and userland, and used to transfer data in and out in a more
performant manner than ioctls.
The comm page contains the VCPU state, plus three flags:
- "wanted": the states the kernel must get/set when requested via ioctls
- "cached": the states that are in the comm page
- "commit": the states the kernel must set in vcpu_run
The idea is to avoid performing expensive syscalls, by using the VCPU
state cached, either explicitly or speculatively, in the comm page. For
example, if the state is cached we do a direct 1->5 with no syscall:
+---------------------------------------------+
| Qemu |
+---------------------------------------------+
| ^
| (0) nvmm_vcpu_getstate | (6) Done
| |
V |
+---------------------------------------+
| libnvmm |
+---------------------------------------+
| ^ | ^
(1) State | | (2) No | (3) Ioctl: | (5) Ok, state
cached? | | | "please cache | fetched
| | | the state" |
V | | |
+-----------+ | |
| Comm Page |------+---------------+
+-----------+ |
^ |
(4) "Alright | V
babe" | +--------+
+-----| Kernel |
+--------+
The main changes in behavior are:
- nvmm_vcpu_getstate(): won't emit a syscall if the state is already
cached in the comm page, will just fetch from the comm page directly
- nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache
the wanted state in the comm page
- nvmm_vcpu_run(): will commit the to-be-set state in the comm page,
as previously requested by nvmm_vcpu_setstate()
In addition to this, the kernel NVMM driver is changed to speculatively
cache certain states known to be of interest, so that the future
nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use
the comm page rather than expensive syscalls. For example, if an I/O
VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS,
and now the kernel caches all of that in the comm page before returning
to userland.
Overall, in a normal run of Windows 10, this saves several millions of
syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO
goes from taking 1min35 to taking 1min16.
The libnvmm API is not changed, but the ABI is. If we changed the API it
would be possible to save expensive memcpys on libnvmm's side. This will
be avoided in a future version. The comm page can also be extended to
implement future services.
For some reason I had initially concluded that it wasn't doable; verily it
is, so let's do it.
The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes
mmap-like protection codes.
- Compress x86_rexpref, x86_regmodrm, x86_opcode and x86_instr.
- Cache-align the register, opcode and group tables.
- Modify the opcode tables to have 256 entries, and avoid a lookup.
- Reorder it, to match the CPU encoding. This is the universal order,
also used by Qemu. Drop the seg_to_nvmm[] tables.
- Compress it. This divides its size by two.
- Rename some of its fields, to better match the x86 spec. Also, take S
out of Type, this was a NetBSD-ism that was likely confusing to other
people.
fetching the displacement, so the node would always think there was no
displacement.
This didn't alter the final GPA we would be touching - because it is
fetched from the kernel directly and not from the computation -, but it
altered the instruction length, and on some guests (like Fedora 64bit),
the VCPU would resume execution at the wrong RIP and crash.
Now these guests work.
AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET
instructions do not force SS.DPL to predefined values. On Intel they do,
so the CPL on Intel is just the guest's SS.DPL value.
Even though technically possible on AMD, there is no sane reason for a
guest kernel to set a non-three SS.DPL, doing that would mess up several
common segmentation practices and wouldn't be compatible with Intel.
So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL.
Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually
increases performance on AMD: to detect interrupt windows the virtualizer
has to modify some fields of misc[], and because CPL was there, we had to
flush the SEG set of the VMCB cache. Now there is no flush necessary.
While here remove the CPL check for XSETBV on Intel, contrary to AMD
Intel checks the CPL before the intercept, so if we receive an XSETBV
VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the
way my check was wrong in the first place, it was reading SS.RPL instead
of SS.DPL.
- Emulate the instructions by executing them directly on the host CPU.
This is easier and probably faster than doing it in software
manually.
- Decode SUB from Primary, CMP from Group1, TEST from Group3, and add
associated tests.
- Handle correctly the cases where an instruction that always implicitly
reads the register operand is executed with the mem operand as source
(eg: "orq (%rbx),%rax").
- Fix the MMU handling of 32bit-PAE. Under PAE CR3 is not page-aligned,
so there are extra bits that are valid.
With these changes in place I can boot Windows XP on Qemu+NVMM.
* Uh I put the wrong masks in some GPRs, fuck.
* When the opsize of MOVZX is 4, we need to combine the zero-extend from
the instruction with the natural zero-extend of long mode.
Add two associated tests.
the exit structure provided by the kernel. This saves an MMU translation,
and sometimes complex address computation (eg SIB).
Drop the GVA field, it is not useful to virtualizers.
* Decode AND/OR/XOR from Group1.
* Sign-extend the immediates and displacements in 64bit mode.
* Fix the storage of {read,write}_guest_memory, now that we batch certain
IO operations we can copy more than 8 bytes, and shit hits the fan.
* Remove the CR4_PSE check in the 64bit MMU. This bit is actually ignored
in long mode, and some systems (like FreeBSD) don't set it.
Kernel driver:
* Don't take an extra (unneeded) reference to the UAO.
* Provide npc for HLT. I'm not really happy with it right now, will
likely be revisited.
* Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide
them in the exitstate too.
* Don't take the TPR into account when processing INTs. The virtualizer
can do that itself (Qemu already does).
* Provide a hypervisor signature in CPUID, and hide SVM.
* Ignore certain MSRs. One special case is MSR_NB_CFG in which we set
NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC.
* If the LWP has pending signals or softints, leave, rather than waiting
for a rescheduling to happen later. This reduces interrupt processing
time in the guest (Qemu sends a signal to the thread, and now we leave
right away). This could be improved even more by sending an actual IPI
to the CPU, but I'll see later.
Libnvmm:
* Fix the MMU translation of large pages, we need to add the lower bits
too.
* Change the IO and Mem structures to take a pointer rather than a
static array. This provides more flexibility.
* Batch together the str+rep IO transactions. We do one big memory
read/write, and then send the IO commands to the hypervisor all at
once. This considerably increases performance.
* Decode MOVZX.
With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0
in a VM with multiple VCPUs, connect to the network, etc.
is needed on certain AMD CPUs (like mine): the segment base of OUTS can be
overridden, and it is wrong to just assume DS.
We fetch the instruction and look at the prefixes if any to determine the
correct segment.
* Change the Assist API. Rather than passing callbacks in each call, the
callbacks are now registered beforehand. Then change the I/O Assist to
fetch MMIO data via the Mem callback. This allows a guest to perform an
I/O string operation on a memory that is itself an MMIO.
* Introduce two new functions internal to libnvmm, read_guest_memory and
write_guest_memory. They can handle mapped memory, MMIO memory and
cross-page transactions.
* Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned
addresses. This simplifies a lot of things.
* Support the MOVS instruction, and add a test for it. This instruction
is special, in that it takes two implicit memory operands. In
particular, it means that the two buffers can both be in MMIO memory,
and we handle this case.
* Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and
there.
Until now, the "owner" of the memory was the guest, and by calling
nvmm_gpa_map(), the virtualizer was creating a view towards the guest
memory.
Qemu expects the contrary: it wants the owner to be the virtualizer, and
nvmm_gpa_map should just create a view from the guest towards the
virtualizer's address space. Under this scheme, it is legal to have two
GPAs that point to the same HVA.
Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into
a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just
perform an enter into the desired UOBJ.
With this change in place, all the mapping-related problems in Qemu+NVMM
are fixed.
- Fix the I/O Assist, for INS* it is RDI and not RSI, and the register
gets updated regardless of the REP prefix.
- Fill in the Mem Assist. We decode and emulate certain instructions,
and pass a mem descriptor to the callback to handle the transaction.
The disassembler could use some polishing, and there are still a
few instructions missing; but basically it works.
and smallkern, there is little interest installing them by default,
rather they can be downloaded from www. It's better this way.
While here add NVMM(4) in "SEE ALSO".
noted by agc@. These _nvmm_area_add/delete functions don't make a lot of
sense right now and will likely be rewritten to match the behavior
expected by Qemu; but still fix for the time being.
Also fix a collision check while here.