# Background
I was investigating spurious non-deterministic EINTR returns from
various 9p file system operations in a Linux guest served from the
qemu 9p server.
## EINTR, ERESTARTSYS and the linux kernel
When a signal arrives that the Linux kernel needs to deliver to user-space
while a given thread is blocked (in the 9p case waiting for a reply to its
request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
driver is currently running to abort its current operation (in the 9p case
causing the submission of a TFLUSH message) and return to user space.
In these situations, the error message reported is generally ERESTARTSYS.
If the userspace processes specified SA_RESTART, this means that the
system call will get restarted upon completion of the signal handler
delivery (assuming the signal handler doesn't modify the process state
in complicated ways not relevant here). If SA_RESTART is not specified,
ERESTARTSYS gets translated to EINTR and user space is expected to handle
the restart itself.
## The 9p TFLUSH command
The 9p TFLUSH commands requests that the server abort an ongoing operation.
The man page [1] specifies:
```
If it recognizes oldtag as the tag of a pending transaction, it should
abort any pending response and discard that tag.
[...]
When the client sends a Tflush, it must wait to receive the corresponding
Rflush before reusing oldtag for subsequent messages. If a response to the
flushed request is received before the Rflush, the client must honor the
response as if it had not been flushed, since the completed request may
signify a state change in the server
```
In particular, this means that the server must not send a reply with the
orignal tag in response to the cancellation request, because the client is
obligated to interpret such a reply as a coincidental reply to the original
request.
# The bug
When qemu receives a TFlush request, it sets the `cancelled` flag on the
relevant pdu. This flag is periodically checked, e.g. in
`v9fs_co_name_to_path`, and if set, the operation is aborted and the error
is set to EINTR. However, the server then violates the spec, by returning
to the client an Rerror response, rather than discarding the message
entirely. As a result, the client is required to assume that said Rerror
response is a result of the original request, not a result of the
cancellation and thus passes the EINTR error back to user space.
This is not the worst thing it could do, however as discussed above, the
correct error code would have been ERESTARTSYS, such that user space
programs with SA_RESTART set get correctly restarted upon completion of
the signal handler.
Instead, such programs get spurious EINTR results that they were not
expecting to handle.
It should be noted that there are plenty of user space programs that do not
set SA_RESTART and do not correctly handle EINTR either. However, that is
then a userspace bug. It should also be noted that this bug has been
mitigated by a recent commit to the Linux kernel [2], which essentially
prevents the kernel from sending Tflush requests unless the process is about
to die (in which case the process likely doesn't care about the response).
Nevertheless, for older kernels and to comply with the spec, I believe this
change is beneficial.
# Implementation
The fix is fairly simple, just skipping notification of a reply if
the pdu was previously cancelled. We do however, also notify the transport
layer that we're doing this, so it can clean up any resources it may be
holding. I also added a new trace event to distinguish
operations that caused an error reply from those that were cancelled.
One complication is that we only omit sending the message on EINTR errors in
order to avoid confusing the rest of the code (which may assume that a
client knows about a fid if it sucessfully passed it off to pud_complete
without checking for cancellation status). This does mean that if the server
acts upon the cancellation flag, it always needs to set err to EINTR. I
believe this is true of the current code.
[1] https://9fans.github.io/plan9port/man/man9/flush.html
[2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc891
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[groug, send a zero-sized reply instead of detaching the buffer]
Signed-off-by: Greg Kurz <groug@kaod.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
No good reasons to do this outside of v9fs_device_realize_common().
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
On some architectures, qemu doesn't support vmcoreinfo device,
and dump-guest-memory fails:
(gdb) dump-guest-memory /tmp/vmcore ppc64-le
guest RAM blocks:
target_start target_end host_addr message count
---------------- ---------------- ---------------- ------- -----
0000000000000000 0000000200000000 00003ffd86980000 added 1
0000200080000000 0000200080800000 00003ffd86170000 added 2
Python Exception <class 'gdb.error'> No symbol "vmcoreinfo_realize" in current context.:
Error occurred in Python command: No symbol "vmcoreinfo_realize" in current context.
Check that vmcoreinfo_realize symbol exists before evaluating an
expression with it.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
200 currently fails on tmpfs because it sets cache=none. However,
without that (and aio=native), the test still works now and it fails
before Jeff's series (on fc7dbc119e0852a70dc9fa68bb41a318e49e4cd6). So
we can probably remove the aio=native safely, and replace cache=none by
cache=$CACHEMODE.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180117135015.15051-1-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
This patch prevents a possible segmentation fault when .desc members are checked
against NULL.
The ssh_runtime_opts was added by commit
8a6a80896d6af03b8ee0c17cdf37219eca2588a7 ("block/ssh: Use QemuOpts for runtime
options").
This fix was inspired by
http://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg00883.html.
Fixes: 8a6a80896d6af03b8ee0c17cdf37219eca2588a7 ("block/ssh: Use QemuOpts for runtime options")
Cc: Max Reitz <mreitz@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.vnet.ibm.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
We masked the wrong bits, which prevented some of the
32-bit R registers. E.g. "fcnvxf,sgl,sgl fr22R,fr6R".
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Now that we have the prerequisites in target/hppa/,
implement the hardware for a PA7100LC.
This also enables build for hppa-softmmu.
Signed-off-by: Helge Deller <deller@gmx.de>
[rth: Since it is all new code, squashed all branch development
withing hw/hppa/ to a single patch.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This is an extension to the base ISA, but we can use this in
the kernel idle loop to reduce the host cpu time consumed.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
HP-UX 10.20 CD contains "add r0, r0, r27" in a delay slot,
which uses at least 5 temps.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Unknown why this works, but if we return EXCP_ITLB_MISS we
will triple-fault the first userland instruction fetch.
Is it something to do with having a combined I/DTLB?
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Linux sets sr4-sr7 all to the same value, which means that we
need not do any runtime computation to find out what space to
use in forming the GVA.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Real hardware would use an external device to control the power.
But for the moment let's invent instructions in reserved space,
to be used by our custom firmware.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Otherwise there's no way to clear them without an external command,
and it could lock the OS in the VM if they were stuck.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Macro parameters should almost always have () around them when used.
llvm reported an error on this.
Remove redundant parenthesis and put parenthesis around the entire
macros with assignments in case they are used in an expression.
The macros were doing ((v) & 1) for a binary input, but that only works
if v == 0 or if v & 1. Changed to !!(v) so they work for all values.
Remove some unused macros.
Reported in https://bugs.launchpad.net/bugs/1651167
An audit of these changes found no semantic changes; this is just
cleanups for proper style and to avoid a compiler warning.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This lets an event be added to the SEL as if a sensor had generated
it. The OpenIPMI driver uses it for storing panic event information.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
According to the spec, from section "32.3 OEM SEL Record - Type
E0h-FFh", event types from 0x0e to 0xff do not have a timestamp.
So don't set it when adding those types. This required putting
the timestamp in a temporary buffer, since it's still required
to set the last addition time.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
The minimum message size was on the wrong commands, for getting
the time it's zero and for setting the time it's 6.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
However since HPPA has a software-managed TLB, and the relevant
TLB manipulation instructions are not implemented, this does not
actually do anything.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Any one TB will have only one space value. If we change spaces,
we change TBs. Thus BE and BEV must exit the TB immediately.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
These instructions force the destination privilege level
of the branch destination to be no higher than current.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This changes the system virtual address width to 64-bit and
incorporates the space registers into load/store operations.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
While the E bit is only used for pa2.0 mfctl,w from sar,
the otherwise reserved bit does not appear to be decoded.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>