Reproducer:
A: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
rmdir("c/d/e"); rmdir("c/d"); }
B: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
rename("c", "c/d/e"); }
C: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
rename("c/d/e", "c"); }
Deadlock:
- A holds c and wants to lock d; and either
- B holds . and d and wants to lock c, or
- C holds . and d and wants to lock c.
The problem with these is that genfs_rename_enter_separate in B or C
tried lock order .->d->c->e (in A/B, fdvp->tdvp->fvp->tvp; in A/C,
tdvp->fdvp->tvp->fvp) which violates the ancestor->descendant order
.->c->d->e.
The resolution is to change B to do fdvp->fvp->tdvp->tvp and C to do
tdvp->tvp->fdvp->fvp. But there's an edge case: tvp and fvp might be
the same (hard links), and we can't detect that until after we've
looked them both up -- and in some file systems (I'm looking at you,
ufs), there is no mere lookup operation, only lookup-and-lock, so we
can't even hold the lock on one of tvp or fvp when we look up the
other one if there's a chance they might be the same.
Fortunately the cases
(a) tvp = fvp
(b) tvp or fvp is a directory
are mutually exclusive as long as directories cannot be hard-linked.
In case (a) we can just defer locking {tvp, fvp} until the end, because
it can't possibly have {fdvp or fvp, tdvp or tvp} as descendants. In
case (b) we can just lock them in the order fdvp->fvp->tdvp->tvp or
tdvp->tvp->fdvp->fvp if the first one of {fvp, tvp} is a directory,
because it can't possibly coincide with the second one of {fvp, tvp}.
With this change, we can now prove that the locking order is consistent
with the ancestor->descendant partial ordering. Where two nodes are
incommensurate under that partial ordering, they are only ever locked
by rename and there is only ever one rename at a time.
Proof:
- For same-directory renames, genfs_rename_enter_common locks the
directory first and then the children. The order
directory->child[i] is consistent with ancestor->descendant and
child[0]/child[1] are incommensurate.
- For cross-directory renames:
. While a rename is in progress and the fs-wide rename lock is held,
directories can be created or removed but not changed, so the
outcome of gro_genealogy -- which, given fdvp and tdvp, returns
the node N relating fdvp/N/.../tdvp or null if there is none --
can only transition from finding N to not finding N, if one of
the directories is removed while any of the vnodes are unlocked.
Merely creating directories cannot change the ancestry of tdvp,
and concurrent renames are not possible.
Thus, if a gro_genealogy determined the operation to have the
form fdvp/N/.../tdvp, then it might cease to have that form, but
only because tdvp was removed which will harmlessly cause the
rename to fail later on. Similarly, if gro_genealogy determined
the operation _not_ to have the form fdvp/N/.../tdvp then it
can't begin to have that form until after the rename has
completed.
The lock order is,
=> for fdvp/.../tdvp:
1. lock fdvp
2. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
3. lock fvp if a directory (consistent with fdvp->fvp)
4. lock tdvp (consistent with fdvp->tdvp and possibly fvp->tdvp)
5. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
6. lock fvp if a nondirectory (fvp->t* or fvp->fdvp is impossible)
7. lock tvp if not fvp (tvp->f* is impossible unless tvp=fvp)
=> for incommensurate fdvp & tdvp, or for tdvp/.../fdvp:
1. lock tdvp
2. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
3. lock tvp if a directory (consistent with tdvp->tvp)
4. lock fdvp (either incommensurate with tdvp and/or tvp, or
consistent with tdvp(->tvp)->fdvp)
5. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
6. lock tvp if a nondirectory (tvp->f* or tvp->tdvp is impossible)
7. lock fvp if not tvp (fvp->t* is impossible unless fvp=tvp)
Deadlocks found by hannken@; resolution worked out with dholland@.
XXX I think we could improve concurrency somewhat -- with a likely
big win for applications like tar and rsync that create many files
with temporary names and then rename them to the permanent one in the
same directory -- by making vfs_renamelock a reader/writer lock: any
number of same-directory renames, or exactly one cross-directory
rename, at any one time.
The word "get" implies a cheap operation without side effects. Parsing
instead has lots of side effects, even if it's only that the parsing
position is updated.
The test had failed in the releng build because it assumed it were run
with .CURDIR == .PARSEDIR. This assumption is true when the tests are
run directly from usr.bin/make, but not when they are run from
tests/usr.bin/make.
By using a Stack instead of a Lst, the available API is reduced to the
very few functions that are really needed for a stack. This prevents
accidental misuse (such as confusing Lst_Append with Lst_Prepend) and
clearly communicates what the expected behavior is.
A stack also needs fewer calls to bmake_malloc than an equally-sized
list, and the memory is contiguous. For the nested include path, all
this doesn't matter, but the type is so generic that it may be used in
other places as well.
- Flush the guest TLB when certain CR0 bits change.
- If the guest updates a static bit in CR0, then reflect the change in
VMCS_CR0_SHADOW, for the guest to get the illusion that the change was
applied. The "real" CR0 static bits remain unchanged.
- In vmx_vcpu_{g,s}et_state(), take VMCS_CR0_SHADOW into account.
- Slightly modify the CR4 handling code, just for more symmetry with CR0.
This way, the fields 2 and 3 don't jump horizontally as often as before,
which makes the appearance of the whole file as calm and organized as it
should be.
In the previous brute force search, it seemed there was no string with
that hash code. That was probably an oversight or a little programming
mistake. Anyway, it's possible to get that hash value, so keep the
example.
The previous test vectors didn't contain any hash with a leading zero.
This could have been a simple programming mistake by using %8x instead
of the intended %08x. Using snprintf wouldn't have been possible anyway
since the hex digits are printed in little-endian order, but without
reversing the bits of each digit. Kind of unusual, but doesn't affect
the distribution of the hashes.
- exception_return(): Use GET_CURLWP directly, rather than a dance
acount GET_CPUINFO.
- Introduce SET_CURLWP(), to set the curlwp value.
- Garbage-collect GET_FPCURLWP.