For jobs -p for a non-job-control job, avoid just printing 0 (as
there is no process group pid) and instead output what we used to,
the pid of one of the processes in the job (usually the right one!)
XXX pullup -10 (9 and earlier not affected).
sh has been remembering the process group of a job for a while now, but
using that for almost nothing.
The old way to resume a job, was to try each pid in the job with a
SIGCONT (using it as the process group identifier via killpg()) until
one worked (or none did, in which case resuming would be impossible,
but that never actually happened). This wasn't as bad as it seems,
as in practice the first process attempted was *always* the correct
one. Why the loop was considered necessary I am not sure. Nothing
but the first could possibly work.
This worked until a fix for an obscure possible bug was added a
while ago - now a process which has already finished, and had its
zombie collected via wait*() is no longer ever considered to have
a pid which is a candidate for use in any system call. That's
because the kernel might have reassigned that pid for some newly
created process (we have no idea how much time might have passed
since the pid was returned to the kernel for reuse, it might have
happened weeks ago).
This is where the example in bin/57053 revealed a problem.
That PR is really about a quite different problem in zsh (from pksrc)
and should be pkg/57053, but as the test case also hit the problem
here, it was assumed (by some) they were the same issue.
The example is (in a small directory)
ls | less
which is then suspended (^Z), and resumed (fg). Since the directory
is small, ls will be finished, and reaped by sh - so the code would
now refuse to use its pid for the killpg() call to send the SIGCONT.
The (useless) loop would attempt to use less's pid for this purpose
(it is still alive at this point) but that would fail, as that pid
is not a process group identifier, of anything. Hence the job
could not be resumed.
Before the PR (or preceding mailing list discussion) the change here
had already been made (part of a much bigger set of changes, some of
which might follow - sometime). We now actually use the job's
remembered process group identifier when we want the process group
identifier, instead of trying to guess which pid it happens to be
(which actually never took any guessing, it was, and is always the
pid of the first process created for the job). A couple of minor
fixes to how the pgrp is obtained, and used, accompany the changes
to use it when appropriate.
(the job number, given jp a pointer to a jobs table entry)
used open coded previously in many places (mostly in DEBUG mode
trace messages, so not included in most shells, but there are
a few others).
Make the type of JNUM() be int rather than the ptrdiff_t the
open coded version became ... which when used in some printf()
type function arg list was cast to some other arbitrary (but not
consistent) int type for which there is a standard %Xd type
format conversion. Now we can (and do) just use %d for this.
If the number of jobs ever exceeds the range of an int, we would
have far more serious problems than the broken output this would
cause.
While here improve a comment or two, and use JOBRUNNING instead
of 0 where the intent is the former (JOBRUNNING is #defined as 0).
NFCI.
28 years was more than enough for the useless 'continue' statement in
this do-while-0 "loop". Without the 'continue' statement, there is no
need for the "loop" anymore. The comment at its top was confusing since
the word 'while' suggested a loop, but there was none, so remove that as
well.
Pointed out by Tom Ivar Helbekkmo on source-changes-d.
No change to the resulting binary.
do setproctitle(NULL) (which is not the same thing at all). Do the
same with jobs -Z '' as setting the title to "sh: " isn't useful.
Improve the way this is documented, and note that it is only done
this way because zsh did it first (ie: pass on the balme, doing this
in the jobs command is simply absurd.)
Correct an issue found by Oguz <oguzismailuysal@gmail.com> and reported
in e-mail (on the bug-bash list initially!) with the code changed to deal
with PR bin/48875
With:
sh -c 'echo start at $SECONDS;
(sleep 3 & (sleep 1& wait) );
echo end at $SECONDS'
The shell should say "start at 0\nend at 1\n", but instead (before
this fix, in -9 and HEAD, but not -8) does "start at 0\nend at 3\n"
(Not in -8 as the 48875 changes were never pulled up)>
There was an old problem, fixed years ago, which cause the same symptom,
related to the way the jobs table was cleared (or not) in subshells, and
it seemed like that might have resurfaced.
But not so, the issue here is the sub-shell elimination, which was part
of the 48875 "fix" (not really, it wasn't really a bug, just sub-optimal
and unexpected behaviour).
What the shell actually has been running in this case is:
sh -c 'echo start at $SECONDS;
(sleep 3 & sleep 1& wait );
echo end at $SECONDS'
as the inner subshell was deemed unnecessary - all its parent would
do is wait for its exit status, and then exit with that status - we
may as well simply replace the current sub-shell with the new one,
let it do its thing, and we're done...
But not here, the running "sleep 3" will remain a child of that merged
sub-shell, and the "wait" will thus wait for it, along with the sleep 1
which is all it should be seeing.
For now, fix this by not eliminating a sub-shell if there are existing
unwaited upon children in the current one. It might be possible to
simply disregard the old child for the purposes of wait (and "jobs", etc,
all cmds which look at the jobs table) but the bookkeeping required to
make that work reliably is likely to take some time to get correct...
Along with this fix comes a fix to DEBUG mode shells, which, in situations
like this, could dump core in the debug code if the relevant tracing was
enabled, and add a new trace for when the jobs table is cleared (which was
added predating the discovery of the actual cause of this issue, but seems
worth keeping.) Neither of these changes have any effect on shells
compiled normally.
XXX pullup -9
instead of simply assuming that the pid of the first (leftmost) process
in a pipeline is the pgrp - someday we may switch things around and
create pipelines right to left instead, which has several advantages,
but which would invalidate the assumption which was being made here.
Also enhance some of the DEBUG mode trace output (nothing visible
in a normal shell build).
A couple of very minor code changes that no-one should ever notice
(eg: one less wait() call in the case that there is nothing pending).
children happens to exit while we are waiting for another child
to exit.
This can happen with code like
sh -c '
sleep 5 &
exec sh -c "sleep 10 & wait !$"
'
when the inner "sh" is waiting for the 10 second sleep to be
done, the 5 second sleep started earlier terminates. It is
a child of our process, as the inner shell is the same process
as the outer one, but not a known child (the inner shell has no
idea what the outer one did before it started).
This was observed in the wild by Martijn Dekker (where the outer
shell was bash but that's irrelevant).
XXX pullup -9
finding a job that had previously terminated.
Now in that case JOBWANTED is set on all jobs (since any will do)
which then simplifies a later test which no longer needs to special
case "wait -n". Further, we always look to see if any wanted
job has already terminated, even if there are still running jobs
we can wait upon - if anything is already ready, that's where we start
harvesting (and finish, if -n is specified).
%x commands) generate the most useful error message (from errno value)
rather than whichever happened last.
In posix mode, cause the "jobs" command to delete records of completed
jobs it reports on (as posix requires) as is done in interactive shells.
We don't (won't) do this in !posix mode, as the ability to throw in a
"jobs" command in a script to debug what is happening is too useful to
lose -- and any script that is relying on "jobs" instead of "wait" to
cleanup background processes (from the sh jobs table, sh always collects
zombies from the kernel) is absurd and not worth considering (besides
which I've never seen one).
No visible differences expected - there is a remote chance that
some internal lossage may no longer occur in interactive shells
that receive SIGINT (untrapped) at inopportune times, but you would
have had to have been very unlucky to have ever suffered from that.
to the main handler, rather than wherever the parent shell would go.
nb: not needed for vfork(), after vfork() we never go that path - which
is good or we'd be corrupting the parent's handler.
This allows the child to always exit (when it should) rather than being
caught up doing something else (and while it would eventually exit, the
status would be incorrect in some cases).
One test is:
sh -c 'trap "(! :) && echo BUG || echo nobug" EXIT'
from Martijn Dekker
Fix from FreeBSD (missed earlier).
XXX - 2b part of the 48875 pullup to -8
the option when a pipeline is created that controls the way the exit
status of the pipeline is calculated. Previously it was the state of
the option when the exit status of the pipeline was collected.
This makes no difference at all for foreground pipelines (there is
no way to change the option between starting and completing the
pipeline) but it does for asynchronous (background) pipelines.
This was always the right way to implement it - it was originally
done the other way as I could not find any other shell implemented
this way - they all seemed to do it our previous way, and I could
not see a good reason to be the sole different shell.
However, now I know that ksh93 works as we will now work, and I
am told that if the option is added to the FreeBSD shell (apparently
the code exists, uncommitted) it will be the same.
Save more characters of command in non-interactive jobs, in case of
core dumps and similar (16 effective chars was a few too little).
Arrange for number to increase if command buffer size increases.
of uninit'd field also fix a couple more (still harmless) related
technical C usage bugs.
Explaining why these issues were harmless would take too long to include here.
-p var: set var to identifier, from arg list, or PID if no job args)
of the job for which status is returned (becomes $? after wait.)
Note: var is unset if the status returned from wait came from wait
itself rather than from some job exiting (so it is now possible to
tell whether 127 means "no such job" or "job did exit(127)", and
whether $? > 128 means "wait was interrupted" or "job was killed
by a signal or did exit(>128)". ($? is too limited to to allow
indicating whether the job died with a signal, or exited with a
status such that it looks like it did...)
process group (-g), the process leader pid (-p) ($! if the job was &'d)
and the job identifier (-j) (the %n that refers to the job) in addition to
(default) the list of all pids in the job (which it has always done).
No change to the (single) "job" arg, which is a specifier of the job:
the process leader pid, or one of the % forms, and defaults to %% (aka %+).
(This is all now documented in sh(1))
Also document the jobs command properly (no change to the command, just
document what it actually is.)
And while here, a whole new section in sh(1) "Job Control". It probably
needs better wording, but this is (perhaps) better than the nothing that
was there before.
Don't delete jobs from the jobs table merely because they finished,
if they are not the job we are waiting upon. (bin/52640 part 1)
In a sub-shell environment, don't allow wait to find jobs from the
parent shell that had already exited (before the sub-shell was
created) and return status for them as if they are our children.
(bin/52640 part 2)
Don't have the "jobs" command also be an implicit "wait" command
in non-interactive shells. (bin/52641)
Use WCONTINUED (when it exists) so we can report on stopped jobs that
"mysteriously" move back to running state without the user issuing
a "bg" command (eg: kill -CONT <pid>) Previously they would keep
being reported as stopped until they exited.
When a job is detected as having changed status just as we're
issuing a "jobs" command (i.e.: the change occurred between the last
prompt and the jobs command being entered) don't report it twice,
once from the status change, and then again in the jobs command
output. Once is enough (keep the jobs output, suppress the other).
Apply some sanity to the way jobs_invalid is processed - ignore it
in getjob() instead of just ignoring it most of the time there, and
instead always check it before calling getjob() in situations where
we can handle only children of the current shell. This allows the
(totally broken) save/clear/restore of jobs_invalid in jobscmd() to
be done away with (previously an error while in the clear state would
have left jobs_invalid incorrectly cleared - shouldn't have mattered
since jobs_invalid => subshell => error causes exit, but better to be safe).
Add/improve the DEBUG more tracing.
XXX pullup -8
code duplication, and reducing the size of /bin/sh by a trivial amount.
NFCI.
This is being done now as there are two other changes forthcoming, both
of which benefit - one would result in even more code duplication without
this, the other might need to alter how this is done, and doing it after this
means there's just one place to change (if required).
to cause (when set, which it is not by default) the exit status of a
pipe to be 0 iff all commands in the pipe exited with status 0, and
otherwise, the status of the rightmost command to exit with a non-0
status.
In the doc, while describing this, also reword some of the text about
commands in general, how they are structured, and when they are executed.
the LINENO hack, and uses the LINENO var for both ${LINENO} and $((LINENO)).
(Code to invert the LINENO hack when required, like when de-compiling the
execution tree to provide the "jobs" command strings, is still included,
that can be deleted when the LINENO hack is completely removed - look for
refs to VSLINENO throughout the code. The var funclinno in parser.c can
also be removed, it is used only for the LINENO hack.)
This version produces accurate results: $((LINENO)) was made as accurate
as the LINENO hack made ${LINENO} which is very good. That's why the
LINENO hack is not yet completely removed, so it can be easily re-enabled.
If you can tell the difference when it is in use, or not in use, then
something has broken (or I managed to miss a case somewhere.)
The way that LINENO works is documented in its own (new) section in the
man page, so nothing more about that, or the new options, etc, here.
This version introduces the possibility of having a "reference" function
associated with a variable, which gets called whenever the value of the
variable is required (that's what implements LINENO). There is just
one function pointer however, so any particular variable gets at most
one of the set function (as used for PATH, etc) or the reference function.
The VFUNCREF bit in the var flags indicates which func the variable in
question uses (if any - the func ptr, as before, can be NULL).
I would not call the results of this perfect yet, but it is close.
levels for debug output. This change accidentally omitted earlier (only
effect is incorrect nesting levels shown in trace output when the option
to show them is enabled.) NFC for any normal shell build.
one of the words happens to contain ${#var}. (This is the command
string shown by the "jobs" command, and when a background job completes)
While here, undo the LINENO hack when building that string.
And one ot two other foibles...
! ! pipeline
(And for now the other places where ! is permitted)
we should at least generate the logically correct exit
status:
! ! (exit 5); echo $?
should print 1, not 5. ksh and bosh do it this way - and it makes sense.
bash and the FreeBSD sh echo "5" (as did we until now.)
dash, zsh, yash all enforce the standard syntax, and prohibit this.
which causes fall through the to command list of the following pattern
(wuthout evaluating that pattern). This has been approved for inclusion
in the next major version of the POSIX standard (Issue 8), and is
implemented by most other shells.
Now all form a circle and together attempt to summon the great wizd
in the hopes that his magic spells can transform the poor attempt
at documenting this feature into something rational...
"jobs" output (or other places where the cmd string is shown - like
when reporting status when a background job completes.)
Without this fix, try
! sleep 5 &
jobs
wait
and try not to wonder at the '???" that appears instead of "! sleep 5"
own purposes, and move them elsewhere whenever a user redirection
happens to pick the same number. With this we can move the shell
file descriptors back to lower values (be slightly kinder to the kernel)
since we can no longer clash. (Also get rid of a little old unneeded code.)
This also completes the fdflags command, which no longer permits access
to (by way or either obtaining, or changing) the shell's internal fds.
Also, move (most of) the shell's internal use fd's to much
higher values (depending upon what ulimit -n allows) so they
are less likely to clash with user supplied fd numbers. A future
patch will (hopefully) avoid this problem completely by dynamically
moving the shell's internal fds around as needed. (From kre@)