Why may USR1 signals sent from background jobs in a Bash script not be reliably received by the parent shell process waiting for their completion?

Question

Why may USR1 signals sent from background jobs in a Bash script not be reliably received by the parent shell process waiting for their completion?

601 views Asked by ib. At 29 December 2020 at 07:48

I have a Bash script running a bunch of background jobs in parallel. Under certain conditions, before a background job completes, it sends a USR1 signal to the spawning Bash process (say, to inform that some process that was run as a part of the job had terminated with a nonzero exit code).

In a simplified form, the script is equivalent to the one shown below. Here, for simplicity, each background job always sends a USR1 signal before completion, unconditionally (via the signalparent() function).

signalparent() { kill -USR1 $$; }
handlesignal() { echo 'USR1 signal caught' >&2; }
trap handlesignal USR1

for i in {1..10}; do
    {
        sleep 1
        echo "job $i finished" >&2
        signalparent
    } &
done
wait

When I run the above script (using Bash 3.2.57 on macOS 11.1, at least), I observe some behavior that I cannot explain, which makes me think that there is something in the interplay of Bash job management and signal trapping that I overlook.

Specifically, I would like to acquire an explanation for the following behaviors.

Almost always, when I run the script, I see fewer “signal caught” lines in the output (from the handlesignal() function) than there are jobs started in the for-loop—most of the time it is one to four of those lines that are printed for ten jobs being started.

Why is it that, by the time the wait call completes, there are still background jobs whose signaling kill commands had not been yet executed?
At the same time, every so often, in some invocations of the script, I observe the kill command (from the signalparent() function) report an error regarding the originating process running the script (i.e., the one with the $$ PID) no longer being present—see the output below.

How come there are jobs whose signaling kill commands are still running while the parent shell process had already terminated? It was my understanding that it is impossible for the parent process to terminate before all background jobs do, due to the wait call.
```
job 2 finished
job 3 finished
job 5 finished
job 4 finished
job 1 finished
job 6 finished
USR1 signal caught
USR1 signal caught
job 10 finished
job 7 finished
job 8 finished
job 9 finished
bash: line 3: kill: (19207) - No such process
bash: line 3: kill: (19207) - No such process
bash: line 3: kill: (19207) - No such process
bash: line 3: kill: (19207) - No such process
```

Both of these behaviors signalize to me a presence of a race condition of some kind, whose origins I do not quite understand. I would appreciate if anyone could enlighten me on those, and perhaps even suggest how the script could be changed to avoid such race conditions.

Original Q&A

There are 2 answers

pynexj On 29 December 2020 at 09:10

Regarding ‘Almost always, when I run the script, I see fewer “signal caught” lines in the output’—

According to signal(7):

Standard signals do not queue. If multiple instances of a standard signal are generated while that signal is blocked, then only one instance of the signal is marked as pending (and the signal will be delivered just once when it is unblocked).

One way to change your script so that the signals do not arrive at the same time is as follows:

signalparent() {
    kill -USR1 $$
}

ncaught=0
handlesignal() {
    (( ++ncaught ))
    echo "USR1 signal caught (#=$ncaught)" >&2
}
trap handlesignal USR1

for i in {1..10}; do
    {
        sleep $i
        signalparent
    } &
done

nwaited=0
while (( nwaited < 10 )); do
    wait && (( ++nwaited ))
done

Here is the output of the modified script with Bash 5.1 on macOS 10.15:

USR1 signal caught (#=1)
USR1 signal caught (#=2)
USR1 signal caught (#=3)
USR1 signal caught (#=4)
USR1 signal caught (#=5)
USR1 signal caught (#=6)
USR1 signal caught (#=7)
USR1 signal caught (#=8)
USR1 signal caught (#=9)
USR1 signal caught (#=10)

**oguz ismail** · Accepted Answer · 2020-12-29T08:11:49+00:00

This is explained in the Bash Reference Manual as follows.

When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.

So, you need to repeat wait until it returns 0 to make sure all background jobs have terminated, e.g.:

until wait; do
    :
done

It was my understanding that it is impossible for the parent process to terminate before all background jobs do, due to the wait call.

That is a misunderstanding; wait may return due to reception of a signal for which a trap has been set while there are running jobs at the background, and that may lead to normal completion of the program, with the side effect of leaving those jobs orphaned.

TechQA.

Why may USR1 signals sent from background jobs in a Bash script not be reliably received by the parent shell process waiting for their completion?

There are 2 answers

Related Questions in BASH

Related Questions in SIGNALS

Related Questions in WAIT

Related Questions in BASH-TRAP

Popular Questions

Trending Questions