I'm facing a strange race condition in my bash program. I tried duplicating it via a simple enough demo program but, obviously, as true for all/most timing-related race demonstration attempts, I couldn't.
Here's an abstracted version of the program that DOES NOT duplicate the issue, but let me still explain:
# Abstracted version of the original program
# that is NOT able to demo the race.
#
function foo() {
local instance=$1
# [A lot of logic here -
# all foreground commands, nothing in the background.]
echo "$instance: test" > /tmp/foo.$instance.log
echo "Instance $instance ended"
}
# Launch the process in background...
#
echo "Launching instance 1"
foo 1 &
# ... and wait for it to complete.
#
echo "Waiting..."
wait
echo "Waiting... done. (wait exited with: $?)"
# This ls command ALWAYS fails in the real
# program in the 1st while-iteration, complaining about
# missing files, but works in the 2nd iteration!
#
# It always works in the very 1st while-iteration of the
# abstracted version.
#
while ! ls -l /tmp/foo.*; do
:
done
In my original program (and NOT in the above abstracted version), I do see Waiting... done. (wait exited with: 0) on stdout, just as I see in the above version. Yet, the ls -l always fails in the original, but always works in the above abstracted version in the very first while loop iteration.
Also, the ls command fails despite seeing the Instance 1 ended message on stdout. The output is:
$ ./myProgram
Launching instance 1
Waiting...
Waiting... done. (wait exited with: 0)
Instance 1 ended
ls: cannot access '/tmp/foo.*': No such file or directory
/tmp/foo.1
$
I noticed that the while loop can be safely done away with if I put a sleep 1 right before ls in my original program, like so:
# This too works in the original program:
sleep 1
ls -l /tmp/foo.*
Question: Why isn't wait working as expected in my original program? Any suggestions to at least help troubleshoot the problem?
I'm using bash 4.4.19 on Ubuntu 18.04.
EDIT: I just also verified that the call to wait in the original, failing program is exiting with a status code of 0.
EDIT 2: Shouldn't the Instance 1 ended message appear BEFORE Waiting... done. (wait exited with: 0)? Could this be a 'flushing problem' with OS' disk-buffer/cache when dealing with background processes in bash?
EDIT 3: If instead of the while loop or sleep 1 hacks, I issue a sync command, then, voila, it works! But why should I have to do a sync in one program but the other?
I noticed that each the following three hacks work, but not quite sure why:
Hack 1
Hack 2
Hack 3
Could this be a 'flushing problem' with OS' disk-buffer/cache, especially when dealing with background processes, especially in
bash? In other words, the call towaitseems to returning BEFORE it flushes the diskcache (or, BEFORE the OS, on its own realizes and, is done flushing the diskcache).EDIT Thanks to @Jon, his was a very close guess and got me thinking in the right direction, along with the age-old, bit-wise tweaking advice from @chepner.
The Real Problem: I was starting
foo, not directly/plainly as shown in my inaccurate abstracted version in my original question, but via anotherlaunchThreadfunction that, after doing some bookkeeping, would also sayfoo 1 &in its body. And the call tolaunchThreadwas itself suffixed with an&! So, mywaitwas really waiting onlaunchThreadand not onfoo! Thesleep,sync, andwhilejust were helping buy more time forfooto complete, which is why introducing them worked. The following is a more accurate demonstration of the problem, even though you may or may not be able to duplicate it on your own system (due to scheduling/timing variance across systems):Output of the above buggy program:
If I remove the
&from EITHERLine 2OR fromLine 3, the program works correctly, with the following as output:The program also works correctly if I remove the
$(now)part fromLine 1.