Strange Behavior with Perl NFS Lock

163 views Asked by At

I am experiencing a weird behavior from File::NFSLock in Perl v5.16. I am using stale lock timeout option as 5minutes. Let's say I have three processes. One of them took more than 5minutes before releasing a lock and process 2 got lock. However, even process 2 has lock for less than 5minutes, 3rd process is coming and removing the lock file causing the 2nd process to fail while removing NFSLock held by itself.

My theory says that process 3 wrongly read the last modified time of lock as that of written by process 1 and not process 2. I am writing nfs lock on partitions mounted on NFS.

Does anyone has an idea or faced similar issue with perl NFSLock? Please refer the below snapshot

my $lock = new File::NFSLock {file      => $file,
                              lock_type => LOCK_EX,
                              blocking_timeout   => 50,     # 50 sec
                              stale_lock_timeout => 5 * 60};# 5 min

$DB::single = 1;
if ($lock) {
    $lock->unlock()
}

If I block at debugger point for process 1 for more than 5 minutes, I am observing this behavior

1

There are 1 answers

9
Bodo Hugo Barwich On

From reviewing the code at
https://metacpan.org/pod/File::NFSLock
I see that the Lock is implemented just by a physical file in the system.
I work in almost every project with the same logic of process lock.

With the Process Lock it is crucial not to set the stale_lock_timeout too tight.
Or it will occur a "Race Condition" as it is also mentioned in In-Code Comments.

As you mentioned the 3 processes start to compete over the same Lock because the Job takes > 5 min and you set the tale_lock_timeout to 5 min.
If you have a fix time giver like the crond Service this will launch a process every 5 min. Each process will take the Lock as outdated because 5 min already passed although the process takes more than > 5 min

To describe a possible scenario:
Some DB Job takes 4 min to complete but on a congested system can take up to 7 min or more.
Now if the crond Service launches a process every 5 min
At 0 min the first process process1 will find the Job as new and set the Lock and start the Job which will take up to 7 min.
Now at 5 min the crond Service will launch process2 which finds the Lock of process1 but decides that it is already stale because it's already 5 min since the Lock was created and it will be taken as stale. So process2 releases the Lock and reaquires it for itself.
Later at 7 min process1 has already finshed the Job and without checking if it is still his Lock it releases the Lock of process2 and finishes.
Now at 10 min process3 is launched and does not find any Lock because the Lock of process2 was already released by process1 and sets its own Lock.
This scenario is actually really problematic because it leads to a process accumulation and workload accumulation and unpredictable results.

The Suggestion to fix this issue is:

  1. Set stale_lock_timeout to an amount far bigger than what would take the Job (like 10 min or 15 min). The stale_lock_timeout but be bigger than the execution time schedule.
  2. Set the execution schedule more spacious to give enough time to each process to finish its task (each 10 min or each 15 min)
  3. Consider integrating the Job of process1, process2 and process3 into one only process_master which launches each process after the former onces are finished.