No output when using HPC job scheduler with GNU parallel

103 views Asked by At

I'm trying to submit a bash script to an HPC job scheduler (IBM Load Sharing Facility) and use GNU parallel to speed up execution at the same time. Basically, I need to run samtools to subset a large number of .bam files with fixed regions. The actual operations on a single file doesn't take too long, but I'm looking to reduce the overall run time.

This is an example of my workflow, with the job parameters used:

### /some/fixed/directory/myscript.sh ###

#!/usr/bin/env bash

#BSUB -J my_job_name
#BSUB -q hpc_queue
#BSUB -n 16
#BSUB -R "rusage[mem=8GB] span[hosts=1]"
#BSUB -o logs/out.%J.log
#BSUB -e logs/err.%J.log

basedir="/some/fixed/directory"

# absolute paths to a large number of .bam files
# none of them in the same directory, so no wildcards
input_bam_files="${basedir}/input_bam_files.txt"

# used across all input files
ref_genome_file="${basedir}/ref_genome.fa"
regions_file="${basedir}/required_chr_start_end.bed"

# fixed version 1.16 on HPC
module load samtools

mkdir -p "${basedir}/output"

extract_regions_from_bam() {
    input_bam_file="$1"
    output_fn=$(basename "$input_bam_file")
    output_bam_file="${basedir}/output/${output_fn}"
    echo "Start: ${input_bam_file} -> ${output_bam_file}"
    samtools view --with-header --reference "$ref_genome_file" --regions-file "$regions_file" "$input_bam_file" | samtools sort -T $basedir/ -O bam -o "$output_bam_file"
    samtools index "$output_bam_file"
    echo "End: ${input_bam_file} -> ${output_bam_file}"
}

export -f extract_regions_from_bam

cat "$input_bam_files" | parallel -j 8 --joblog "${basedir}/extract_regions_from_bam.log" extract_regions_from_bam {}

Then I simply submit this using bsub < /some/fixed/directory/myscript.sh

I've tested running these (with a small subset of input files) components separately, i.e., using only GNU parallel on the local machine, and then using only the job scheduler without GNU parallel. Both of these work fine separately, yet somehow there are no output files when I combine them.

Am I missing something here, in terms of making the HPC recognise the GNU parallel execution? Or is this simply not possible?

Any help would be greatly appreciated, thank you in advance!

P.S: I realise I could use workflow programs like Nextflow or Snakemake, but at this time I'm required to stick to using bash.

0

There are 0 answers