A bash script for replacing patterns in multiple files names based on a 2-column mapping file

59 views Asked by At

I have a bunch of files with mixed IDs in a directory (linux env.) and look like this:

SRR7821874_1.fastq.gz
SRR7821874_2.fastq.gz
SRR7821870_1.fastq.gz
SRR7821870_2.fastq.gz

I also have a 2-column tab-delimited file (called rename.tsv) based on which I try to replace IDs:

Read       Sample      
SRR7821874 GSM3385663 
SRR7821870 GSM3385659  

Besides, I would like to concurrently change _1 to _S1_L001_R1_001 and _2 to _S1_L001_R2_001 in the file names, so the final result should look like this:

SRR7821874_1.fastq.gz --> GSM3385663_S1_L001_R1_001.fastq.gz
SRR7821874_2.fastq.gz --> GSM3385663_S1_L001_R2_001.fastq.gz
SRR7821870_1.fastq.gz --> GSM3385659_S1_L001_R1_001.fastq.gz
SRR7821870_2.fastq.gz --> GSM3385659_S1_L001_R2_001.fastq.gz   

I've tried the following script with no success as apparently it requires the full file names to rename them (just for ID replacement part):

while read -r Read Sample; do mv "$Read" "$Sample"; done < rename.tsv
2

There are 2 answers

1
Renaud Pacalet On BEST ANSWER

You can try:

tail -n+2 rename.tsv | while IFS=$'\t' read -r from to; do
  shopt -s nullglob
  for f in "${from}_"*.fastq.gz; do
    num="${f##*_}"; num="${num%%.*}"
    mv "$f" "${to}_S1_L001_R${num}_001.fastq.gz"
  done
done

We use tail to skip the header line, and we enable the nullglob bash option to expand "${from}_"*.fastq.gz as the null string instead of the pattern itself if no file matches. As this is part of a pipe the nullglob option is restored to its previous state at the end.

"${f##*_}" and "${num%%.*}" are two of the numerous bash parameter expansions.

Note that you can use a more accurate pattern if needed. For instance, if you know that the number is always 1 or 2 you could replace "${from}_"*.fastq.gz with "${from}_"[12].fastq.gz. Or, if it is any one-digit number: "${from}_"[0-9].fastq.gz.

1
Ed Morton On

You could use awk to create the mapping of old to new file names and then loop through that:

$ cat tst.sh
#!/usr/bin/env bash

while IFS=$'\t' read -r old new; do
    echo mv -- "$old" "$new"
done < <(
    shopt -s nullglob
    sfx=".fastq.gz"
    awk -v sfx="$sfx" '
        NR == FNR {
            map[$1] = $2
            next
        }
        $1 in map {
            print $0, map[$1] "_S1_L001_R" $2+0 "_001" sfx
        }
    ' OFS='\t' FS='\t' rename.tsv FS='_' <(printf '%s\n' *"$sfx")
)

$ ./tst.sh
mv -- SRR7821870_1.fastq.gz GSM3385659_S1_L001_R1_001.fastq.gz
mv -- SRR7821870_2.fastq.gz GSM3385659_S1_L001_R2_001.fastq.gz
mv -- SRR7821874_1.fastq.gz GSM3385663_S1_L001_R1_001.fastq.gz
mv -- SRR7821874_2.fastq.gz GSM3385663_S1_L001_R2_001.fastq.gz