How to filter a fastq file grepping a degenerate pattern?

226 views Asked by At

I would like to filter a fastq file in order to output only sequences that present a specific pattern ("..........CAA.....GTGG..........", the dot corresponds to a whatever nucleotide A,C,G,T) with its related quality.

I.e. Input file

@Reads1
AGCATTTGATATCAAATTTGGTGGATTGGTGTTGTGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATTATCACCAGGGCAACAAAAGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads3
ATTATCAAAAAAAAACCCTTGGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ

Output file:

@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ
2

There are 2 answers

1
kvantour On

There are a plethora of tools available that are specifically tuned to processing FASTQ or FASTA files. However, here is a standard awk program.

We assume the following convention for the FASTQ format.

A FASTQ record has four line-separated fields per sequence:

  • Field 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Field 2 is the raw sequence letters.
  • Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.

We can use awk to recreate this:

awk 'BEGIN{name=1;seq=2;comm=3;qual=0}
     { a[FNR%4]=$0; if (FNR%4) next }
     # At this point we have the full record
     # Perform sequence check and print if accepted:
     match(a[seq],"..........CAA.....GTGG..........") { 
        print a[1]; print substr(a[2],RSTART,RLENGTH);
        print a[3]; print substr(a[0],RSTART,RLENGTH)
     }' file.fastq
0
Kaz On

Solution in TXR.

$ txr filter.txr data
@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
FFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ

Note that, based on the assumption that the corresponding subsequence of the second nucleotide sequence in each section is to be extracted, there is a discrepancy between the value we calculate and the expected data:

$ diff <(txr filter.txr data) expected
4c4
< FFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
---
> AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ

It's also not clear whether multiple matches are expected, with the second sequence being trimmed accordingly.

Code:

@(repeat)
@@@heading
@before@{match /..........CAA.....GTGG........../}@nil
+
@other
@  (output)
@@@heading
@match
+
@{other [(len before)..(+ (len before) (len match))]}
@  (end)
@(end)