I would like to filter a fastq file in order to output only sequences that present a specific pattern ("..........CAA.....GTGG..........", the dot corresponds to a whatever nucleotide A,C,G,T) with its related quality.
I.e. Input file
@Reads1
AGCATTTGATATCAAATTTGGTGGATTGGTGTTGTGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATTATCACCAGGGCAACAAAAGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads3
ATTATCAAAAAAAAACCCTTGGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ
Output file:
@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ
There are a plethora of tools available that are specifically tuned to processing FASTQ or FASTA files. However, here is a standard awk program.
We assume the following convention for the FASTQ format.
We can use awk to recreate this: