How to filter a fastq file grepping a degenerate pattern?

Question

How to filter a fastq file grepping a degenerate pattern?

226 views Asked by Marco At 09 July 2023 at 15:32

I would like to filter a fastq file in order to output only sequences that present a specific pattern ("..........CAA.....GTGG..........", the dot corresponds to a whatever nucleotide A,C,G,T) with its related quality.

I.e. Input file

@Reads1
AGCATTTGATATCAAATTTGGTGGATTGGTGTTGTGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATTATCACCAGGGCAACAAAAGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads3
ATTATCAAAAAAAAACCCTTGGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ

Output file:

@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ

Original Q&A

There are 2 answers

**kvantour** · Answer 1 · 2023-07-09T16:25:59+00:00

There are a plethora of tools available that are specifically tuned to processing FASTQ or FASTA files. However, here is a standard awk program.

We assume the following convention for the FASTQ format.

A FASTQ record has four line-separated fields per sequence:

Field 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).

Field 2 is the raw sequence letters.

Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.

We can use awk to recreate this:

awk 'BEGIN{name=1;seq=2;comm=3;qual=0}
     { a[FNR%4]=$0; if (FNR%4) next }
     # At this point we have the full record
     # Perform sequence check and print if accepted:
     match(a[seq],"..........CAA.....GTGG..........") { 
        print a[1]; print substr(a[2],RSTART,RLENGTH);
        print a[3]; print substr(a[0],RSTART,RLENGTH)
     }' file.fastq

**Kaz** · Answer 2 · 2023-07-09T18:53:42+00:00

Solution in TXR.

$ txr filter.txr data
@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
FFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ

Note that, based on the assumption that the corresponding subsequence of the second nucleotide sequence in each section is to be extracted, there is a discrepancy between the value we calculate and the expected data:

$ diff <(txr filter.txr data) expected
4c4
< FFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
---
> AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ

It's also not clear whether multiple matches are expected, with the second sequence being trimmed accordingly.

Code:

@(repeat)
@@@heading
@before@{match /..........CAA.....GTGG........../}@nil
+
@other
@  (output)
@@@heading
@match
+
@{other [(len before)..(+ (len before) (len match))]}
@  (end)
@(end)

TechQA.

How to filter a fastq file grepping a degenerate pattern?

There are 2 answers

Related Questions in PYTHON

Related Questions in BASH

Related Questions in FILTER

Related Questions in FASTQ

Popular Questions

Trending Questions