I'm trying to build database bacteria genre using all the sequences published to calculate the coverage of my reads against this database using bowtie2 for mapping, for that, I merge all the genomes sequences I downloaded from ncbi in one fasta_library ( i merge 74 files in on fasta file ), the problem is that in this fasta file (the library I created ) I have a lot of duplicated sequences, and that affected the coverage in a big way, so I'm asking if there's any way to eliminate duplication I have in my Library_File, or if there's any way to merge the sequences without having the duplication, or also if there's any other way to calculate the coverage of my reads against reference sequences
I hope I'm clear enough, please tell me if there's anything not clear.
If you have control over your setup, then you could install seqkit and run the following on your FASTA file:
If you have multiple files, you can concatenate them and feed them in as standard input:
The
rmdupoption removes duplicates, and the-soption calls duplicates on the basis of sequence, ignoring differences in headers. I'm not sure which header is kept in the output, but that may be something to think about.To avoid third-party dependencies and understand how dups are being removed, one can use
awk.The idea is to read all FASTA records one by one into an associative array (or hash table, also called a "dictionary" in Python), only if the sequence is not already in the array.
For example, starting with a single-line FASTA file
in.fathat looks like this:We can remove duplicates, preserving the first header, like so:
It requires a little knowledge about
awkif you need modifications. This approach also depends on how your FASTA files are structured (records with sequences on one line or multiple lines, etc.), though it is usually pretty easy to modify FASTA files into the above structure (one line each for header and sequence).Any hash table approach also uses a fair bit of memory (I imagine that
seqkitprobably makes the same compromise for this particular task, but I haven't looked at the source). This could be an issue for very large FASTA files.It's probably better to use
seqkitif you have a local environment on which you can install software. If you have an IT-locked-down setup, thenawkwould work for this task, as well, as it comes with most Unixes out of the box.