using awk to print header name and a substring

Question

using awk to print header name and a substring

652 views Asked by Ziv Attia At 22 November 2019 at 23:45

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work

>output_file
cat input_file | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' tmp`
        startPos=`awk '{print $2}' tmp`
        endPOs=`awk '{print $3}' tmp`
                for i in temp; do
                echo ">${geneName}" >> genes_fasta ;
                echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
        done
done

input_file

nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548

fasta

atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............

and this is my wrong output file

>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta

output should look like that:

>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....

Original Q&A

There are 3 answers

tshiono On 23 November 2019 at 03:29

If I'm understanding correctly, how about:

awk 'NR==FNR {fasta = fasta $0; next}
    {
        printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
    }' fasta input_file > genes_fasta

It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)

Hope this helps.

David C. Rankin On 23 November 2019 at 05:41

You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.

For example:

awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

or using getline within the BEGIN rule:

awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

Example Input Files

Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:

$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79

and the first 129-characters of your example fasta

$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc

Example Use/Output

$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg

Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.

**Ziv Attia** · Accepted Answer · 2019-11-23T06:53:59+00:00

made it work! this is the script for pulling substrings from a fasta file

cat genes_and_bounderies1 | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' temp`
        startPos=`awk '{print $2}' temp`
        endPos=`awk '{print $3}' temp`
        length=$(expr $endPos - $startPos)
                for i in temp; do
                echo ">${geneName}" >> genes_fasta
                awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
        done
done

TechQA.

using awk to print header name and a substring

There are 3 answers

Related Questions in BASH

Related Questions in UNIX

Related Questions in BIOINFORMATICS

Related Questions in GENOME

Related Questions in GOOGLE-GENOMICS

Popular Questions

Trending Questions