Concatenating lines using awk - awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;

If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.

The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa

Related

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Match regexp at the end of the string with AWK

I am trying to match two different Regexp to long strings with awk, removing the part of the string that matches in a 35 characters window.
The problem is that the same bunch of code works when I am looking for the first (which matches at the beginnng) whereas fails to match with the second one (end of string).
Input:
Regexp1(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)Regexp2
Desired output
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
So far I used this code that extracts correctly Regexp1, but, unfortunately, is not able to extract also Regexp2 since indexed of RSTART and RLENGTH for Regexp2 are incorrect.
Code for extracting Regexp1 (correct output):
awk -v F="Regexp1" '{if (match(substr($1,1,35),F)) print substr($1,RSTART,RLENGTH)}' file
Code for extracting Regexp2 (wrong output)
awk -v F="Regexp2" '{if (match(substr($1,length($1)-35,35),F)) print substr($1,RSTART,RLENGTH)}' file
Despite the indexes for Regexp1 are correct, for Regexp2 indexes are wrond (RSTART=13). I cannot figure out how to extract the second Regexp.
Considering that your actual Input_file is same as shown samples, if this is the case could you please try following then(good to have new version of awk since old versions may not support number of times logic for regex).
awk '
match($0,/\([0-9]+\){5}.*\([0-9]\){4}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
In case your number of parenthesis values are not fixed then you could do like as follows:
awk '
match($0,/\([0-9]+\){1,}.*\([0-9]\){1,}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
If this isn't all you need:
$ sed 's/Regexp1\(.*\)Regexp2/\1/' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
or using GNU awk for gensub():
$ awk '{print gensub(/Regexp1(.*)Regexp2/,"\\1",1)}' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
then edit your question to be far clearer with your requirements and example.

Strip last field

My script will be receiving various lengths of input and I want to strip the last field separated by a "/". An example of the input I will be dealing with is.
this/that/and/more
But the issue I am running into is that the length of the input will vary like so:
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
In any case, the expected output should be the whole string minus the last "/more".
Note: The word "more" will not be a constant these are arbitrary examples.
Example input:
this/that/and/more
this/that/maybe/more/and/more
Expected output:
this/that/and
this/that/maybe/more/and
What I know works for a string you know the length of would be
cut -d'/' -f[x]
With what I need is a '/' delimited AWK command I'm assuming like:
awk '{$NF=""; print $0}'
With awk as requested:
$ awk '{sub("/[^/]*$","")} 1' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
but this is the type of job sed is best suited for:
$ sed 's:/[^/]*$::' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
The above were run against this input file:
$ cat file
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
Depending on how you have the input in your script, bash's Shell Parameter Expansion may be convenient:
$ s1=this/that/maybe/more/and/more
$ s2=or/even/this/could/be/it/and/maybe/more
$ s3=short/more
$ echo ${s1%/*}
this/that/maybe/more/and
$ echo ${s2%/*}
or/even/this/could/be/it/and/maybe
$ echo ${s3%/*}
short
(Lots of additional info on parameter expansion at https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html)
In your script, you could create a loop that removes the last character in the input string if it is not a slash through each iteration. Then, when the loop finds a slash character, exit the loop then remove the final character (which is supposed to be a slash).
Pseudo-code:
while (lastCharacter != '/') {
removeLastCharacter();
}
removeLastCharacter(); # removes the slash
(Sorry, it's been a while since I wrote a bash script.)
Another awk alternative using fields instead of regexs
awk -F/ '{printf "%s", $1; for (i=2; i<NF; i++) printf "/%s", $i; printf "\n"}'
Here is an alternative shell solution:
while read -r path; do dirname "$path"; done < file

Use AWK to search through fasta file, given a second file containing sequence names

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC

Processing of awk with multiple variable from previous processing?

I have a Q's for awk processing, i got a file below
cat test.txt
/home/shhh/
abc.c
/home/shhh/2/
def.c
gthjrjrdj.c
/kernel/sssh
sarawtera.c
wrawrt.h
wearwaerw.h
My goal is to make a full path from splitting sentences into /home/jhyoon/abc.c.
This is the command I got from someone:
cat test.txt | awk '/^\/.*/{path=$0}/^[a-zA-Z]/{printf("%s/%s\n",path,$0);}'
It works, but I do not understand well about how do make interpret it step by step.
Could you teach me how do I make interpret it?
Result :
/home/shhh//abc.c
/home/shhh/2//def.c
/home/shhh/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
What you probably want is the following:
$ awk '/^\//{path=$0}/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)}' file
/home/jhyoon//abc.c
/home/jhyoon/2//def.c
/home/jhyoon/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
Explanation
/^\//{path=$0} on lines starting with a /, store it in the path variable.
/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)} on lines starting with a letter, print the stored path together with the current line.
Note you can also say
awk '/^\//{path=$0; next} {printf("%s/%s\n",path,$0)}' file
Some comments
cat file | awk '...' is better written as awk '...' file.
You don't need the ; at the end of a block {} if you are executing just one command. It is implicit.