Removing steric (*) from the end of a fasta sequence in a multi fasta file - awk

I have a multifasta file containi g predicted proteins from 2 abinitio tools. Every sequence contains a steric (*) in the end. I want to remove it from the file. my sequences are like this:
>snapgene1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP*
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP*
i want the sequences like this :
>snapgen1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP
Can anyone help me in this. Thankyou

If the text stored in a file "temp.txt",you can use command :
sed -i "s/*$//" temp.txt

In awk, if you keep your fastas in file:
$ awk '{sub(/\*$/,"")}1' file
>snapgene1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP
It replaces trailing * with nothing.

Related

How to replace strings in text with id from second text?

I've got two CSV files. The first file contains organism family names and connection weight information but I need to change the format of the file to load it into different programs like Gephi. I have created a second file where each family has an ID value. I haven't found a good example on this site on how to change the family names in the first file to the ids from the second file. Example of my files:
$ cat edge_file.csv
Source,Target,Weight,Type,From,To
Argasidae,Alcaligenaceae,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
Argasidae,Burkholderiaceae,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
Argasidae,Methylophilaceae,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
Argasidae,Oxalobacteraceae,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
Argasidae,Rhodocyclaceae,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
Argasidae,Sphingomonadaceae,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
Argasidae,Zoogloeaceae,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
Argasidae,Agaricaceae,0.190482976,undirected,A_Argasidae,F_Agaricaceae
Argasidae,Bulleribasidiaceae,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
Argasidae,Camptobasidiaceae,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
Argasidae,Chrysozymaceae,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
Argasidae,Cryptococcaceae,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
$ cat id_file.csv
Id,Family
1,Argasidae
2,Buthidae
3,Alcaligenaceae
4,Burkholderiaceae
5,Methylophilaceae
6,Oxalobacteraceae
7,Rhodocyclaceae
8,Oppiidae
9,Sphingomonadaceae
10,Zoogloeaceae
11,Agaricaceae
12,Bulleribasidiaceae
13,Camptobasidiaceae
14,Chrysozymaceae
15,Cryptococcaceae
I basically want the edge_file.csv output to turn into the output below, where Source and Target have changed from family names to ids instead.
Source,Target,Weight,Type,From,To
1,3,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
1,4,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
1,5,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
1,6,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
1,7,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
1,9,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
1,10,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
1,11,0.190482976,undirected,A_Argasidae,F_Agaricaceae
1,12,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
1,13,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
1,14,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
1,15,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
I haven't been able to figure it out with awk since I'm new to it, but I tried some variations from other examples here such as (just testing it out for the "Source" column):
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' edge_file.csv id_file.csv
Everything just prints out blank. My understanding is that I should create an array for the Source and Target columns in the edge_file.csv, and then replace it with the first column from the id_file.csv, which is the Id column. Can't get the syntax to work even for just one column.
You're close. This oneliner should help:
awk -F, -v OFS=',' 'NR==FNR{a[$2]=$1;next}{$1=a[$1];$2=a[$2]}1' id_file.csv edge_file.csv

AWK - Replace between two csv files and print correctly

I have 2 files:
f1.csv:
CL*VIN
AV*AZA
PS*LUG
f2.csv:
2100-12-31*1234A*Thomas*Frederuc*1931-02-20*6791237*6791238*test1*1*0*0*CL*Jame 12*13*a1*zz3*D*13*1234*Tex*F
2100-12-31*1235A*Jack*Borbin*1931-02-21*7791238*7791239*test2*1*0*0*PS*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F
2100-12-31*1236A*Pierce*Matheus*1931-02-22*8791239*8791240*test3*1*1*1*AV*Magnum st*15*a3*zz5*A*13*1236*Euo*F
And I want this output:
2100-12-31*1234A*Thomas*Frederuc*1931-02-20*6791237*6791238*test1*1*0*0*VIN*Jame 12*13*a1*zz3*D*13*1234*Tex*F
2100-12-31*1235A*Jack*Borbin*1931-02-21*7791238*7791239*test2*1*0*0*LUG*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F
2100-12-31*1236A*Pierce*Matheus*1931-02-22*8791239*8791240*test3*1*1*1*AZA*Magnum st*15*a3*zz5*A*13*1236*Euo*F
I have the following code:
awk -F"*" 'FNR==NR{ A[$1]=$2;next} ($12 in A){$12=A[$12];print}' OFS='*' f1.csv f2.csv
But the output is:
*Jame 12*13*a1*zz3*D*13*1234*Tex*F931-02-20*6791237*6791238*test1*1*0*0*VIN
*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F791238*7791239*test2*1*0*0*LUG
*Magnum st*15*a3*zz5*A*13*1236*Euo*F-02-22*8791239*8791240*test3*1*1*1*VIN
How can I obtain my desired output?
Your code works perfectly fine here, what's your system/code environment and awk version?
It seems something to do with carriage returns, so better run this before dealing with these files:
dos2unix files
However, you can try this:
awk 'BEGIN{FS=OFS="*";RS="\r\n|\r|\n";}FNR==NR{A[$1]=$2;next}($12 in A){$12=A[$12];print}' f1.csv f2.csv

extracting subtext between two characters using grep

I have text file which has information like follows
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
+
!!!#!!!!!!!!++??????????????~~~~~~~~~~~~~
BBBBBBBBBBBBMMMMMM!!!!!++LLLLLL******
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
+
**********||||||||||||#########++++?????????
MMMMMMMMMUUUU***+++~~~~~~~~~~~~~~~~~~~~~~~~~~
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas
+
!!!!!!!!!!!!!!!!!!#####???????????????????
I would like extract the region only between #Mp_* and + that comes right below the text and export it to txt file like following
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas
When I used the following code
grep -o -P '(?<=#MP.*).*(?=+)' query.txt > output.txt
It gave me "grep: nothing to repeat".
Could anyone guide where my mistake is and how to rectify it.
Thanks in advance.
Better use awk for this:
awk '/^#/{f=1} /^+/ {f=0} f' file > output.txt
Or, if you have leading spaces, match them with \s*:
awk '/^\s*#/{f=1} /^\s*\+/ {f=0} f' file > output.txt
This uses a flag f to decide whether the line should be printed or not.
When it sees a line starting with #, it activates it.
When it sees a line starting with +, it deactivates it.
Then, it evaluates the flag and prints if it is True.
With your given input it returns:
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas

Scaling the values to plot a graph using gnuplot

I have a text file in the below format.The first column represents a timestamp with a very high resolution.The second number represents the sequence number.I want to plot a graph between these two values.i.e Sequence number Over time.For this purpose I want to scale the sequence number and the timestamp.Time stamp can be scaled by subtracting the first time stamp from the remaining time stamps.Sequence number also should be scaled the same way.However when scaled the sequence number can have negative values.How do I write a bash script using awk to achieve this.This file name is print_1010171.txt.Please not that I do have a number of files of the same format.so I want the script to get generic.
5698771509078629376 1133254688
5698771509371165696 1150031904
5698771510035551232 1150031904
5698771510036082688 4170258464
5698771510036715520 2895583264
5698771510037202176 1620908064
5698771510037665280 346232864
5698771510038193664 3366459424
5698771510332259072 2091784224
5698771510332816128 817109024
5698771510333344512 3837335584
5698771510339882240 2562660384
5698771510340411392 1287985184
5698771510340939776 13309984
5698771510348048896 3033536544
5698771510348577280 1758861344
5698771510349228800 484186144
5698771510632804864 3504412704
5698771510633441792 2229737504
5698771510634390272 955062304
5698771510638858496 3975288864
5698771510639347712 2700613664
5698771510642663168 1425938464
5698771510643387136 134486304
5698771510643808768 3154712864
5698771510648858368 1880037664
5698771510649410560 605362464
5698771510655600384 3625589024
5698771510656128768 2350913824
5698771510656657408 1076238624
Very similar to Dennis Williamson's solution -- This should be more efficient (but probably not something you'd ever notice) and it will also silently ignore blank lines (the other solution will give very large negative numbers for blank lines).
#script coolscript.gp
if(!exists("DATAFILE")) DATAFILE='test.dat'
EXT_INDEX=strstr(DATAFILE,'.txt') #assume data has a .txt extension.
set term post enh color
set output DATAFILE[:EXT_INDEX] . '.ps' #gnuplot string slicing and concatenation
plot "< awk 'BEGIN{getline; header_col1=$1; header_col2=$2 }{if(NF){print $1-header_col1,$2-header_col2}}' ".DATAFILE using 1:2
You can definitely do this using an all-gnuplot solution. (See #andyras's nice solution and my answer that he linked to). This (alternate) solution works by reading the first line in awk and assigning the variables header_col1 and header_col2 with the data in column 1 and column 2. It then subtracts those from the future columes (as expected) as long as the line isn't empty.
Note that this solution can be called from the commandline using:
gnuplot -e "DATAFILE='mydatafile.txt'" coolscript.gp
Unfortunately, the quotes are necessary since gnuplot needs them, meaning that if you're using this in a shell loop, you should definitely use the double quotes on the outside as I show.
for FILE in *.dat; do
gnuplot -e "DATAFILE='${FILE}'" coolscript.gp
done
awk 'NR == 1 {basets = $1; baseseq = $2} {print $1 - basets, $2 - baseseq}' inputfile
or, if you don't want to output the initial pair of zeros:
awk 'NR == 1 {basets = $1; baseseq = $2; next} {print $1 - basets, $2 - baseseq}' inputfile
Here is a bash wrapper script which should do what you want:
#!/bin/bash
gnuplot << EOF
set terminal png truecolor size 800,600
set output 'plot_$1.png'
firstx=0
offsetx=0
funcx(x)=(offsetx=(firstx==0)?x:offsetx,firstx=1,x-offsetx)
firsty=0
offsety=0
funcy(x)=(offsety=(firsty==0)?x:offsety,firsty=1,x-offsety)
plot '$1' u (funcx(\$1)):(funcy(\$2))
EOF
To use the script, give it the name of the file you want to plot as an argument:
$ myscript.sh print_1010171.txt
I modified the answer given here to accommodate two variables. See that answer also if you want to subtract the lowest value from all data rather than the first.

awk getline skipping to last line -- possible newline character issue

I'm using
while( (getline line < "filename") > 0 )
within my BEGIN statement, but this while loop only seems to read the last line of the file instead of each line. I think it may be a newline character problem, but really I don't know. Any ideas?
I'm trying to read the data in from a file other than the main input file.
The same syntax actually works for one file, but not another, and the only difference I see is that the one for which it DOES work has "^M" at the end of each line when I look at it in Vim, and the one for which it DOESN'T work doesn't have ^M. But this seems like an odd problem to be having on my (UNIX based) Mac.
I wish I understood what was going with getline a lot better than I do.
You would have to specify RS to something more vague.
Here is a ugly hack to get things working
RS="[\x0d\x0a\x0d]"
Now, this may require some explanation.
Diffrent systems use difrent ways to handle change of line.
Read http://en.wikipedia.org/wiki/Carriage_return and http://en.wikipedia.org/wiki/Newline if you are
interested in it.
Normally awk hadles this gracefully, but it appears that in your enviroment, some files are being naughty.
0x0d or 0x0a or 0x0d 0x0a (CR+LF) should be there, but not mixed.
So lets try a example of a mixed data stream
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
try]oe
We can see that the last lines are lost.
Now using the ugly hack to RS
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{RS="[\x0d\x0a\x0d]";while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
r=[zoe]
r=[qwe]
r=[try]
We can see every line is obtained, reguardless of the 0x0d 0x0a junk :-)
Maybe you should preprocess your input file with for example dos2unix (http://sourceforge.net/projects/dos2unix/) utility?