How to use sed/awk to replace the original file and get the following desired output? - awk

I'm writing a bash scrip that would translate one file to another, and am encountering an issue.
Whenever the program sees something like this(......not included):
......Mul(-a1+b2-c3...+f+e)......
change it to:
......M(-a1)*M(b2)*M(-c3)*...*M(f)*M(e)......
the number of the variables in Mul is unknown and there could be multiple occurrence of Mul in the file. There are also other places in the file where + or - appears. And Variables could be one or more characters.
I tried grouping in sed, with a group followed by a "*", but it doesn't seem to be working due to the need of replacing unknown amount of variables.

Here is a sed script that will do it:
:a
s/\(Mul(.[^)]*\)\([+-].\)/\1)*Mul(\2/
ta
s/Mul(+\{0,1\}/M(/g
The trick is to use the test to jump back to the beginning after making a substitution (e.g. "Mul(a+b+c)"=>"Mul(a)*Mul(+b+c)").

$ cat tst.awk
match($0,/Mul\([^()]+\)/) {
tgt = substr($0,RSTART+4,RLENGTH-5)
gsub(/[-+][[:alnum:]]+/,"*M(&)",tgt)
gsub(/\+/,"",tgt)
sub(/^\*/,"",tgt)
print substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
......M(-a1)*M(b2)*M(-c3)*M(f)*M(e)......
The above was run on this input file:
$ cat file
......Mul(-a1+b2-c3+f+e)......

Related

Find replace "./." in awk

I am very new to using linux and I am trying to find/replace some of the text in my file.
I have successfully been able to find and replace "0/0" using gsub:
awk '{gsub(/0\/0/,"0")}; 1' filename
However, if I try to replace "./." using the same idea
awk '{gsub(/\.\/\./,"U")}; 1' filename
the output is truncated and stops at the location of the first "./." in the file. I know that "." is a special wildcard character, but I thought that having the "\" in front of it would neutralize it. I have searched but have been unable to find an explanation why the formula I used would truncate the file.
Any thoughts would be much appreciated. Thank you.
Recall that the basic outline of an awk is:
awk 'pattern { action }'
The most common patterns are regexes or tests against line counts:
awk '/FOO/ { do_something_with_a_line_with_FOO_in_it }'
awk 'FNR==10'
The last one has no action so the default is to print the line.
But functions that return a value are also useable as patterns. gsub is a function and returns the number of substitutions.
So given:
$ echo "$txt"
abc./.def line 1
ghk/lmn won't get printed
abc./.def abc./.def printed
To print only lines that have a successful substitution you can do:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U")'
abcUdef line 1
abcUdef abcUdef printed
You do not need to put gsub into an action block since you want to run it on every line and the return tells you something about what happened. The lines that successfully are matched are printed since gsub returns the number of substitutions.
If you want every line printed regardless if there is a match:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U") || 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
Or, you can use the function as an action with an empty pattern and then a 1 with an empty action:
$ echo "$txt" | awk '{gsub(/\.\/\./,"U")} 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
In either case, 1 as a pattern with no action prints the line regardless if there is a match and the gsub makes the substitution if any.
The second awk is what you have. Why it is not working on your input data is probably related to you input data.
Your awk script is fine, your input contains control-Ms, probably from being created by a Windows program. You can see them with cat -v file and use dos2unix or similar to remove them.

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa

Augmentation of a path using awk and/or sed commands

I'm an awk and sed newbie. I have the following string in a file
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /buildout-cache/downloads
and I want to produce the following output from it:
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /Plone/buildout-cache/downloads
How can I do that using sed and awk commands in a shell script?
I want to repeat the same operations on many lines of my file.
Any suggestion would help me a lot.
Without knowing a few more details, the following awk command is a start.
$ cat data
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /buildout-cache/downloads
$ awk '/^mount.*buildout-cache.downloads/ { $NF = "/Plone" $NF; print }' < data
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /Plone/buildout-cache/downloads
It will prefix the last token on the line with /Plone for any lines that start with mount and end with buildout-cache/downloads.
The /^mount.*buildout-cache.downloads/ part makes the block apply to lines that match the regular expression. The command block uses $NF which is a reference to the last field on the line, prepends it with "/Plone", and then prints the entire line out.
for the generic folder name in path
sed 's#\(/[^/]\{1,\}\)\(/.\{1,\}\)\([[:space:]]\{1,\}\)\2$#\1\2\3\1\2#' YourFile
based on last path as pattern on first path

Processing of awk with multiple variable from previous processing?

I have a Q's for awk processing, i got a file below
cat test.txt
/home/shhh/
abc.c
/home/shhh/2/
def.c
gthjrjrdj.c
/kernel/sssh
sarawtera.c
wrawrt.h
wearwaerw.h
My goal is to make a full path from splitting sentences into /home/jhyoon/abc.c.
This is the command I got from someone:
cat test.txt | awk '/^\/.*/{path=$0}/^[a-zA-Z]/{printf("%s/%s\n",path,$0);}'
It works, but I do not understand well about how do make interpret it step by step.
Could you teach me how do I make interpret it?
Result :
/home/shhh//abc.c
/home/shhh/2//def.c
/home/shhh/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
What you probably want is the following:
$ awk '/^\//{path=$0}/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)}' file
/home/jhyoon//abc.c
/home/jhyoon/2//def.c
/home/jhyoon/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
Explanation
/^\//{path=$0} on lines starting with a /, store it in the path variable.
/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)} on lines starting with a letter, print the stored path together with the current line.
Note you can also say
awk '/^\//{path=$0; next} {printf("%s/%s\n",path,$0)}' file
Some comments
cat file | awk '...' is better written as awk '...' file.
You don't need the ; at the end of a block {} if you are executing just one command. It is implicit.

Getting substring using ksh script

I'm using ksh script to determine what the delimiter in a file is using awk. I know this delimiter will always be in the 4 position on the first line. The issue I'm having is that the character being used as in a delimiter in a particular file is a * and so instead of returning * in the variable the script is returning a file list. Here is sample text in my file along with my script:
text in file:
XXX*XX* *XX*XXXXXXX.......
here is my kind of what my script looks like (I don't have the script in front of me but you get the jist):
delimiter=$(awk '{substr $0, 4, 1}' file.txt)
echo ${delimiter} # lists files in directory..file.txt file1.txt file2.txt instead of * which is the desired result
Thank you in advance,
Anthony
Birei is right about your problem. But your AWK expression doesn't seem to be interested in the 1st line only. You can replace it with :
'NR==1 {print substr($0, 4, 1)}'
Then you can do a simple:
echo "$delimiter"
The shell is interpreting the content of the delimiter variable. You need to quote it to avoid this behaviour:
echo "${delimiter}"
It will print *