Processing of awk with multiple variable from previous processing? - awk

I have a Q's for awk processing, i got a file below
cat test.txt
/home/shhh/
abc.c
/home/shhh/2/
def.c
gthjrjrdj.c
/kernel/sssh
sarawtera.c
wrawrt.h
wearwaerw.h
My goal is to make a full path from splitting sentences into /home/jhyoon/abc.c.
This is the command I got from someone:
cat test.txt | awk '/^\/.*/{path=$0}/^[a-zA-Z]/{printf("%s/%s\n",path,$0);}'
It works, but I do not understand well about how do make interpret it step by step.
Could you teach me how do I make interpret it?
Result :
/home/shhh//abc.c
/home/shhh/2//def.c
/home/shhh/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h

What you probably want is the following:
$ awk '/^\//{path=$0}/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)}' file
/home/jhyoon//abc.c
/home/jhyoon/2//def.c
/home/jhyoon/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
Explanation
/^\//{path=$0} on lines starting with a /, store it in the path variable.
/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)} on lines starting with a letter, print the stored path together with the current line.
Note you can also say
awk '/^\//{path=$0; next} {printf("%s/%s\n",path,$0)}' file
Some comments
cat file | awk '...' is better written as awk '...' file.
You don't need the ; at the end of a block {} if you are executing just one command. It is implicit.

Related

Using pipe and shell command in awk script

I am writing an awk script which needs to produce an output which needs to be sorted.
I am able to get the desired unsorted output in an awk array. I tried the following code to sort the array and it works and I don't know why and whether it is the expected behavior.
Sample Input to the question:
Ram,London,200
Alex,London,500
David,Birmingham,300
Axel,Mumbai,150
John,Seoul,450
Jen,Tokyo,600
Sarah,Tokyo,630
The expected output should be:
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
The following script is required to show the city name along with the respective cumulative total of the integers present in the third field.
BEGIN{
FS = ","
OFS = ","
}
{
if($2 in arr){
arr[$2]+=$3;
}else{
arr[$2]=$3;
}
}
END{
for(i in arr){
print i,arr[i] | "sort"
}
}
The following code is in question:
for(i in arr){
print i,arr[i] | "sort"
}
The output of the print is piped to sort, which is a bash command.
So, how does this output travel from awk to bash?
Is this the expected behavior or a mere side effect?
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
Addressing your specific questions in order:
So, how does this output travel from awk to bash?
A pipe to a spawned process.
Is this the expected behavior or a mere side effect?
Expected
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
Yes, pipe the output of the whole awk command to sort.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
See https://web.archive.org/web/20150928141114/http://awk.info/?Sorting for the implementation of a few common sorting algorithms in awk. See also https://rosettacode.org/wiki/Category:Sorting_Algorithms.
With respect to the question in your comments:
Since a process is spawned to sort from within the loop in the END rule, I was confused whether this will call the sort function on a single line and the spawned process will die there after, and a new process to sort will be spawned in the next iteration of the loop
The spawned process won't die until your awk script terminates or you call close("sort").
Could you please try changing you sort to sort -t',' -k1 in your code. Since your delimiter is comma so you need to inform sort that your delimiter is different than space. By default sort takes delimiter as comma.
Also you could remove if, else block ftom your main block and you could use only arr[$2]+=$3. Keep the rest code as it is apart from sort changes which I mentioned above
I am on mobile so couldn't paste all code but explanation should help you here.
What I would suggest is piping the output of awk to sort and not try and worry about piping the output within the END rule. While GNU awk provides asorti() to allow sorting the contents of an array, in this case since it is just the output you want sorted, a single pipe to sort after your awk script completes is all you need, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file | sort
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
And since it is a single pipe of the output, you incur no per-iteration overhead for the subshell required by the pipe.
If you want to avoid the pipe altogether, if you have bash, you can simply use process-substitution with redirection, e.g.
$ sort < <(awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file)
(same result)
If you have GNU awk, then asorti() will sort a by index and you can place the sorted array in a new array b and then output the sorted results within the END rule, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{asorti(a,b);for(i in b)print b[i], a[b[i]]}' file
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230

Find replace "./." in awk

I am very new to using linux and I am trying to find/replace some of the text in my file.
I have successfully been able to find and replace "0/0" using gsub:
awk '{gsub(/0\/0/,"0")}; 1' filename
However, if I try to replace "./." using the same idea
awk '{gsub(/\.\/\./,"U")}; 1' filename
the output is truncated and stops at the location of the first "./." in the file. I know that "." is a special wildcard character, but I thought that having the "\" in front of it would neutralize it. I have searched but have been unable to find an explanation why the formula I used would truncate the file.
Any thoughts would be much appreciated. Thank you.
Recall that the basic outline of an awk is:
awk 'pattern { action }'
The most common patterns are regexes or tests against line counts:
awk '/FOO/ { do_something_with_a_line_with_FOO_in_it }'
awk 'FNR==10'
The last one has no action so the default is to print the line.
But functions that return a value are also useable as patterns. gsub is a function and returns the number of substitutions.
So given:
$ echo "$txt"
abc./.def line 1
ghk/lmn won't get printed
abc./.def abc./.def printed
To print only lines that have a successful substitution you can do:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U")'
abcUdef line 1
abcUdef abcUdef printed
You do not need to put gsub into an action block since you want to run it on every line and the return tells you something about what happened. The lines that successfully are matched are printed since gsub returns the number of substitutions.
If you want every line printed regardless if there is a match:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U") || 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
Or, you can use the function as an action with an empty pattern and then a 1 with an empty action:
$ echo "$txt" | awk '{gsub(/\.\/\./,"U")} 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
In either case, 1 as a pattern with no action prints the line regardless if there is a match and the gsub makes the substitution if any.
The second awk is what you have. Why it is not working on your input data is probably related to you input data.
Your awk script is fine, your input contains control-Ms, probably from being created by a Windows program. You can see them with cat -v file and use dos2unix or similar to remove them.

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa

Augmentation of a path using awk and/or sed commands

I'm an awk and sed newbie. I have the following string in a file
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /buildout-cache/downloads
and I want to produce the following output from it:
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /Plone/buildout-cache/downloads
How can I do that using sed and awk commands in a shell script?
I want to repeat the same operations on many lines of my file.
Any suggestion would help me a lot.
Without knowing a few more details, the following awk command is a start.
$ cat data
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /buildout-cache/downloads
$ awk '/^mount.*buildout-cache.downloads/ { $NF = "/Plone" $NF; print }' < data
mount --bind /vsepr_app_repo/fedora/20/plone/4.3.4/Plone/buildout-cache/downloads /Plone/buildout-cache/downloads
It will prefix the last token on the line with /Plone for any lines that start with mount and end with buildout-cache/downloads.
The /^mount.*buildout-cache.downloads/ part makes the block apply to lines that match the regular expression. The command block uses $NF which is a reference to the last field on the line, prepends it with "/Plone", and then prints the entire line out.
for the generic folder name in path
sed 's#\(/[^/]\{1,\}\)\(/.\{1,\}\)\([[:space:]]\{1,\}\)\2$#\1\2\3\1\2#' YourFile
based on last path as pattern on first path

Extract data from ASCII file with grep/AWK

I have a long ASCII log-file from a simulation and need to extract some data from it.
The lines I want have the structure:
Main step= 1 a= 0.00E+00 b=-6.85E-08 c= 4.58E-08
The phrase "Main step" is only used in the lines I want. This is easy to grep for, but I also want to include the next line following the line above, which has the structure:
Fine step= 1 t=-1.31854E+01
Note that "Fine step" is used other places in the log-file.
My question boils down to this: How can I extract the lines containing a keyword/phrase (here "Main step") and also make sure that I get the next following line using grep or AWK or some other standard Linux program?
You can use sed
sed -n '/Main step/,/./p' inputFile
This prints only the lines in a range starting from Main step and ending with . (the wildcard). Effectively, every line which reads Main step and the following are printed.
Posted according to the tag awk. And the one through awk's getline function,
awk '/Main step/{print; getline; print}' file
It would print the Main step line and also the next line.
Because you tagged "grep", and since this is the most obvious solution to me:
grep -A1 'Main step' file
...although this will add "--" between matches. So to get the same output as the awk and sed answer:
grep -A1 'Main step' file | grep -v '^--$'