Using pipe and shell command in awk script - awk

I am writing an awk script which needs to produce an output which needs to be sorted.
I am able to get the desired unsorted output in an awk array. I tried the following code to sort the array and it works and I don't know why and whether it is the expected behavior.
Sample Input to the question:
Ram,London,200
Alex,London,500
David,Birmingham,300
Axel,Mumbai,150
John,Seoul,450
Jen,Tokyo,600
Sarah,Tokyo,630
The expected output should be:
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
The following script is required to show the city name along with the respective cumulative total of the integers present in the third field.
BEGIN{
FS = ","
OFS = ","
}
{
if($2 in arr){
arr[$2]+=$3;
}else{
arr[$2]=$3;
}
}
END{
for(i in arr){
print i,arr[i] | "sort"
}
}
The following code is in question:
for(i in arr){
print i,arr[i] | "sort"
}
The output of the print is piped to sort, which is a bash command.
So, how does this output travel from awk to bash?
Is this the expected behavior or a mere side effect?
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.

Addressing your specific questions in order:
So, how does this output travel from awk to bash?
A pipe to a spawned process.
Is this the expected behavior or a mere side effect?
Expected
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
Yes, pipe the output of the whole awk command to sort.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
See https://web.archive.org/web/20150928141114/http://awk.info/?Sorting for the implementation of a few common sorting algorithms in awk. See also https://rosettacode.org/wiki/Category:Sorting_Algorithms.
With respect to the question in your comments:
Since a process is spawned to sort from within the loop in the END rule, I was confused whether this will call the sort function on a single line and the spawned process will die there after, and a new process to sort will be spawned in the next iteration of the loop
The spawned process won't die until your awk script terminates or you call close("sort").

Could you please try changing you sort to sort -t',' -k1 in your code. Since your delimiter is comma so you need to inform sort that your delimiter is different than space. By default sort takes delimiter as comma.
Also you could remove if, else block ftom your main block and you could use only arr[$2]+=$3. Keep the rest code as it is apart from sort changes which I mentioned above
I am on mobile so couldn't paste all code but explanation should help you here.

What I would suggest is piping the output of awk to sort and not try and worry about piping the output within the END rule. While GNU awk provides asorti() to allow sorting the contents of an array, in this case since it is just the output you want sorted, a single pipe to sort after your awk script completes is all you need, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file | sort
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
And since it is a single pipe of the output, you incur no per-iteration overhead for the subshell required by the pipe.
If you want to avoid the pipe altogether, if you have bash, you can simply use process-substitution with redirection, e.g.
$ sort < <(awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file)
(same result)
If you have GNU awk, then asorti() will sort a by index and you can place the sorted array in a new array b and then output the sorted results within the END rule, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{asorti(a,b);for(i in b)print b[i], a[b[i]]}' file
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230

Related

How to AWK print only specific item?

I have a log file that looks like this:
RPT_LINKS=1,T1999
RPT_NUMALINKS=1
RPT_ALINKS=1,1999TK,2135,2009,31462,29467,2560
RPT_TXKEYED=1
RPT_ETXKEYED=0
I have used grep to isolate the line I am interested in with the RPT_ALINKS. In that line I want to know how to use AWK to print only the link that ends with a TK.
I am really close running this:
grep -w 'RPT_ALINKS' stats2.log | awk -F 'TK' '{print FS }'
But I am sure those who are smarter than me already know I am getting only the TK back, how do I get the entire field so that I would get a return of 1999TK?
If there is only a single RT in that line and RT is always at the end:
awk '/RPT_ALINKS/{match($0,/[^=,]*TK/); print substr($0,RSTART,RLENGTH)}'
You can also use a double grep
grep -w 'RPT_ALINKS' stats2.log | grep -wo '[^=,]*TK'
The following sed solution also works nicely:
sed '/RPT_ALINKS/s/\(^.*[,=]\)\([^=,]*TK\)\(,.*\)\?/\2/'
It doesn't get any more elegant
awk -F '=' '$1=="RPT_ALINKS" {n=split($2,array,",")
for(i=1; i<=n; i++)
if (array[i] ~ /TK$/)
{print array[i]}}
' stats2.log
n=split($2,array,","): split 1,1999TK,2135,2009,31462,29467,2560 with , to array array. n contains number of array elements, here 7.
Here is a simple solution
awk -F ',|=' '/^RPT_ALINKS/ { for (i=1; i<=NF; i++) if ($i ~ /TK$/) print $i }' stats2.log
It looks only on the record which begins with RPT_ALINKS. And there it check every field. If field ends with TK, then it prints it.
Dang, I was just about to post the double-grep alternative, but got scooped. And all the good awk solutions are taken as well.
Sigh. So here we go in bash, for fun.
$ mapfile a < stats2.log
$ for i in "${a[#]}"; do [[ $i =~ ^RPT_ALINKS=(.+,)*([^,]+TK) ]] && echo "${BASH_REMATCH[2]}"; done
1999TK
This has the disadvantage of running way slower than awk and not using fields. Oh, and it won't handle multiple *TK items on a single line. And like sed, this is processing lines as patterns rather than fields, which saps elegance. And by using mapfile, we limit the size of input you can handle because your whole log is loaded into memory. Of course you don't really need to do that, but if you were going to use a pipe, you'd use a different tool anyway. :-)
Happy Thursday.
With a sed that has -E for EREs, e.g. GNU or OSX/BSD sed:
$ sed -En 's/^RPT_ALINKS=(.*,)?([^,]*TK)(,.*|$)/\2/p' file
1999TK
With GNU awk for the 3rd arg to match():
$ awk 'match($0",",/^RPT_ALINKS=(.*,)?([^,]*TK),.*/,a){print a[2]}' file
1999TK
Instead of looping through it, you can use an other alternative.
This will be fast, loop takes time.
awk -F"TK" '/RPT_ALINKS/ {b=split($1,a,",");print a[b]FS}' stats2.log
1999TK
Here you split the line by setting field separator to TK and search for line that contains RPT_ALINKS
That gives $1=RPT_ALINKS=1,1999 and $2=,2135,2009,31462,29467,2560
$1 will always after last comma have our value.
So split it up using split function by comma. b would then contain number of fields.
Since we know that number would be in last section we do use a[b] and add FS that contains TK

awk command to print only part of matching lines

awk command to compare lines in file and print only first line if there are some new words in other lines.
For example: file.txt is having
i am going
i am going today
i am going with my friend
output should be
I am going
this will work for the sample input but perhaps will fail for the actual one, unless you have a representative input we wouldn't know...
$ awk 'NR>1 && $0~p {if(!f) print p; f=1; next} {p=$0; f=0}' file
i am going
you may want play with p=$0 to restrict matching number of fields if the line lengths are not in increasing order...

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

Processing of awk with multiple variable from previous processing?

I have a Q's for awk processing, i got a file below
cat test.txt
/home/shhh/
abc.c
/home/shhh/2/
def.c
gthjrjrdj.c
/kernel/sssh
sarawtera.c
wrawrt.h
wearwaerw.h
My goal is to make a full path from splitting sentences into /home/jhyoon/abc.c.
This is the command I got from someone:
cat test.txt | awk '/^\/.*/{path=$0}/^[a-zA-Z]/{printf("%s/%s\n",path,$0);}'
It works, but I do not understand well about how do make interpret it step by step.
Could you teach me how do I make interpret it?
Result :
/home/shhh//abc.c
/home/shhh/2//def.c
/home/shhh/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
What you probably want is the following:
$ awk '/^\//{path=$0}/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)}' file
/home/jhyoon//abc.c
/home/jhyoon/2//def.c
/home/jhyoon/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
Explanation
/^\//{path=$0} on lines starting with a /, store it in the path variable.
/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)} on lines starting with a letter, print the stored path together with the current line.
Note you can also say
awk '/^\//{path=$0; next} {printf("%s/%s\n",path,$0)}' file
Some comments
cat file | awk '...' is better written as awk '...' file.
You don't need the ; at the end of a block {} if you are executing just one command. It is implicit.

Awk unable to store the value into an array

I am using a script like below
SCRIPT
declare -a GET
i=1
awk -F "=" '$1 {d[$1]++;} {for (c in d) {GET[i]=d[c];i=i+1}}' session
echo ${GET[1]} ${GET[2]}
DESCRIPTION
The problem is the GET value printed outside is not the correct value ...
I understand your question as "how can I use the results of my awk script inside the shell where awk was called". The truth is that it isn't really trivial. You wouldn't expect to be able to use the output from a C program or python script easily inside your shell. Same with awk, which is a scripting language of its own.
There are some workarounds. For a robust solution, write your results from the awk script to a file in a suitably easy format and read them from the shell. As a hack, you could also try to ready the output from awk directly into the shell using $(). Combine that with the set command and you could do:
set -- $(awk <awk script here>)
and then use $1 $2 etc. but you have to be careful with spaces in the output from awk.