awk output format for average - awk

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".

You need printf instead of simply print. Print is a much simpler routine than printf is.

for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt

Related

Why double quote does not work in echo statement inside cmd in awk script?

gawk 'BEGIN { FS="|"; OFS="|" }NR ==1 {print} NR >=2 {cmd1="echo -n "$2" | base64 -w 0";cmd1 | getline d1;close(cmd1); print $1,d1 }' dummy2.txt
input:
id|dummy
1|subhashree:1;user=phn
2|subha:2;user=phn
Expected output:
id|dummy
1|c3ViaGFzaHJlZToxO3VzZXI9cGhuCg==
2|c3ViaGE6Mjt1c2VyPXBobgo=
output produced by script:
id|dummy
1|subhashree:1
2|subha:2
I have understood that the double quote around $2 is causing the issue. It does not work hence not encoding the string properly and just stripping off the string after semi colon.Because it does work inside semicolon and gives proper output in terminal.
echo "subhashree:1;user=phn" | base64
c3ViaGFzaHJlZToxO3VzZXI9cGhuCg==
[root#DERATVIV04 encode]# echo "subha:2;user=phn" | base64
c3ViaGE6Mjt1c2VyPXBobgo=
I have tried with different variation with single and double quote inside awk but it does not work.Any help will be highly appreciated.
Thanks a lot in advance.
Your existing cmd1 producing
echo -n subhashree:1;user=phn | base64 -w 0
^ semicolon is there
So if you execute below would produce
$ echo -n subhashree:1;user=phn | base64 -w 0
subhashree:1
With quotes
$ echo -n 'subhashree:1;user=phn' | base64 -w 0
c3ViaGFzaHJlZToxO3VzZXI9cGhu
Solution is just to use quotes before echo -n '<your-string>' | base64 -w 0
$ cat file
id|dummy
1|subhashree:1;user=phn
2|subha:2;user=phn
$ gawk -v q="'" 'BEGIN { FS="|"; OFS="|" }NR ==1 {print} NR >=2 {cmd1="echo -n " q $2 q" | base64 -w 0"; cmd1 | getline d1;close(cmd1); print $1,d1 }' file
id|dummy
1|c3ViaGFzaHJlZToxO3VzZXI9cGhu
2|c3ViaGE6Mjt1c2VyPXBobg==
It can be simplified as below
gawk -v q="'" 'BEGIN {
FS=OFS="|"
}
NR==1{
print;
next
}
{
cmd1="echo -n " q $2 q" | base64 -w 0";
print ((cmd1 | getline d1)>0)? $1 OFS d1 : $0;
close(cmd1);
}
' file
Based on Ed Morton recommendation http://awk.freeshell.org/AllAboutGetline
if/while ( (getline var < file) > 0)
if/while ( (command | getline var) > 0)
if/while ( (command |& getline var) > 0)
The problem is because of lack of quotes, when trying to run the echo command in shell context. What you are trying to do is basically converted into
echo -n subhashree:1;user=phn | base64 -w 0
which the shell has executed as two commands separated by ; i.e. user=phn | base64 -w 0 means an assignment followed by a pipeline, which would be empty because the assignment would not produce any result over standard input for base64 for encode. The other segment subhashree:1 is just echoed out, which is stored in your getline variable d1.
The right approach fixing your problem should be using quotes
echo -n "subhashree:1;user=phn" | base64 -w 0
When you said, you were using quotes to $2, that is not actually right, the quotes are actually used in the context of awk to concatenate the cmd string i.e. "echo -n ", $2 and " | base64 -w 0" are just joined together. The proposed double quotes need to be in the context of the shell.
SO with that and few other fixes, your awk command should be below. Added gsub() to remove trailing spaces, which were present in your input shown. Also used printf over echo.
awk -v FS="|" '
BEGIN {
OFS = FS
}
NR == 1 {
print
}
NR >= 2 {
gsub(/[[:space:]]+/, "", $2)
cmd = "printf \"%s\" \"" $2 "\" | base64 -w 0"
if ((cmd | getline result) > 0) {
$2 = result
}
close(cmd)
print
}
' file
So with the command above, your command is executed as below, which would produce the right result.
printf "%s" "subhashree:1;user=phn" | base64 -w 0
You already got answers explaining how to use awk for this but you should also consider not using awk for this. The tool to sequence calls to other commands (e.g. bas64) is a shell, not awk. What you're trying to do in terms of calls is:
shell { awk { loop_on_input { shell { base64 } } } }
whereas if you call base64 directly from shell it'd just be:
shell { loop_on_input { base64 } }
Note that the awk command is spawning a new subshell once per line of input while the direct call from shell isn't.
For example:
#!/usr/bin/env bash
file='dummy2.txt'
head -n 1 "$file"
while IFS='|' read -r id dummy; do
printf '%s|%s\n' "$id" "$(base64 -w 0 <<<"$dummy")"
done < <(tail -n +2 "$file")
Here's the difference in execution speed for an input file that has each of your data lines duplicated 100 times created by awk -v n=100 'NR==1{print; next} {for (i=1;i<=n;i++) print}' dummy2.txt > file100
$ ./tst.sh file100
Awk:
real 0m23.247s
user 0m3.755s
sys 0m10.966s
Shell:
real 0m14.512s
user 0m1.530s
sys 0m4.776s
The above timing was produced by running this command (both awk scripts posted in answers will have about the same timeing so I just picked one at random):
#!/usr/bin/env bash
doawk() {
local file="$1"
gawk -v q="'" 'BEGIN {
FS=OFS="|"
}
NR==1{
print;
next
}
{
cmd1="echo -n " q $2 q" | base64 -w 0";
print ((cmd1 | getline d1)>0)? $1 OFS d1 : $0;
close(cmd1);
}
' "$file"
}
doshell() {
local file="$1"
head -n 1 "$file"
while IFS='|' read -r id dummy; do
printf '%s|%s\n' "$id" "$(base64 -w 0 <<<"$dummy")"
done < <(tail -n +2 "$file")
}
# Use 3rd-run timing to eliminate cache-ing as a factor
doawk "$1" >/dev/null
doawk "$1" >/dev/null
echo "Awk:"
time doawk "$1" >/dev/null
echo ""
doshell "$1" >/dev/null
doshell "$1" >/dev/null
echo "Shell:"
time doshell "$1" >/dev/null

How to merge lines using awk command so that there should be specific fields in a line

I want to merge some rows in a file so that the lines should contain 22 fields seperated by ~.
Input file looks like this.
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~
UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~
~76~23~2523~Y~UN~~223~223~~~~A~123
and So on
First line looks fine. 2nd and 3rd line needs to be merged so that it becomes a line with 22 fields. 4th,5th and 6th line should be merged and so on.
Expected output:
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The file has 10 GB data but the code I wrote (used while loop) is taking too much time to execute . How to solve this problem using awk/sed command?
Code Used:
IFS=$'\n'
set -f
while read line
do
count_tild=`echo $line | grep -o '~' | wc -l`
if [ $count_tild == 21 ]
then
echo $line
else
checkLine
fi
done < file.txt
function checkLine
{
current_line=$line
read line1
next_line=$line1
new_line=`echo "$current_line$next_line"`
count_tild_mod=`echo $new_line | grep -o '~' | wc -l`
if [ $count_tild_mod == 21 ]
then
echo "$new_line"
else
line=$new_line
checkLine
fi
}
Using only the shell for this is slow, error-prone, and frustrating. Try Awk instead.
awk -F '~' 'NF==1 { next } # Hack; see below
NF<22 {
for(i=1; i<=NF; i++) f[++a]=$i }
a==22 {
for(i=1; i<=a; ++i) printf "%s%s", f[i], (i==22 ? "\n" : "~")
a=0 }
NF==22
END {
if(a) for(i=1; i<=a; i++) printf "%s%s", f[i], (i==a ? "\n" : "~") }' file.txt>file.new
This assumes that consecutive lines with too few fields will always add up to exactly 22 when you merge them. You might want to check this assumption (or perhaps accept this answer and ask a new question with more and better details). Or maybe just add something like
a>22 {
print FILENAME ":" FNR ": Too many fields " a >"/dev/stderr"
exit 1 }
The NF==1 block is a hack to bypass the weirdness of the completely empty line 5 in your sample.
Your attempt contained multiple errors and inefficiencies; for a start, try http://shellcheck.net/ to diagnose many of them.
$ cat tst.awk
BEGIN { FS="~" }
{
sub(/^[0-9]+\./,"")
gsub(/[[:space:]]+/,"")
$0 = prev $0
if ( NF == 22 ) {
print ++cnt "." $0
prev = ""
}
else {
prev = $0
}
}
$ awk -f tst.awk file
1.200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2.2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
3.209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The assumption above is that you never have more than 22 fields on 1 line nor do you exceed 22 in any concatenation of the contiguous lines that are each less than 22 fields, just like you show in your sample input.
You can try this awk
awk '
BEGIN {
FS=OFS="~"
}
{
while(NF<22) {
if(NF==0)
break
a=$0
getline
$0=a$0
}
if(NF!=0)
print
}
' infile
or this sed
sed -E '
:A
s/((.*~){21})([^~]*)/\1\3/
tB
N
bA
:B
s/\n//g
' infile

tcsh error: while loop

This is a basic program but since I'm a newbie, I'm not able to figure out the solution.
I have a file named rama.xvg in the following format:
-75.635 105.879 ASN-2
-153.704 64.7089 ARG-3
-148.238 -47.6076 GLN-4
-63.2568 -8.05441 LEU-5
-97.8149 -7.34302 GLU-6
-119.276 8.99017 ARG-7
-144.198 -103.917 SER-8
-65.4354 -10.3962 GLY-9
-60.6926 12.424 ARG-10
-159.797 -0.551989 PHE-11
65.9924 -48.8993 GLY-12
179.677 -7.93138 GLY-13
..........
...........
-70.5046 38.0408 GLY-146
-155.876 153.746 TRP-147
-132.355 151.023 GLY-148
-66.2679 167.798 ASN-2
-151.342 -33.0647 ARG-3
-146.483 41.3483 GLN-4
..........
..........
-108.566 0.0212432 SER-139
47.6854 33.6991 MET-140
47.9466 40.1073 ASP-141
46.4783 48.5301 SER-142
-139.17 172.486 LYS-143
58.9514 32.0602 SER-144
60.744 18.3059 SER-145
-94.0533 165.745 GLY-146
-161.809 177.435 TRP-147
129.172 -101.736 GLY-148
I need to extract all the lines containing "ASN-2" in one file all_1.dat and so on for all the 147 residues.
If I run the following command in the terminal, it gives the desired output for ASN-2:
awk '{if( NR%147 == 1 ) printf $0 "\n"}' rama.xvg > all_1.dat
To avoid doing it repeatedly for all the residues, I have written the following code.
#!/bin/tcsh
set i = 1
while ( $i < 148)
echo $i
awk '{if( NR%147 == i ) printf $0 "\n"}' rama.xvg > all_"$i".dat
# i++
end
But this code prints the lines containing GLY-148 in all the output files.
Please let me know what is the error in this code. I think it is related to nesting.
In your awk-line the variable i is an awk-variable not shell variable! If you want use shell-variable $i you can do:
awk -v i="$i" '{if( NR%147 == i ) printf $0 "\n"}' rama.xvg > all_"$i".dat
But I think would better put your while-loop into awk:
awk '{for (i=1; i<=147; i++) { if (NR%147==i) {printf $0 "\n" > ("all_" i ".dat") } } }' rama.xvg

Set number of lines to variable in awk

I am experimenting with an awk script (an independent file).
I want it to process a text file which looks like this:
value1: n
value2: n
value3: n
value4: n
value5: n
value6: n
value7: n
value1: n
:
The text file contains a lot of these blocks with 7 values in each of them. I want the awk script to print some of these values (name of the value and "n") into a new file or the commandline. I thought I'd process it with a while loop, which works with a variable set to the number of all lines. But I just cant get the total of all lines in the file into a variable. It seems I have to process every line and do something with it until the end of the file to get the total. But I'd like to have the total in a variable and then process it with the while loop which loops until the total is reached.
Do you have an idea?
Where $1 is the input parameter to your script: myscript textfile.txt
count="`wc -l $1 | cut -d' ' -f1`"
echo "Number of lines in $1 is $count"
Then do your awk command utilising $count as your line count
Edit: courtesy of fedorqui
count="`wc -l <$1`"
echo "Number of lines in $1 is $count"
Edit 2: (forgive my awk command it's not something that I use much)
count="`wc -l </etc/fstab`"
echo "Number of lines in /etc/fstab is $count"
awk '{print $0,"\t","\tLine ",NR," of ","'$count'";}' /etc/fstab
Either do two passes over the file:
awk 'NR==FNR { lines = NR; next }
{ ... this happens on the second pass, use lines as you wish ... }' file file
or read the lines into an array and process it in END:
awk '{ a[NR] = $0 }
END { lines = NR; for(i=1; i<=lines; ++i) { line = a[i]; ... } }' file
The first consumes I/O, the second memory.
In more detail,
awk 'NR==FNR { count++; next }
{ print "Item " FNR " of " count ": " $0 }' file file
or similarly
awk '{ a[NR] = $0; }
END { for (i=1; i<=NR; ++i) print "Item " i " of " NR ": " a[i] }' file
If you need the line count outside of Awk, you will need to print it and capture it from your script. If you have an Awk script which performs something useful and produces the count as a side effect, you will want to print the line count to standard output, and take care that all other output is directed somewhere else. Perhaps like this;
lines=$(awk '{ sum += $1; }
END { print "Average: " sum/NR >"average.txt"; print NR }')
# Back in Bash
echo "Total of $lines processed, output in average.txt"

Redirect input for gawk to a system command

Usually a gawk script processes each line of its stdin. Is it possible to instead specify a system command in the script use the process each line from output of the command in the rest of the script?
For example consider the following simple interaction:
$ { echo "abc"; echo "def"; } | gawk '{print NR ":" $0; }'
1:abc
2:def
I would like to get the same output without using pipe, specifying instead the echo commands as a system command.
I can of course use the pipe but that would force me to either use two different scripts or specify the gawk script inside the bash script and I am trying to avoid that.
UPDATE
The previous example is not quite representative of my usecase, this is somewhat closer:
$ { echo "abc"; echo "def"; } | gawk '/d/ {print NR ":" $0; }'
2:def
UPDATE 2
A shell script parallel would be as follows. Without the exec line the script would read from stdin; with the exec it would use the command that line as input:
/tmp> cat t.sh
#!/bin/bash
exec 0< <(echo abc; echo def)
while read l; do
echo "line:" $l
done
/tmp> ./t.sh
line: abc
line: def
From all of your comments, it sounds like what you want is:
$ cat tst.awk
BEGIN {
if ( ("mktemp" | getline file) > 0 ) {
system("(echo abc; echo def) > " file)
ARGV[ARGC++] = file
}
close("mktemp")
}
{ print FILENAME, NR, $0 }
END {
if (file!="") {
system("rm -f \"" file "\"")
}
}
$ awk -f tst.awk
/tmp/tmp.ooAfgMNetB 1 abc
/tmp/tmp.ooAfgMNetB 2 def
but honestly, I wouldn't do it. You're munging what the shell is good at (creating/destroying files and processes) with what awk is good at (manipulating text).
I believe what you're looking for is getline:
awk '{ while ( ("echo abc; echo def" | getline line) > 0){ print line} }' <<< ''
abc
def
Adjusting the answer to you second example:
awk '{ while ( ("echo abc; echo def" | getline line) > 0){ counter++; if ( line ~ /d/){print counter":"line} } }' <<< ''
2:def
Let's break it down:
awk '{
cmd = "echo abc; echo def"
# line below will create a line variable containing the ouptut of cmd
while ( ( cmd | getline line) > 0){
# we need a counter because NR will not work for us
counter++;
# if the line contais the letter d
if ( line ~ /d/){
print counter":"line
}
}
}' <<< ''
2:def