AWK sum all values of a column keeping all floats - awk

I'm using the following code to grep the lines that im interested in, keep only the last ones and sum over column nine:
grep -n -49 'FINAL BlaBla' output |tail -9 | awk 'BEGIN {SUM=0}; {SUM=SUM+$9}; END {printf "%.3f\n" SUM}.
However the sum over column 9 returns 0,000
the selected lines look as follows
84- C -3.42056726 +1 -0.82831327 +1 1.52743549 +1 0.5647
85- N -4.78612760 +1 -1.01185554 +1 1.58894854 +1 -0.5837
86- C -5.19047197 +1 -2.20130686 +1 2.06176295 +1 0.3890
87- N -4.42537785 +1 -3.22689397 +1 2.47304603 +1 -0.4775
88- C -3.03532546 +1 -2.98933854 +1 2.38795560 +1 0.3686
89- N -2.51737448 +1 -1.78267672 +1 1.92262528 +1 -0.5526
90- Cl -6.86455806 +1 -2.45050886 +1 2.15229544 +1 0.0934
91- N -2.24043582 +1 -3.93651444 +1 2.76082642 +1 0.0890
92- N -2.94053526 +1 0.36941710 +1 1.06455738 +1 -0.3274
I can't find out where the mistake is.
I also tried to sum over $1 and I correctly obtain 792,000 but when I sum over $3 I get 31,000 ...
what's wrong?

I think the problem lies in the missing comma in your printf expression:
$ awk 'BEGIN {SUM=0}; {SUM=SUM+$9}; END {printf "%.3f\n", SUM}' file
# ^
# comma!
-0.436
Note by the way that there is no need to set the variable to zero, since this is the default. So drop the BEGIN {} block and leave to just:
awk '{sum+=$9}; END {printf "%.3f\n", sum}' file
For the other fields:
$ awk '{sum+=$nvar}; END {printf "%.3f\n", sum}' nvar=1 file
792.000
$ awk '{sum+=$nvar}; END {printf "%.3f\n", sum}' nvar=3 file
-35.421
$ awk '{sum+=$nvar}; END {printf "%.3f\n", sum}' nvar=9 file
-0.436
Why wasn't it working?
From The GNU Awk user's guide 5.5.1 Introduction to the printf Statement:
A simple printf statement looks like this:
printf format, item1, item2, …
As for print, the entire list of arguments may optionally be enclosed
in parentheses
The difference between printf and print is the format argument. This
is an expression whose value is taken as a string; it specifies how to
output each of the other arguments. It is called the format string.
The format string is very similar to that in the ISO C library
function printf(). Most of format is text to output verbatim.
Scattered among this text are format specifiers—one per item. Each
format specifier says to output the next item in the argument list at
that place in the format.
So if you don't use commas to call the fields you get this error:
$ awk 'BEGIN {printf "%.3f" 3}'
awk: cmd. line:1: fatal: not enough arguments to satisfy format string
`%.3f3'
^ ran out for this one
Using them it works!
$ awk 'BEGIN {printf "%.3f", 3}'
3.000

Related

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

awk print sum of group of lines

I have a file with a column named (effect) which has rows separated by blank lines,
(effect)
1
1
1
(effect)
1
1
1
1
(effect)
1
1
I know how to print the sum of column like
awk '{sum+=$1;} END{print sum;}' file.txt
Using awk how can I print the sum of each (effect) in for loop? such that I have three lines or multiple lines in other cases like below
sum=3
sum=4
sum=2
You can check if there is an (effect) part, and print the sum when encountering either the (effect) part or when in the END block.
awk '
$1 == "(effect)" { if(seen) print "sum="sum; seen = 1; sum = 0 }
/[0-9]/ { sum += $1 }
END { if (seen) print "sum="sum }
' file
Output
sum=3
sum=4
sum=2
With your shown samples, please try following awk code. Written and tested in GNU awk.
awk -v RS='(^|\n)?\\(effect\\)[^(]*' '
RT{
gsub(/\(effect\)\n|\n+[[:space:]]*$/,"",RT)
num=split(RT,arr,ORS)
print "sum="num
}
' Input_file
Explanation: Simple explanation would be, using GNU awk. In awk program set RS as (^|\n)?\\(effect\\)[^(]* regex for whole Input_file. In main program checking condition if RT is NOT NULL then using gsub(Global substitution) function to substitute (effect)\n and \n+[[:space:]]*$(new lines followed by spaces at end of value) with NULL in RT. Then splitting value of RT into array named arr with delimiter of ORS and saving its(total contents value OR array length value) into variable named num, then printing sum= along with value of num here to get required results.
With shown samples, output will be as follows:
sum=3
sum=4
sum=2
This should work in any version of awk:
awk '{sum += $1} $0=="(effect)" && NR>1 {print "sum=" sum; sum=0}
END{print "sum=" sum}' file
sum=3
sum=4
sum=2
Similar to #Ravinder's answer, but does not depend on the name of the header:
awk -v RS='' -v FS='\n' '{
sum = 0
for (i=2; i<=NF; i++) sum += $i
printf "sum=%d\n", sum
}' file
RS='' means that sequences of 2 or more newlines separate records.
The Field Separator is newline.
The for loop omits field #1, the header.
However that means that empty lines truly need to be empty: no spaces or tabs allowed. If your data might have blank lines that contain whitespace, you can set
-v RS='\n[[:space:]]*\n'
$ awk -v RS='(effect)' 'NR>1{sum=0; for(i=1;i<=NF;i++) sum+=$i; print "sum="sum}' file
sum=3
sum=4
sum=2

awk / gawk printf when variable format string, changing zero to dash

I have a table of numbers I am printing in awk using printf.
The printf accomplishes some truncation for the numbers.
(cat <<E\OF
Name,Where,Grade
Bob,Sydney,75.12
Sue,Sydney,65.2475
George,Sydney,84.6
Jack,Sydney,35
Amy,Sydney,
EOF
)|gawk 'BEGIN{FS=","}
FNR==1 {print("Name","Where","Grade");next}
{if ($3<50) {$3=0}
printf("%s,%s,%d \n",$1,$2,$3)}'
This produces:
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,0
Amy,Sydney,0
What I want is to display scores which are less than 50, or missing, as a dash ("-").
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
This requires the 3rd string format in printf change from %d to %s.
So in some rows, the third column should be a value, and in some rows, the third column should be a string. How can I tell this to GAWK? Or should I just pipe through another awk to re-format?
$ gawk 'BEGIN{FS=","}
FNR==1 {print("Name","Where","Grade");next}
{if ($3<50) {$3="-"} else {$3=sprintf("%d", $3)}
printf("%s,%s,%s \n",$1,$2,$3)}' ip.txt
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
use if-else to assign value to $3 as needed
sprintf allows to assign result of formatting to a variable
for this case, you could use int function as well
now printf will have %s for $3 as well
Assuming you missed the commas for the header and space after third column is not needed, you could do this with a simple one-liner
$ awk -F, -v OFS=, 'NR>1{$3 = $3 < 50 ? "-" : int($3)} 1' ip.txt
Name,Where,Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
?: ternary operator is alternate for if-else
1 is an awk idiom to print contents of $0

How to increment a column value with an increasing number in a csv file

I have a text file with 3 columns as below.
$ cat test.txt
1,A,300
1,B,300
1,C,300
Till now i have tried as, awk -F, '{$3=$3+1;print}' OFS=, test.txt
But output is coming as:
1,A,301
1,B,301
1,C,301
& below is my desired output
Now i want to increment the third column only, the output should be like below
1,A,300
1,B,301
1,C,302
How can I achieve the desired output?
could be, assuming line are sequential like your sample)
awk -F ',' '{sub($3"$",$3+NR-1)}7' YourFile
it use the line numer as increment value, changing the line end and not the field value (different from an awk POV, that don't need to rebuild the line with separator)
Alternative if empty or other line between modifiable lines (i arbitrary use NF as filter but it depend of your criteria if any)
awk -F ',' 'NF{sub($3"$",$3+i++)}7' YourFile
awk 'BEGIN{x=0;FS=OFS=","} NF>1{$3=$3+x;x++}1' inputfile
1,A,300
1,B,301
1,C,302
Explanation:
BEGIN Block : It contains x which is a counter initially set to zero, FS and OFS . /./ is used to ignore blank lines(Remove this part if there are no blank lines). $3=$3+x : This will add the value of counter to $3. x++ : To increment the current value of the counter.
try this NR starts at 1 so NR -1 should give you the correct number
awk -F, '{$3=$3+NR-1;print}' OFS=, test.txt
Yet another:
awk 'BEGIN{ FS=OFS="," } ($3+=i++)||1 ' file
awk 'BEGIN{i=0;FS=OFS=","} NF>1{$3=$3+i;i++}1' filename
It contains x which is a counter initially set to zero, FS and OFS . /./ is used to ignore blank lines(Remove this part if there are no blank lines).
$3=$3+i : This will add the value of counter to $3. i++ : To increment the value of counter. Must and should give space betwen awk and begin as well as filename and end of the file

awk script for finding smallest value from column

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].