Compute average and standard deviation with awk - awk

I have a 'file.dat' with 24 (rows) x 16 (columns) data.
I have already tested the following awk script that computes de average of each column.
touch aver-std.dat
awk '{ for (i=1; i<=NF; i++) { sum[i]+= $i } }
END { for (i=1; i<=NF; i++ )
{ printf "%f \n", sum[i]/NR} }' file.dat >> aver-std.dat
The output 'aver-std.dat' has one column with these averages.
Similarly as the average computation
I would like to compute the standard deviation of each column of the data file 'file.dat' and write it in a second column of the output file.
Namely I would like an output file with the average in the first column and the standard deviation in the second column.
I have been making different tests, like this one
touch aver-std.dat
awk '{ for (i=1; i<=NF; i++) { sum[i]+= $i }}
END { for (i=1; i<=NF; i++ )
{std[i] += ($i - sum[i])^2 ; printf "%f %f \n", sum[i]/NR, sqrt(std[i]/(NR-1))}}' file.dat >> aver-std.dat
and it writes values in the second column but they are not the correct value of the standard deviation. The computation of the deviation is not right somehow.
I would appreciate very much any help.
Regards

Standard deviation is
stdev = sqrt((1/N)*(sum of (value - mean)^2))
But there is another form of the formula which does not require you to know the mean beforehand. It is:
stdev = sqrt((1/N)*((sum of squares) - (((sum)^2)/N)))
(A quick web search for "sum of squares" formula for standard deviation will give you the derivation if you are interested)
To use this formula, you need to keep track of both the sum and the sum of squares of the values. So your awk script will change to:
awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}}
END {for (i=1;i<=NF;i++) {
printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}
}' file.dat >> aver-std.dat

To simply calculate the population standard deviation of a list of numbers, you can use a command like this:
awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}'
Or this calculates the sample standard deviation:
awk '{sum+=$0;a[NR]=$0}END{for(i in a)y+=(a[i]-(sum/NR))^2;print sqrt(y/(NR-1))}'
^ is in POSIX. ** is supported by gawk and nawk but not by mawk.

Here is some calculation I've made on a grinder data output file for a long soak test which had to be interrupted:
Standard deviation(biased) + average:
cat <grinder_data_file> | grep -v "1$" | awk -F ', ' '{ sum=sum+$5 ; sumX2+=(($5)^2)} END { printf "Average: %f. Standard Deviation: %f \n", sum/NR, sqrt(sumX2/(NR) - ((sum/NR)^2) )}'
Standard deviation(non-biased) + average:
cat <grinder_data_file> | grep -v "1$" | awk -F ', ' '{ sum=sum+$5 ; sumX2+=(($5)^2)} END { avg=sum/NR; printf "Average: %f. Standard Deviation: %f \n", avg, sqrt(sumX2/(NR-1) - 2*avg*(sum/(NR-1)) + ((NR*(avg^2))/(NR-1)))}'

Your script should somehow be in this form instead:
awk '{
sum = 0
for (i=1; i<=NF; i++) {
sum += $i
}
avg = sum / NF
avga[NR] = avg
sum = 0
for (i=1; i<=NF; i++) {
sum += ($i - avg) ^ 2
}
stda[NR] = sqrt(sum / NF)
}
END { for (i = 1; i in stda; ++i) { printf "%f %f \n", avga[i], stda[i] } }' file.dat >> aver-std.dat

Related

Compute standard deviationfor each row in awk

I have a problem. I have data that consists of 500 fields in each row (500 columns) and I also have 5000 rows. I want to compute the standard deviation for each line as output
Input example
3 0 2 ...(496 another values)... 1
4 1 0 ...(496 another values)... 4
1 3 0 ...(496 another values)... 2
Expected output
0.571 (std for values from the first row)
0.186 (std values from the second row)
0.612 (std values from the third row)
I found something like that, but It is not fit in my case (they compute std for each column). Compute average and standard deviation with awk
I think about compute a sum of each row to check average and then for every field std[i] += ($i - sum[i])^2, and at the end sqrt(std[i]/(500-1)), but then I must create array for every row probably (5000 arrays).
Maybe I should change rows into columns and columns into the rows?
Edit:
Yes this works fantastic
#!/bin/bash
awk 'function std1() {
s=0; t=0;
for( i=1; i<=NF; i++)
s += $i;
mean = s / NF;
for (i=1; i<=NF; i++ )
t += (mean-$i)*(mean-$i);
return sqrt(t / s)
}
{ print std1()}' data.txt >> std.txt
I won't vouch for the calculation, but you could just do:
awk 'function sigma( s, t) {
for( i=1; i<=NF; i++)
s += $i;
mean = s / NF;
for (i=1; i<=NF; i++ )
t += (mean-$i)*(mean-$i);
return sqrt(t / NF)
}
{ print sigma()}' input-path

Use row 1 column ith as output filename awk

I'm a very recent command line user thus I'm requiring some help to split a text file by columns using awk. The difficulty for me is that I want the ith filename to be the text from the 1st row of the ith column.
This is what I had in mind:
awk '{for(i = 2; i <= NF; i++){name= ??FNR == 1 $i?? ;print $1, $i > name}}' myfile.txt
But I don't know how to set the name variable...
Input: myfile.txt
'ID' 'sample_1' 'sample_2' ...
'id_1' 1 2 ...
'id_2' 2 3 ...
Excpected output:
sample_1.txt:
'ID' 'sample_1'
'id_1' 1
'id_2' 2
sample_2.txt:
'ID' 'sample_2'
'id_1' 2
'id_2' 3
Thanks
You should keep column headers in an array.
awk 'NR==1 {
for (i=2; i<=NF; ++i) {
fnames[i] = gensub(/\x27/, "", "g", $i)
print $1, $i > fnames[i] ".txt"
}
next
}
{
for (i=2; i<=NF; ++i)
print $1, "\x27" $i "\x27" > fnames[i] ".txt"
}' myfile.txt
\x27 is single quote in hex-escaped form
gensub(/\x27/, "", "g", $i) removes single quotes from column headers to name output files as you wanted.
You can try this awk :
awk -F'\t' ' # tab as field separator
{
for ( i = 2 ; i <= NF ; i++ ) { # for each record loop from field 2 to last field
if ( NR == 1 ) { # if first record
a[i] = $i # keep each field in array a
gsub ( /^'\''|'\''$/ , "" , a[i] ) # remove quote at start and end in array a
}
print $1 FS $i > a[i]".txt" # print needed field in corresponding file
}
}' myfile.txt

Find max/min of each column in awk

I have a file with an variable number of columns:
Input:
1 1 2
2 1 5
5 2 3
7 0 -1
4 1 4
I want to print the max and min of each column:
Desired output:
max: 7 2 5
min: 1 0 -1
For a single column, e.g. $1, I know I can find the max and min using something like:
awk '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1};} END {printf "%.2g %.2g\n", min, max}'
Question
How can I extend this to loop over all columns (not necessarily just the 3 in my example)?
Many thanks!
awk 'NR==1{for(i=1;i<=NF;i++)min[i]=max[i]=$i;}
{for(i=1;i<=NF;i++){if($i<min[i]){min[i]=$i}else if($i>max[i])max[i]=$i;}}
END{printf "max:\t"; for(i in max) printf "%d ",max[i]; printf "\nmin:\t"; for(i in min)printf "%d ",min[i];}' input.txt
input.txt:
1 1 2 2
2 1 5 3
5 2 3 10
7 0 -1 0
4 1 4 5
output:
max: 7 2 5 10
min: 1 0 -1 0
Like this
awk 'NR==1{for(i=1;i<=NF;i++){xmin[i]=$i;xmax[i]=$i}}
{for(i=1;i<=NF;i++){if($i<xmin[i])xmin[i]=$i;if($i>xmax[i])xmax[i]=$i}}
END{for(i=1;i<=NF;i++)print xmin[i],xmax[i]}' file
Let's try to make it a bit shorter by using the min=(current<min?current:min) expression. This is a ternary operator that is the same as saying if (current<min) min=current.
Also, printf "%.2g%s", min[i], (i==NF?"\n":" ") prints the new line on the END{} block whenever it reaches the last field.
awk 'NR==1{for (i=1; i<=NF; i++) {min[i]=$i}; next}
{for (i=1; i<=NF; i++) { min[i]=(min[i]>$i?$i:min[i]); max[i]=(max[i]<$i?$i:max[i]) }}
END {printf "min: "; for (i=1;i<=NF;i++) printf "%.2g%s", min[i], (i==NF?"\n":" ");
printf "max: "; for (i=1;i<=NF;i++) printf "%.2g%s", max[i], (i==NF?"\n":" ")}' file
Sample output:
$ awk 'NR==1{for (i=1; i<=NF; i++) {min[i]=$i}; next} {for (i=1; i<=NF; i++) { min[i]=(min[i]>$i?$i:min[i]); max[i]=(max[i]<$i?$i:max[i]) }} END {printf "min: "; for (i=1;i<=NF;i++) printf "%.2g%s", min[i], (i==NF?"\n":" "); printf "max: "; for (i=1;i<=NF;i++) printf "%.2g%s", max[i], (i==NF?"\n":" ")}' file
min: 1 0 -1
max: 7 2 5

Calculations in AWK

I need to perform calculations in awk. For each array[$1,$2] i need to check $3. If it is "5310" the value in $4 is positiv, else the value is negativ.
In the end I need to substract all negativ values from the positiv values per array[$1,$2]
input
K019001^ABC^531^12
K019001^ABC^601^12
K019002^ABC^531^100
K019002^ABC^601^40
K019003^ABC^531^50
K019003^ABC^601^30
K019003^ABC^601^40
K019004^ABC^531^10
desired output
K019001^ABC^0
K019002^ABC^60
K019003^ABC^-20
K019004^ABC^10
Use this awk:
awk 'BEGIN {FS=OFS=SUBSEP="^"} {a[$1,$2] += $4 * ($3==531 ? 1 : -1)}
END {for (i in a) print i, a[i]}' file
K019001^ABC^0
K019004^ABC^10
K019002^ABC^60
K019003^ABC^-20
UPDATE:: To get correct ordering:
awk 'BEGIN{FS=OFS="^"} {k=$1 FS $2; if (!a[k]) b[c++]=k} $3==531{a[k]+=$4; next} {a[k]-=$4}
END {for (i=0; i<length(b); i++) print b[i], a[b[i]]}' file
K019001^ABC^0
K019002^ABC^60
K019003^ABC^-20
K019004^ABC^10

awk math - getting natural numbers

I have two columns:
100011780 100016332
10100685 10105465
101190948 101195542
101286838 101288018
101411746 101413662
101686767 101718138
101949793 101950504
101989424 101993757
102095320 102106147
102133372 102143125
I want to get the middle value of those numbers.
Tried to:
awk '{print $1"\t"$2-$1}' input | awk '{print $1"\t"$2/2}' | awk '{print $1+$2}' > output
But some numbers after the division by 2 aren't natural anymore and probably of that my output is like this :
100014056
10103075
101193245
101287428
101412704
1.01702e+08
1.0195e+08
1.01992e+08
1.02101e+08
1.02138e+08
Maybe it's possible to locate non natural value and -/+ 0.5 to make it natural?
You certainly don't need to call awk 3 times to get the average of two numbers.
awk '{printf("%d\n", ($1+$2)/2)}' input
Use printf() to control the output.
100014056
10103075
101193245
101287428
101412704
101702452
101950148
101991590
102100733
102138248
You can add, and use, this round function in your AWK file:
function round(x) {
ival = int(x);
if (ival == x)
return x;
if (x < 0) {
aval = -x;
ival = int(aval);
fraction = aval - ival;
if (fraction >= .5)
return int(x) - 1;
else
return int(x);
} else {
fraction = x - ival;
if (fraction >= .5)
return ival + 1;
else
return ival;
}
}
For example, the avg value will be:
{print round(($1+$2)/2)}
Not sure what you want when the sum is uneven, but you could do all in one go:
gawk '{printf "%i\n", ($1 + $2) / 2}' input
What you are looking for are format control options to printf.