I have a problem. I have data that consists of 500 fields in each row (500 columns) and I also have 5000 rows. I want to compute the standard deviation for each line as output
Input example
3 0 2 ...(496 another values)... 1
4 1 0 ...(496 another values)... 4
1 3 0 ...(496 another values)... 2
Expected output
0.571 (std for values from the first row)
0.186 (std values from the second row)
0.612 (std values from the third row)
I found something like that, but It is not fit in my case (they compute std for each column). Compute average and standard deviation with awk
I think about compute a sum of each row to check average and then for every field std[i] += ($i - sum[i])^2, and at the end sqrt(std[i]/(500-1)), but then I must create array for every row probably (5000 arrays).
Maybe I should change rows into columns and columns into the rows?
Edit:
Yes this works fantastic
#!/bin/bash
awk 'function std1() {
s=0; t=0;
for( i=1; i<=NF; i++)
s += $i;
mean = s / NF;
for (i=1; i<=NF; i++ )
t += (mean-$i)*(mean-$i);
return sqrt(t / s)
}
{ print std1()}' data.txt >> std.txt
I won't vouch for the calculation, but you could just do:
awk 'function sigma( s, t) {
for( i=1; i<=NF; i++)
s += $i;
mean = s / NF;
for (i=1; i<=NF; i++ )
t += (mean-$i)*(mean-$i);
return sqrt(t / NF)
}
{ print sigma()}' input-path
Related
I have a coma delemiated file. I am interested to count the number of rows (column count/length) in each column with their header name.
Example Dataset:
ID, IB, IM, IZ
0.05, 0.02, 0.01, 0.09
0.06, 0.01, , 0.08
0.02, 0.06,
Coumn ID:3
Column IB:3
Column IM: 1
Column IZ:2
I have tried quite few option:
I can split these columns into seperate files and then can count number of lines in each file using wc -l File_name command.
This Command is very close to what I am interested in but stillunable to get header name. Any help will be highly appreciated.
I would use GNU AWK for this task following way, let file.txt content be
ID, IB, IM, IZ
0.05, 0.02, 0.01, 0.09
0.06, 0.01, , 0.08
0.02, 0.06,
then
awk 'BEGIN{FS=","}NR==1{split($0,names);next}{for(i=1;i<=NF;i+=1){counts[i]+=$i~/[^[:space:]]/}}END{for(i=1;i<=length(names);i+=1){print "Column",names[i]": "counts[i]}}' file.txt
output
Column ID: 3
Column IB: 3
Column IM: 1
Column IZ: 2
Explanation: I inform GNU AWK that , is field separator, when processing 1st record split whole lines ($0) into array names, so ID becomes names[1], IB becomes names[2], IM becomes names[3] and so on. After doing that go to next line. For all but 1st line iterate over columns using for loop, for every line increase value of counts[i] (where i is number of column) by does that column contain non-whitespace character? which is 0 for false and 1 for true. In other words increase by 1 if non-whitespace character found else increase by 0. After processing all lines iterate over names and print name with corresponding value of counts.
(tested in gawk 4.2.1)
With GNU awk:
awk -F'[[:space:]]*,[[:space:]]*' 'NR == 1 {header = $0; next} \
{for(i = 1; i <= NF; i++) n[i] += ($i ~ /\S/)} \
END {$0 = header; for(i = 1; i <= NF; i++) print "Column " $i ": " n[i]}' file
Column ID: 3
Column IB: 3
Column IM: 2
Column IZ: 1
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=" *, *" }
NR == 1 {
numCols = split($0,tags)
next
}
{
for ( i=1; i<=NF; i++ ) {
if ( $i ~ /./ ) {
cnt[i]++
}
}
}
END {
for ( i=1; i<=numCols; i++ ) {
printf "Column %s:%d\n", tags[i], cnt[i]
}
}
$ awk -f tst.awk file
Column ID:3
Column IB:3
Column IM:1
Column IZ:2
I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.
You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file
And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file
You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.
awk '{for (i = 1; i <= NF; i++) {gsub(/[^[:alnum:]]/, " "); print tolower($i)": "NR | "sort -V | uniq";}}' input.txt
With above code, I get output as:
line1: 2
line1: 3
line1: 5
line2: 1
line2: 2
line3: 10
I want it like below:
line1: 2, 3, 5
line2: 1, 2
lin23: 10
How to achieve it?
Use gawk's array features. I'll provide actual code once I hack it up.
awk '{for (i = 1; i <= NF; i++) {
gsub(/[^[:alnum:]]/, " ");
arr[tolower($i)] = arr[tolower($i)]NR", "}
}
END {
for (x in arr) {
print x": "substr(arr[x], 1, length(arr[x])-2);
}}' input.txt | sort
Note that this includes duplicate line numbers if a word appears multiple times on the same lines.
using perl...
#!/usr/bin/perl
while(<>){
if( /(\w+):\s*(\d+)/){ # extract the parts
$arr{lc($1)}{$2} ++ # count them
};
}
for my $k (sort keys %arr){ # print sorted alpha
print "$k: ";
$lines=$arr{$k};
print join(", ",(sort {$a<=>$b} keys %$lines)),"\n"; print sorted numerically
}
This solution removes ans sorts the duplicated numbers. Is this what you needed?
I have a 'file.dat' with 24 (rows) x 16 (columns) data.
I have already tested the following awk script that computes de average of each column.
touch aver-std.dat
awk '{ for (i=1; i<=NF; i++) { sum[i]+= $i } }
END { for (i=1; i<=NF; i++ )
{ printf "%f \n", sum[i]/NR} }' file.dat >> aver-std.dat
The output 'aver-std.dat' has one column with these averages.
Similarly as the average computation
I would like to compute the standard deviation of each column of the data file 'file.dat' and write it in a second column of the output file.
Namely I would like an output file with the average in the first column and the standard deviation in the second column.
I have been making different tests, like this one
touch aver-std.dat
awk '{ for (i=1; i<=NF; i++) { sum[i]+= $i }}
END { for (i=1; i<=NF; i++ )
{std[i] += ($i - sum[i])^2 ; printf "%f %f \n", sum[i]/NR, sqrt(std[i]/(NR-1))}}' file.dat >> aver-std.dat
and it writes values in the second column but they are not the correct value of the standard deviation. The computation of the deviation is not right somehow.
I would appreciate very much any help.
Regards
Standard deviation is
stdev = sqrt((1/N)*(sum of (value - mean)^2))
But there is another form of the formula which does not require you to know the mean beforehand. It is:
stdev = sqrt((1/N)*((sum of squares) - (((sum)^2)/N)))
(A quick web search for "sum of squares" formula for standard deviation will give you the derivation if you are interested)
To use this formula, you need to keep track of both the sum and the sum of squares of the values. So your awk script will change to:
awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}}
END {for (i=1;i<=NF;i++) {
printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}
}' file.dat >> aver-std.dat
To simply calculate the population standard deviation of a list of numbers, you can use a command like this:
awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}'
Or this calculates the sample standard deviation:
awk '{sum+=$0;a[NR]=$0}END{for(i in a)y+=(a[i]-(sum/NR))^2;print sqrt(y/(NR-1))}'
^ is in POSIX. ** is supported by gawk and nawk but not by mawk.
Here is some calculation I've made on a grinder data output file for a long soak test which had to be interrupted:
Standard deviation(biased) + average:
cat <grinder_data_file> | grep -v "1$" | awk -F ', ' '{ sum=sum+$5 ; sumX2+=(($5)^2)} END { printf "Average: %f. Standard Deviation: %f \n", sum/NR, sqrt(sumX2/(NR) - ((sum/NR)^2) )}'
Standard deviation(non-biased) + average:
cat <grinder_data_file> | grep -v "1$" | awk -F ', ' '{ sum=sum+$5 ; sumX2+=(($5)^2)} END { avg=sum/NR; printf "Average: %f. Standard Deviation: %f \n", avg, sqrt(sumX2/(NR-1) - 2*avg*(sum/(NR-1)) + ((NR*(avg^2))/(NR-1)))}'
Your script should somehow be in this form instead:
awk '{
sum = 0
for (i=1; i<=NF; i++) {
sum += $i
}
avg = sum / NF
avga[NR] = avg
sum = 0
for (i=1; i<=NF; i++) {
sum += ($i - avg) ^ 2
}
stda[NR] = sqrt(sum / NF)
}
END { for (i = 1; i in stda; ++i) { printf "%f %f \n", avga[i], stda[i] } }' file.dat >> aver-std.dat
I have two columns:
100011780 100016332
10100685 10105465
101190948 101195542
101286838 101288018
101411746 101413662
101686767 101718138
101949793 101950504
101989424 101993757
102095320 102106147
102133372 102143125
I want to get the middle value of those numbers.
Tried to:
awk '{print $1"\t"$2-$1}' input | awk '{print $1"\t"$2/2}' | awk '{print $1+$2}' > output
But some numbers after the division by 2 aren't natural anymore and probably of that my output is like this :
100014056
10103075
101193245
101287428
101412704
1.01702e+08
1.0195e+08
1.01992e+08
1.02101e+08
1.02138e+08
Maybe it's possible to locate non natural value and -/+ 0.5 to make it natural?
You certainly don't need to call awk 3 times to get the average of two numbers.
awk '{printf("%d\n", ($1+$2)/2)}' input
Use printf() to control the output.
100014056
10103075
101193245
101287428
101412704
101702452
101950148
101991590
102100733
102138248
You can add, and use, this round function in your AWK file:
function round(x) {
ival = int(x);
if (ival == x)
return x;
if (x < 0) {
aval = -x;
ival = int(aval);
fraction = aval - ival;
if (fraction >= .5)
return int(x) - 1;
else
return int(x);
} else {
fraction = x - ival;
if (fraction >= .5)
return ival + 1;
else
return ival;
}
}
For example, the avg value will be:
{print round(($1+$2)/2)}
Not sure what you want when the sum is uneven, but you could do all in one go:
gawk '{printf "%i\n", ($1 + $2) / 2}' input
What you are looking for are format control options to printf.