Divide floats in awk - awk

I have written a code to calculate the zscore which calculates the mean and standard deviation from one file and uses some values from rows in another file, as follows:
mean=$(awk '{total += $2; count++} END {print total/count}' ABC_avg.txt)
#calculating mean of the second column of the file
std=$(awk '{x[NR]=$2; s+=$2; n++} END{a=s/n; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/n); print sd}' ABC_avg.txt)
#calculating standard deviation from the second column of the same file
awk '{if (std) print $2-$mean/$std}' ABC_splicedavg.txt" > ABC.tmp
#calculate the zscore for each row and store it in a temporary file
zscore=$(awk '{total += $0; count++} END {if (count) print total/count}' ABC.tmp)
#calculate an average of all the zscores in the rows and store it in a variable
echo $motif" "$zscore
rm ABC.tmp
However when I execute this code ,at the step where a temp file is created I get an error as fatal: division by zero attempted, what is the right way to implement this code? TIA I used bc -l option but it gives a very long version of the floating integer.

Here is a script to compute mean and std in one pass, you may lose some resolution if not acceptable there are alternatives...
$ awk '{print rand()}' <(seq 100)
| awk '{sum+=$1; sqsum+=$1^2}
END{print mean=sum/NR, std=sqrt(sqsum/NR-mean^2), z=mean/std}'
0.486904 0.321789 1.51312
Your script for z-score for each sample is wrong! You need to do ($2-mean)/std.

You can control the precision of your output with bc by using the scale variable:
$ echo "4/7" | bc -l
.57142857142857142857
$ echo "scale=3; 4/7" | bc -l
.571

Related

awk command or sed command

000Bxxxxx111118064085vxas - header
10000000001000000000053009-000000000053009-
10000000005000000000000000+000000000000000+
10000000030000000004025404-000000004025404-
10000000039000000000004930-000000000004930-
10000000088000005417665901-000005417665901-
90000060883328364801913 - trailer
In the above file we have header and trailer and the records which start with 1 is the detail record
in the detail record,want to sum the values starting from position 28 till 44 including the sign using awk/sed command
Here is sed, with help from bc to do the arithmetic:
sed -rn '
/header|trailer/! {
s/[[:digit:]]*[+-]([[:digit:]]+)([+-])$/\2\1/
H
}
$ {
x
s/\n//gp
}
' file | bc
I assume the +/- sign follows the number.
Using awk we can solve this problem making use of substr:
substr(s, m[, n ]):
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s.
This allows us to take the string which represents the number. Here, I assumed that the sign before and after the number is same and thus the sign of the number :
$ echo "10000000001000000000053009-000000000053009-" \
| awk '{print length($0); print substr($0,27,43-27)}'
43
-000000000053009
Since awk implicitly converts strings to numbers if you do numeric operations on them we can write the following awk-code to achieve the requested :
$ awk '/header|trailer/{next}
{s+=substr($0,27,43-27)}
END{print s}' file.dat
-5421749244
Or in a single line:
$ awk '/header|trailer/{next}{s+=substr($0,27,43-27)} END{print s}' file.dat
-5421749244
The above examples just work on the example file given by the OP. However, if you have a file containing multiple blocks with header and trailer and you only want to use the text inside these blocks (exclude everything outside of the blocks), then you should handle it a bit differently :
$ awk '/header/{s=0;c=1;next}
/trailer/{S+=s;c=0;next}
c{s+=substr($0,27,43-27)}
END{print S}' file.dat
Here we do the following:
If a line with header is found, reset the block sum s to ZERO and set c=1 indicating that we take the next lines into account
If a line with trailer is found, add the block sum s to the overall sum S and set c=0 indicating to ignore the lines.
If c/=0 compute the block sum s
At the END, print the total sum S

gawk: sum from a record to another

I have the following table:
login:numero:sobrenome:nome
INICIO
Alcala:1234:Thomas:Alcala
Baron:1235:Alexis:Baron
Bezier:1236:Pascal:Bezier
Boutier:1237:Damien:Boutier
Buard:1238:Jeremy:Buard
Fagour:1239:Dimitri:Fagour
Fagour:1240:Stephane:Fagour
Justice:1241:Jonathan:Justice
FIM
Numero de usuario = 15
I would like to return the sum from line Bezier to Buard.
I tried the following command:
gawk '/Bezier/{init=NR}/Buard/{fin=NR}NR>=$init{Sum1+=$2}NR>$fin{Sum2+=$2}END{Sum=Sum1-Sum2;print Sum}BEGIN{FS=":"}' arq_test_awq
But no way, Sum2 always begin with /Buard/ line. Even if I put "fin=NR+1", the result is the same. I can begin with /Fagour/ to solve the problem but I just can't understand why it doesn't work with this version.
check whether these are the records of interest
$ awk -F: '/Bezier/,/Buard/' file
Bezier:1236:Pascal:Bezier
Boutier:1237:Damien:Boutier
Buard:1238:Jeremy:Buard
sum up the second fields and print at the end
$ awk -F: '/Bezier/,/Buard/{sum+=$2} END{print sum}' file
3711

Calculating 95th percentile with awk

I'm new in awk scripting and would like to have some help in calculating 95th percentile value for a file that consist of this data:
0.0001357
0.000112
0.000062
0.000054
0.000127
0.000114
0.000136
I tried:
cat filename.txt | sort -n |
awk 'BEGIN{c=0} {total[c]=$1; c++;} END{print total[int(NR*0.95-0.5)]}'
but I dont seem to get the correct value when I compare it to excel.
I am not sure if Excel does some kind of weighted percentile, but if you actually want one of the numbers that was in your original set, then your method should work correctly for rounding.
You can simplify a little bit like this, but it's the same thing.
sort -n input.txt | awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'
Following the calculation suggested here, you can do this:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
Output for given input:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
0.0001357
Explanation:
Sort the file numerically
drop the top 5%
pick the next value
PS. The statement p5=p5%1?int(p5)+1:p5 is doing a ceil operation available in many languages.
Just for the record, there is also solution, inspired by merlin2011 answer, that prints several desired percentiles:
# get amount of values
num="$(wc -l input.txt | cut -f1 -d' ')";
# sort values
sort -n input.txt > temp && mv temp input.txt
# print the desired percentiles
for p in 50 70 80 90 92 95 99 100; do
printf "%3s%%: %-5.5sms\n" "$p" "$(head input.txt -n "$((num / 100 * $p))" | tail -n1)";
done
Update: I messed it up. Bash math can't handle floating numbers, even not if used during a "single expression". That only works for files with 100*(N>0) values. So either bc or awk is required to do the math.
In case you have an "odd" amount of values, you should replace "$((num / 100 * $p))" with "$(awk "BEGIN {print int($num/100*$p)}")" in the code above.
Finally awk is part of that answer. ;)

setting default numeric format in awk

I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt

how to get rid of awk fatal division by zero error

when ever I am trying to calculate mean and standard deviation using awk i am getting "awk: fatal: division by zero attempted" error.
my command is
awk '{s+=$3} END{print $2"\t"s/(NR)}' >> mean;
awk '{sum+=$3;sumsq+=$3*$3} END {print $2"\t"sqrt(sumsq/NR - (sum/NR)^2)}' >>sd
does any one know how to solve this ?
Your trouble is that ... you are dividing by zero.
You have two commands:
awk '{s+=$3} END{print $2"\t"s/(NR)}' >> mean;
awk '{sum+=$3;sumsq+=$3*$3} END {print $2"\t"sqrt(sumsq/NR - (sum/NR)^2)}' >>sd
The first command reads from standard input to EOF. The second command is then run, tries to read standard input, but finds that it is empty, so it has zero records read, so NR is zero, and you are dividing by 0, and crashing.
You will need to deal with both the mean and the standard deviation in a single command.
awk '{s1 += $3; s2 += $3*$3}
END { if (NR > 0){
print $2 "\t" s1 / NR;
print $2 "\t" sqrt(s2 / NR - (s1/NR)^2);
}
}'
This avoids divide-by-zero errors.