Odd `gawk` filtering of very small floating point number - awk

gawk filters out very small positive number differently depending on threshold used, but all thresholds should retain the entry.
Example input file, tmp:
A 3.92e-373
B 5e-300
C 5e-20
D 5e-6
E 5e-3
Output:
% gawk '$2 < 5e-4' tmp
B 5e-300
C 5e-20
D 5e-6
% gawk '$2 < 5e-8' tmp
A 3.92e-373
D 5e-300
C 5e-20
Note gawk '$2 < 5e-4' should retain entry as $2 < 3.92e-373, which works for gawk '$2 < 5e-8'.
Clearly this is issue with limit of floating point, but I find it odd that the result is not consistent for both thresholds. Shouldn't gawk simply limit 3.92e-373 to 0 and thus print this line under all circumstances?

I wouldn't assume that gawk can figure out what's a number vs a string given your input and hard-coded values. Make sure they're treated as numbers by using strtonum() on them:
$ gawk 'strtonum($2) < strtonum("5e-4")' file
A 3.92e-373
B 5e-300
C 5e-20
D 5e-6
$ gawk 'strtonum($2) < strtonum("5e-8")' file
A 3.92e-373
B 5e-300
C 5e-20
You can see what types gawk thinks it's dealing with by calling typeof() on each:
$ gawk '{print typeof($2), $2, typeof(5e-4), 5e-4, strtonum($2), strtonum("5e-4")}' file | column -t
string 3.92e-373 number 0.0005 0 0.0005
strnum 5e-300 number 0.0005 5e-300 0.0005
strnum 5e-20 number 0.0005 5e-20 0.0005
strnum 5e-6 number 0.0005 5e-06 0.0005
strnum 5e-3 number 0.0005 0.005 0.0005
So it looks like the strtonum("5e-4") is redundant but IMHO it improves clarity so I'd keep it.
Notice that gawk doesn't automatically recognize 3.92e-373 as a number and so the comparison for that input would be string vs number and that's done as a string comparison (see the table at https://www.gnu.org/software/gawk/manual/gawk.html#Typing-and-Comparison).

Related

Awk: Why round up number doesn't work with range?

Trying understand why the following code will print my desired range from 0-30
awk 'BEGIN{n=300;k=sprintf("%.0f",n/10);x=k*1;for (i=0;i<=x;i++) print i}' /dev/null
While, the following code will only print number ranging from 0-3
awk 'BEGIN{n=300;k=sprintf("%.0f",n/10);for (i=0;i<=k;i++) print i}' /dev/null
Is there a better way to round up a number and print the range?
What are you hoping the sprintf() will do for you? All it really does is convert the number you want into a string so then the later comparison is string-based rather than numeric which is why you have a problem since the string "4" is larger than the string "30". You do not need /dev/null at the end of the line btw. All you need is:
awk 'BEGIN{n=300;k=n/10;for (i=0;i<=k;i++) print i}'
Actually, I see you said something about rounding up a number, is that what you're hoping the sprintf will do? Most [s]printf implementations do unbiased rounding so it will round .5 towards even rather than up or down as you might expect. Consider this instead to control the rounding direction:
$ awk 'BEGIN{x=6.5; print x, int(x), sprintf("%.0f",x), int(x+0.5)}'
6.5 6 6 7
$ awk 'BEGIN{x=7.5; print x, int(x), sprintf("%.0f",x), int(x+0.5)}'
7.5 7 8 8
Note in the above that for positive numbers int(x) always rounds down and int(x+0.5) always rounds up while sprintf("%0.f",x) rounds towards the nearest even number. To do it for negative numbers too:
$ awk 'BEGIN{x=1; print x, "down:", int(x<0 ? x-0.5 : x), "up:", int(x<0 ? x : x+0.5)}'
1 down: 1 up: 1
$ awk 'BEGIN{x=0.5; print x, "down:", int(x<0 ? x-0.5 : x), "up:", int(x<0 ? x : x+0.5)}'
0.5 down: 0 up: 1
$ awk 'BEGIN{x=0; print x, "down:", int(x<0 ? x-0.5 : x), "up:", int(x<0 ? x : x+0.5)}'
0 down: 0 up: 0
$ awk 'BEGIN{x=-0.5; print x, "down:", int(x<0 ? x-0.5 : x), "up:", int(x<0 ? x : x+0.5)}'
-0.5 down: -1 up: 0
$ awk 'BEGIN{x=-1; print x, "down:", int(x<0 ? x-0.5 : x), "up:", int(x<0 ? x : x+0.5)}'
-1 down: -1 up: -1
See https://www.gnu.org/software/gawk/manual/gawk.html#Round-Function for more info but I don't understand why that function there is so complicated.
From the GNU Awk documentation, it is quite clear that sprintf returns a string type which you need to convert to an integer during the loop
awk 'BEGIN{n=300;k=sprintf("%.0f",n/10);for (i=0;i<=k+0;i++) print i}' /dev/null
# ^^^^ adding +0 casts any string type to an
# numeric type in awk
In the first example though, since you have the variable x which you multiply with 1 and use that in the for loop, the cast conversion had taken place.
Like Ed’s comment, you could just do this once instead on every for loop iteration
awk 'BEGIN{n=300;k=(sprintf("%.0f",n/10)+0); for (i=0;i<=k;i++) print i}' /dev/null

math in awk - unexpected formatting of floating-point numbers - loss of precision [duplicate]

This question already has answers here:
How can I make awk not use scientific notation when printing small values?
(3 answers)
Closed 6 years ago.
I am trying to do some simple math in awk
user#lab-client:~$ awk '{ram=(1.8 * 1024) * 1024; print ram}'
1.88744e+06
So I assume this means that this number is too large to be stored in variable "ram"
The total number is: 1887436.8
Lets try to store that number in the variable
user#lab-client:~$ awk '{ram=1887436.8; print ram}'
1.88744e+06
Same again. But what if we get rid of the "."?
user#lab-client:~$ awk '{ram=18874368; print ram}'
18874368
Further tests show that when the dot is in the number, it cannot be longer than 6 digits
user#lab-client:~$ awk '{ram=188743.68; print ram}'
188744
So its not a too large number, it is the dot that messes things up. How can I get around this?
you can control the number of decimal points with printf, eventually though the numbers won't be significant due to floating point representation
for example
awk 'BEGIN{for(i=5;i<20;i++) printf "%."i"f\n", 1./3}'
0.33333
0.333333
0.3333333
0.33333333
0.333333333
0.3333333333
0.33333333333
0.333333333333
0.3333333333333
0.33333333333333
0.333333333333333
0.3333333333333333
0.33333333333333331
0.333333333333333315
0.3333333333333333148
To complement karakfa's helpful answer:
In addition to explicit number formatting with printf and sprintf, you can also change awk's default floating-point number formatting, via built-in variable OFMT.
Note that it does not apply to integers.
OFMT defaults to %.6g, which means that any floating-point number is rounded to 6 significant digits and for exponents starting with 7 is represented in scientific notation.
Calculation result 1887436.8 - which has 8 significant digits - is therefore represented as 1.88744e+06, i.e., in scientific notation with 6 significant digits.
The following example sets OFMT to %1.f in order to output all floating-point numbers with 1 decimal place by default:
$ awk -v OFMT='%.1f' 'BEGIN {ram=(1.8 * 1024) * 1024; print ram}'
1887436.8
Note, however, that OFMT does not apply in the following scenarios:
If the floating-point number is used in a string concatenation:
$ awk -v OFMT='%.1f' 'BEGIN { print "result: " 1 / 3 }'
result: 0.333333
# Workaround: Use `sprintf()` with OFMT
awk -v OFMT='%.1f' 'BEGIN { print "result: " sprintf(OFMT, 1 / 3) }'
result: 0.3
If a literal can be parsed as an integer - even if it looks like a floating-point number:
$ awk -v OFMT='%.1f' 'BEGIN { print 1.000 }'
1
Caveat: There are many subtleties around number conversion and formatting in awk, not least because of the limited precision of floating-point numbers (which in awk are always of the ISO C double type).

Divide floats in awk

I have written a code to calculate the zscore which calculates the mean and standard deviation from one file and uses some values from rows in another file, as follows:
mean=$(awk '{total += $2; count++} END {print total/count}' ABC_avg.txt)
#calculating mean of the second column of the file
std=$(awk '{x[NR]=$2; s+=$2; n++} END{a=s/n; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/n); print sd}' ABC_avg.txt)
#calculating standard deviation from the second column of the same file
awk '{if (std) print $2-$mean/$std}' ABC_splicedavg.txt" > ABC.tmp
#calculate the zscore for each row and store it in a temporary file
zscore=$(awk '{total += $0; count++} END {if (count) print total/count}' ABC.tmp)
#calculate an average of all the zscores in the rows and store it in a variable
echo $motif" "$zscore
rm ABC.tmp
However when I execute this code ,at the step where a temp file is created I get an error as fatal: division by zero attempted, what is the right way to implement this code? TIA I used bc -l option but it gives a very long version of the floating integer.
Here is a script to compute mean and std in one pass, you may lose some resolution if not acceptable there are alternatives...
$ awk '{print rand()}' <(seq 100)
| awk '{sum+=$1; sqsum+=$1^2}
END{print mean=sum/NR, std=sqrt(sqsum/NR-mean^2), z=mean/std}'
0.486904 0.321789 1.51312
Your script for z-score for each sample is wrong! You need to do ($2-mean)/std.
You can control the precision of your output with bc by using the scale variable:
$ echo "4/7" | bc -l
.57142857142857142857
$ echo "scale=3; 4/7" | bc -l
.571

Awk to multiply all numbers in a text file by a constant subject to another constraint

I have files with contents like the following:
0.23423
0.10093
0.44231
0.45630
0.89999
I want to increase every number by a given percentage, say, 20%. So, I want to know how to multiply each value by 1.2.
The "constraint" I need to impose is that the products be less than or equal to 1, because these values are probabilities.
So, in pseudocode, I need to replace each number X in a given text file by max(1.0,X*1.2).
How can this be acheived in Awk?
try this one-liner:
awk '{p=1.2*$0;$0=p>1?1:p}7' file
test with your example:
kent$ cat f
0.23423
0.10093
0.44231
0.45630
0.89999
kent$ awk '{p=1.2*$0;$0=p>1?1:p}7' f
0.281076
0.121116
0.530772
0.54756
1
if you want to keep the precision of the floats, you could use printf:
awk '{p=1.2*$0;$0=p>1?1:p;printf "%.5f\n",$0}' file
with same input, it gives:
kent$ awk '{p=1.2*$0;$0=p>1?1:p;printf "%.5f\n",$0}' f
0.28108
0.12112
0.53077
0.54756
1.00000
Using the C-like ternary operator in a one-liner :
awk '{res = $1 * 1.2; print (res > 1) ? 1 : res}' file
0.281076
0.121116
0.530772
0.54756
1

setting default numeric format in awk

I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt