I am surprised with behaviour of awk while performing floating point calculations. It lead me to wrong calculation on table data.
$ awk 'BEGIN {print 2.3/0.1}'
23 <-- Ok
$ awk 'BEGIN {print int(2.3/0.1)}'
22 <-- Wrong!
$ awk 'BEGIN {print 2.3-2.2==0.1}'
0 <-- Surprise!
$ awk 'BEGIN {print 2.3-2.2>0.1}' <-- Din't produce any output :(
$ awk 'BEGIN {print 2.3-2.2<0.1}'
1 <-- Totally confused now ...
Can somebody throw light as to what's happing here?
EDIT 1
As pointed by #fedorqui, output of second last command goes to file named 0.1 because of redirection operator (>).
Then how am I supposed to perform greater than (>) operation?
Solution to it is also given by #fedorqui
$ awk 'BEGIN {print (2.3-2.2>0.1)}'
0 <-- Wrong!
The following section from the manual should help you understand the issue you're observing:
15.1.1.2 Floating Point Numbers Are Not Abstract Numbers
Unlike numbers in the abstract sense (such as what you studied in high
school or college arithmetic), numbers stored in computers are limited
in certain ways. They cannot represent an infinite number of digits,
nor can they always represent things exactly. In particular,
floating-point numbers cannot always represent values exactly. Here is
an example:
$ awk '{ printf("%010d\n", $1 * 100) }'
515.79
-| 0000051579
515.80
-| 0000051579
515.81
-| 0000051580
515.82
-| 0000051582
Ctrl-d
This shows that some values can be represented exactly, whereas others
are only approximated. This is not a “bug” in awk, but simply an
artifact of how computers represent numbers.
A highly recommended reading:
What every computer scientist should know about floating-point arithmetic
Related
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
I have an issue where I'm using printf to round a float to the proper number of decimal points. I'm getting inconsistent results as shown below.
echo 104.45 | awk '{printf "%.1f\n",$1}'
104.5 <-- seem to be correct behaviour
echo 104.445 | awk '{printf "%.2f\n",$1}'
104.44 (should be 104.45) <-- seems to be INCORRECT behaviour
echo 104.4445 | awk '{printf "%.3f\n",$1}'
104.445 <-- seems to be correct behaviour
I've seen examples where float number in calculations may cause problems, but did not expect this with formatting.
The number 104.4445 cannot be represented exactly as a binary number. In other words, your computer doesn't know such a number.
# echo 104.4445 | awk '{printf "%.20f\n",$1}'
104.44450000000000500222
# echo 104.445 | awk '{printf "%.20f\n",$1}'
104.44499999999999317879
That's why the former is rounded to 104.445, while the latter is rounded to 104.44 .
The sjsam's answer is relevant only to numbers which can be represented exactly as a binary number, i. e. m/2**n , where m and n are integers and not too big. Changing ROUNDINGMODE to "A" has absolutely no effect on printing 104.45, 104.445, or 104.4445 :
# echo 104.4445 | awk -v ROUNDMODE="A" '{printf "%.3f\n",$1}'
104.445
# echo 104.4445 | awk '{printf "%.3f\n",$1}'
104.445
# echo 104.445 | awk -v ROUNDMODE="A" '{printf "%.2f\n",$1}'
104.44
# echo 104.445 | awk '{printf "%.2f\n",$1}'
104.44
I tried something analogous in Python and got similar results to you:
>>> round(104.445, 2)
104.44
>>> round(104.4445, 3)
104.445
This seems to be run-of-the-mill wonky floating point wonkiness, especially considering that the floating-point representation of 104.445 is less than the actual mathematical value of 104.445:
>>> 104.445 - 104.44
0.0049999999999954525
>>> 104.445 - 104.44 + 104.44
104.445
I strongly suspect that the reason for this behavior has less to do with awk than with how computers store numbers. As user31264 states: "Your computer doesn't know such a number [as 104.4445]."
Here are the results of an experiment I just did with the JavaScript Scratchpad in Pale Moon Web browser:
(104.45).toFixed(55)
/*
104.4500000000000028421709430404007434844970703125000000000
*/
(104.445).toFixed(55)
/*
104.4449999999999931787897367030382156372070312500000000000
*/
(104.4445).toFixed(55)
/*
104.4445000000000050022208597511053085327148437500000000000
*/
In all probability, your awk interpreter is not dealing with the decimal numbers 104.45, etc., but rather with the "wonky" values shown here. Rounding the first, second, and third of these "wonky" values to, respectively, 1, 2, and 3 decimal places will give the same results as your awk interpreter is giving you.
Using printf, one can print a character multiple times:
$ printf "%0.s-" {1..5}
-----
In awk I know that I can do something like:
$ awk 'BEGIN {while (i++ < 5) printf "-"}'
-----
But I wonder if awk's printf allows this as well.
I went through the printf modifiers page but could not find how. All in all, what the printf from Bash does is to expand {1..5} and print a - for every parameter it gets, so it is equivalent to saying
$ printf "%0.s-" hello how are you 42
-----
However, I lack the knowledge on how to mimic this behaviour with awk's printf, if it is possible, because this fails:
$ awk 'BEGIN {printf "%0.s-", 1 2 3 4 5}'
-
I do not believe this is possible with awk's printf, as there is also no way to do this just with printf in C and C++.
With awk, I think the most reasonable option is using a loop like you have. If for some reason performance is vital and awk is creating a bottleneck, the following will speed things up:
awk 'BEGIN {s=sprintf("%5s","");gsub(/ /,"-",s);print s}'
This command will run logarithmically faster[1] Though, it won't cause a noticeable difference in performance unless you're planning on printing a character many times. (Printing a character 1,000,000 times will be about 13x faster.)
Also, if you want a one-liner and are using gawk, even though it's the slowest of the bunch:
gawk 'BEGIN {print gensub(/ /,"-","g",sprintf("%5s",""));}'
[1] While the sprintf/gsub command should always be faster than using a loop, I'm not sure if all versions of awk will behave the same as mine. I also do not understand why the while-loop awk command would have a time complexity of O(n*log(n)), but it does on my system.
I know this is old but the width modifier can be used e.g.
l = some_value
print gensub(/ /, "-", "g", sprintf("%*s", l, ""))
will print a variable number of - depending on the value of l
This was GNU Awk 3.1.8
If you can assume a (modest) upper bound on how long the result should be, how about something like this:
l = 5;
print substr("---------------------", 1, l);
Besides being dead simple, this has the benefit that it works in versions of AWK that lack the "gensub()" function.
i know this post is old, but thought it would be worth demonstrating functionality to allow dynamic control of character string length using basic awk.
A simple example incorporating a string of length 'var' into a printf statement
$ echo "1 foo baa\n2 baa foo" | awk -v var=6 'BEGIN{for(i=1;i<=var;i++) l=l "-" }{printf "%s" l "%s" l "%s\n",$1,$2,$3}'
1------foo------baa
2------baa------foo
You can either split the format string and insert your character string as I've done above. Or you can give the string it's own format specifier
$ echo "1 foo baa\n2 baa foo" | awk -v var=6 'BEGIN{for(i=1;i<=var;i++) l=l "-" }{printf "%s%s%s%s%s\n",$1,l,$2,l,$3}'
both output the same.
A more complicated example that column justifies text. (I've actually used basic 'print' rather than 'printf' in this case, although you could use the latter, as above).
$ echo "Hi, I know this\npost is old, but thought it would be worth demonsrating\nfunctionality to allow a dynamic control of\ncharacter string lengths using basic awk." |
awk '{
line[NR]=$0 ;
w_length[NR]=length($0)-gsub(" "," ",$0) ;
max=max>length($0)?max:length($0) ;
w_count[NR]=NF
}END{
for(i=1;i<=NR;i++)
{
string="" ;
for (j=1;j<=int((max-w_length[i])/(w_count[i]-1));j++)
string=string "-" ;
gsub(" ",string,line[i]) ;
print line[i]
}
}'
Hi,--------------I--------------know--------------this
post-is-old,-but-thought-it-would-be-worth-demonsrating
functionality---to---allow---a---dynamic---control---of
character---string---lengths---using---basic---awk.
Can someone explain why 2 different hexa are converted to the same decimal?
$ echo A0000044956EA2 | gawk '{print strtonum("0x" $1)}'
45035997424348832
$ echo A0000044956EA0 | gawk '{print strtonum("0x" $1)}'
45035997424348832
Starting with GNU awk 4.1 you can use --bignum or -M
$ awk -M 'BEGIN {print 0xA0000044956EA2}'
45035997424348834
$ awk -M 'BEGIN {print 0xA0000044956EA0}'
45035997424348832
§ Command-Line Options
Not as much an answer but a workaround to at least not bin the strtonum function completely:
It seems to be the doubles indeed. I found the calculation here : strtonum.
Nothing wrong with it.
However if you really need this in some awk you should strip the last digit from the hexa number and manually add that after the strtonum did its calculation on the main part of it.
So 0xA0000044956EA1 , 0xA0000044956EA2 and 0xA0000044956EA"whatever" should all become 0xA0000044956EA0 with a simple regex and then add the "whatever".
Edit* Maybe I should delete this all together as I am even to downgrade this even further. This is not working to satisfaction either, just tried it and I actually can't add a number that small to a number this big i.e. print (45035997424348832 + 4) just comes out as 45035997424348832. So this workaround will have to remain having output like 45035997424348832 + 4 for hexa 0xA0000044956EA4.
I have a tab-separated file with several columns, where one column contains numbers written in a format like this
4.07794484177529E-293
I wonder if AWK understands this notation? I.e. I want to get only the lines where the numbers in that column are smaller than 0.1.
But I am not sure if AWK will understand what "4.07794484177529E-293" is - can it do arithmetic comparisons on this format?
Yes, to answer your question, awk does understand E notation.
You can confirm that by:
awk '{printf "%f\n", $1}' <<< "4.07794484177529E-3"
0.004078
In general, awk uses double-precision floating-point numbers to represent all numeric values. This gives you the range between 1.7E–308 and 1.7E+308 to work with, so you are okay with 4.07794484177529E-293
Aside: you can specify how to format the print of floating point number with awk as follows:
awk '{printf "%5.8f\n", $1}' <<< "1.2345678901234556E+4"
12345.67890123
Explanation:
%5.8f is what formats the float
the 5 part before the . specifies how many digits to print before the decimal apoint
the 8 part after the . specifies how many digits to print after the decimal point
In the following awk command
awk '{sum+=$1; ++n} END {avg=sum/n; print "Avg monitoring time = "avg}' file.txt
what should I change to remove scientific notation output (very small values displayed as 1.5e-05) ?
I was not able to succeed with the OMFT variable.
You should use the printf AWK statement. That way you can specify padding, precision, etc. In your case, the %f control letter seems the more appropriate.
I was not able to succeed with the OMFT variable.
It is actually OFMT (outputformat), so for example:
awk 'BEGIN{OFMT="%f";print 0.000015}'
will output:
0.000015
as opposed to:
awk 'BEGIN{print 0.000015}'
which output:
1.5e-05
GNU AWK manual says that if you want to be POSIX-compliant it should be floating-point conversion specification.
Setting -v OFMT='%f' (without having to embed it into my awk statement) worked for me in the case where all I wanted from awk was to sum columns of arbitrary floating point numbers.
As the OP found, awk produces exponential notation with very small numbers,
$ some_accounting_widget | awk '{sum+=$0} END{print sum+0}'
8.992e-07 # Not useful to me
Setting OFMT for fixed that, but also rounded too aggressively,
$ some_accounting_widget | awk -v OFMT='%f' '{sum+=$0} END{print sum+0}'
0.000001 # Oops. Rounded off too much. %f rounds to 6 decimal places by default.
Specifying the number of decimal places got me what I needed,
$ some_accounting_widget | awk -v OFMT='%.10f' '{sum+=$0} END{print sum+0}'
0.0000008992 # Perfect.