awk and gawk with large integers and large powers of 2

awk and gawk with large integers and large powers of 2 - awk

It was my understanding that both POSIX awk and GNU awk use IEEE 754 double for both integer and floats. (I know the -M switch is available on GNU awk for arbitrary precision integers. This question assumes without -M selected...)
This means that the max size of integer result with awk / gawk / perl (those without AUTOMATIC promotion to arbitrary precision integers) would be 53 bits since this is the max size integer that can fit in a IEEE 754 double. (At magnitudes greater than 2^53, you can no longer expect ±1 to work as it would with an integer but floating point arithmetic still works within the limits of a IEEE double.)
It seems to be easily demonstrated.
These work as expected with correct results (to the last digit) on both awk and gawk:
$ gawk 'BEGIN{print 2**52-1}'
4503599627370495
$ gawk 'BEGIN{print 2**52+1}'
4503599627370497
$ gawk 'BEGIN{print 2**53-1}'
9007199254740991
This is off by 1 (and is what I would expect with 53 bit max integer):
$ gawk 'BEGIN{print 2**53+1}' # 9007199254740993 is the correct result
9007199254740992
But here is what I would NOT expect. With certain power of 2 values both awk and GNU awk perform integer arithmetic at far greater precision than is possible within 53 bits.
(On my system, /usr/bin/awk is MacOS POSIX awk; gawk is GNU awk.)
Consider these examples, all precise to the digit:
$ gawk 'BEGIN{print 2**230}' # float result with awk...
1725436586697640946858688965569256363112777243042596638790631055949824
$ /usr/bin/awk 'BEGIN{print 2**99}' # max that POSIX awk supports
633825300114114700748351602688
The precision of ±1 is not supported at these magnitudes but limited arithmetic operations with powers of 2 are supported. Again, precise to the digit:
$ /usr/bin/awk 'BEGIN{print 2**99-2**98}'
316912650057057350374175801344
$ /usr/bin/awk 'BEGIN{print 2**99+2**98}'
950737950171172051122527404032
$ gawk 'BEGIN{print 2**55-968}' # 2^55=36028797018963968
36028797018963000
I am speculating that awk and gawk have some sort of non standard way of recognizing that 2^N is equivalent to 2<<N and doing some limited math inside of that arena.
Any form of [integer > 2] ^ Y with the result being greater than 2^53 has a drop in precision that is expected. ie, 10^15 is the rough max integer for ±1 to be accurate since 10^16 requires 54 bits.
$ gawk 'BEGIN{print 10**15+1}' # correct
1000000000000001
$ gawk 'BEGIN{print 10**16+1}' # not correct
10000000000000000
This is correct in magnitude for 10**64 but only precise for the first 16 digits (which I would expect):
$ gawk 'BEGIN{print 10**64}'
10000000000000001674705827425446886926697411428962669123675881472
# should be '1' + 64 '0'
# This is just a presentation issue of a value implying greater precision...
The GNU document is not exactly helpful since it speaks of the max values for 64 bit unsigned and signed integers implying those are used somehow. But it is easy to demonstrate that with the exception of powers of 2, the max integer on gawk is 2**53
Questions:
Am I correct that ALL integer calculations in awk / gawk are in fact IEEE doubles with max value of 2**53 for ±1? Is that documented somewhere?
If that is correct, what is happening with larger powers of 2?
(It would be nice if there were automatic switching to float format (the way Perl does) at that magnitude where there is a loss of precision btw.)

I cannot speak to the numeric implementations used in particular versions of gawk or awk. This answer speaks to floating-point generally, particularly IEEE-754 binary formats.
Computing 299 for 2**99 and 2230 for 2**230 are simply normal operations for floating-point arithmetic. Each is represented with a significand with one significant binary digit, 1, and an exponent of 99 or 230. Whatever routine is used to implement the exponentiation operation is presumably doing its job correctly. Since binary floating-point represents a number using a sign, a significand, and a scaling of two to some power, 299 and 2230 are easily represented.
When these numbers are printed, some routine is called to convert them to decimal numerals. This routine also appears to be well implemented, producing correct output. Some work is required to do that conversion correctly, as implementing it with naïve arithmetic will introduce rounding errors that produce incorrect results. (Sometimes little engineering effort is given to conversion routines and they produce results accurate only to a limited number of significant decimal digits. This appears to be less common; correctly rounded implementations are more common now than they used to be.)
Apparent “loss of precision,” more accurately called “loss of accuracy” or “rounding errors,” occurs when results cannot be exactly implemented (such as 253+1) or when floating-point operations are implemented without correct rounding. For 299 and 2230, no such loss is imposed by the floating-point format.
This means that the max size of integer result with awk / gawk / perl… would be 53 bits… ”
This is incorrect, or at least incorrectly phrased. The last consecutive integer that can be represented in IEEE-754 64-bit binary is 253. But it is certainly not the maximum. 253+2 can also be represented, having skipped 253+1. There are many more integers larger than 253 that can be represented.

UPDATE : statement on powers of 10 in 754 double :
anectodally, even w/o bigint add-ons, if you're just looking for powers of 10 standalone, or using them to mod ( % ) against powers of 2, it appears u can go up to 10^22.
jot 25 | gawk -be '$++NF = sprintf("%.f", 10^$1)' # same for mawk+nawk
{ … pruned smaller ones… }
13 10000000000000
14 100000000000000
15 1000000000000000
16 10000000000000000
17 100000000000000000
18 1000000000000000000
19 10000000000000000000
20 100000000000000000000
21 1000000000000000000000
22 10000000000000000000000 <---------
23 99999999999999991611392
24 999999999999999983222784
25 10000000000000000905969664
( 2 ^ x ) % ( 10 ^ 22 )
jot -w 'x=%d;y=22; print x,"=",y,"=",(2^x)%%(10^y),"\n"; ' - 3 1023 68 |
bc |
mawk2 '$++NF = sprintf("\f\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"\
"\b\b\b\b\b\b\b\b%.f",
( 2 ^ $1 ) % ( 10 ^ $2 ) )' FS== OFS== |
column -s= -t | gcat -n
All the powers of 2, even up to 1023, can directly obtain the exact modulo of 10 ^ 22 w/o needing specialized algorithms, string ops, or bigint packages, as confirmed by gnu-bc
1 3 22 8
8
2 71 22 2361183241434822606848
2361183241434822606848
3 139 22 2991196020261297061888
2991196020261297061888
4 207 22 1983204197482918576128
1983204197482918576128
5 275 22 7804340758912662765568
7804340758912662765568
6 343 22 3047019946172856926208
3047019946172856926208
7 411 22 729696200186866434048
729696200186866434048
8 479 22 3987839644142653145088
3987839644142653145088
9 547 22 1954520592778765795328
1954520592778765795328
10 615 22 5333860664072122400768
5333860664072122400768
11 683 22 3741475809259744657408
3741475809259744657408
12 751 22 1513586051123404341248
1513586051123404341248
13 819 22 2458937421726941708288
2458937421726941708288
14 887 22 8118143630040815894528
8118143630040815894528
15 955 22 3346307143437247315968
3346307143437247315968
16 1023 22 2417678164812112068608
2417678164812112068608
=============
not for all powers of 2. as you said, It's based on limitations of IEEE 754 double format. The highest i can get it to spit out is 2^1023.
2^1024 results in INF unless you invoke bignum mode -M
That said, gap begins to jump past 2^53, and increases along the way (as you go further into so-called "sub-optimal" range. As for printing those out, %d / %i is good for +/- 63-bits in gawk/mawk2, and %u up to unsigned 64-bits int (but potentially imprecise other than exact powers of 2 once you're past 2^53).
mawk 1.3.4 appears to be limited to 31/32-bits instead respectively.
Past those ranges, %.f is just about the only way to go.

one special case about powers of two -
if you want just 2^N-1 for powers up to 1023, a very clean sub() wil do the trick without having to physically go figure out what the last digit is :
sub(/[2468]$/, index("1:2:0:3", bits % 4), pow2str)
the last digit for positive integer powers of 2 have this repeat and predictable pattern when you take modulo against 4,
so using this specially crafted string, where the values exist at positions 7/3/1/5 (in descending modulo order), the string index itself is already exact last digit minus 1.
e.g. 2^719 : it goes 275. . . . 60288
| |
719 % 4 = 3, located at position 7 of reference string "1:2:0:3",
so the regex replaces the final "8" with a "7", giving you exactly 2^N-1 for any gigantic integer power of 2.
if you already know how many bits this power of 2 is supposed to be, then this way is faster, otherwise, a substring replacement approach is definitely faster than running it through logarithmic functions.

Related

Why do these 2 for looping over sequences differ?

First:
$ raku -e "for 1...6, 7...15 { .say }"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Now:
$ raku -e "for 1...3, 7...15 { .say }"
1
2
3
7
11
15
I would expect this case to print 1,2,3,7,8,... 15.
What's happening here?

I think you might want the raku Range operator .. (two dots) and not the raku Sequence operator ... (three dots).
Here's how you examples look with the Range operator instead:
> raku -e 'for 1..6, 7..15 { .say }'
1..6
7..15
Oh, that's not good ... looks like for is just iterating over the two things 1..6 and 7..15 and stringifying them.
We can use a Slip | to fix that:
> raku -e 'for |(1..6), |(7..15) { .say }'
1
2
... (all the numbers)
14
15
And then:
raku -e 'for |(1..3), |(7..15) { .say }'
1
2
3
7
8
9
10
11
12
13
14
15
With the Sequence operator, you have made something like:
>raku -e 'for 3,7...15 { .say }'
3
7
11
15
That is raku for "make a sequence that starts with 3, then 7, then all the values until you get to the last at 15" ... and since the gap from 3 to 7 is 4, raku will count up in steps of 4. Then you began it with 1..3. ;-)
~p6steve

It's because it is two deductive sequences.
1...3
Is obviously a sequence where you add 1 to each successive value.
1, 2, 3
And since 7 is 4 more than 3, this is a sequence where you add 4 to each successive value.
3, 7 ... 15
3, 7, 11, 15
To get what you want, you could use a flattened Range.
1...3, |(7..15)
Or even a flattened Sequence.
1...3, |(7...15)

TL;DR This answer focuses on addressing what you originally asked (which was about "sequences") and precisely what the code you wrote is doing, rather than providing a solution (using ranges instead).
This is a work in progress dealing with something that seems both poorly documented and hard to fathom (which may explain part though not all of the doc situation). Please bear with me! (And I may just end up deleting this answer.)
1 ... 3, 7 ... 15 ≡ 1 ... (3, 7) ... 15
In the absence of parentheses, operators within an expression are applied according to rules of "precedence" and "associativity".
Infix , has a higher precedence than infix ....¹ The above two lines of code thus produce the same result (1␤2␤3␤7␤11␤15␤):
for 1 ... 3, 7 ... 15 { .say } # Operator evaluation by precedence
for 1 ... (3, 7) ... 15 { .say } # Operator evaluation by parentheses
That said, while the result is what, given a glance at the code, I would expect based on my own "magical" DWIM ("Do What I Mean") thinking, I must say I don't yet know what the precise Raku(do)'s rule(s) are that lead to it DWIMing.
The doc for infix ... says:
If the endpoint is not *, it's smartmatched against each generated element and the sequence is terminated when the smartmatch succeeded.
But that seems overly simple. What if the endpoint of one sequence is another sequence? (As, at least taking a naive view, appears to be the case in your code.)
Also, as #MustafaAydin has noted:
how does your post explain the irregular last step size (of 2) instead of 3? I mean 4, 7 ... 15 alone produces (4, 7, 10, 13). But 1... 4, 7...15 now produces 7, 10, 13, 15 in the tail. Why is 15 included? Maybe i'm missing something idk
I'm at least as confused as Mustafa.
Indeed, I'm confused about several things. How come Raku(do) flattens the two sequences? [D'oh. Because the infix comma is higher precedence than the infix ....] Why doesn't it repeat the 3 in the final combined list? [Perhaps because multiple infix ...s are smart about what to do when there's an expression that's the endpoint of one sequence and the start of another?]
I'm going to go read the old design docs and/or spelunk roast and/or the Rakudo compiler code to see if I can see what's supposedly/actually going on. But not tonight.
Footnotes
¹ There's a table of operators in the current official operator doc. Supposedly this table:
summarizes the precedence levels offered by Raku, listing them in order from high to low precedence.
Unfortunately, at the time of writing this, the central operator table in the Operators page is profoundly wrong #4071.
Until that's fixed, here are "official" and "unofficial" options for determining the precedence of operators:
"official" Use in page search to search the official doc operator page for the operator of interest. Skip to the match in the entries on the left hand side of that same page. As you'll see, infix ,' is one level higher precedence than infix ...`:
Comma operator precedence
infix ,
infix :
List infix precedence
infix Z
infix X
infix ...
"unofficial" Look at the corresponding page of a staging site for an improved doc site. (I don't know how up to date it is, but the central table appears to list operators by precedence order as it claims.)

Convert character variable to numeric variable in SAS

I'm trying to convert a character variable to a numeric variable, but unfortunately i'm really struggeling. Help would be appreciated!
I keep getting the following error: 'Invalid argument to function INPUT at line 3259 column 17'
Syntax:
Data want;
Set have;
Dosis_num = input(Dosis, best12.);
run;
I have also tried multiplying the variable by 1. This doesnt work either.
The variable looks like this:
Dosis
155
201
2.1
0.8
123.80
12.0
3333.4
00.6
Want:
Dosis_num
155.0
201.0
2.1
0.8
123.8
12.0
333.4
0.6
Thanks alot!

The code will work with the data you show. So either the values in the character variable are not what you think or you are not using the right variable name for the variable.
The code is trying to only use the first 12 bytes of the character variable. Normally you don't need to restrict the number of characters you ask the INPUT() function to use. In fact the INPUT() function does not care if the width of the informat used is larger than the length of the string being read. So just use 32. as the informat since 32 is the maximum width that the normal numeric informat can read. Note that BEST is the name of a FORMAT, if you use it as the name of informat it is just an alias for the normal numeric informat.
If the variable has a length longer than 12 then perhaps there are leading spaces in the variable (note the ODS output displays do not properly display leading spaces) then use the LEFT() function to remove them.
Dosis_num = input(left(Dosis), 32.);

The typical thing to do here is to find out what's actually in the character variable. There is likely something in there that is causing the issue.
Try this:
data have;
input #1 Dosis $8.;
datalines;
155
201
2.1
0.8
123.80
12.0
3333.4
00.6
;;;;
run;
data check;
set have;
put dosis hex32.;
run;
What I get is this:
83 data check;
84 set have;
85 put dosis hex32.;
86 run;
3135352020202020
3230312020202020
322E312020202020
302E382020202020
3132332E38302020
31322E3020202020
333333332E342020
30302E3620202020
NOTE: There were 8 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.CHECK has 8 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
All those 2020202020 are spaces, which should be there (all strings are space-padded to full length). Period/Decimal Point is 2E, Digits are 3x where x is the digit (because the ASCII for 0 is 30, not because of any other reason). So for example for the last one, 00.6, 30 means zero, 30 means zero, 2E means period, and 36 means 6.
Check to make sure that you don't have any other characters other than digits (3x) and period (2e) and space (20).
The other thing to verify is that your system is set to use . as the decimal separator and not , as many European systems are - otherwise this requires the commaw. informat. You can actually just try the commaw. informat (comma12. is sufficient if 12 is plenty - and don't include anything after the period) as anything that 12. can read in also can be read in by commaw..

'grep' or 'awk' for extracting numeric data from a file

I have a CFD output file containing alpha-numeric data. My goal is to extract certain rows having numeric data to be able to plot. I was able to extract data which starts with a numeric value using grep. However, some of the rows of this extracted data start with a number but contains alphabets also which i do not want. here is a sample
3185 interface metric data, zone 1444, binary.
33268 interface metric data, zone 1440, binary.
3d, double precision, pressure-based, SST k-omega solver.
1 1.0000e+00 1.0163e-01 4.9782e-06 1.2250e-05 6.5126e-06 3.8876e+01 4.1845e+03 7.8685e+02 7.9475e+02 7.8234e+02 3.0537e+00 4.4427e+02 106:48:28 4999
2 1.0000e+00 6.5455e-02 1.4961e-04 2.2052e-04 1.3530e-02 6.8334e-01 4.5948e-01 7.9448e+02 8.0249e+02 7.9007e+02 1.3742e+00 5.7040e+02 92:12:06 4998
4587 interface metric data, zone 2541, binary.
6584 interface metric data, zone 1254, binary.
3 1.0000e+00 4.2029e-02 1.5227e-04 2.1588e-04 3.0255e-03 6.4570e-01 1.2661e-01 7.8044e+02 7.9048e+02 7.7804e+02 -2.3999e+05 6.4085e+02 80:35:24 4997
4 9.9121e-01 3.0808e-02 1.1390e-04 1.7132e-04 1.6542e-03 6.0594e-01 3.4626e-02 7.8613e+02 7.9673e+02 7.8422e+02 -1.9033e+05 7.0184e+02 70:56:41 4996
This is the command i used grep -P '^\s*\d+' file. How can i modify grep command to give me last rows with numeric data only ie
1 1.0000e+00 1.0163e-01 4.9782e-06 1.2250e-05 6.5126e-06 3.8876e+01 4.1845e+03 7.8685e+02 7.9475e+02 7.8234e+02 3.0537e+00 4.4427e+02 106:48:28 4999
2 1.0000e+00 6.5455e-02 1.4961e-04 2.2052e-04 1.3530e-02 6.8334e-01 4.5948e-01 7.9448e+02 8.0249e+02 7.9007e+02 1.3742e+00 5.7040e+02 92:12:06 4998
3 1.0000e+00 4.2029e-02 1.5227e-04 2.1588e-04 3.0255e-03 6.4570e-01 1.2661e-01 7.8044e+02 7.9048e+02 7.7804e+02 -2.3999e+05 6.4085e+02 80:35:24 4997
4 9.9121e-01 3.0808e-02 1.1390e-04 1.7132e-04 1.6542e-03 6.0594e-01 3.4626e-02 7.8613e+02 7.9673e+02 7.8422e+02 -1.9033e+05 7.0184e+02 70:56:41 4996

How can i modify grep command to give me last 4 rows only
Pipe the grep output to tail.
grep -P '^\s*\d+' file | tail -n 4

Given that the text in the question is the only thing we have to go on, I see a few patterns we might use to extract the last four lines.
The following matches lines whose first field is a number and contain no commas:
egrep '^[[:space:]]*[0-9][^,]+$'
This one matches lines containing numbers in scientific notation:
grep '[0-9]e[+-][0-9]'
And this one matches lines containing what looks like a time followed by an integer at the end of the line:
egrep '[0-9]+(:[0-9]{2}){2} [0-9]+$'
Or if you want an explicit match for the whole line -- that is, an integer, a number of scientific numbers, a time and then an integer, you can bundle it all together:
egrep '^[[:space:]]*[0-9]([[:space:]]+-?[0-9]+\.[0-9]+e[+-][0-9]+)+[[:space:]]+[0-9]+(:[0-9]{2}){2} [0-9]+$'
Note that I'm using explicit class names and ERE rather than shortcuts and PREG to maintain compatibility with non-Linux environments.

If your desired section of data can be identified by a certain header, e.g., the 3d, before it, you can look for the header and only start printing matching lines afterwards, e.g.,
awk '/^\s*3d,/ { in_data=1; next } in_data && /^\s*[0-9]/' file
Here /^\s*3d,/ is the pattern for the header which indicates the beginning of the "data section" (from the next line, i.e., not including the header itself). And /^\s*[0-9]/ is the pattern for lines to print within the data section.
In case there is no such header, you could try to identify the first line of data itself with a more complicated pattern, e.g., the number of fields combined with a regular expression:
awk 'NF == 15 && /^\s*[0-9]*\s*/ { in_data=1 } in_data && /^\s*[0-9]/' file

How to do multi-row calculations using awk on a large file

I have a big file that is sorted on the first word. I need to add a new column for each line with the proportional value: line value/total value for that group; group is determined by the first column. In the below example, the total of group "a" = 100 and hence each line gets a proportion. The total of group "the" is 1000 and hence each line gets the proprotion value of the total of that group.
I need an awk script to do this.
Sample File:
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
Output:
a lot 10 0.1
a few 20 0.2
a great 20 0.1
a little 40 0.4
a good 10 0.1
the best 25 .25
the dog 75 .75
zisty cool 20 1

You describe this as a "big file." Consequently, this solution tries to save memory: it holds no more than one group in memory at a time. When we are done with that group, we print it out before starting on the next group:
$ awk -v i=0 'NR==1{name=$1} $1==name{a[i]=$0;b[i++]=$3;tot+=$3+0;next} {for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1} END{for (j=0;j<i;j++){print a[j],b[j]/tot}}' file
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
How it works
-v i=0
This initializes the variable i to zero.
NR==1{name=$1}
For the first line, set the variable name to the first field, $1. This is the name of the group.
$1==name {a[i]=$0; b[i++]=$3; tot+=$3+0; next}
If the first field matches name, then save the whole line into array a and save the value of column (field) three into array b. Increment the variable tot by the value of the third field. Then, skip the rest of the commands and jump to the next line.
for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1
If we get to this line, then we are at the start of a new group. Print out all the values for the old group and initialize the variables for the start of the next group.
END{for (j=0;j<i;j++){print a[j],b[j]/tot}}
After we get to the last line, print out what we have for the last group.

awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' File
Example:
sdlcb#Goofy-Gen:~/AMD$ cat ff
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
sdlcb#Goofy-Gen:~/AMD$ awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' ff
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
Logic: update an array (a[]) with first column as index for each line. save array b[] with complete line for each line, to be used in the end for printing. similarly, update arrays c[] and d[] with first and third column values for each line. at the end, use these arrays to get the results using a for loop, looping through all the lines processed. First printing the line as itself, then the proportion value.

Is there a way to represent a number in binary where bits have approximately uniform significance?

I'm wondering if it is possible to represent a number as a sequence of bits, each having approximately the same significance, such that if we flip one of the bits, the overall value does not change by much.
For example, we can use sequences of 4-bits, where each group represents a value from 0 to 15 and the overall value is the sum of all these values.
0110 0101 1101 1010 1011 → 6 + 5 + 13 + 10 + 11 = 45
and now flipping any bit can only incur in a maximum difference of 8 in the final value.
Some drawbacks obviously exist with this approach:
values have multiple representations, with some values having more representations than other ones (for example, there are 39280 distinct representations for the number 38, and only 1 for the number 0);
the amount of values that can be represented is greatly reduced (this representation allows for integers from 0 to 75, while 20 bits could normally represent 220 ~ 1 million different integers).
Are there any resources I can find concerning this problem? I can't seem to find anything online, but maybe I'm not searching with the right keywords. What other alternatives exist to my approach? Do they improve on its disadvantages?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas