awk n-gram extracting not correct - awk

I'm currently working on an awk script which extracts all n-grams from an input file.
When running my awk script on a file it prints out every n-gram (sorted) with the number of occurrences next to it.
When testing on an input file it prints out the correct order of n-grams. Only the number of occurrences are not correct.
For extracting n-grams I have the following code:
$1=$1
line=tolower($0)
split(line,chars,"")
begin_len=0
for (i in chars){
ngram=""
for (ind=0;ind<n;ind++){
ngram=ngram""chars[i+ind]
}
if(begin_len == 0){
begin_len=length(ngram)
}
if(length(ngram) == begin_len){
counter+=1
freq_tabel[ngram]+=1
}
}
(sort function not included)
I was wondering if there is something wrong in the code. Or are there some aspects which I have overlooked?
The output I should have is the following:
35383
1580 n
1323 en
1081 e
940 de
839 v
780 er
716 d
713 an
615 t
instead, i have the following output:
34845
1561 n
1302 en
1067 e
930 de
827 v
772 er
711 d
703 an
609 t
As you can see, the n-grams are correct but the number of occurences not.
INPUT FILE: http://cl.ly/202j3r0B1342

Not an answer but may help you (assuming n=2).
Did you happen to convert the original file (that seems UTF-8) to latin-1? I got two sets of figures:
==> sorted.latin1_in_utf8_locale <==
1566 n
1308 en
1072 e
929 de
836 v
==> sorted.utf8_in_utf8_locale <==
1579 n
1320 en
1080 e
940 de
838 v
with latin-1 input the figures are closer to yours. with utf-8 to the expected ones.
However, neither matches. Scratching my head.
BTW, I am not sorting the ngrams in the script but outputting in the form suitable for piping them to sort -rn. But this should not cause difference, I guess.
for (ngram in freq_tabel)
printf "%7i %s\n", freq_tabel[ngram], ngram

I'm in your class, so here's a couple of hints:
Copy the exact input file (using clone from github, don't do a raw copy)
Re-read the assignment, you're supposed to get rid of the leading and trailing spaces, and replace all multiple tabs/spaces with one space.
Also, what's the point of the $1 = $1 on top?

Related

awk and gawk with large integers and large powers of 2

It was my understanding that both POSIX awk and GNU awk use IEEE 754 double for both integer and floats. (I know the -M switch is available on GNU awk for arbitrary precision integers. This question assumes without -M selected...)
This means that the max size of integer result with awk / gawk / perl (those without AUTOMATIC promotion to arbitrary precision integers) would be 53 bits since this is the max size integer that can fit in a IEEE 754 double. (At magnitudes greater than 2^53, you can no longer expect ±1 to work as it would with an integer but floating point arithmetic still works within the limits of a IEEE double.)
It seems to be easily demonstrated.
These work as expected with correct results (to the last digit) on both awk and gawk:
$ gawk 'BEGIN{print 2**52-1}'
4503599627370495
$ gawk 'BEGIN{print 2**52+1}'
4503599627370497
$ gawk 'BEGIN{print 2**53-1}'
9007199254740991
This is off by 1 (and is what I would expect with 53 bit max integer):
$ gawk 'BEGIN{print 2**53+1}' # 9007199254740993 is the correct result
9007199254740992
But here is what I would NOT expect. With certain power of 2 values both awk and GNU awk perform integer arithmetic at far greater precision than is possible within 53 bits.
(On my system, /usr/bin/awk is MacOS POSIX awk; gawk is GNU awk.)
Consider these examples, all precise to the digit:
$ gawk 'BEGIN{print 2**230}' # float result with awk...
1725436586697640946858688965569256363112777243042596638790631055949824
$ /usr/bin/awk 'BEGIN{print 2**99}' # max that POSIX awk supports
633825300114114700748351602688
The precision of ±1 is not supported at these magnitudes but limited arithmetic operations with powers of 2 are supported. Again, precise to the digit:
$ /usr/bin/awk 'BEGIN{print 2**99-2**98}'
316912650057057350374175801344
$ /usr/bin/awk 'BEGIN{print 2**99+2**98}'
950737950171172051122527404032
$ gawk 'BEGIN{print 2**55-968}' # 2^55=36028797018963968
36028797018963000
I am speculating that awk and gawk have some sort of non standard way of recognizing that 2^N is equivalent to 2<<N and doing some limited math inside of that arena.
Any form of [integer > 2] ^ Y with the result being greater than 2^53 has a drop in precision that is expected. ie, 10^15 is the rough max integer for ±1 to be accurate since 10^16 requires 54 bits.
$ gawk 'BEGIN{print 10**15+1}' # correct
1000000000000001
$ gawk 'BEGIN{print 10**16+1}' # not correct
10000000000000000
This is correct in magnitude for 10**64 but only precise for the first 16 digits (which I would expect):
$ gawk 'BEGIN{print 10**64}'
10000000000000001674705827425446886926697411428962669123675881472
# should be '1' + 64 '0'
# This is just a presentation issue of a value implying greater precision...
The GNU document is not exactly helpful since it speaks of the max values for 64 bit unsigned and signed integers implying those are used somehow. But it is easy to demonstrate that with the exception of powers of 2, the max integer on gawk is 2**53
Questions:
Am I correct that ALL integer calculations in awk / gawk are in fact IEEE doubles with max value of 2**53 for ±1? Is that documented somewhere?
If that is correct, what is happening with larger powers of 2?
(It would be nice if there were automatic switching to float format (the way Perl does) at that magnitude where there is a loss of precision btw.)
I cannot speak to the numeric implementations used in particular versions of gawk or awk. This answer speaks to floating-point generally, particularly IEEE-754 binary formats.
Computing 299 for 2**99 and 2230 for 2**230 are simply normal operations for floating-point arithmetic. Each is represented with a significand with one significant binary digit, 1, and an exponent of 99 or 230. Whatever routine is used to implement the exponentiation operation is presumably doing its job correctly. Since binary floating-point represents a number using a sign, a significand, and a scaling of two to some power, 299 and 2230 are easily represented.
When these numbers are printed, some routine is called to convert them to decimal numerals. This routine also appears to be well implemented, producing correct output. Some work is required to do that conversion correctly, as implementing it with naïve arithmetic will introduce rounding errors that produce incorrect results. (Sometimes little engineering effort is given to conversion routines and they produce results accurate only to a limited number of significant decimal digits. This appears to be less common; correctly rounded implementations are more common now than they used to be.)
Apparent “loss of precision,” more accurately called “loss of accuracy” or “rounding errors,” occurs when results cannot be exactly implemented (such as 253+1) or when floating-point operations are implemented without correct rounding. For 299 and 2230, no such loss is imposed by the floating-point format.
This means that the max size of integer result with awk / gawk / perl… would be 53 bits… ”
This is incorrect, or at least incorrectly phrased. The last consecutive integer that can be represented in IEEE-754 64-bit binary is 253. But it is certainly not the maximum. 253+2 can also be represented, having skipped 253+1. There are many more integers larger than 253 that can be represented.
UPDATE : statement on powers of 10 in 754 double :
anectodally, even w/o bigint add-ons, if you're just looking for powers of 10 standalone, or using them to mod ( % ) against powers of 2, it appears u can go up to 10^22.
jot 25 | gawk -be '$++NF = sprintf("%.f", 10^$1)' # same for mawk+nawk
{ … pruned smaller ones… }
13 10000000000000
14 100000000000000
15 1000000000000000
16 10000000000000000
17 100000000000000000
18 1000000000000000000
19 10000000000000000000
20 100000000000000000000
21 1000000000000000000000
22 10000000000000000000000 <---------
23 99999999999999991611392
24 999999999999999983222784
25 10000000000000000905969664
( 2 ^ x ) % ( 10 ^ 22 )
jot -w 'x=%d;y=22; print x,"=",y,"=",(2^x)%%(10^y),"\n"; ' - 3 1023 68 |
bc |
mawk2 '$++NF = sprintf("\f\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"\
"\b\b\b\b\b\b\b\b%.f",
( 2 ^ $1 ) % ( 10 ^ $2 ) )' FS== OFS== |
column -s= -t | gcat -n
All the powers of 2, even up to 1023, can directly obtain the exact modulo of 10 ^ 22 w/o needing specialized algorithms, string ops, or bigint packages, as confirmed by gnu-bc
1 3 22 8
8
2 71 22 2361183241434822606848
2361183241434822606848
3 139 22 2991196020261297061888
2991196020261297061888
4 207 22 1983204197482918576128
1983204197482918576128
5 275 22 7804340758912662765568
7804340758912662765568
6 343 22 3047019946172856926208
3047019946172856926208
7 411 22 729696200186866434048
729696200186866434048
8 479 22 3987839644142653145088
3987839644142653145088
9 547 22 1954520592778765795328
1954520592778765795328
10 615 22 5333860664072122400768
5333860664072122400768
11 683 22 3741475809259744657408
3741475809259744657408
12 751 22 1513586051123404341248
1513586051123404341248
13 819 22 2458937421726941708288
2458937421726941708288
14 887 22 8118143630040815894528
8118143630040815894528
15 955 22 3346307143437247315968
3346307143437247315968
16 1023 22 2417678164812112068608
2417678164812112068608
=============
not for all powers of 2. as you said, It's based on limitations of IEEE 754 double format. The highest i can get it to spit out is 2^1023.
2^1024 results in INF unless you invoke bignum mode -M
That said, gap begins to jump past 2^53, and increases along the way (as you go further into so-called "sub-optimal" range. As for printing those out, %d / %i is good for +/- 63-bits in gawk/mawk2, and %u up to unsigned 64-bits int (but potentially imprecise other than exact powers of 2 once you're past 2^53).
mawk 1.3.4 appears to be limited to 31/32-bits instead respectively.
Past those ranges, %.f is just about the only way to go.
one special case about powers of two -
if you want just 2^N-1 for powers up to 1023, a very clean sub() wil do the trick without having to physically go figure out what the last digit is :
sub(/[2468]$/, index("1:2:0:3", bits % 4), pow2str)
the last digit for positive integer powers of 2 have this repeat and predictable pattern when you take modulo against 4,
so using this specially crafted string, where the values exist at positions 7/3/1/5 (in descending modulo order), the string index itself is already exact last digit minus 1.
e.g. 2^719 : it goes 275. . . . 60288
| |
719 % 4 = 3, located at position 7 of reference string "1:2:0:3",
so the regex replaces the final "8" with a "7", giving you exactly 2^N-1 for any gigantic integer power of 2.
if you already know how many bits this power of 2 is supposed to be, then this way is faster, otherwise, a substring replacement approach is definitely faster than running it through logarithmic functions.

Awk failing extraction

I have a huge file containing the xyz positions of some atoms from different molecules. The whole file contains ~ 10000 configurations. I have created a script that iterates over the total number of configurations and extracts the coordinates associated with a specific atomic species that is systematically repeated at a fixed position, along with each frame associated with each system. My code works perfectly, except in the case in which the atomic position coincides with the last position of the frame I have to process, skipping to grab it and print in the corresponding file.
Each frame contains 384 atoms. In the xyz format, we have to take into account two extra lines at the beginning, where the number of atoms (in this case 384, line #1) and a blank/commented line are (line #2) are located.
The awk file with the list of atoms position lines is of the form:
{n = NR%386}
n == 1 {print "24"; next}
n == 2 ||
n == 91 ||
...
n == 378 ||
n == 380 ||
n == 381 ||
n == 386
where the n=NR%386 is the number of lines that awk has to account at every iteration in order to have the correct number of frames, in
n == 1 {print "24"; next}
the code prints the number of atoms I want to extract for each frame, in this case, 24.
The problem arises with the last value, in the last position of each frame before advancing to the next frame:
n == 386
When using the command
awk -f file.awk filename.xyz >> test.txt
the code will skip reading, extracting, and printing the last coordinate.
The filename.xyz I have to process is something like:
384
i = 3171, time = 3171.000, E = -3298.3005315786
C 6.66359796 19.29831718 16.63773520
C 6.19922671 19.83243350 15.35406226
C 7.73577004 21.24303011 16.94974860
C 7.32315891 21.77975003 15.67093925
N 5.08248005 17.55384984 15.51887635
N 7.75857672 23.00895664 15.43811018
N 8.58649028 22.07495287 17.61330368
N 7.45555304 19.97249138 17.42360101
...
...
...
N 3.62924684 23.22942656 15.38486984
N 4.52670891 22.25077226 17.55981432
N 3.17369677 20.23465407 17.45881199
N 2.28230853 21.30557433 14.86646780
S 1.48394488 18.18032187 17.21253664
S 0.70072709 19.13053602 14.60582837
S 4.67511560 23.53830074 16.57005901
Currently, just trying to extract only position 386
n == 386
produces something like:
1
i = 3171, time = 3171.000, E = -3298.3005315786
1
i = 3172, time = 3172.000, E = -3298.3023115390
1
i = 3173, time = 3173.000, E = -3298.3056102462
1
i = 3174, time = 3174.000, E = -3298.3101590395
that are just the corresponding to the commented lines, apparently skipping or not correctly interpreting which line to grep.
I would like to understand why awk if not able to extract the last line properly and how to solve the problem.
This appears to be a math problem. NR%386 will never be 386 because of the way the modulus operator works (there is no remainder when you divide 386 by 386). So your n==386 will never get executed. Try using (NR-1)%386 instead of NR%386 and shift all your conditionals accordingly:
n == 0 {print "24"; next}
etc. If you need n for calculations, add one to it.

Dumpbin output meaning below .dll import part

Here is a part of what I got when run dumpbin .exe file.
Section contains the following imports:
KERNEL32.dll
5A71E8 Import Address Table
620468 Import Name Table
0 time date stamp
0 Index of first forwarder reference
458 SetErrorMode
2B9 GlobalFlags
64 CompareStringW
206 GetLocaleInfoW
26E GetSystemDefaultUILanguage
418 RtlUnwind
300 IsDebuggerPresent
304 IsProcessorFeaturePresent
B5 CreateThread
11A ExitThread
119 ExitProcess
217 GetModuleHandleExW
2D1 HeapQueryInformation
487 SetStdHandle
1F3 GetFileType
4F1 VirtualQuery
264 GetStdHandle
263 GetStartupInfoW
This part is under SECTION HEADER #2
( .rdata name...)
I don't know what is these lines under the line KERNEL32.dll mean?
Thanks
458 SetErrorMode
2B9 GlobalFlags
64 CompareStringW
206 GetLocaleInfoW
The right-hand column is the name of the function, the left-hand column is the index of the function in kernel33.dll's import table, in hexadecimal.
The 'W' suffix indicates that the function takes UTF-16 'Wide' strings, a 'A' suffix indicates that it takes ASCII, or other 8-bit string, according to the codepage settings. This includes UTF-8.

Graphing code size

I was curious if there exists a ready-made script that would provide some starting point for an ultimate code size tracker tool. To start with I'd like to be able to graph size with various optimisation options for an number of cross-compiler targets and I'm quite tempted to put this on revision timeline later as well.
So taken the output from size command:
text data bss dec hex filename
1634 0 128 1762 6e2 csv_data.o (ex libs/libxyz.a)
28 0 0 28 1c csv_data_layer.o (ex libs/libxyz.a)
1063 0 0 1063 427 http_parser.o (ex libs/libxyz.a)
1312 0 1024 2336 920 http_queries.o (ex libs/libxyz.a)
8 36 0 44 2c transport.o (ex libs/libxyz.a)
1748 0 3688 5436 153c transport_layer.o (ex libs/libxyz.a)
8 0 0 8 8 misc_allocator.o (ex libs/libxyz.a)
847 108 1 956 3bc misc_err.o (ex libs/libxyz.a)
0 4 0 4 4 misc_globals.o (ex libs/libxyz.a)
273 0 0 273 111 misc_helpers.o (ex libs/libxyz.a)
71 0 4 75 4b misc_printf.o (ex libs/libxyz.a)
1044 0 44 1088 440 misc_time.o (ex libs/libxyz.a)
3724 0 0 3724 e8c xyz.o (ex libs/libxyz.a)
627 0 0 627 273 dummy.o (ex libs/libxyz.a)
8 16 0 24 18 dummy_layer.o (ex libs/libxyz.a)
12395 164 4889 17448 4428 (TOTALS)
With most of values being different when the library is being compiled with various optimisation flags (i.e.: -Os, -O0, -O1, -O2) and a variety of cross-compilers (e.g.: AVR, MSP430, ARMv6, i386), I'd like to make a combined graph or set of graphs using either gnuplot, d3.js, matplotlib or any other package. Has anyone have a seen ready-made script which would help this partially (e.g. at least convert the above tabular format to CSV, JSON or XML) or some study paper that presents a decent visualisation example? I have to admit, it's rather hard to find this using a web search engine.
Here is a possible visualization of the data as bar chart using gnuplot. This is of course not the ultimate visualization, but should be a good starting point.
set style data histogram
set style histogram rowstacked
set style fill solid 1.0 border lc rgb "white"
set xtics rotate 90
set key outside reverse Left
set bmargin 8
plot 'file.dat' using (!(stringcolumn(6) eq "(TOTALS)") ? column(1) : 1/0):xtic(6) title columnheader(1), \
for [i=2:5] '' using (!(stringcolumn(6) eq "(TOTALS)") ? column(i) : 1/0) title columnheader(i)
With the settings set terminal pngcairo size 1000,800, this gives
You must also decide, which columns you want to use, because plotting every column for every file for every compiler will be quite messy. Maybe you want to plot only the size:
set style data histogram
set style histogram clustered
set style fill solid 1.0 noborder
set xtics rotate 90
set key outside reverse Left
set bmargin 8
plot 'file.dat' using (!(stringcolumn(6) eq "(TOTALS)") ? $4 : 1/0):xtic(6) title 'i386', \
'' using (!(stringcolumn(6) eq "(TOTALS)") ? $4*1.2 : 1/0) title 'ARMv6',\
'' using (!(stringcolumn(6) eq "(TOTALS)") ? $4*0.7 : 1/0) title 'AVR'
Which gives you:
Note, that the lengthy using statements are only to skip the last line with the TOTAL. Alternatively you could also remove this last line with head, either when generating the data files, or on-the-fly like this:
plot '< head -n -1 file.dat' using 4:xtic(6) title 'i386', \
'' using ($4*1.2) title 'ARMv6',\
'' using ($4*0.7) title 'AVR'
Of course, for your real data you would have something like
plot '< head -n -1 file-i386.dat' using 4:xtic(6) title 'i386', \
'< head -n -1 file-armv6.dat' using ($4*1.2) title 'ARMv6',\
'< head -n -1 file-avr.dat' using ($4*0.7) title 'AVR'
I hope, this gives you an idea of different visualization possiblities. What might be appropriate, you must decide by yourself.

count and print the number of occurences

I have some files as shown below
GLL ALM 654-656 654 656
SEM LYG 655-657 655 657
SEM LYG 655-657 655 657
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LEG LEG 658-660 658 660
The value of GLL is 654. The value of ALM is 656. In the same way, 4th column represents the values of first column. 5th column represents the values of second column.I would like to count the unique occurrences of each number in the fourth and fifth column.
Desired output
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
If I understand your question right, this script could give you the output:
awk '{d[$4]=$1;d[$5]=$2;p[$4];l[$5]}
END{
for(k in p){
if (k in l){
delete l[k]
print k,d[k],"2"
}else
print k,d[k],"1"
}
for (k in l)
print k, d[k],1
} ' file
with your input data, the output of above script:
654 GLL 1
655 SEM 1
656 ALM 2
658 LEG 2
657 LYG 1
660 LEG 1
so it is not 100% same as your expected output (the order), but if you pipe it to sort -n, it is gonna give you the exactly same thing. The sorting part could be done within the awk too. I was a bit lazy... :)
My take:
sort -u file |
awk '
BEGIN {SUBSEP = OFS}
{count[$4,$1]++; count[$5,$2]++}
END {for (key in count) print key, count[key]}
' |
sort -n
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
Sorry it is so long, but it works and has a bonus built in if such a thing occurred! See edit 2 for more info. :-)
awk '
BEGIN { SUBSEP = FS;
before = 0;
between = 1;
after = 0;
}
{
offset = int((NF - after - before - between) / 2) + between;
for (i=1 + before; i <= offset + before - between; i++) {
j = i + offset;
if (! ((i, $j, $i) in entry))
entry[i, $j, $i]++;
}
}
END {
for (item in entry) {
split(item, itema);
entry[itema[2], itema[3]]++;
delete entry[item];
}
for (item in entry)
print item, entry[item];
}' filename | sort -n
The first part filters the input, only accepting unique occurrences of the pair that should be in the first and second columns of the output. The second part combines the results, adding 1 for each occurrence in a unique column (e.g. LEG,658 appears at least once in both $1,$4 and $2,$5, so it is counted twice), and prints the results, which is passed to the sort utility to sort the output numerically.
It is generalized for N pairs, so if you have something like the following in the future, the script still works, so long as only pairs are added (you can't add another separate field, or the script breaks):
GLL ALM LEG 654-660 654 656 660
I suppose if you wanted, you could add extra fields to the beginning and change the start value of i. Or maybe add at the end and subtract one more from the end value of i for each new field you add (e.g. NF - 2 if you add 1 one more unpaired field at the end). It would require a redesign to accommodate unpaired values in the middle because the data set would be completely different.
Edit
It's only so long because it is flexible (somewhat) and because I'm still an awk newbie. I'd recommend Kent's if you don't like mine (or it doesn't work--I'm not using a computer that has awk installed at the moment).
Edit 2
Updated script. It didn't work before, and it can now handle arbitrary offsets so long as no unpaired fields split the pairs up. Something like the following works:
GLL ALM LYG 654-657 654 656 657
SEM LYG 655-657 655 657
SEM LYG LEG 655-660 655 657 660
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LYG LEG 657-660 657 660
Output:
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 3
658 LEG 2
660 LEG 2
Edit 3
The script now handles arbitrary contiguous unpaired fields. You must configure how many fields you have before the first part of a pair begins (e.g. how many fields before the first GLL, ALM, etc. on the line), how many fields are between the first and second parts of the pairs, and how many fields are after the list of second parts of the pairs. Note that it must be contiguous and consistent, meaning you can't have something like 1 field before the first pair start component for one line and 5 fields before the first pair start component on another line, and you can't have a pair start/end component separated from another of the same (e.g. "GLL xyz ALM 654 656" doesn't work because "xyz" separates "GLL" and "ALM", which are both pair start components).
For anything more than this, actual knowledge about the data set would be required, such as if GLL may have extra information immediately after it, but ALM does not ever have such data.