How to transform a string into an int equivalent in a deterministic way with gawk 5? - awk

I am facing a case where I need to transform a string to an int equivalent with gawk5.
This transformation must be deterministic.
My first, naive, approach is to convert each letter of the string to its equivalent position in the latin alphabet and then concat the results back into a string.
For example:
my_string = "AB"
A = 1
B = 2
my_int=12
However, this has several downsides:
Very long strings may generate an integer that goes beyond maximum integer size.
What to do in case of special characters, symbols, etc. ?
This requires me to hold a table of each character position in the alphabet.
So, basically, it's a no go.
What is a good and robust method to generate an integer from a string with gawk5 ?
PS: Some will comment that gawk may not be the tool for that, and they may be right and I am aware of that. But this is for a personnal project that should include only awk if possible ;)

If your string contains only ASCII characters, no newlines, and if you use GNU awk, the following simply converts each character into its 3-digits ASCII code:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=0;i<128;i++) c[sprintf("%c",i)]=i}
{for(i=1;i<=NF;i++) printf("%03d",c[$i])}'
097098099
Of course this expands the string by a factor of 3, which can be sub-optimal. If you know that your string contains only ASCII characters in the 32-127 range you can reduce this factor to 2:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=32;i<128;i++) c[sprintf("%c",i)]=i-32}
{for(i=1;i<=NF;i++) printf("%02d",c[$i])}'
656667

Related

How to remove unix timestamp specific data from a flatfile

I have a huge file containing a list like this
email#domain.com^B1569521698
email2#domain.com,#2domain.com^B1569521798
email3#domain.com,test#2domain.com^B1569521898
email10000#domain.com^B1569521998
..
..
The file is named /usr/local/email/whitelist
The number after ^B is a unix timestamp
I need to remove from the list all the rows having a timestamp smaller than
(e.g.) 1569521898.
I tried using various awk/sed combinations with no result.
The character ^B you notice is a control character. The first 32 control-characters which are ASCII codes 0 through 1FH, form a special set of non-printing characters. These characters are called the control characters because these characters perform various printer and display control operations rather than displaying symbols. This particular one stands for STX or Start of Text.
You can type control-charcters in a shell as: Ctrl+v Ctrl+b, or you can use the octal representation directly (\002).
awk -F '\002' '($2 >= 1569521898)'
Since you have control characters in your Input_file could you please try following once. This is written and tested with given samples only.
awk '
match($0,/\002[0-9]+/){
val=substr($0,RSTART+1,RLENGTH-1)
if(val>=1569521898){ print }
val=""
}
' Input_file

How to remove diacritics in Perl 6

Two related questions.
Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä, U+00E4) or two and more combined symbols (like p̄ and ḏ̣). This little code
my #symb;
#symb.push("ä");
#symb.push("p" ~ 0x304.chr); # "p̄"
#symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for #symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.
1) Remove diacritics from ä. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄ into p and Combining Macron U+0304. E.g. something like the following in bash:
$ echo p̄ | grep . -o | wc -l
2
Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.
Per the documentation:
multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method samemark(Str:D: Str:D $pattern --> Str:D)
Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.
Examples:
say 'åäö'.samemark('aäo'); # OUTPUT: «aäo␤»
say 'åäö'.samemark('a'); # OUTPUT: «aao␤»
say samemark('Pêrl', 'a'); # OUTPUT: «Perl␤»
say samemark('aöä', ''); # OUTPUT: «aöä␤»
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.
say "p̄".ords; # OUTPUT: «(112 772)␤»
You can use the uniname method/routine to get the Unicode name for a codepoint:
.uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»
or just use the uninames method/routine:
.say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»
If you just want the number of codepoints in the string, you can use codes:
say "p̄".codes; # OUTPUT: «2␤»
This is different than chars, which just counts the number of characters in the string:
say "p̄".chars; # OUTPUT: «1␤»
Also see #hobbs' answer using NFD.
This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.
my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino
The .NFD method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str then assembles those codepoints back into a string.
You can also put these pieces together to answer your second question; e.g.:
$in.NFD.map: { Uni.new($_).Str }
will return a list of 1-character strings, each with a single decomposed codepoint, or
$in.NFD.map(&uniname).join("\n")
will make a nice little unicode debugger.
I can't say this is better or faster, but I strip diacritics in this way:
my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"

Returning uint Value from Char in gawk

I'm trying to get value of an ASCII char I receive via RS232 to convert them into binary like values.
Example:
0xFF-->########
0x01--> #
0x02--> #
...
My Problem is to get the value of ASCII chars higher than 127.
Test-Code to get the int value:
echo -e "\xFF" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
� : -1
Test-Code 2:
echo -e "\x61" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
a : 97
So my solution to convert the values into unsigned int, is like this:
if(ord($0)<0)
{
new_char=ord($0)+256;
}
else new_char = ord($0)+0`
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Later I tried to write my own ord() function.
#!/bin/bash
echo -e "\xFF" | awk 'BEGIN {_ord_init()}
{
printf("%s : %d\n", $0, ord($0))
}
function _ord_init( i, t)
{
for (i=0; i <= 255; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
0xFF returns:
� : 0
0x61 returns:
a : 97
Can someone explain me the behavior?
I'm using:
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4-p1, GNU MP 6.1.1)
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Actually, any string in awk is, in the end, a number.
Strings are converted to numbers and numbers are converted to strings,
if the context of the awk program demands it. [...] A string is
converted to a number by interpreting any numeric prefix of the string
as numerals: "2.5" converts to 2.5, "1e3" converts to 1,000, and
"25fix" has a numeric value of 25. Strings that can’t be interpreted
as valid numbers convert to zero. source
Let's make a quick test:
BEGIN {
print 0xff
print 0xff + 0
print 0xff +0.0
print "0xff"
}
# 255
# 255
# 255
# 0xff
So, any hex is automatically interpreted as uint. Casting a int to uint is a tricky question: generally, you should convert the modulus of the int to hex, then add the sign bit as MSB (that is, if the number is non-positive). But you should not need to do so in awk.
Remember that conversion is made as a call to sprintf() and you may control it via the CONVFMT variable:
CONVFMT
A string that controls the conversion of numbers to strings
(see section Conversion of Strings and Numbers). It works by being
passed, in effect, as the first argument to the sprintf() function
(see section String-Manipulation Functions). Its default value is
"%.6g". CONVFMT was introduced by the POSIX standard. source
Remember that locale settings may affect the way the conversion is performed, especially with the decimal separator. For more, see this, which is out of scope.
Can someone explain me the behavior?
I can't actually reproduce it, but I suspect this line of code:
# only first character is of interest
c = substr(str, 1, 1)
In your example, the first char is always 0 and the output should always be the same. I'm testing this online.
I'll make another example of mine:
BEGIN {
a = 0xFF
b = 0x61
printf("a: %d %f %X %s %c\n", a,a,a,a,a)
printf("b: %d %f %X %s %c\n", b,b,b,b,b)
}
# a: 255 255.000000 FF 255 ÿ
# b: 97 97.000000 61 97 a
Either run gawk in binary mode gawk -b to stop it from pre-stitching UTF8 code points. Split it by // empty string, then each single spot inside that resulting array will contain something that's 1-byte wide.
For the other way around, just pre-make an array from 0 to 256. Gawk doesn't stop there at all. In my routine gawk startup sequence, I do that same custom ord sequence from 0x3134F all the way back to zero (around 210k or so). The reason to do it backwards is, for whatever reason, there are some code points that will come out with an IDENTICAL character that gawk can't differentiate. doing it reverse will ensure the lowest # code point is assigned to it. For this mode, I run it in regular utf8 one.
For your scenario I'll just pre-make 4-hex wide array from 0x0000 to 0xFFFF, back to their integer ones, then for each 0xZZ 0xWW, throw ZZWW into that lookup dictionary and get back and integer.
If you just try ord( ) from 128 to 255 it usually won't work like that because 128 is where unicode begins 2 bytes. 0x800 begins 3bytes, 0x10000 begins 4 bytes. I'm not too familiar with those that extend ascii to 256 - they usually require using iconv or similar to get back to UTF-8 first.
A quick note if you want to take raw UTF8 bytes and trying to figure out how many stitched UTF8 code points there are, just delete everything 0x80 - 0xBF. The length() of the residual is the number of code points.
In decimal lingo, out of the 4 ranges of 64 numbers from 0 to 255 :
000 - 063 - ASCII
064 - 127 - ASCII
128 - 191 - UT8-multiple-byte continuation encoding (the 0x80 0xBF)
192 - 255 - the most significant byte of UTF8 multi-byte char
and this looks hideous. Luckily, octal to the rescue. The 0x80 - 0xBF range is just \200-\277. You can use any of AWK's regex to find those (also for FS / RS etc). I was spending time manually coding up the utf8 algorithm before doing all that bit-shifting when I realized much later I don't need that to get to my end goal.
You can easily beat the system built in wc -m command if you want to count utf8 code-points when combining the logic above with mawk2. On my 2.5 year old laptop, against a 1.83 GB flat text file FILLED with unicode all over, I got it down to approx 19 seconds or so to count out 1.29 billion utf8 code points, using just awk.
i've ran into the same problem myself. I ended up with first with a detector whether it's running gawk in unicode mode or byte mode (check the length() of 3 octal value combo that make up one UTF8 code point returns 1 or 3)
then when it sees gawk unicode mode, run a custom shell command from gawk and use unix printf to print out bytes 128-255, and chunk it back into gawk into an array. If you need it i can paste the code sometime (but it's SUPER hideous so i hope i won't get dinged for its lack of elegance)
because there are simply bytes like C0, C1, or FF etc that don't exist in UTF8, no matter what combination you attempt, you cannot get it to generate it all 256 within gawk. I mean another way to do it would be pre-making that chain and using something xxd -ps to store it as a hash string, only converting it back at runtime, but it's admittedly slower.

Fortran read statement reading beyond an end of line

do you know if the following statement is guaranteed to be true by one of the fortran 90/95/2003 standards?
"Suppose a read statement for a character variable is given a blank line (i.e., containing only white spaces and new line characters). If the format specifier is an asterisk (*), it continues to read the subsequent lines until a non-blank line is found. If the format specifier is '(A)', a blank string is substituted to the character variable."
For example, please look at the following minimal program and input file.
program code:
PROGRAM chk_read
INTEGER, PARAMETER :: MAXLEN=30
CHARACTER(len=MAXLEN) :: str1, str2
str1='minomonta'
read(*,*) str1
write(*,'(3A)') 'str1_start|', str1, '|str1_end'
str2='minomonta'
read(*,'(A)') str2
write(*,'(3A)') 'str2_start|', str2, '|str2_end'
END PROGRAM chk_read
input file:
----'input.dat' content is below this line----
yamanakako
kawaguchiko
----'input.dat' content is above this line----
Please note that there are four lines in 'input.dat' and the first and third lines are blank (contain only white spaces and new line characters). If I run the program as
$ ../chk_read < input.dat > output.dat
I get the following output
----'output.dat' content is below this line----
str1_start|yamanakako |str1_end
str2_start| |str2_end
----'output.dat' content is above this line----
The first read statement for the variable 'str1' seems to look at the first line of 'input.dat', find a blank line, move on to the second line, find the character value 'yamanakako', and store it in 'str1'.
In contrast, the second read statement for the variable 'str2' seems to be given the third line, which is blank, and store the blank line in 'str2', without moving on to the fourth line.
I tried compiling the program by Intel Fortran (ifort 12.0.4) and GNU Fortran (gfortran 4.5.0) and got the same result.
A little bit about a background of asking this question: I am writing a subroutine to read a data file that uses a blank line as a separator of data blocks. I want to make sure that the blank line, and only the blank line, is thrown away while reading the data. I also need to make it standard conforming and portable.
Thanks for your help.
From Fortran 2008 standard draft:
List-directed input/output allows data editing according to the type
of the list item instead of by a format specification. It also allows
data to be free-field, that is, separated by commas (or semicolons) or
blanks.
Then:
The characters in one or more list-directed records constitute a
sequence of values and value separators. The end of a record has the
same effect as a blank character, unless it is within a character
constant. Any sequence of two or more consecutive blanks is treated as
a single blank, unless it is within a character constant.
This implicitly states that in list-directed input, blank lines are treated as blanks until the next non-blank value.
When using a fmt='(A)' format descriptor when reading, blank lines are read into str. On the other side, fmt=*, which implies list-directed I/O in free-form, skips blank lines until it finds a non-blank character string. To test this, do something like:
PROGRAM chk_read
INTEGER :: cnt
INTEGER, PARAMETER :: MAXLEN=30
CHARACTER(len=MAXLEN) :: str
cnt=1
do
read(*,fmt='(A)',end=100)str
write(*,'(I1,3A)')cnt,' str_start|', str, '|str_end'
cnt=cnt+1
enddo
100 continue
END PROGRAM chk_read
$ cat input.dat
yamanakako
kawaguchiko
EOF
Running the program gives this output:
$ a.out < input.dat
1 str_start| |str_end
2 str_start| |str_end
3 str_start| |str_end
4 str_start|yamanakako |str_end
5 str_start| |str_end
6 str_start|kawaguchiko |str_end
On the other hand, if you use default input:
read(*,fmt=*,end=100)str
You end up with this output:
$ a.out < input.dat
1 str1_start|yamanakako |str1_end
2 str2_start|kawaguchiko |str2_end
This Part of the F2008 standard draft probably treats your problem:
10.10.3 List-directed input
7 When the next effective item is of type character, the input form
consists of a possibly delimited sequence of zero or more
rep-char s whose kind type parameter is implied by the kind of the
effective item. Character sequences may be continued from the end of
one record to the beginning of the next record, but the end of record
shall not occur between a doubled apostrophe in an
apostrophe-delimited character sequence, nor between a doubled quote
in a quote-delimited character sequence. The end of the record does
not cause a blank or any other character to become part of the
character sequence. The character sequence may be continued on as many
records as needed. The characters blank, comma, semicolon, and slash
may appear in default, ASCII, or ISO 10646 character sequences.

Summing values in one-line comma delimited file

EDIT: Thanks all of you. Python solution worked lightning-fast :)
I have a file that looks like this:
132,658,165,3216,8,798,651
but it's MUCH larger (~ 600 kB). There are no newlines, except one at the end of file.
And now, I have to sum all values that are there. I expect the final result to be quite big, but if I'd sum it in C++, I possess a bignum library, so it shouldn't be a problem.
How should I do that, and in what language / program? C++, Python, Bash?
Penguin Sed, "Awk"
sed -e 's/,/\n/g' tmp.txt | awk 'BEGIN {total=0} {total += $1} END {print total}'
Assumptions
Your file is tmp.txt (you can edit this obviously)
Awk can handle numbers that large
Python
sum(map(int,open('file.dat').readline().split(',')))
The language doesn't matter, so long as you have a bignum library. A rough pseudo-code solution would be:
str = ""
sum = 0
while input
get character from input
if character is not ','
append character to back of str
else
convert str to number
add number to sum
str = ""
output sum
If all of the numbers are smaller than (2**64)/600000 (which still has 14 digits), an 8 byte datatype like "long long" in C will be enough. The program is pretty straight-forward, use the language of your choice.
Since it's expensive to treat that large input as a whole I suggest you take a look at this post. It explains how to write a generator for string splitting. It's in C# but it well suited for crunching through that kind of input.
If you are worried about the total sum to not fit in a integer (say 32-bit) you can just as easily implement a bignum your self, especially if you just use integer and addition. Just carry the bit-31 to next dword and keep adding.
If precision isn't important, just accumulate the result in a double. That should give you plenty of range.
http://www.koders.com/csharp/fid881E3E70CC37E480545A0C37C98BC8C208B06723.aspx?s=datatable#L12
A fast C# CSV parser. I've seen it crunch though a few thousand 1MB files rather quickly, I have it running as part of a service that consumes about 6000 files a month.
No need to reinvent a fast wheel.
python can handle the big integers.
tr "," "\n" < file | any old script for summing
Ruby is convenient, since it automatically handles big numbers. I can't remember of Awk does arbitrary precision arithmentic, but if so, you could use
awk 'BEGIN {RS="," ; sum = 0 }
{sum += $1 }
END { print sum }' < file