Returning uint Value from Char in gawk - awk

I'm trying to get value of an ASCII char I receive via RS232 to convert them into binary like values.
Example:
0xFF-->########
0x01--> #
0x02--> #
...
My Problem is to get the value of ASCII chars higher than 127.
Test-Code to get the int value:
echo -e "\xFF" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
� : -1
Test-Code 2:
echo -e "\x61" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
a : 97
So my solution to convert the values into unsigned int, is like this:
if(ord($0)<0)
{
new_char=ord($0)+256;
}
else new_char = ord($0)+0`
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Later I tried to write my own ord() function.
#!/bin/bash
echo -e "\xFF" | awk 'BEGIN {_ord_init()}
{
printf("%s : %d\n", $0, ord($0))
}
function _ord_init( i, t)
{
for (i=0; i <= 255; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
0xFF returns:
� : 0
0x61 returns:
a : 97
Can someone explain me the behavior?
I'm using:
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4-p1, GNU MP 6.1.1)

But I wanted to know if there was a way to cast directly an int as uint in gawk.
Actually, any string in awk is, in the end, a number.
Strings are converted to numbers and numbers are converted to strings,
if the context of the awk program demands it. [...] A string is
converted to a number by interpreting any numeric prefix of the string
as numerals: "2.5" converts to 2.5, "1e3" converts to 1,000, and
"25fix" has a numeric value of 25. Strings that can’t be interpreted
as valid numbers convert to zero. source
Let's make a quick test:
BEGIN {
print 0xff
print 0xff + 0
print 0xff +0.0
print "0xff"
}
# 255
# 255
# 255
# 0xff
So, any hex is automatically interpreted as uint. Casting a int to uint is a tricky question: generally, you should convert the modulus of the int to hex, then add the sign bit as MSB (that is, if the number is non-positive). But you should not need to do so in awk.
Remember that conversion is made as a call to sprintf() and you may control it via the CONVFMT variable:
CONVFMT
A string that controls the conversion of numbers to strings
(see section Conversion of Strings and Numbers). It works by being
passed, in effect, as the first argument to the sprintf() function
(see section String-Manipulation Functions). Its default value is
"%.6g". CONVFMT was introduced by the POSIX standard. source
Remember that locale settings may affect the way the conversion is performed, especially with the decimal separator. For more, see this, which is out of scope.
Can someone explain me the behavior?
I can't actually reproduce it, but I suspect this line of code:
# only first character is of interest
c = substr(str, 1, 1)
In your example, the first char is always 0 and the output should always be the same. I'm testing this online.
I'll make another example of mine:
BEGIN {
a = 0xFF
b = 0x61
printf("a: %d %f %X %s %c\n", a,a,a,a,a)
printf("b: %d %f %X %s %c\n", b,b,b,b,b)
}
# a: 255 255.000000 FF 255 ÿ
# b: 97 97.000000 61 97 a

Either run gawk in binary mode gawk -b to stop it from pre-stitching UTF8 code points. Split it by // empty string, then each single spot inside that resulting array will contain something that's 1-byte wide.
For the other way around, just pre-make an array from 0 to 256. Gawk doesn't stop there at all. In my routine gawk startup sequence, I do that same custom ord sequence from 0x3134F all the way back to zero (around 210k or so). The reason to do it backwards is, for whatever reason, there are some code points that will come out with an IDENTICAL character that gawk can't differentiate. doing it reverse will ensure the lowest # code point is assigned to it. For this mode, I run it in regular utf8 one.
For your scenario I'll just pre-make 4-hex wide array from 0x0000 to 0xFFFF, back to their integer ones, then for each 0xZZ 0xWW, throw ZZWW into that lookup dictionary and get back and integer.
If you just try ord( ) from 128 to 255 it usually won't work like that because 128 is where unicode begins 2 bytes. 0x800 begins 3bytes, 0x10000 begins 4 bytes. I'm not too familiar with those that extend ascii to 256 - they usually require using iconv or similar to get back to UTF-8 first.
A quick note if you want to take raw UTF8 bytes and trying to figure out how many stitched UTF8 code points there are, just delete everything 0x80 - 0xBF. The length() of the residual is the number of code points.
In decimal lingo, out of the 4 ranges of 64 numbers from 0 to 255 :
000 - 063 - ASCII
064 - 127 - ASCII
128 - 191 - UT8-multiple-byte continuation encoding (the 0x80 0xBF)
192 - 255 - the most significant byte of UTF8 multi-byte char
and this looks hideous. Luckily, octal to the rescue. The 0x80 - 0xBF range is just \200-\277. You can use any of AWK's regex to find those (also for FS / RS etc). I was spending time manually coding up the utf8 algorithm before doing all that bit-shifting when I realized much later I don't need that to get to my end goal.
You can easily beat the system built in wc -m command if you want to count utf8 code-points when combining the logic above with mawk2. On my 2.5 year old laptop, against a 1.83 GB flat text file FILLED with unicode all over, I got it down to approx 19 seconds or so to count out 1.29 billion utf8 code points, using just awk.

i've ran into the same problem myself. I ended up with first with a detector whether it's running gawk in unicode mode or byte mode (check the length() of 3 octal value combo that make up one UTF8 code point returns 1 or 3)
then when it sees gawk unicode mode, run a custom shell command from gawk and use unix printf to print out bytes 128-255, and chunk it back into gawk into an array. If you need it i can paste the code sometime (but it's SUPER hideous so i hope i won't get dinged for its lack of elegance)
because there are simply bytes like C0, C1, or FF etc that don't exist in UTF8, no matter what combination you attempt, you cannot get it to generate it all 256 within gawk. I mean another way to do it would be pre-making that chain and using something xxd -ps to store it as a hash string, only converting it back at runtime, but it's admittedly slower.

Related

Formatting in Raku

I have written a function that outputs a double, upto 25 decimal
places. I am trying to print it as a formatted output from Raku.
However, the output is incorrect and truncated.
See MWE:
my $var = 0.8144262510988963255087469;
say sprintf("The variable value is: %.25f", $var)
The above code gives The variable value is: 0.8144262510988963000000000 which is not what is expected.
Also, this seems weird:
my $var = 0.8144262510988963255087469;
say $var.Str.chars; # 29 wrong, expected 27
I tested the same in C:
#include <stdio.h>
int main() {
double var = 0.8144262510988963255087469;
printf("The variable value is: %.25lf \n", var);
return 0;
}
However, it works fine. Given the identical nature of sprintf and printf, I expected this C example to work in Raku too. Seems like %lf is not supported.
So is there a workaround to fix this?
I think this is actually a bug in how Rat literals are created. Or at least as WAT :-).
I actually sort of expect 0.8144262510988963255087469 to either give a compile time warning, or create a Num, as it exceeds the standard precision of a Rat:
raku -e 'say 0.8144262510988963255087469'
0.814426251098896400086204416
Note that these are not the same.
There is fortunately an easy workaround, by creating a FatRat
$ raku -e 'say 0.8144262510988963255087469.FatRat'
0.8144262510988963255087469
FWIW, I think this is worthy of creating an issue
From your question:
I have written a function that outputs a double, upto 25 decimal places.
From google:
Double precision numbers are accurate up to sixteen decimal places
From the raku docs :
When constructing a Rat (i.e. when it is not a result of some mathematical expression), however, a larger denominator can be used
so if you go
my $v = 0.8144262510988963255087469;
say $v.raku;
#<8144262510988963255087469/10000000000000000000000000>
it works.
However, do a mathematical expression such as
my $b = $a/10000000000000000000000000;
and you get the Rat => Num degradation applied unless you explicitly declare FatRats. I visualise this as the math operation placing the result in a Num register in the CPU.
The docs also mention that .say and .put may be less faithful than .raku, presumably because they use math operations (or coercion) internally.
Sorry to be the bearer of bad news, but 10**25 > 2 **64, but what you report as an issue is correct & (fairly) well documented behaviour given the constraints of double precision IEEE P754.

How to transform a string into an int equivalent in a deterministic way with gawk 5?

I am facing a case where I need to transform a string to an int equivalent with gawk5.
This transformation must be deterministic.
My first, naive, approach is to convert each letter of the string to its equivalent position in the latin alphabet and then concat the results back into a string.
For example:
my_string = "AB"
A = 1
B = 2
my_int=12
However, this has several downsides:
Very long strings may generate an integer that goes beyond maximum integer size.
What to do in case of special characters, symbols, etc. ?
This requires me to hold a table of each character position in the alphabet.
So, basically, it's a no go.
What is a good and robust method to generate an integer from a string with gawk5 ?
PS: Some will comment that gawk may not be the tool for that, and they may be right and I am aware of that. But this is for a personnal project that should include only awk if possible ;)
If your string contains only ASCII characters, no newlines, and if you use GNU awk, the following simply converts each character into its 3-digits ASCII code:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=0;i<128;i++) c[sprintf("%c",i)]=i}
{for(i=1;i<=NF;i++) printf("%03d",c[$i])}'
097098099
Of course this expands the string by a factor of 3, which can be sub-optimal. If you know that your string contains only ASCII characters in the 32-127 range you can reduce this factor to 2:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=32;i<128;i++) c[sprintf("%c",i)]=i-32}
{for(i=1;i<=NF;i++) printf("%02d",c[$i])}'
656667

what's best way for awk to check arbitrary integer precision

from GNU gawk's page
https://www.gnu.org/software/gawk/manual/html_node/Checking-for-MPFR.html
they have a formula to check arbitrary precision
function adequate_math_precision(n) { return (1 != (1+(1/(2^(n-1))))) }
My question is : wouldn't it be more efficient by staying within integer math domain with a formula such as
( 2^abs(n) - 1 ) % 2 # note 2^(n-1) vs. 2^|n| - 1
Since any power of 2 must also be even, then subtracting 1 must always be odd, then its modulo (%) over 2 becomes indicator function for is_odd() for n >= 0, while the abs(n) handles the cases where it's negative.
Or does the modulo necessitate a casting to float point, thus nullifying any gains ?
Good question. Let's tackle it.
The proposed snippet aims at checking wether gawk was invoked with the -M option.
I'll attach some digression on that option at the bottom.
The argument n of the function is the floating point precision needed for whatever operation you'll have to perform. So, say your script is in a library somewhere and will get called but you have no control over it. You'll run that function at the beginning of the script to promptly throw exception and bail out, suggesting that the end result will be wrong due to lack of bits to store numbers.
Your code stays in the integer realm: a power of two of an integer is an integer. There is no need to use abs(n) here, because there is no point in specifying how many bits you'll need as a negative number in the first place.
Then you subtract one from an even, integer number. Now, unless n=0, in which case 2^0=1 and then your code reads (1 - 1) % 2 = 0, your snippet shall always return 1, because the quotient (%) of an odd number divided by two is 1.
Problem is: you are trying to calculate a potentially stupidly large number in a function that should check if you are able to do so in the first place.
Since any power of 2 must also be even, then subtracting 1 must always
be odd, then its modulo (%) over 2 becomes indicator function for
is_odd() for n >= 0, while the abs(n) handles the cases where it's
negative.
Except when n=0 as we discussed above, you are right. The snippet will tell that any power of 2 is even, and any power of 2, minus 1, is odd. We were discussing another subject entirely thought.
Let's analyze the other function instead:
return (1 != (1+(1/(2^(n-1)))))
Remember that booleans in awk runs like this: 0=false and non zero equal true. So, if 1+x where x is a very small number, typically a large power of two (2^122 in the example page) is mathematically guaranteed to be !=1, in the digital world that's not the case. At one point, floating computation will reach a precision rock bottom, will be rounded down, and x=0 will be suddenly declared. At that point, the arbitrary precision function will return 0: false: 1 is equal 1.
A larger discussion on types and data representation
The page you link explains precision for gawk invoked with the -M option. This sounds like technoblahblah, let's decipher it.
At one point, your OS architecture has to decide how to store data, how to represent it in memory so that it can be accessed again and displayed. Terms like Integer, Float, Double, Unsigned Integer are examples of data representation. We here are addressing Integer representation: how is an integer stored in memory?
A 32-bit system will use 4 bytes to represent and integer, which in turn determines how larger the integer will be. The 32 bits are read from most significative (MSB) to less significative (LSB) and if signed, one bit will represent the sign (the MSB typically, drastically reducing the max size of the integer).
If asked to compute a large number, a machine will try to fit in in the max number available. If the end result is larger than that, you have overflow and end up with a wrong result or an error. Many online challenges typically ask you to write code for arbitrary long loops or large sums, then test it with inputs that will break the 64bit barrier, to see if you master proper types for indexes.
AWK is not a strongly typed language. Meaning, any variable can store data, regardless of the type. The data type can change and it is determined at runtime by the interpreter, so that the developer doesn't need to care. For instance:
$awk '{a="this is text"; print a; a=2; print a; print a+3.0*2}'
-| this is text
-| 2
-| 8
In the example, a is text, then is an integer and can be summed to a floating point number and printed as integer without any special type handling.
The Arbitrary Precision Page presents the following snippet:
$ gawk -M 'BEGIN {
> s = 2.0
> for (i = 1; i <= 7; i++)
> s = s * (s - 1) + 1
> print s
> }'
-| 113423713055421845118910464
There is some math voodoo behind, we will skip that. Since s is interpreted as a floating point number, the end result is computed as floating point.
Try to input that number on Windows calculator as decimal, and it will fail. Although you can compute it as a binary. You'll need the programmer setting and to add up to 53 bits to be able to fit it as unsigned integer.
53 is a magic number here: with the -M option, gawk uses arbitrary precision for numbers. In other words, it commandeers how many bits are necessary, track them and breaks free of the native OS architecture. The default option says that gawk will allocate 53 bits for any given arbitrary number. Fun fact, the actual result of that snippet is wrong, and it would take up to 100 bits to compute correctly.
To implement arbitrary large numbers handling, gawk relies on an external library called MPFR. Provided with an arbitrary large number, MPFR will handle the memory allocation and bit requisition to store it. However, the interface between gawk and MPFR is not perfect, and gawk can't always control the type that MPFR will use. In case of integers, that's not an issue. For floating point numbers, that will result in rounding errors.
This brings us back to the snippet at the beginning: if gawk was called with the -M option, numbers up to 2^53 can be stored as integers. Floating points will be smaller than that (you'll need to make the comma disappear somehow, or rather represent it spending some of the bits allocated for that number, just like the sign). Following the example of the page, and asking an arbitrary precision larger than 32, the snippet will return TRUE only if the -M option was passed, otherwise 1/2^(n-1) will be rounded down to be 0.

Using awk to detect UTF-8 multibyte character

I am using awk (symlinked to gawk on my machine) to read through a file and get a character count per line to test if a file is fixed width. I can then re-use the following script with the -b --characters-as-bytes option to see if the file is fixed width by byte.
#!/usr/bin/awk -f
BEGIN {
width = -1;
}
{
len = length($0);
if (width == -1) {
width = len;
} else if (len != 0 && len != width) {
exit 1;
}
}
I want to do something similar to test whether each line in a file has the same amount of bytes and characters to assume all characters are a single byte (I do realize this is subject false negatives). The challenge is that I would like to run through the file one time and break out at first mismatch. Is there a way to set the -b option from within an awk script similar to how you can adjust FS. If this isn't possible, I'm open to options outside of awk. I can always just write this in C if I have to, but I wanted to make sure there isn't something already available.
Efficiency is what I am aiming for here. Having this information will help me skip a costly process, so I don't this in itself to be costly. I'm dealing with files that can be over 100 million lines long.
Clarification
I want something like the above. Something like this
#!/usr/bin/awk -f
{
if (length($0) != bytelength($0))
exit 1;
}
I don't need any output. I will just trigger off the return code ($? in bash). So exit 1 if this fails. Obviously bytelength is not a function. I'm just looking for a way to achieve this without running awk twice.
UPDATE
sundeep's solution works for what I have described above:
awk -F '' -l ordchr '{for(i=1;i<=NF;i++) if(ord($i)<0) {exit 1;}}'
I was operating under the assumption that awk would count a higher-end character with a Windows single-byte encoding above 0x7F as a single character, but it actually doesn't count it at all. So byte length would still not be the same as length. I guess I will need to write this in C for something that specific.
Conclusion
So I think I did a poor job of explaining my problem. I receive data that is encoded in either UTF-8 or Windows' style single-byte encoding like CP1252. I wanted to check if there are any multibyte characters in the file and exit if found. I originally wanted to do this in awk, but I playing with files that may have a different encoding has proven difficult.
So in a nutshell if we assume a file with a single character in it:
CHARACTER FILE_ENCODING ALL_SINGLE_BYTE IN_HEX
á UTF-8 false 0xC3 0xA1
á CP1252 true 0xE1
a ANY true 0x61
You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx and the next byte needs to be 0b10xxxxxx where x represents any value (from wikipedia).
So you can detect such sequence with sed by matching the hex ranges and exit with nonzero exit status if found:
LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'
Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111].
I think \x?? and q are both GNU extensions to sed.
The best answer is imho actually the one with grep provided by Sundeep in the comment. You should try to get that working. The answer below makes use of sed in a similar way. I will probably delete it, as it's really doesn't add anything to grep's solution.
What about this?
[[ -z "$(LANG=C sed -z '/[\x80-\xFF]/d' <(echo -e 'one\ntwo\nth⌫ree'))" ]]
echo $?
<(echo -e 'one\ntwo\nth⌫ree') is just an example file with a multibyte character in it
the whole sed command does one of two things:
outputs the empty string if the file contains a multibyte character
outputs the full file if it doesn't
the [[ -z string ]] returns 0 or 1 if the string has length zero.
Quote from the same wikipedia page above :
Fallback and auto-detection: Only a small subset of possible byte
strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF
cannot appear, and bytes with the high bit set must be in pairs, and
other requirements.
in octal code that means xC0 = \300, xC1 = \301 and xF5 = \365 -> xFF = \377 being non-valid UTF-8.
Knowing that this space isn't valid UTF-8 is plenty useful in terms of wiggle room for one to insert custom delimiters inside any string :
pick any of those bytes, say \373, and once a quick if statement is used to verify it doesn't exist for that line, you can now perform custom text manipulation tricks of your choice, with a single-byte delimiter, even if it involves inserting them right in between the UTF8 bytes for a single code point, and it won't ruin the unicode at all. once you're done with the logic block, simply use a quick gsub( ) to remove all traces of it.
If that byte (\373 ie \xFB) exist, well, you're likely encountering either a binary file, or partially corrupted UTF8 text data.
One use case, such as in my own modules, is a UTF-8 code-point-level-safe* substr( ) function. So instead of manually counting out the points 1 at a time, first use regex to count out max bytes of any code-point. Let's say 3-bytes (since 4-bytes ones are still rare in practice).
Then i apply 1 pad of \373 next to the 2-byte ones (I pad it to the left of [\302-\337]), and 2 pads of it, i.e. \373\373, next to ASCII ones, and voila, now all UTF8 code points have a fixed width, so a substr( ) becomes a mere multiplication exercise of it.
run a byte-level substr( ) on those start and end points, apply gsub(/[\373]+/, "", s) to throw away all the padding bytes, and now you have a usable* UTF-8-safe substr( ) function for all the variants of awk that aren't unicode-aware. This approach also works for multi-line records, and absolutely does not affect how FS and RS interacts with the record.
(if u need 4-bytes, just pad more)
*i haven't incorporated any fancy logic to account for code-points that are post-decomposition components that supposedly grouped together as a single logical unit for string manipulation purposes.
for non-unicode aware versions of awk,
gawk -b/ LC_ALL=C /mawk/mawk2 'BEGIN {
reUTF8="([\\000-\\177]|" \
"[\\302-\\337][\\200-\\277]|" \
"\\340[\\240-\\277][\\200-\\277]|" \
"\\355[\\200-\\237][\\200-\\277]|" \
"[\\341-\\354\\356-\\357][\\200-\\277]" \
"[\\200-\\277]|\\360[\\220-\\277]" \
"[\\200-\\277][\\200-\\277]|" \
"[\\361-\\363][\\200-\\277][\\200-\\277]" \
"[\\200-\\277]|\\364[\\200-\\217]" \
"[\\200-\\277][\\200-\\277])" }'
Set this regex. You should be able to get total UTF8-compliant character count as counted by gnu-wc -lcm, even for purely binary files like mp3s or mp4s or compressed gz/xz/zip that. As long as your data itself is UTF8-compliant, then this will count it, as specified in Unicode 13.
Your locale settings don't matter here whatsoever, nor is your platform, OS version, awk version, or awk variant.
$ echo; time pvE0 < MV84/*BLITZE*webm | gwc -lcm
in0: 449MiB 0:00:10 [44.4MiB/s] [44.4MiB/s] [================================================>] 100%
1827289 250914815 471643928
real 0m10.188s
user 0m10.075s
sys 0m0.352s
$ echo; time pvE0 < MV84/*BLITZE*webm | mawk2x 'BEGIN { FS = "^$"} { bytes += lengthB0(); chars += lengthC0(); } END { print --NR, chars+NR, bytes+NR }'
in0: 449MiB 0:00:16 [27.0MiB/s] [27.0MiB/s] [================================================>] 100%
1827289=250914815=471643928
real 0m16.756s
user 0m16.621s
sys 0m0.449s
the file being tested is a 449 MB .webm music video clip from youtube that's 3840x2160 VP9 + Opus codecs. not too shabby for an interpreted scripting language to be this close to compiled C-binaries.
And it's only this slow for the hideously long regex to account for invalid bytes. If you're extremely sure your data is fully UTF8 compliant text, you can further optimize that regex so that mawk2 can go faster than both gnu-wc and bsd-wc :
$ brc; time pvE0 < "${m3t}" | awkwc4m
in0: 1.85GiB 0:00:14 [ 128MiB/s] [ 128MiB/s] [================================================>] 100%
12,494,275 lines 1,285,316,715 utf8 (349,725,658 uc) 1,891.656 MB ( 1983544693) /dev/stdin
real 0m14.753s <—- Custom Bash function that's entirely AWK
$ brc; time pvE0 < "${m3t}" |gwc -lcm
in0: 1.85GiB 0:00:28 [67.3MiB/s] [67.3MiB/s] [================================================>] 100%
12494275 1285316715 1983544693
real 0m28.165s <—— GNU WC
$ brc; time pvE0 < "${m3t}" |wc -lcm
in0: 1.85GiB 0:00:22 [85.5MiB/s] [85.5MiB/s] [================================================>] 100%
12494275 1285316715
real 0m22.181s <—— BSD WC
ps : "${m3t}" is a 1.85GB flat .txt file that's 12.5 million rows, and 13 fields each, filled to the brim with multibyte unicode characters (349.7 million of them).
gawk -e (in unicode mode) will complain about that regex. To circumvent that annoyance, use this regex which is the same as the one above, but expanded out to make gawk -e happy
([\000-\177]|((\302|\303|\304|\305|\306|\307|\310|\311|\312|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336|\337)|(\340)(\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\355)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|((\341|\342|\343|\344|\345|\346|\347|\350|\351|\352|\353|\354|\356|\357)|(\360)(\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\361|\362|\363)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\364)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277){2})
== update = 9-20-21 ========
so turns out even the pre-slicing isn't necessary at all.
gawk -e 'BEGIN { ORS = ":";
a0 = a = "\354\236\274";
n = 1; # this # is for how many bytes
# you'd like to see
b1 = b = \
sprintf("%.*s",n + 1,a = "\301" a);
sub("^"b, "", a)
sub(/^\301/,"", b)
sub("\236|\270|\271|\272|\273|\274|\275|\276|\277",":&", a)
# for that string,
# chain up every byte in \x80-\xBF range,
# but make sure not to tag on "( )" at the 2 ends.
# that will make the regex a lot slower,
# for reasons unclear to me
printf(":" a0 "|" b1 "|" b ORS a "|") } ' | odview
yielding this output
: 잼 ** ** | 301 354 | 354 : 236 : 274 |
072 354 236 274 174 301 354 174 354 072 236 072 274 174
: ? 9e ? | ? ? | ? : 9e : ? |
58 236 158 188 124 193 236 124 236 58 158 58 188 124
3a ec 9e bc 7c c1 ec 7c ec 3a 9e 3a bc 7c
voila ~~ using only sprintf() and [g]sub(), every individual byte is at ur fingertip, even when in unicode code, without needing to use arrays at all.
===========================
since we're on the topic of awk and UTF8, a quick tip share (only on the multi-byte part):
if you're in gawk unicode-aware mode, and wanna access individual bytes of just a few utf8 chars (e.g. performing URL encoding, analyze them individually, or like packing a DWORD32), but don't wanna use the cost-heavy approach of gsub(//,"&"SUBSEP) then splitting into an array, a quick-n-dirty method is just
gsub(/\302|\303|\304|\305|\306|\307|\310|\311|\312\
|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324
|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336
|\337|\340|\341|\342|\343|\344|\345|\346|\347|\350
|\351|\352|\353|\354|\355|\356|\357|\360|\361|\362
|\363|\364/, "&\300")
잼 ** ** = 354 *300*<---236 274
354 236 274 075 354 300 236 274
? 9e ? = ? ? 9e ?
236 158 188 61 236 192 158 188
ec 9e bc 3d ec c0 9e bc
Basically, "slicing" properly encoded UTF8 characters right between the leading byte and the trailing ones. In my personal trial-and-error, i find the 13 bytes illegal within UTF8 (xC0 xC1 xF5-xFF) to be best suited for this task.
say original var is called b3. then use
b2 = sprintf("%.3s",b3)
to extract out \354 \300 \236.
sub(b2,"",b3)
so now b3 will only have \274.
b1 = sprintf("%.1s", b2)
b1 will now just now \354
sub(b1"\300","",b2)
and finally, b2 will actually just be the 2nd byte of \236
The reason why this painfully tedious process is that 1 gsub doubling every byte then another full array split() plus 3 more array entry lookups can be slightly slow. If you wanna count bytes first,
lenBytes = match($0, /$/) - 1;
# i only recently discovered
# this trick that works decently well
that match one even works for randon collection of bytes that have no resemblance to Unicode, and gawk is very happy to give you the exact result. That's the only meaningful way to run match( ) against random bytes and not get an error message from gawk. (the other being match($0,/^/) but that's quite uselsss. try doing .* / . / .+ all will end up erroring about bad character in locale.
** don't use index( ). if you need exact positions, then just split into array.
And if you need to do byte-level substring
Don't directly use substr() for random bytes in gawk unicode-mode.
Use sprintf("%.53s",b3) instead.
Before slicing, that syntax gives you 53 unicode characters.
After slicing, it's 53 bytes from start of string.
i even chain them up myself as if they're gensub() even though it's good ole' sub() :
if (sub(reANY340357,"&\301",z)||3==b) {
sub((x=sprintf("%.1s",(y=sprintf("%.3s",z))sub(y,"",z)))"\301","",y)
And once you're done with everything you need, a quick gsub(/\300|\301/, "") will restore you the proper UTF8 string.
Hope this is useful =)
Note : The code in this answer can be used to detect valid UTF-8 multi-byte characters. It will also fail if there are invalid UTF-8 byte sequences. However, it does not guarantee that your file is intended to be UTF-8. All valid UTF-8 code is also valid CP1252, but not all CP1252 is valid UTF-8.
So it seems this may be a bit of a niche problem. For me, that means time to resort to C. This should work, but, in the spirit of the question, I won't accept it in case someone can come up with an awk solution.
Here is my C solution I called hasmultibyte:
#include <stdio.h>
#include <stdlib.h>
void check_for_multibyte(FILE* in)
{
int c = 0;
while ((c = getc(in)) != EOF) {
/* Floating continuation byte */
if ((c & 0xC0) == 0x80)
exit(5);
/* utf8 multi-byte start */
if ((c & 0xC0) == 0xC0) {
int continuations = 1;
switch (c & 0xF0) {
case 0xF0:
continuations = 3;
break;
case 0xE0:
continuations = 2;
}
int i = 0;
for (; i < continuations; ++i)
if ((getc(in) & 0xC0) != 0x80)
exit(5);
exit(0);
}
}
}
int main (int argc, char** argv)
{
FILE* in = stdin;
int i = 1;
do {
if (i != argc) {
in = fopen(argv[i], "r");
if (!in) {
perror(argv[i]);
exit(EXIT_FAILURE);
}
}
check_for_multibyte(in);
if (in != stdin)
fclose(in);
} while (++i < argc);
return 5;
}
In the shell environment, you could then use it like this:
if hasmultibyte file.txt; then
...
fi
It will also read from stdin if not file is provided if you want to use it on the end of a pipeline:
if cat file.txt | hasmultibyte; then
...
fi
TEST
Here is a test of the program. I created 3 files with the name Hernández in it:
name_ascii.txt - Uses a instead of á.
name_cp1252.txt - Encoded in CP1252
name_utf-8.txt - Encoded in UTF-8 (default)
The � you see is due to the invalid UTF-8 that the terminal is expecting. It is, in fact the character á in CP1252.
> file name_*
name_ascii.txt: ASCII text
name_cp1252.txt: ISO-8859 text
name_utf-8.txt: UTF-8 Unicode text
> cat name_*
Hernandez
Hern�ndez
Hernández
> hasmultibyte name_ascii.txt && echo multibyte
> hasmultibyte name_cp1252.txt && echo multibyte
> hasmultibyte name_utf-8.txt && echo multibyte
multibyte
Update
This code has been updated from the original. It has been changed to read the first byte of a multibyte character and read how many bytes the character should be. This can be determined as follows.
first byte number of bytes
110xxxxx 2
1110xxxx 3
11110xxx 4
This method is more reliable and will reduce inaccuracies. The original method searched for a byte of the form 11xxxxxx and checked the next byte for a continuation byte (10xxxxxx). That will produce a false positive given something like â„x in a CP1252 file. In binary, this is 11100010 10000100 01111000. The first byte claims a character of 3 bytes, the second is a continuation byte, but the third is not. This is not a valid UTF-8 sequence.
Additional testing
> # create files
> echo "â„¢" | iconv -f UTF-8 -t CP1252 > 3byte.txt
> echo "Ââ„¢" | iconv -f UTF-8 -t CP1252 > 3byte_fail.txt
> echo "â„x" | iconv -f UTF-8 -t CP1252 > 3byte_fail2.txt
> hasmultibyte 3byte.txt; echo $?
0
> hasmultibyte 3byte_fail.txt; echo $?
5
> hasmultibyte 3byte_fail2.txt; echo $?
5

Processing: How to convert a char datatype into its utf-8 int representation?

How can I convert a char datatype into its utf-8 int representation in Processing?
So if I had an array ['a', 'b', 'c'] I'd like to obtain another array [61, 62, 63].
After my answer I figured out a much easier and more direct way of converting to the types of numbers you wanted. What you want for 'a' is 61 instead of 97 and so forth. That is not very hard seeing that 61 is the hexadecimal representation of the decimal 97. So all you need to do is feed your char into a specific method like so:
Integer.toHexString((int)'a');
If you have an array of chars like so:
char[] c = {'a', 'b', 'c', 'd'};
Then you can use the above thusly:
Integer.toHexString((int)c[0]);
and so on and so forth.
EDIT
As per v.k.'s example in the comments below, you can do the following in Processing:
char c = 'a';
The above will give you a hex representation of the character as a String.
// to save the hex representation as an int you need to parse it since hex() returns a String
int hexNum = PApplet.parseInt(hex(c));
// OR
int hexNum = int(c);
For the benefit of the OP and the commenter below. You will get 97 for 'a' even if you used my previous suggestion in the answer because 97 is the decimal representation of hexadecimal 61. Seeing that UTF-8 matches with the first 127 ASCII entries value for value, I don't see why one would expect anything different anyway. As for the UnsupportedEncodingException, a simple fix would be to wrap the statements in a try/catch block. However that is not necessary seeing that the above directly answers the question in a much simpler way.
what do you mean "utf-8 int"? UTF8 is a multi-byte encoding scheme for letters (technically, glyphs) represented as Unicode numbers. In your example you use trivial letters from the ASCII set, but that set has very little to do with a real unicode/utf8 question.
For simple letters, you can literally just int cast:
print((int)'a') -> 97
print((int)'A') -> 65
But you can't do that with characters outside the 16 bit char range. print((int)'二') works, (giving 20108, or 4E8C in hex) but print((int)'𠄢') will give a compile error because the character code for 𠄢 does not fit in 16 bits (it's supposed to be 131362, or 20122 in hex, which gets encoded as a three byte UTF-8 sequence 239+191+189)
So for Unicode characters with a code higher than 0xFFFF you can't use int casting, and you'll actually have to think hard about what you're decoding. If you want true Unicode point values, you'll have to literally decode the byte print, but the Processing IDE doesn't actually let you do that; it will tell you that "𠄢".length() is 1, when in real Java it's really actually 3. There is -in current Processing- no way to actually get the Unicode value for any character with a code higher than 0xFFFF.
update
Someone mentioned you actually wanted hex strings. If so, use the built in hex function.
println(hex((int)'a')) -> 00000061
and if you only want 2, 4, or 6 characters, just use substring:
println(hex((int)'a').substring(4)) -> 0061