Using awk to detect UTF-8 multibyte character - awk

I am using awk (symlinked to gawk on my machine) to read through a file and get a character count per line to test if a file is fixed width. I can then re-use the following script with the -b --characters-as-bytes option to see if the file is fixed width by byte.
#!/usr/bin/awk -f
BEGIN {
width = -1;
}
{
len = length($0);
if (width == -1) {
width = len;
} else if (len != 0 && len != width) {
exit 1;
}
}
I want to do something similar to test whether each line in a file has the same amount of bytes and characters to assume all characters are a single byte (I do realize this is subject false negatives). The challenge is that I would like to run through the file one time and break out at first mismatch. Is there a way to set the -b option from within an awk script similar to how you can adjust FS. If this isn't possible, I'm open to options outside of awk. I can always just write this in C if I have to, but I wanted to make sure there isn't something already available.
Efficiency is what I am aiming for here. Having this information will help me skip a costly process, so I don't this in itself to be costly. I'm dealing with files that can be over 100 million lines long.
Clarification
I want something like the above. Something like this
#!/usr/bin/awk -f
{
if (length($0) != bytelength($0))
exit 1;
}
I don't need any output. I will just trigger off the return code ($? in bash). So exit 1 if this fails. Obviously bytelength is not a function. I'm just looking for a way to achieve this without running awk twice.
UPDATE
sundeep's solution works for what I have described above:
awk -F '' -l ordchr '{for(i=1;i<=NF;i++) if(ord($i)<0) {exit 1;}}'
I was operating under the assumption that awk would count a higher-end character with a Windows single-byte encoding above 0x7F as a single character, but it actually doesn't count it at all. So byte length would still not be the same as length. I guess I will need to write this in C for something that specific.
Conclusion
So I think I did a poor job of explaining my problem. I receive data that is encoded in either UTF-8 or Windows' style single-byte encoding like CP1252. I wanted to check if there are any multibyte characters in the file and exit if found. I originally wanted to do this in awk, but I playing with files that may have a different encoding has proven difficult.
So in a nutshell if we assume a file with a single character in it:
CHARACTER FILE_ENCODING ALL_SINGLE_BYTE IN_HEX
á UTF-8 false 0xC3 0xA1
á CP1252 true 0xE1
a ANY true 0x61

You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx and the next byte needs to be 0b10xxxxxx where x represents any value (from wikipedia).
So you can detect such sequence with sed by matching the hex ranges and exit with nonzero exit status if found:
LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'
Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111].
I think \x?? and q are both GNU extensions to sed.

The best answer is imho actually the one with grep provided by Sundeep in the comment. You should try to get that working. The answer below makes use of sed in a similar way. I will probably delete it, as it's really doesn't add anything to grep's solution.
What about this?
[[ -z "$(LANG=C sed -z '/[\x80-\xFF]/d' <(echo -e 'one\ntwo\nth⌫ree'))" ]]
echo $?
<(echo -e 'one\ntwo\nth⌫ree') is just an example file with a multibyte character in it
the whole sed command does one of two things:
outputs the empty string if the file contains a multibyte character
outputs the full file if it doesn't
the [[ -z string ]] returns 0 or 1 if the string has length zero.

Quote from the same wikipedia page above :
Fallback and auto-detection: Only a small subset of possible byte
strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF
cannot appear, and bytes with the high bit set must be in pairs, and
other requirements.
in octal code that means xC0 = \300, xC1 = \301 and xF5 = \365 -> xFF = \377 being non-valid UTF-8.
Knowing that this space isn't valid UTF-8 is plenty useful in terms of wiggle room for one to insert custom delimiters inside any string :
pick any of those bytes, say \373, and once a quick if statement is used to verify it doesn't exist for that line, you can now perform custom text manipulation tricks of your choice, with a single-byte delimiter, even if it involves inserting them right in between the UTF8 bytes for a single code point, and it won't ruin the unicode at all. once you're done with the logic block, simply use a quick gsub( ) to remove all traces of it.
If that byte (\373 ie \xFB) exist, well, you're likely encountering either a binary file, or partially corrupted UTF8 text data.
One use case, such as in my own modules, is a UTF-8 code-point-level-safe* substr( ) function. So instead of manually counting out the points 1 at a time, first use regex to count out max bytes of any code-point. Let's say 3-bytes (since 4-bytes ones are still rare in practice).
Then i apply 1 pad of \373 next to the 2-byte ones (I pad it to the left of [\302-\337]), and 2 pads of it, i.e. \373\373, next to ASCII ones, and voila, now all UTF8 code points have a fixed width, so a substr( ) becomes a mere multiplication exercise of it.
run a byte-level substr( ) on those start and end points, apply gsub(/[\373]+/, "", s) to throw away all the padding bytes, and now you have a usable* UTF-8-safe substr( ) function for all the variants of awk that aren't unicode-aware. This approach also works for multi-line records, and absolutely does not affect how FS and RS interacts with the record.
(if u need 4-bytes, just pad more)
*i haven't incorporated any fancy logic to account for code-points that are post-decomposition components that supposedly grouped together as a single logical unit for string manipulation purposes.

for non-unicode aware versions of awk,
gawk -b/ LC_ALL=C /mawk/mawk2 'BEGIN {
reUTF8="([\\000-\\177]|" \
"[\\302-\\337][\\200-\\277]|" \
"\\340[\\240-\\277][\\200-\\277]|" \
"\\355[\\200-\\237][\\200-\\277]|" \
"[\\341-\\354\\356-\\357][\\200-\\277]" \
"[\\200-\\277]|\\360[\\220-\\277]" \
"[\\200-\\277][\\200-\\277]|" \
"[\\361-\\363][\\200-\\277][\\200-\\277]" \
"[\\200-\\277]|\\364[\\200-\\217]" \
"[\\200-\\277][\\200-\\277])" }'
Set this regex. You should be able to get total UTF8-compliant character count as counted by gnu-wc -lcm, even for purely binary files like mp3s or mp4s or compressed gz/xz/zip that. As long as your data itself is UTF8-compliant, then this will count it, as specified in Unicode 13.
Your locale settings don't matter here whatsoever, nor is your platform, OS version, awk version, or awk variant.
$ echo; time pvE0 < MV84/*BLITZE*webm | gwc -lcm
in0: 449MiB 0:00:10 [44.4MiB/s] [44.4MiB/s] [================================================>] 100%
1827289 250914815 471643928
real 0m10.188s
user 0m10.075s
sys 0m0.352s
$ echo; time pvE0 < MV84/*BLITZE*webm | mawk2x 'BEGIN { FS = "^$"} { bytes += lengthB0(); chars += lengthC0(); } END { print --NR, chars+NR, bytes+NR }'
in0: 449MiB 0:00:16 [27.0MiB/s] [27.0MiB/s] [================================================>] 100%
1827289=250914815=471643928
real 0m16.756s
user 0m16.621s
sys 0m0.449s
the file being tested is a 449 MB .webm music video clip from youtube that's 3840x2160 VP9 + Opus codecs. not too shabby for an interpreted scripting language to be this close to compiled C-binaries.
And it's only this slow for the hideously long regex to account for invalid bytes. If you're extremely sure your data is fully UTF8 compliant text, you can further optimize that regex so that mawk2 can go faster than both gnu-wc and bsd-wc :
$ brc; time pvE0 < "${m3t}" | awkwc4m
in0: 1.85GiB 0:00:14 [ 128MiB/s] [ 128MiB/s] [================================================>] 100%
12,494,275 lines 1,285,316,715 utf8 (349,725,658 uc) 1,891.656 MB ( 1983544693) /dev/stdin
real 0m14.753s <—- Custom Bash function that's entirely AWK
$ brc; time pvE0 < "${m3t}" |gwc -lcm
in0: 1.85GiB 0:00:28 [67.3MiB/s] [67.3MiB/s] [================================================>] 100%
12494275 1285316715 1983544693
real 0m28.165s <—— GNU WC
$ brc; time pvE0 < "${m3t}" |wc -lcm
in0: 1.85GiB 0:00:22 [85.5MiB/s] [85.5MiB/s] [================================================>] 100%
12494275 1285316715
real 0m22.181s <—— BSD WC
ps : "${m3t}" is a 1.85GB flat .txt file that's 12.5 million rows, and 13 fields each, filled to the brim with multibyte unicode characters (349.7 million of them).
gawk -e (in unicode mode) will complain about that regex. To circumvent that annoyance, use this regex which is the same as the one above, but expanded out to make gawk -e happy
([\000-\177]|((\302|\303|\304|\305|\306|\307|\310|\311|\312|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336|\337)|(\340)(\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\355)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|((\341|\342|\343|\344|\345|\346|\347|\350|\351|\352|\353|\354|\356|\357)|(\360)(\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\361|\362|\363)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\364)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277){2})

== update = 9-20-21 ========
so turns out even the pre-slicing isn't necessary at all.
gawk -e 'BEGIN { ORS = ":";
a0 = a = "\354\236\274";
n = 1; # this # is for how many bytes
# you'd like to see
b1 = b = \
sprintf("%.*s",n + 1,a = "\301" a);
sub("^"b, "", a)
sub(/^\301/,"", b)
sub("\236|\270|\271|\272|\273|\274|\275|\276|\277",":&", a)
# for that string,
# chain up every byte in \x80-\xBF range,
# but make sure not to tag on "( )" at the 2 ends.
# that will make the regex a lot slower,
# for reasons unclear to me
printf(":" a0 "|" b1 "|" b ORS a "|") } ' | odview
yielding this output
: 잼 ** ** | 301 354 | 354 : 236 : 274 |
072 354 236 274 174 301 354 174 354 072 236 072 274 174
: ? 9e ? | ? ? | ? : 9e : ? |
58 236 158 188 124 193 236 124 236 58 158 58 188 124
3a ec 9e bc 7c c1 ec 7c ec 3a 9e 3a bc 7c
voila ~~ using only sprintf() and [g]sub(), every individual byte is at ur fingertip, even when in unicode code, without needing to use arrays at all.
===========================
since we're on the topic of awk and UTF8, a quick tip share (only on the multi-byte part):
if you're in gawk unicode-aware mode, and wanna access individual bytes of just a few utf8 chars (e.g. performing URL encoding, analyze them individually, or like packing a DWORD32), but don't wanna use the cost-heavy approach of gsub(//,"&"SUBSEP) then splitting into an array, a quick-n-dirty method is just
gsub(/\302|\303|\304|\305|\306|\307|\310|\311|\312\
|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324
|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336
|\337|\340|\341|\342|\343|\344|\345|\346|\347|\350
|\351|\352|\353|\354|\355|\356|\357|\360|\361|\362
|\363|\364/, "&\300")
잼 ** ** = 354 *300*<---236 274
354 236 274 075 354 300 236 274
? 9e ? = ? ? 9e ?
236 158 188 61 236 192 158 188
ec 9e bc 3d ec c0 9e bc
Basically, "slicing" properly encoded UTF8 characters right between the leading byte and the trailing ones. In my personal trial-and-error, i find the 13 bytes illegal within UTF8 (xC0 xC1 xF5-xFF) to be best suited for this task.
say original var is called b3. then use
b2 = sprintf("%.3s",b3)
to extract out \354 \300 \236.
sub(b2,"",b3)
so now b3 will only have \274.
b1 = sprintf("%.1s", b2)
b1 will now just now \354
sub(b1"\300","",b2)
and finally, b2 will actually just be the 2nd byte of \236
The reason why this painfully tedious process is that 1 gsub doubling every byte then another full array split() plus 3 more array entry lookups can be slightly slow. If you wanna count bytes first,
lenBytes = match($0, /$/) - 1;
# i only recently discovered
# this trick that works decently well
that match one even works for randon collection of bytes that have no resemblance to Unicode, and gawk is very happy to give you the exact result. That's the only meaningful way to run match( ) against random bytes and not get an error message from gawk. (the other being match($0,/^/) but that's quite uselsss. try doing .* / . / .+ all will end up erroring about bad character in locale.
** don't use index( ). if you need exact positions, then just split into array.
And if you need to do byte-level substring
Don't directly use substr() for random bytes in gawk unicode-mode.
Use sprintf("%.53s",b3) instead.
Before slicing, that syntax gives you 53 unicode characters.
After slicing, it's 53 bytes from start of string.
i even chain them up myself as if they're gensub() even though it's good ole' sub() :
if (sub(reANY340357,"&\301",z)||3==b) {
sub((x=sprintf("%.1s",(y=sprintf("%.3s",z))sub(y,"",z)))"\301","",y)
And once you're done with everything you need, a quick gsub(/\300|\301/, "") will restore you the proper UTF8 string.
Hope this is useful =)

Note : The code in this answer can be used to detect valid UTF-8 multi-byte characters. It will also fail if there are invalid UTF-8 byte sequences. However, it does not guarantee that your file is intended to be UTF-8. All valid UTF-8 code is also valid CP1252, but not all CP1252 is valid UTF-8.
So it seems this may be a bit of a niche problem. For me, that means time to resort to C. This should work, but, in the spirit of the question, I won't accept it in case someone can come up with an awk solution.
Here is my C solution I called hasmultibyte:
#include <stdio.h>
#include <stdlib.h>
void check_for_multibyte(FILE* in)
{
int c = 0;
while ((c = getc(in)) != EOF) {
/* Floating continuation byte */
if ((c & 0xC0) == 0x80)
exit(5);
/* utf8 multi-byte start */
if ((c & 0xC0) == 0xC0) {
int continuations = 1;
switch (c & 0xF0) {
case 0xF0:
continuations = 3;
break;
case 0xE0:
continuations = 2;
}
int i = 0;
for (; i < continuations; ++i)
if ((getc(in) & 0xC0) != 0x80)
exit(5);
exit(0);
}
}
}
int main (int argc, char** argv)
{
FILE* in = stdin;
int i = 1;
do {
if (i != argc) {
in = fopen(argv[i], "r");
if (!in) {
perror(argv[i]);
exit(EXIT_FAILURE);
}
}
check_for_multibyte(in);
if (in != stdin)
fclose(in);
} while (++i < argc);
return 5;
}
In the shell environment, you could then use it like this:
if hasmultibyte file.txt; then
...
fi
It will also read from stdin if not file is provided if you want to use it on the end of a pipeline:
if cat file.txt | hasmultibyte; then
...
fi
TEST
Here is a test of the program. I created 3 files with the name Hernández in it:
name_ascii.txt - Uses a instead of á.
name_cp1252.txt - Encoded in CP1252
name_utf-8.txt - Encoded in UTF-8 (default)
The � you see is due to the invalid UTF-8 that the terminal is expecting. It is, in fact the character á in CP1252.
> file name_*
name_ascii.txt: ASCII text
name_cp1252.txt: ISO-8859 text
name_utf-8.txt: UTF-8 Unicode text
> cat name_*
Hernandez
Hern�ndez
Hernández
> hasmultibyte name_ascii.txt && echo multibyte
> hasmultibyte name_cp1252.txt && echo multibyte
> hasmultibyte name_utf-8.txt && echo multibyte
multibyte
Update
This code has been updated from the original. It has been changed to read the first byte of a multibyte character and read how many bytes the character should be. This can be determined as follows.
first byte number of bytes
110xxxxx 2
1110xxxx 3
11110xxx 4
This method is more reliable and will reduce inaccuracies. The original method searched for a byte of the form 11xxxxxx and checked the next byte for a continuation byte (10xxxxxx). That will produce a false positive given something like â„x in a CP1252 file. In binary, this is 11100010 10000100 01111000. The first byte claims a character of 3 bytes, the second is a continuation byte, but the third is not. This is not a valid UTF-8 sequence.
Additional testing
> # create files
> echo "â„¢" | iconv -f UTF-8 -t CP1252 > 3byte.txt
> echo "Ââ„¢" | iconv -f UTF-8 -t CP1252 > 3byte_fail.txt
> echo "â„x" | iconv -f UTF-8 -t CP1252 > 3byte_fail2.txt
> hasmultibyte 3byte.txt; echo $?
0
> hasmultibyte 3byte_fail.txt; echo $?
5
> hasmultibyte 3byte_fail2.txt; echo $?
5

Related

How to transform a string into an int equivalent in a deterministic way with gawk 5?

I am facing a case where I need to transform a string to an int equivalent with gawk5.
This transformation must be deterministic.
My first, naive, approach is to convert each letter of the string to its equivalent position in the latin alphabet and then concat the results back into a string.
For example:
my_string = "AB"
A = 1
B = 2
my_int=12
However, this has several downsides:
Very long strings may generate an integer that goes beyond maximum integer size.
What to do in case of special characters, symbols, etc. ?
This requires me to hold a table of each character position in the alphabet.
So, basically, it's a no go.
What is a good and robust method to generate an integer from a string with gawk5 ?
PS: Some will comment that gawk may not be the tool for that, and they may be right and I am aware of that. But this is for a personnal project that should include only awk if possible ;)
If your string contains only ASCII characters, no newlines, and if you use GNU awk, the following simply converts each character into its 3-digits ASCII code:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=0;i<128;i++) c[sprintf("%c",i)]=i}
{for(i=1;i<=NF;i++) printf("%03d",c[$i])}'
097098099
Of course this expands the string by a factor of 3, which can be sub-optimal. If you know that your string contains only ASCII characters in the 32-127 range you can reduce this factor to 2:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=32;i<128;i++) c[sprintf("%c",i)]=i-32}
{for(i=1;i<=NF;i++) printf("%02d",c[$i])}'
656667

removing unconventional field separators (^#^#^#) in a text file [duplicate]

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Converting pack to perl6

I would like to convert the following from perl5 to perl6,
$salt = pack "C*", map {int rand 256} 1..16;
It create a string of 16 characters where each character has a randomly picked value from 0 to 255. Perl5 doesn't assign any semantics to those characters, so they could be bytes, Unicode Code Points, or something else.
I think I can get by with
$salt = (map {(^256).pick.chr},^16).join;
But I got stuck on using pack, here is my attempt,
use experimental :pack;
my $salt = pack("C*",(map {(^256).pick} , ^16)).decode('utf-8');
say $salt;
say $salt.WHAT;
and results can be either an error,
Malformed termination of UTF-8 string
in block <unit> at test.p6 line 3
or something like,
j
(Str)
My line of thought is that packing the integer List would return a Buf then decoding that should produce the required Str.
Update:
As suggested on comment and answer Buf is the correct object to use. Now to follow up on the pack part,
perl6 -e 'use experimental :pack; my $salt = pack("C*",(map {(^256).pick} , ^16));say $salt;say $salt.WHAT;'
Buf:0x<7D>
(Buf)
that only packed one unit
On the other hand, using P5pack (suggested by Scimon) returns an error
perl6 -e 'use P5pack; my $salt = pack("C*",(map {(^256).pick} , ^16));say $salt;say $salt.WHAT;'
Cannot convert string to number: base-10 number must begin with valid digits or '.' in '⏏*' (indicated by ⏏)
in sub one at /home/david/.rakudobrew/moar-master/install/share/perl6/site/sources/D326BD5B05A67DBE51C279B9B9D9B448C6CDC401 (P5pack) line 166
in sub pack at /home/david/.rakudobrew/moar-master/install/share/perl6/site/sources/D326BD5B05A67DBE51C279B9B9D9B448C6CDC401 (P5pack) line 210
in block <unit> at -e line 1
Update 2:
I didn't spot the difference.
perl6 -e 'say (map {(^256).pick}, ^16).WHAT;'
(Seq)
perl6 -e 'say Buf.new((^256).roll(16)).WHAT;'
(Buf)
Now make them lists,
perl6 -e 'use experimental :pack; my $salt = pack("C*",(Buf.new((^256).roll(16)).list));say $salt;say $salt.WHAT;'
Buf:0x<39>
(Buf)
And
perl6 -e 'use P5pack; my $salt = pack("C*",Buf.new((^256).roll(16)).list);say $salt;say $salt.WHAT;'
Cannot convert string to number: base-10 number must begin with valid digits or '.' in '⏏*' (indicated by ⏏)
in sub one at /home/david/.rakudobrew/moar-master/install/share/perl6/site/sources/D326BD5B05A67DBE51C279B9B9D9B448C6CDC401 (P5pack) line 166
in sub pack at /home/david/.rakudobrew/moar-master/install/share/perl6/site/sources/D326BD5B05A67DBE51C279B9B9D9B448C6CDC401 (P5pack) line 210
in block <unit> at -e line 1
Reference:
Buffers and Binary IO
A first approach to pack/unpack in Perl 6
Thanks in advance for the help.
As ikegami says in a comment to your question, you really should use a Buf, which is basically a “string” of bytes.
my $salt = Buf.new((^256).roll(16));
You can write this to a file with something like:
spurt 'foo', $salt, :bin;
or encode it in base-64 with:
use MIME::Base64;
my $encoded = MIME::Base64.encode($salt);
But if you need this to be reasonably secure, have a look at Crypt::Random
use Crypt::Random;
my $salt = crypt_random_buf(16);
The easiest way is to probably to use https://modules.perl6.org/dist/P5pack:cpan:ELIZABETH which allows you to use Perl5 pack syntax.
Apparently "C" x 16 works but not "C*". Don't ask, I don't know why either. :-D
perl6 -e 'use experimental :pack; my $salt = pack("C" x 16,(Buf.new((^256).roll(16)).list));say $salt;say $salt.WHAT;'
Buf:0x<64 71 D4 E6 E6 AD 7B 1C DD A2 62 CC DD DA F3 08>
(Buf)
On the other hand, unpack does work.
perl6 -e 'use experimental :pack; my $salt = pack("C" x 16,(Buf.new((^256).roll(16)).list)).unpack("C*");say $salt;say $salt.WHAT;'
(35 101 155 237 153 126 109 193 94 105 70 111 59 51 131 233)
(List)
All in all, mscha's answer is P6'ish and neat. It was a silly round trip to pack a list of Buf to get a Buf.
As for the other pack conversions, note the two key points from here,
Here is the difference between Perl 5 and Perl 6 pack/unpack:
Perl 5 Perl 6
pack(List) --> Str pack(List) --> Buf
unpack(Str) --> List unpack(Buf) --> List
And
some Perl 5 template rules assume an uncomplicated two-way street between Buf and Str. There simply is no real distinction in Perl 5 between Buf and Str, and Perl 5 makes use of that quite a bit.
Edit: fix typo, s/masha/mscha/;

How to remove a specific pattern of junk values from a file using awk or sed?

I have two types of pattern in my xml file which I want to remove without disturbing any other meaningful patterns.
testname="#TEST-Loop${c}- 05030502956 #TEST - verify that the Handler returns an error indicating â~#~\call barredâ~#~]." enabled="true">
I want to change it to
testname="#TEST-Loop${c}- 05030502956 #TEST - verify that the Handler returns an error indicating call barred." enabled="true">
I tried below code but it didnt worked
awk '{if(match($0,/#TEST.*" enabled="true">$/))
gsub(/â~#~\\/,"");
gsub(/â~#~\]/,"");
print}' $file >> tmp.jmx && mv tmp.jmx $file
The pattern you are attempting to replace looks like a mangled UTF-8 character viewed in some legacy 8-bit encoding. Because you don't specify which encoding that is, we have to do a fair amount of guesswork.
You are asking about Unix tools, so this answer assumes that you are using some U*x derivative or have access to similar tools on your local box (Cygwin?)
To find the actual bytes in the string you want to replace, you can do something like
bash$ grep -o '...~#~...' -m1 "$file" |
> od -Ax -tx1o1
0000000 67 20 e2 7e 40 7e 5c 63 61 0a
147 040 342 176 100 176 134 143 141 012
000000a
I use od for portability reasons; you might prefer hexdump or xxd or some other tool. The output includes both hex and octal, as octal is preferred in Awk but hex is ubiquitous in programming otherwise. I keep a couple of characters of context around the match in case â would in fact be stored in a multibyte encoding in your sample, but here, in this somewhat speculative example, it turns out it is represented by the single byte 0xE2 (octal 342). (This would identify your terminal encoding as Latin-1 or some close relative; maybe one of the CP125x Windows encodings.)
Armed with this information, we can proceed with
awk '{ gsub(/\342~#~./, "") }1' "$file"
to replace the pesky character sequence, or perhaps
sed $'s/\xe2~#~.//' "$file"
which assumes your shell is Bash or some near-compatible which allows you to use C-style strings $'...' -- alternatively, if you know your sed dialect supports a particular notation for unprintable characters, you can use that, but that's even less portable.
(If your sed supports the -i option, or your Awk supports --inline, you can replace the file in-place, i.e. have the script replace the file with a modified version without the need for redirection or temporary files. Again, this has portability issues.)
I want to emphasize that we cannot guess your encoding so your question should ideally include this information. See further the Stack Overflow character-encoding tag wiki for guidance on what to include in a question like this.

Returning uint Value from Char in gawk

I'm trying to get value of an ASCII char I receive via RS232 to convert them into binary like values.
Example:
0xFF-->########
0x01--> #
0x02--> #
...
My Problem is to get the value of ASCII chars higher than 127.
Test-Code to get the int value:
echo -e "\xFF" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
� : -1
Test-Code 2:
echo -e "\x61" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
a : 97
So my solution to convert the values into unsigned int, is like this:
if(ord($0)<0)
{
new_char=ord($0)+256;
}
else new_char = ord($0)+0`
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Later I tried to write my own ord() function.
#!/bin/bash
echo -e "\xFF" | awk 'BEGIN {_ord_init()}
{
printf("%s : %d\n", $0, ord($0))
}
function _ord_init( i, t)
{
for (i=0; i <= 255; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
0xFF returns:
� : 0
0x61 returns:
a : 97
Can someone explain me the behavior?
I'm using:
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4-p1, GNU MP 6.1.1)
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Actually, any string in awk is, in the end, a number.
Strings are converted to numbers and numbers are converted to strings,
if the context of the awk program demands it. [...] A string is
converted to a number by interpreting any numeric prefix of the string
as numerals: "2.5" converts to 2.5, "1e3" converts to 1,000, and
"25fix" has a numeric value of 25. Strings that can’t be interpreted
as valid numbers convert to zero. source
Let's make a quick test:
BEGIN {
print 0xff
print 0xff + 0
print 0xff +0.0
print "0xff"
}
# 255
# 255
# 255
# 0xff
So, any hex is automatically interpreted as uint. Casting a int to uint is a tricky question: generally, you should convert the modulus of the int to hex, then add the sign bit as MSB (that is, if the number is non-positive). But you should not need to do so in awk.
Remember that conversion is made as a call to sprintf() and you may control it via the CONVFMT variable:
CONVFMT
A string that controls the conversion of numbers to strings
(see section Conversion of Strings and Numbers). It works by being
passed, in effect, as the first argument to the sprintf() function
(see section String-Manipulation Functions). Its default value is
"%.6g". CONVFMT was introduced by the POSIX standard. source
Remember that locale settings may affect the way the conversion is performed, especially with the decimal separator. For more, see this, which is out of scope.
Can someone explain me the behavior?
I can't actually reproduce it, but I suspect this line of code:
# only first character is of interest
c = substr(str, 1, 1)
In your example, the first char is always 0 and the output should always be the same. I'm testing this online.
I'll make another example of mine:
BEGIN {
a = 0xFF
b = 0x61
printf("a: %d %f %X %s %c\n", a,a,a,a,a)
printf("b: %d %f %X %s %c\n", b,b,b,b,b)
}
# a: 255 255.000000 FF 255 ÿ
# b: 97 97.000000 61 97 a
Either run gawk in binary mode gawk -b to stop it from pre-stitching UTF8 code points. Split it by // empty string, then each single spot inside that resulting array will contain something that's 1-byte wide.
For the other way around, just pre-make an array from 0 to 256. Gawk doesn't stop there at all. In my routine gawk startup sequence, I do that same custom ord sequence from 0x3134F all the way back to zero (around 210k or so). The reason to do it backwards is, for whatever reason, there are some code points that will come out with an IDENTICAL character that gawk can't differentiate. doing it reverse will ensure the lowest # code point is assigned to it. For this mode, I run it in regular utf8 one.
For your scenario I'll just pre-make 4-hex wide array from 0x0000 to 0xFFFF, back to their integer ones, then for each 0xZZ 0xWW, throw ZZWW into that lookup dictionary and get back and integer.
If you just try ord( ) from 128 to 255 it usually won't work like that because 128 is where unicode begins 2 bytes. 0x800 begins 3bytes, 0x10000 begins 4 bytes. I'm not too familiar with those that extend ascii to 256 - they usually require using iconv or similar to get back to UTF-8 first.
A quick note if you want to take raw UTF8 bytes and trying to figure out how many stitched UTF8 code points there are, just delete everything 0x80 - 0xBF. The length() of the residual is the number of code points.
In decimal lingo, out of the 4 ranges of 64 numbers from 0 to 255 :
000 - 063 - ASCII
064 - 127 - ASCII
128 - 191 - UT8-multiple-byte continuation encoding (the 0x80 0xBF)
192 - 255 - the most significant byte of UTF8 multi-byte char
and this looks hideous. Luckily, octal to the rescue. The 0x80 - 0xBF range is just \200-\277. You can use any of AWK's regex to find those (also for FS / RS etc). I was spending time manually coding up the utf8 algorithm before doing all that bit-shifting when I realized much later I don't need that to get to my end goal.
You can easily beat the system built in wc -m command if you want to count utf8 code-points when combining the logic above with mawk2. On my 2.5 year old laptop, against a 1.83 GB flat text file FILLED with unicode all over, I got it down to approx 19 seconds or so to count out 1.29 billion utf8 code points, using just awk.
i've ran into the same problem myself. I ended up with first with a detector whether it's running gawk in unicode mode or byte mode (check the length() of 3 octal value combo that make up one UTF8 code point returns 1 or 3)
then when it sees gawk unicode mode, run a custom shell command from gawk and use unix printf to print out bytes 128-255, and chunk it back into gawk into an array. If you need it i can paste the code sometime (but it's SUPER hideous so i hope i won't get dinged for its lack of elegance)
because there are simply bytes like C0, C1, or FF etc that don't exist in UTF8, no matter what combination you attempt, you cannot get it to generate it all 256 within gawk. I mean another way to do it would be pre-making that chain and using something xxd -ps to store it as a hash string, only converting it back at runtime, but it's admittedly slower.