Perl6 IO::Socket::Async truncates data - raku

I'm rewriting my P5 socket server in P6 using IO::Socket::Async, but the data received got truncated 1 character at the end and that 1 character is received on the next connection. Someone from Perl6 Facebook group (Jonathan Worthington) pointed that this might be due to the nature of strings and bytes are handled very differently in P6. Quoted:
In Perl 6, strings and bytes are handled very differently. Of note, strings work at grapheme level. When receiving Unicode data, it's not only possible that a multi-byte sequence will be split over packets, but also a multi-codepoint sequence. For example, one packet might have the letter "a" at the end, and the next one would be a combining acute accent. Therefore, it can't safely pass on the "a" until it's seen how the next packet starts.
My P6 is running on MoarVM
https://pastebin.com/Vr8wqyVu
use Data::Dump;
use experimental :pack;
my $socket = IO::Socket::Async.listen('0.0.0.0', 7000);
react {
whenever $socket -> $conn {
my $line = '';
whenever $conn {
say "Received --> "~$_;
$conn.print: &translate($_) if $_.chars ge 100;
$conn.close;
}
}
CATCH {
default {
say .^name, ': ', .Str;
say "handled in $?LINE";
}
}
}
sub translate($raw) {
my $rawdata = $raw;
$raw ~~ s/^\s+|\s+$//; # remove heading/trailing whitespace
my $minus_checksum = substr($raw, 0, *-2);
my $our_checksum = generateChecksum($minus_checksum);
my $data_checksum = ($raw, *-2);
# say $our_checksum;
return $our_checksum;
}
sub generateChecksum($minus_checksum) {
# turn string into Blob
my Blob $blob = $minus_checksum.encode('utf-8');
# unpack Blob into ascii list
my #array = $blob.unpack("C*");
# perform bitwise operation for each ascii in the list
my $dec +^= $_ for $blob.unpack("C*");
# only take 2 digits
$dec = sprintf("%02d", $dec) if $dec ~~ /^\d$/;
$dec = '0'.$dec if $dec ~~ /^[a..fA..F]$/;
$dec = uc $dec;
# convert it to hex
my $hex = sprintf '%02x', $dec;
return uc $hex;
}
Result
Received --> $$0116AA861013034151986|10001000181123062657411200000000000010235444112500000000.600000000345.4335N10058.8249E00015
Received --> 0
Received --> $$0116AA861013037849727|1080100018112114435541120000000000000FBA00D5122500000000.600000000623.9080N10007.8627E00075
Received --> D
Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022
Received --> 7
Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022
Received --> 7
Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022
Received --> 7
Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022
Received --> 7

First of all, TCP connections are streams, so there's no promises that the "messages" that are sent will be received as equivalent "messages" on the receiving end. Things that are sent can be split up or merged as part of normal TCP behavior, even before Perl 6 behavior is considered. Anything that wants a "messages" abstraction needs to build it on top of the TCP stream (for example, by sending data as lines, or by sending a size in bytes, followed by the data).
In Perl 6, the data arriving over the socket is exposed as a Supply. A whenever $conn { } is short for whenever $conn.Supply { } (the whenever will coerce whatever it is given into a Supply). The default Supply is a character one, decoded as UTF-8 into a stream of Perl 6 Str. As noted in the answer you already received, strings in Perl 6 work at grapheme level, so it will keep back a character in case the next thing that arrives over the network is a combining character. This is the "truncation" that you are experiencing. (There are some things which can never be combined. For example, \n can never have a combining character placed on it. This means that line-oriented protocols won't encounter this kind of behavior, and can be implemented as simply whenever $conn.Supply.lines { }.)
There are a couple of options available:
Do whenever $conn.Supply(:bin) { }, which will deliver binary Blob objects, which will correspond to what the OS passed to the VM. That can then be .decode'd as wanted. This is probably your best bet.
Specify an encoding that does not support combining characters, for example whenever $conn.Supply(:enc('latin-1')) { }. (However, note that since \r\n is 1 grapheme, then if the message were to end in \r then that would be held back in case the next packet came along with a \n).
In both cases, it's still possible for messages to be split up during transmission, but these will (entirely and mostly, respectively) avoid the keep-one-back requirement that grapheme normalization entails.

Related

Managing EOF in Trx Library

I am using the TRX library to process the ISO8583 Message. I am receiving a Raw data EOF character. But the last byte is not removed from the buffer as it's not defined in the packager and it's causing an issue in parsing the next transaction. How to manage this?
And while sending a response back how to add EOF character?
Normally, this kind of stuff (protocol characters) is removed before decoding ISO8583 data.
For example, you read 100 bytes from the socket. And it's ISO data + EOF character. You would remove EOF character and process 99 bytes of ISO data through a decoder.
And reverse is true when you send data. You encode your data first then add EOF character. The resulting byte array goes into the socket.
Sorry, I don't know anything about TRX library, but, hopefully, general advice helps you somewhat.

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

Split a BibTeX author field into parts

I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.
The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.

Why does Redis not work with requirepass directive?

I want to set a password to connect to a Redis server.
The appropriate way to do that is using the requirepass directive in the configuration file.
http://redis.io/commands/auth
However, after setting the value, I get this upon restarting Redis:
Stopping redis-server: redis-server.
Starting redis-server: Segmentation fault (core dumped)
failed
Why is that?
The password length is limited to 512 characters.
In redis.h:
#define REDIS_AUTHPASS_MAX_LEN 512
In config.c:
} else if (!strcasecmp(argv[0],"requirepass") && argc == 2) {
if (strlen(argv[1]) > REDIS_AUTHPASS_MAX_LEN) {
err = "Password is longer than REDIS_AUTHPASS_MAX_LEN";
goto loaderr;
}
server.requirepass = zstrdup(argv[1]);
}
Now, the parsing mechanism of the configuration file is quite basic. All the lines are split using the sdssplitargs function of the sds (string management) library. This function interprets specific sequence of characters such as:
single and double quotes
\x hex digits
special characters such as \n, \r, \t, \b, \a
Here the problem is your password contains a single double quote character. The parsing fails because there is no matching double quote at the end of the string. In that case, the sdssplitargs function returns a NULL pointer. The core dump occurs because this pointer is not properly checked in the config.c code:
/* Split into arguments */
argv = sdssplitargs(lines[i],&argc);
sdstolower(argv[0]);
This is a bug that should be filed IMO.
A simple workaround would be to replace the double quote character or any other interpreted characters by an hexadecimal sequence (ie. \x22 for the double quote).
Although not documented, it seems there are limitations to the password value, particularly with the characters included, not the length.
I tried with 160 characters (just digits) and it works fine.
This
9hhNiP8MSHZjQjJAWE6PmvSpgVbifQKCNXckkA4XMCPKW6j9YA9kcKiFT6mE
too. But this
#hEpj6kNkAeYC3}#:M(:$Y,GYFxNebdH<]8dC~NLf)dv!84Z=Tua>>"A(=A<
does not.
So, Redis does not support some or all of the "special characters".
Just nailed this one with:
php: urlencode('crazy&char's^pa$$wor|]');
-or-
js: encodeURIComponent('crazy&char's^pa$$wor|]');
Then it can be used anywhere sent to the redis server via (usually) tcp

Can NMEA values contain '*' (asterisks)?

I am trying to create NMEA-compatible proprietary sentences, which may contain arbitrary strings.
The usual format for an NMEA sentence with checksum is:
$GPxxx,val1,val2,...,valn*ck<cr><lf>
where * marks the start of a 2-digit checksum.
My question is: Can any of the value fields contain a * character themselves?
It would seem possible for a parser to wait for the final <cr><lf>, then to look back at the previous 3 characters to find the checksum if present (rather than just waiting for the first * in the sentence). However I don't know if the standard allows it.
Are there other characters which may cause problems?
The two ASCII characters to be careful with are $, which has to be at the start, and * which precedes the checksum. Anyone else parsing your custom NMEA wouldn't expect to find either of those characters anywhere else. Some parsers, when they hit a $ assume that a new line has started. With serial port communication sometimes characters get lost in transit, and that's why there's a $ start of sentence marker.
If you're going to make your own NMEA commands it is customary to start them with P followed by a 3 character code indicating the manufacturer or company creating the proprietary message, so you could use $PSQU. Note that although it is recommended that NMEA commands are 5 characters long, there are proprietary messages out there by various hardware and software manufacturers that are anywhere from 4 characters to 7 characters long.
Obviously if you're writing your own parser you can do what you like.
This website is rather useful:
http://www.gpsinformation.org/dale/nmea.htm
If you're extending the protocol yourself (based on "proprietary") - then sure, you can put in anything you like. I would stick to ASCII, but go wild within those bounds. (Obviously, you need to come up with your own $GPxxx so as not to clash with existing messages. Perhaps a new header $SQUEL, ...)
By definition, a proprietary message will not be NMEA-compatible.
A standard parser listening to an NMEA stream should ignore anything that doesn't match what it thinks is 'good' data. That means a checksum error, or any massively corrupted message like it would think your new message is with some random *s thrown in.
If you are merely writing an existing message, then a * doesn't make sense, and should be ignored, but you run the risk of major issues if the checksum is correct, and the parser doesn't understand the payload.