Perl6 grammars: match full line - grammar

I've just started exploring perl6 grammars. How can I make up a token "line" that matches everything between the beginning of a line and its end? I've tried the following without success:
my $txt = q:to/EOS/;
row 1
row 2
row 3
EOS
grammar sample {
token TOP {
<line>
}
token line {
^^.*$$
}
}
my $match = sample.parse($txt);
say $match<line>[0];

I can see 2 problem in your Grammar here, the first one here is the token line, ^^ and $$ are anchor to start and end of line, howeve you can have new line in between. To illustrate, let's just use a simple regex, without Grammar first:
my $txt = q:to/EOS/;
row 1
row 2
row 3
EOS
if $txt ~~ m/^^.*$$/ {
say "match";
say $/;
}
Running that, the output is:
match
「row 1
row 2
row 3」
You see that the regex match more that what is desired, however the first problem is not there, it is because of ratcheting, matching with a token will not work:
my $txt = q:to/EOS/;
row 1
row 2
row 3
EOS
my regex r {^^.*$$};
if $txt ~~ &r {
say "match regex";
say $/;
} else {
say "does not match regex";
}
my token t {^^.*$$};
if $txt ~~ &t {
say "match token";
say $/;
} else {
say "does not match token";
}
Running that, the output is:
match regex
「row 1
row 2
row 3」
does not match token
I am not really sure why, but token and anchor $$ does not seems to work well together. But what you want instead is searching for everything except a newline, which is \N*
The following grammar solve mostly your issue:
grammar sample {
token TOP {<line>}
token line {\N+}
}
However it only matches the first occurence, as you search for only one line, what you might want to do is searching for a line + an optional vertical whitespace (In your case, you have a new line at the end of your string, but i guess you would like to take the last line even if there is no new line at the end ), repeated several times:
my $txt = q:to/EOS/;
row 1
row 2
row 3
EOS
grammar sample {
token TOP {[<line>\v?]*}
token line {\N+}
}
my $match = sample.parse($txt);
for $match<line> -> $l {
say $l;
}
Output of that script begin:
「row 1」
「row 2」
「row 3」
Also to help you using and debugging Grammar, 2 really usefull modules : Grammar::Tracer and Grammar::Debugger . Just include them at the beginning of the script. Tracer show a colorful tree of the matching done by your Grammar. Debugger allows you to see it matching step by step in real time.

Your original aproach can be made to work via
grammar sample {
token TOP { <line>+ %% \n }
token line { ^^ .*? $$ }
}
Personally, I would not try to anchor line and use \N instead as already suggested.

my $txt = q:to/EOS/;
row 1
row 2
row 3
EOS
grammar sample {
token TOP {
<line>+
}
token line {
\N+ \n
}
}
my $match = sample.parse($txt);
say $match<line>[0];
Or if you can be specific about the line:
grammar sample {
token TOP {
<line>+
}
rule line {
\w+ \d
}
}

my $txt = q:to/EOS/;
row 1
row 2
row 3
EOS
grammar sample {
token TOP { <line> }
token line { .* }
}
for $txt.lines -> $line {
## An single line of text....
say $line;
## Parse line of text to find match obj...
my $match = sample.parse($line);
say $match<line>;
}

Related

awk does not get multiple matches in a line with match

AWK has the match(s, r [, a]) function which according to the manual is capable of recording all occuring patterns into array "a":
...If array a is provided, a is cleared and then elements 1 through n are filled with the portions of s that match the corresponding parenthesized subexpression in r. The 0'th element of a contains the portion of s matched by the entire regular expression r. Subscripts a[n, "start"], and a[n, "length"] provide the starting index in the string and length respectively, of EACH matching substring.
I expect that the following line:
echo 123412341234 | awk '{match($0,"1",arr); print arr[0] arr[1] arr[2];)'
prints 111
But in fact "match" ignores all other matches except the first one.
Could please someone tell me please what is the proper syntax here to populate "arr" with all occurrences of "1"?
match only finds first match and stops there. You will have to run match in a loop or else use this way where we use split input on anything this is not 1:
echo '123412341234' | awk -F '[^1]+' '{print $1 $2 $3}'
111
Or using split in gnu-awk:
echo '123412341234' | awk 'split($0, a, /1/, m) {print m[1] m[2] m[3]}'
111
I would harness GNU AWK patsplit function for that task following way, let file.txt content be
123412341234
then
awk '{patsplit($0,arr,"1");print arr[1] arr[2] arr[3]}' file.txt
gives output
111
Explanation: patsplit is function which allows you to get similar effect to using FPAT variable, it does put all matches of 3rd argument into array provided as 2nd argument (clearing it if is not empty) found in string provided as 1st argument. Observe that 1st finding does goes under key 1, 2nd under 2, 3rd under 3 and so on (there is nothing under 0)
(tested in GNU Awk 5.0.1)
If sub is allowed then you can do a substitution here. Try following awk code once.
awk '{gsub(/[^1]+/,"")} 1' Input_file
patsplit() is basically same as wrapping the desired regex pattern with a custom pair of SEPs before splitting, which is what anysplit() is emulating here, while being UTF-8 friendly.
echo "123\uC350abc:\uF8FF:|\U1F921#xyz" |
mawk2x '{ print ("\t\f"($0)"\n")>>(STDERR)
anysplit($_, reFLEX_UCode8 "|[[-_!-/3-?]",___=2,__)
OFS="\t"
for(_ in __) { if (!(_%___)) {
printf(" matched_items[ %2d ] = # %-2d = \42%s\42\n",
_,_/___,__[_])
} } } END { printf(ORS) }'
123썐abc::|🤡#xyz
matched_items[ 2 ] = # 1 = "3썐"
matched_items[ 4 ] = # 2 = "::"
matched_items[ 6 ] = # 3 = "🤡#"
In the background, anysplit() is nothing all that complicated either :
xs3pFS is a 3-byte string of \301\032\365 that I assumed would be extremely rare to show up even in binary data.
gsub(patRE, xs3pFS ((pat=="&")?"\\":"") "&" xs3pFS,_)
gsub(xs3pFS "("xs3pFS")+", "",_)
return split(_, ar8, xs3pFS)
By splitting the input string in this manner, all the desired items would exist in even-numbered array indices, while the rest of the string would be distributed along odd-numbered indices,
somewhat similar to the 2nd array i.e. 4th argument in gawk's split() and patsplit() for the seps, but difference being that both the matches and the seps, whichever way you want to see them, are in the same array.
When you print out every cell in the array, you'll see :
_SEPS_[ 1 ] = # 1 = "123"
matched_items
[ 2 ] = # 1 = "썐"
_SEPS_[ 3 ] = # 2 = "abc"
matched_items
[ 4 ] = # 2 = "::"
_SEPS_[ 5 ] = # 3 = "|"
matched_items
[ 6 ] = # 3 = "🤡#"
_SEPS_[ 7 ] = # 4 = "xyz"

Is Perl 6's uncuddled else a special case for statement separation?

From the syntax doc:
A closing curly brace followed by a newline character implies a statement separator, which is why you don't need to write a semicolon after an if statement block.
if True {
say "Hello";
}
say "world";
That's fine and what was going on with Why is this Perl 6 feed operator a “bogus statement”?.
However, how does this rule work for an uncuddled else? Is this a special case?
if True {
say "Hello";
}
else {
say "Something else";
}
say "world";
Or, how about the with-orwith example:
my $s = "abc";
with $s.index("a") { say "Found a at $_" }
orwith $s.index("b") { say "Found b at $_" }
orwith $s.index("c") { say "Found c at $_" }
else { say "Didn't find a, b or c" }
The documentation you found was not completely correct. The documentation has been updated and is now correct. It now reads:
Complete statements ending in bare blocks can omit the trailing semicolon, if no additional statements on the same line follow the block's closing curly brace }.
...
For a series of blocks that are part of the same if/elsif/else (or similar) construct, the implied separator rule only applies at the end of the last block of that series.
Original answer:
Looking at the grammar for if in nqp and Rakudo, it seems that an if/elsif/else set of blocks gets parsed out together as one control statement.
Rule for if in nqp
rule statement_control:sym<if> {
<sym>\s
<xblock>
[ 'elsif'\s <xblock> ]*
[ 'else'\s <else=.pblock> ]?
}
(https://github.com/perl6/nqp/blob/master/src/NQP/Grammar.nqp#L243, as of August 5, 2017)
Rule for if in Rakudo
rule statement_control:sym<if> {
$<sym>=[if|with]<.kok> {}
<xblock(so ~$<sym>[0] ~~ /with/)>
[
[
| 'else'\h*'if' <.typed_panic: 'X::Syntax::Malformed::Elsif'>
| 'elif' { $/.typed_panic('X::Syntax::Malformed::Elsif', what => "elif") }
| $<sym>='elsif' <xblock>
| $<sym>='orwith' <xblock(1)>
]
]*
{}
[ 'else' <else=.pblock(so ~$<sym>[-1] ~~ /with/)> ]?
}
(https://github.com/rakudo/rakudo/blob/nom/src/Perl6/Grammar.nqp#L1450 as of August 5, 2017)

What does it mean when an awk script has code outside curly braces?

Going through an awk tutorial, I came across this line
substr($0,20,5) == "HELLO" {print}
which prints a line if there is a "HELLO" string starting at 20th char.
Now I thought curly braces were necessary at the start of an awk script and an 'if' for this to work, but it works without nevertheless.
Can some explain how it evaluates?
If you have:
{ action }
...then that action runs on every line. By contrast, if you have:
condition { action }
...then that action runs only against lines for which the condition is true.
Finally, if you have only a condition, then the default action is print:
NR % 2 == 0
...will thus print every other line.
You can similarly have multiple pairs in a single script:
condition1 { action1 }
condition2 { action2 }
{ unconditional_action }
...and can also have BEGIN and END blocks, which run at the start and end of execution.

How to remove space and the specific character in string - awk

Below is a input.
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
I am trying to make a final result like as following. That is, wanted to remove space,'{' and '}' character and check if the next line is '>' or '<'.
In fact, the input above is repeated. I also need to parse '>' and '<' character so I will put the parsed string(YES or NO) into database.
ID=34,ID=35#YES#NO
ID=99,ID=23#YES#NO
ID=18,ID=87#NO#YES
So, with 'sub' function I thought I can replace the space with blank but the result shows:
1#YES#NO
Can you let me know what is wrong?
If possible, teach me how to remove '{' and '}' as well.
Appreciated if you could show me the awk file version instead of one-liner.
BEGIN {
VALUES = ""
L_EXIST = "NO"
R_EXIST = "NO"
}
/!/ { VALUES = gsub(" ", "", $0);
getline;
if ($1 == ">") L_EXIST = "YES";
else if ($1 == "<") R_EXIST = "YES";
print VALUES"#"L_EXIST"#"R_EXIST
}
END {
}
Given your sample input:
$ cat file
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
This script produces the desired output:
BEGIN { FS="[}{=, ]+"; RS="!" }
NR > 1 { printf "ID=%d,ID=%d#%s\n", $3, $5, ($6==">"?"YES#NO":"NO#YES") }
The Field Separator is set to consume the spaces and other characters between the parts of the line that you're interested in. The Record Separator is set to !, so that each pair of lines is treated as a single record.
The first record is empty (the start of the first line, up to the first !), so we only process the ones after that. The output is constructed using printf, with a ternary to determine the last part (I assume that there are only two options, > or <).
Let's say you have this input:
input.txt
!{ID=34, ID2=35}
!{ID=36, ID2=37}
>
You can use the following awk command
awk -F'[!{}, ]' 'NR>1{yn="NO";if($1==">")yn="YES";print l"#"yn}{l=$3","$5}' input.txt
to produce this output:
ID=34,ID2=35#NO
ID=36,ID2=37#YES

Lex : line with one character but spaces

I have sentences like :
" a"
"a "
" a "
I would like to catch all this examples (with lex), but I don't how to say the beginning of the line
I'm not totally sure what exactly you're looking for, but the regex symbol to specify matching the beginning of a line in a lex definition is the caret:
^
If I understand correctly, you're trying to pull the "a" out as the token, but you don't want to grab any of the whitespace? If this is the case, then you just need something like the following:
[\n\t\r ]+ {
// do nothing
}
"a" {
assignYYText( yylval );
return aToken;
}