raku grammar problem when trying to define a grammar for markdown - raku

I'm using the following code to parse a subset of Markdown text:
#!/usr/bin/perl6
use v6;
my #tests = '', #should print OK
"\n\n\n", #should print OK
'alpha', #should print OK
'1234', #should print OK
'alpha123', #should print OK
'a simple test without a heading and whitespaces and 123 numbers', #should print OK
'a simple test without a heading with punctuation, whitespaces and 123 numbers.', #should print OK
'a simple test without a heading with punctuation, whitespaces, 123 numbers, quotes\' ", braces (){}<>[], \/.', #should print OK
'a simple test with a # in the middle', #should print OK
"#heading", #should print FAILED
"#heading ", #should print FAILED
"#heading\n", #should print OK
"#heading \n", #should print OK
"#heading\nand a text", #should print OK
"#heading\nand\na\ntext\nwith\nnewlines", #should print OK
"#heading1\nand text\n#heading2\nand text" #should print OK
;
grammar markdown_ltd {
rule TOP {
[ <text> | <headingandtext>+ ]
}
rule text {
<textline>*<lasttextline>?
}
rule textline {
<textlinecontent>? [\n]
}
rule lasttextline {
<textlinecontent>
}
rule textlinecontent {
<[\N]-[\#]> [\N]*
}
rule headingandtext {
<heading> [\n] <text>
}
rule heading {
[\#] [\N]+
}
}
for #tests -> $test {
my $result = markdown_ltd.parse($test);
say "\nTesting '" ~ $test ~ "' " ~ ($result ?? 'OK' !! 'FAILED');
}
The purpose pf this grammar is to detect headings and regular text.
However, there is some error in the code such that the included tests do not lead to the expected behavior. When I run the script, I get the following output:
Testing '' OK
Testing '
' OK
Testing 'alpha' FAILED
Testing '1234' FAILED
Testing 'alpha123' FAILED
Testing 'a simple test without a heading and whitespaces and 123 numbers' OK
Testing 'a simple test without a heading with punctuation, whitespaces and 123 numbers.' OK
Testing 'a simple test without a heading with punctuation, whitespaces, 123 numbers, quotes' ", braces (){}<>[], \/.' OK
Testing 'a simple test with a # in the middle' OK
Testing '#heading' FAILED
Testing '#heading ' FAILED
Testing '#heading
' FAILED
Testing '#heading
' FAILED
Testing '#heading
and a text' FAILED
Testing '#heading
and
a
text
with
newlines' FAILED
Testing '#heading1
and text
#heading2
and text' FAILED
Which differs from the output I expected.

Thanks #donaldh for pointing me to the right direction!
By changing all occurences of rule to token fixed the issue!

Related

Add #y to the end of the string if it contains x

I have a file with the following contents:
# cat.txt
AB-1 text: foo
AB-1 test3: test cat dog
AB-1 test4: abc
# cat2.txt
AB-4 test: qwerty
AB-5 test2: Foo bar
AB-6 abc: Dog
and try to get the following from it, if the string contains x then add #y to its end:
# cat.txt
AB-1 text: foo#foo #text
AB-1 test3: test cat dog#animal
AB-1 test4: abc
# cat2.txt
AB-4 test: qwerty
AB-5 test2: Foo bar#foo
AB-6 abc: Dog#animal
I ran into the following problems:
ignore strings starting with #
two keywords in one line is possible
case sensitive
I solved the last two points, but not sure if it is logical and proper, anaway:
awk '{print $0 (tolower($0) ~ /foo/ ? "#foo " : "" ) (tolower($0) ~ /cat|dog/ ? "#animal " : "") (tolower($0) ~ /text/ ? "#text " : "")}' ./file.txt
Thus, at this point is the following result:
# cat.txt#animal
AB-1 text: foo#foo #text
AB-1 test3: test cat dog#animal
AB-1 test4: abc
# cat2.txt#animal
AB-4 test: qwerty
AB-5 test2: Foo bar#foo
AB-6 abc: Dog#animal
The result is pretty close to what is needed, but the following points still cause concern:
As you can see i am using tolower($0) ~ to make the concept case insensitive. I use macOS and have not succeeded using -v IGNORECASE=1 flag or BEGIN{IGNORECASE = 1}, for some reason it also does not work in my concept. Is it possible to improve this?
Is there any way to ignore strings that start with #?
You may use this awk:
awk '/^#/ { # if line starts with #
print # print it
next # skip to next line
}
{
lcr = tolower($0)
print $0 (lcr ~ /foo/ ? "#foo " : "" ) \
(lcr ~ /cat|dog/ ? "#animal " : "") (lcr ~ /text/ ? "#text " : "")
}' file
# cat.txt
AB-1 text: foo#foo #text
AB-1 test3: test cat dog#animal
AB-1 test4: abc
# cat2.txt
AB-4 test: qwerty
AB-5 test2: Foo bar#foo
AB-6 abc: Dog#animal
macOS and have not succeeded using -v IGNORECASE=1 flag or
BEGIN{IGNORECASE = 1}, for some reason it also does not work in my
concept. Is it possible to improve this?
IGNORECASE is GNU AWK specific feature, so if this does not work on your machine you are not using GNU AWK. Use awk --version to detect what version of AWK are you actually using. macappstore.org suggest that there exist gawk for MacOS, but I do not have ability to test it.

awk script is not running the middle block

the following script will only run the BEGIN and END blocks:
#!/bin/awk -f
BEGIN {print "Hello, World!"}
{ print "Don't Panic" }
END { print "and we're panicking... I told you not to panic. Did you miss that part?" }
and the output is:
$ awk -f joint.awk .
Hello, World!
and we're panicking... I told you not to panic. Did you miss that part?
the expected output is:
$ awk -f joint.awk .
Hello, World!
Don't panic
and we're panicking... I told you not to panic. Did you miss that part?
what's odd is that when I change the middle block to print $1, instead of printing a piece of text, it runs as expected when I pass a file in.
The inner line with explicit no condition gets run once per line of input on stdin (or in your input file, if one is explicitly named).
Thus, how many times Don't Panic gets printed depends on how much input there is.
See this tested by the following code:
awkScript=$(cat <<'EOF'
BEGIN {print "Hello, World!"}
{ print "Don't Panic" }
END { print "and we're panicking... I told you not to panic. Did you miss that part?" }
EOF
)
echo "Testing with no input:"
awk "$awkScript" </dev/null
echo
echo "Testing with one line of input:"
awk "$awkScript" <<<"One line of input"
echo
echo "Testing with two lines of input:"
awk "$awkScript" <<<$'First line\nSecond line'
...which emits as output:
Testing with no input:
Hello, World!
and we're panicking... I told you not to panic. Did you miss that part?
Testing with one line of input:
Hello, World!
Don't Panic
and we're panicking... I told you not to panic. Did you miss that part?
Testing with two lines of input:
Hello, World!
Don't Panic
Don't Panic
and we're panicking... I told you not to panic. Did you miss that part?

Using awk to process html-related Gift-format Moodle questions

This is basically a awk question but it is about processing data for the Moodle
Gift format, thus the tags.
I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed.
However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.
I thought awk would be the perfect tool to do this.
Say I have this (invalid as such) Moodle/gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
What I want is a script that translates this into a valid gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
key point: replace < and > with < and > when:
inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
between <code>and </code>, with arbitrary string in between.
For the first part, I'm fine. I have a shell script calling awk (gawk, actually).
awk -f process_src2gift.awk $1.src >$1.gift
with process_src2gift.awk:
BEGIN { print "// THIS IS A GENERATED FILE !" }
{
if( $1=="<pre>" ) # opening a "code" block
{
code=1;
print $0;
}
else
{
if( $1=="</pre>" ) # closing a "code" block
{
code=0;
print $0;
}
else
{ # if "code block", replace < > by html entities
if( code==1 )
{
gsub(">","\\>");
gsub("<","\\<");
}
print $0;
}
}
}
END { print "// END" }
However, I'm stuck with the second requirement..
Questions:
Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.
Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.
Answering own question.
I found a solution by doing a two step awk process:
first step as described in question
second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).
The shell file becomes:
echo "Step 1"
awk -f process_src2gift.awk $1.src >$1.tmp
echo "Step 2"
awk -f process_src2gift_2.awk $1.tmp >$1.gift
rm $1.tmp
And the second awk file (process_src2gift_2.awk) will be:
BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
{
gsub(">","\\>",$2);
gsub("<","\\<",$2);
if( NF >= 3 )
print $1 "<code>" $2 "</code>" $3
else
print $0
}
Of course, there are limitations:
no attributes in the <code> tag
only one pair <code></code> in the line
probably others...

Awk iterating with out a loop construct

I was reading a tutorial on awk scripting, and observed this strange behaviour, Why this awk script while executing asks for a number repeatedly even with out a loop construct like while or for. If we enter CTRL+D(EOF) it stops prompting for another number.
#!/bin/awk -f
BEGIN {
print "type a number";
}
{
print "The square of ", $1, " is ", $1*$1;
print "type another number";
}
END {
print "Done"
}
Please explain this behaviour of the above awk script
awk continues to work on lines until end of file is reached. Since in this case the input (STDIN) never ends as you keep entering number or hitting enter, it causes an endless loop.
When you hit CTRL+D you indicate the awk script that EOF is reached there by exiting the loop.
try this and enter 0 to exit
BEGIN {
print "type a number";
}
{
if($1==0)
exit;
print "The square of ", $1, " is ", $1*$1;
print "type another number";
}
END {
print "Done"
}
From the famous The AWK Programming Language:
If you don't provide a input file to the awk script on the command line, awk will apply the program to whatever you type next on your terminal until you type an end-of-file signal (control-d on Unix systems).

Break down JSON string in simple perl or simple unix?

ok so i have have this
{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}
and at the moment i'm using this shell command to decode it to get the string i need,
echo $x | grep -Po '"utterance":.*?[^\\]"' | sed -e s/://g -e s/utterance//g -e 's/"//g'
but this only works when you have a grep compiled with perl and plus the script i use to get that JSON string is written in perl, so is there any way i can do this same decoding in a simple perl script or a simpler unix command, or better yet, c or objective-c?
the script i'm using to get the json is here, http://pastebin.com/jBGzJbMk and if you want a file to use then download http://trevorrudolph.com/a.flac
How about:
perl -MJSON -nE 'say decode_json($_)->{hypotheses}[0]{utterance}'
in script form:
use JSON;
while (<>) {
print decode_json($_)->{hypotheses}[0]{utterance}, "\n"
}
Well, I'm not sure if I can deduce what you are after correctly, but this is a way to decode that JSON string in perl.
Of course, you'll need to know the data structure in order to get the data you need. The line that prints the "utterance" string is commented out in the code below.
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json = decode_json
q#{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}#;
#print $json->{'hypotheses'}[0]{'utterance'};
print Dumper $json;
Output:
$VAR1 = {
'status' => 0,
'hypotheses' => [
{
'utterance' => 'hello how are you',
'confidence' => '0.96311796'
}
],
'id' => '7aceb216d02ecdca7ceffadcadea8950-1'
};
Quick hack:
while (<>) {
say for /"utterance":"?(.*?)(?<!\\)"/;
}
Or as a one-liner:
perl -lnwe 'print for /"utterance":"(.+?)(?<!\\)"/g' inputfile.txt
The one-liner is troublesome if you happen to be using Windows, since " is interpreted by the shell.
Quick hack#2:
This will hopefully go through any hash structure and find keys.
my $json = decode_json $str;
say find_key($json, 'utterance');
sub find_key {
my ($ref, $find) = #_;
if (ref $ref) {
if (ref $ref eq 'HASH' and defined $ref->{$find}) {
return $ref->{$find};
} else {
for (values $ref) {
my $found = find_key($_, $find);
if (defined $found) {
return $found;
}
}
}
}
return;
}
Based on the naming, it's possible to have multiple hypotheses. The prints the utterance of each hypothesis:
echo '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}' | \
perl -MJSON::XS -n000E'
say $_->{utterance}
for #{ JSON::XS->new->decode($_)->{hypotheses} }'
Or as a script:
use feature qw( say );
use JSON::XS;
my $json = '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}';
say $_->{utterance}
for #{ JSON::XS->new->decode($json)->{hypotheses} };
If you don't want to use any modules from CPAN and try a regex instead there are multiple variants you can try:
# JSON is on a single line:
$json = '{"other":"stuff","hypo":[{"utterance":"hi, this is \"bob\"","moo":0}]}';
# RegEx with negative look behind:
# Match everything up to a double quote without a Backslash in front of it
print "$1\n" if ($json =~ m/"utterance":"(.*?)(?<!\\)"/)
This regex works if there is only one utterance. It doesn't matter what else is in the string around it, since it only searches for the double quoted string following the utterance key.
For a more robust version you could add whitespace where necessary/possible and make the . in the RegEx match newlines: m/"utterance"\s*:\s*"(.*?)(?<!\\)"/s
If you have multiple entries for the utterance confidence hash/object, changing case and weird formatting of the JSON string try this:
# weird JSON:
$json = <<'EOJSON';
{
"status":0,
"id":"an ID",
"hypotheses":[
{
"UtTeraNcE":"hello my name is \"Bob\".",
"confidence":0.0
},
{
'utterance' : 'how are you?',
"confidence":0.1
},
{
"utterance"
: "
thought
so!
",
"confidence" : 0.9
}
]
}
EOJSON
# RegEx with alternatives:
print "$1\n" while ( $json =~ m/["']utterance["']\s*:\s*["'](([^\\"']|\\.)*)["']/gis);
The main part of this RegEx is "(([^\\"]|\\.)*)". Description in detail as extended regex:
/
["'] # opening quotes
( # start capturing parentheses for $1
( # start of grouping alternatives
[^\\"'] # anything that's not a backslash or a quote
| # or
\\. # a backslash followed by anything
) # end of grouping
* # in any quantity
) # end capturing parentheses
["'] # closing quotes
/xgs
If you have many data sets and speed is a concern you can add the o modifier to the regex and use character classes instead of the i modifier. You can suppress the capturing of the alternatives to $2 with clustering parenthesis (?:pattern). Then you get this final result:
m/["'][uU][tT][tT][eE][rR][aA][nN][cC][eE]["']\s*:\s*["']((?:[^\\"']|\\.)*)["']/gos
Yes, sometimes perl looks like a big explosion in a bracket factory ;-)
Just stubmled upon another nice method of doing this, i finaly found how to acsess the Mac OS X JavaScript engine form commandline, heres the script,
alias jsc='/System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc'
x='{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}'
jsc -e "print(${x}['hypotheses'][0]['utterance'])"
Ugh, yes i came up with another answer, im strudying python and it reads arrays in both its python format and the same format as a json so, i jsut made this one liner when your variable is x
python -c "print ${x}['hypotheses'][0]['utterance']"
figured it out for unix but would love to see your perl and c, objective-c answers...
echo $X | sed -e 's/.*utterance//' -e 's/confidence.*//' -e s/://g -e 's/"//g' -e 's/,//g'
:D
shorter copy of the same sed:
echo $X | sed -e 's/.*utterance//;s/confidence.*//;s/://g;s/"//g;s/,//g'