How to commit to an alternation branch in a Raku grammar token? - grammar

Suppose I have a grammar with the following tokens
token paragraph {
(
|| <header>
|| <regular>
)
\n
}
token header { ^^ '---' '+'**1..5 ' ' \N+ }
token regular { \N+ }
The problem is that a line starting with ---++Foo will be parsed as a regular paragraph because there is no space before "Foo". I'd like to fail the parse in this case, i.e. somehow "commit" to this branch of the alternation, e.g. after seeing --- I want to either parse the header successfully or fail the match completely.
How can I do this? The only way I see is to use a negative lookahead assertion before <regular> to check that it does not start with ---, but this looks rather ugly and impractical, considering that my actual grammar has many more than just these 2 branches. Is there some better way? Thanks in advance!

If I understood your question correctly, you could do something like this:
token header {
^^ '---' [
|| '+'**1..5 ' ' \N+
|| { die "match failed near position $/.pos()" }
]
}

Related

How to get the line of a token in parse rules?

I have searched everywhere and can't find a solution. I am new to ANTLR and for an assignment, I need to print out (using similar syntax that I have below) an error message when my parser comes across an unidentified token with the line number and token. The documentation for antlr4 says line is an attribute for Token objects that gives "[t]he line number on which the token occurs, counting from 1; translates to a call to getLine. Example: $ID.line."
I attempted to implement this in the following chunk of code:
not_valid : not_digit { System.out.println("Line " + $not_digit.line + " Contains Unrecognized Token " $not_digit.text)};
not_digit : ~( DIGIT );
But I keep getting the error
unknown attribute line for rule not_digit in $not_digit.line
My first thought was that I was applying an attribute for a lexer token to a parser rule because the documentation splits Token and Rule attributes into two different tables. so then I change the code to be:
not_valid : NOT_DIGIT { System.out.println("Line " + $NOT_DIGIT.line + " Contains Unrecognized Token " $NOT_DIGIT.text)};
NOT_DIGIT : ~ ( DIGIT ) ;
and also
not_valid : NOT_DIGIT { System.out.println("Line " + $NOT_DIGIT.line + " Contains Unrecognized Token " $NOT_DIGIT.text)};
NOT_DIGIT : ~DIGIT ;
like the example in the documentation, but both times I got the error
rule reference DIGIT is not currently supported in a set
I'm not sure what I am missing. All my searches turn up how to do this in Java outside of the parser, but I need to work in the action code (I think that's what it is called) in the parser.
A block like { ... } is called an action. You embed target specific code in it. So if you're working with Java, then you need to write Java between { and }
A quick demo:
grammar T;
parse
: not_valid EOF
;
not_valid
: r=not_digit
{
System.out.printf("line=%s, charPositionInLine=%s, text=%s\n",
$r.start.getLine(),
$r.start.getCharPositionInLine(),
$r.start.getText()
);
}
;
not_digit
: ~DIGIT
;
DIGIT
: [0-9]
;
OTHER
: ~[0-9]
;
Test it with the code:
String source = "a";
TLexer lexer = new TLexer(CharStreams.fromString(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
which will print:
line=1, charPositionInLine=0, text=a

warning use of undefined constant.. this will throw an error in future version of php

This should be enough for someone to correct my issue - I'm very much a newbie at this.
It's a short bit of code to strip spaces from the ends of strings submitted in forms.
The warning message is saying "Use of undefined constant mystriptag - assumed 'mystriptag' (this will throw an error..."
How should I change this?
function mystriptag($item)
{
$item = strip_tags($item);
}
array_walk($_POST, mystriptag);
function t_area($str){
$order = array("\r\n", "\n", "\r");
$replace = ', ';
$newstr = str_replace($order, $replace, $str);
return $newstr;
}
You must use single quote in order for PHP to understand your parameter mystriptag.
So the correct line would be :
array_walk($_POST, 'mystriptag');

deserialization issue with char '\'

Does json.net have an in built method that would escape special characters? My json strings I recv from vendors have \, double " .
If not what is the best way to escape the special charecters before invoking JsonConvert.DeserializeObject(myjsonString)?
My sample json string
{
"EmailAddresses": [
{
"EmailAddress": "N\A"
}
]
}
Pasting this in json lint results in
Parse error on line 4:
... "EmailAddress": "N\A",
-----------------------^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '['
VB.NET code
instanceofmytype = JsonConvert.DeserializeObject(Of myType)(myJsonString)
Exception: Newtonsoft.Json.JsonReaderException: Bad JSON escape sequence:
The JSON is not valid: a \ must be followed by one of the following: "\/bfnrtu. Since it's followed by A, Json.NET chokes (as it ought to). The source of your JSON should be fixed. If this is not an option, you can make a guess to fix it yourself, e.g.
myStr = Regex.Replace(myStr, "\\(?=[^""\\/bfnrtu])", "\\")
You shouldn't have to worry about it. JSON.NET handles a lot of nice things for you. It should just work.
Have you tried it?

JavaCC: Defining a *password* token or grammar rule

I'm using JavaCC do simulate a small part of SQL grammars, and I'm having a problem with defining a password.
I'm writting grammar rules for a
CREATE USER user_name IDENTIFIED BY a_password
statement, and I'm stuck. Since a password can match with ANYTHING like asdkj*!##, or !#%^%ASDjnkj, _ASDJLJK##& etc. Note that in Oracle, it's totally legal to input your password without single quote mark ('). I could solve this problem easily if the quote marks are compulsory, but unfortunately they're not.
I've tried many ways to define a token/grammar rule for this password, but it didn't work as I expected, the latest rule I've tried is:
TOKEN : {
< S_PASSWORD: ( < DIGIT > | < LETTER > |< S_PASSCHAR >)+ >
| <#S_PASSCHAR : "!"|"#"|"#"|"$"|"%"|"^"|"&"|"*" >
| <#LETTER: ["a"-"z", "A"-"Z", "_"]>
| <#DIGIT: ["0" - "9"]>
}
But since < S_PASSWORD > can match ANYTHING, any other token that I defined earlier will be match with it, and I always get a JavaCC warning like this:
Warning: "#" cannot be matched as a string literal token at line 33515, column 13. It will be matched as < S_PASSWORD >.
There are similar suggestions from my friends, but they didn't work either.
Can someone help me with this?
Assuming there is some lexical way to tell where the password begins and ends, you can use lexical states. For example, if the sequence IDENTIFIED BY is only ever followed by spaces that are followed by a password, you make a state machine so that IDENTIFIED transitions from DEFAULT to S0. In S0 spaces are skipped and BY transitions to S1. In S1 spaces are skipped and a sequence of password characters is a PASSWORD token; the PASSWORD token transitions back to DEFAULT. Of course this only works if IDENTIFIED BY can only ever be followed by a password. Also in S0 you need to by able to deal with all the normal stuff, so most of your token rules should apply in both states S0 and DEFAULT but transition to DEFAULT. See the FAQ for more on lexical states.
If BY is only ever followed by a password, then it is even easier, as you don't need S0.
Edit
Here are some example rules. If the keyword BY is only ever followed by a password, you only need two states
TOKEN : { <BY : "BY"> : S1>
<S1> TOKEN : { <PASSWORD : ( <PASSWORDCHAR> )+ } : DEFAULT }
<DEFAULT, S1> : SKIP { " " } // Stays in the same state.
If you can use IDENTIFIED followed by BY then you need three states
<DEFAULT, S0> TOKEN : { <CREATE : "CREATE"> : DEFAULT } // And similar for most token rules.
<DEFAULT, S0> TOKEN : { <IDENTIFIED : "IDENTIFIED"> : S1 }
<S0> TOKEN : { <BY : "BY"> : S1> // BYs that follow IDENTIFIED
<DEFAULT> TOKEN : { <BY : "BY"> : DEFAULT } BYs that don't follow IDENTIFIED.
<S1> TOKEN : { <PASSWORD : ( <PASSWORDCHAR> )+ } : DEFAULT }
<DEFAULT, S0, S1> : SKIP { " " } // Stays in the same state.

Antlr syntactic predicate no matching

I have the following grammar:
rule : (PATH)=> (PATH) SLASH WORD
{System.out.println("file: " + $WORD.text + " path: " + $PATH.text);};
WORD : ('a'..'z')+;
SLASH : '/';
PATH : (WORD SLASH)* WORD;
but it does not work for a string like "a/b/c/filename".
I thought I could solve this "path"-problem with the syntactic predicate feature. Maybe I am doing something wrong here and I have to redefine the grammar. Any suggestion for this problem?
You must understand that a syntactic predicate will not cause the parser to give the lexer some sort of direction w.r.t. what token the parser would "like" to retrieve. A syntactic predicate is used to force the parser to look ahead in an existing token stream to resolve ambiguities (emphasis on 'existing': the parser has no control over what token are created!).
The lexer operates independently from the parser, creating tokens in a systematic way:
it tries to match as much characters as possible;
whenever 2 (or more) rules match the same amount of characters, the rule defined first will get precedence over the rule(s) defined later.
So in your case, given the input "a/b/c/filename", the lexer will greedily match the entire input as a single PATH token.
If you want to get the file name, either retrieve it from the PATH:
rule : PATH
{
String file = $PATH.text.substring($PATH.text.lastIndexOf('/') + 1);
System.out.println("file: " + file + ", path: " + $PATH.text);
}
;
WORD : ('a'..'z')+;
SLASH : '/';
PATH : (WORD SLASH)* WORD;
or create a parser rule that matches a path:
rule : dir WORD
{
System.out.println("file: " + $WORD.text + ", dir: " + $dir.text);
}
;
dir : (WORD SLASH)+;
WORD : ('a'..'z')+;
SLASH : '/';