JavaCC grammar - parse everything to the end of the file - grammar

I am using FeatureBNF (and so in essence I am using JavaCC) to try and write a grammar that will produce a (very) simple parser to parse Gherkin files.
An example Gherkin file:
Feature: Calculator
In order to avoid silly mistakes
As a math idiot
I want to be told the sum of two numbers
Scenario: Add two numbers
Given I have entered 50 into the calculator
And I have also entered 70 into the calculator
When I press add
Then the result should be 120 on the screen
All I want to do, to begin with, is parse this into a Feature that has the name Calculator and a Body, which is the entirety of the rest of the file.
I have struggled with the part where I'm trying to read the rest of the file into the Body however. I think maybe partially because there is no 'natural' delimiters for when one section ends - it's denoted by a newline.
Trying the following grammar:
<DEFAULT> TOKEN :
{
<FEATURE: "Feature: " >
| <#LETTER: ["\u0027","\u0041"-"\u005a","\u005f","\u0061"-"\u007a"] >
| <FEATURE_NAME: (<LETTER>)+ >
| <NEWLINE: ("\r\n" | "\n\r" | "\r" | "\n") >
| <TEXT : ~[] >
}
GRAMMARSTART
Feature :
<FEATURE> FeatureName <NEWLINE>
Body
<EOF>
;
FeatureName: <FEATURE_NAME>;
Body: (<TEXT>)*;
I get the error:
[java] java.lang.reflect.InvocationTargetException
... lots of stack trace removed...
[java] Caused by: cide.gparser.ParseException: Encountered "\r\n" (5) at line 2, column 1.
[java] Was expecting one of:
[java] <EOF>
[java] <TEXT> ...
I have been able to achieve what I want by adding some delimiters in to the Gherkin file and using lexical states, like so:
Feature: Calculator #TITLEEND
#BODYSTART
In order to avoid silly mistakes
As a math idiot
I want to be told the sum of two numbers
Scenario: Add two numbers
Given I have entered 50 into the calculator
And I have also entered 70 into the calculator
When I press add
Then the result should be 120 on the screen
#BODYEND
With the following relevant parts of the grammar:
<DEFAULT, IN_BODY> SPECIAL_TOKEN : {
" " | "\t" | "\n" | "\r" | "\f"
}
<DEFAULT> TOKEN : {
<FEATURE: "Feature: " >
| <#LETTER: ["\u0027", "\u0041"-"\u005a", "\u005f", "\u0061"-"\u007a"] >
| <FEATURE_NAME: (<LETTER>)+ >
| <ENDFEATURETITLE: "#TITLEEND" >
}
<DEFAULT> TOKEN : { <BODYSTART : "#BODYSTART"> : IN_BODY }
<IN_BODY> TOKEN : { <TEXT : ~[] > }
<IN_BODY> TOKEN : { <BODYEND : "#BODYEND"> : DEFAULT }
GRAMMARSTART
Feature:
<FEATURE> FeatureName <ENDFEATURETITLE>
Body
<EOF>;
FeatureName: <FEATURE_NAME>;
Body: <BODYSTART> Text <BODYEND>;
Text: (<TEXT>)*;
But I am sure I must be missing something and would like to be able to achieve this without having to annotate the feature files. What is a better way to do this?
SIDE NOTE
FeatureBNF builds on top of JavaCC and outputs a grammar file for JavaCC to process. I am completely new to both FeatureBNF and JavaCC, but they seem similar enough that I hope this question might be applicable to JavaCC gurus. (FeatureBNF uses JavaCC syntax for the lexical specifications and then its own format for the grammar's production rules.)

Based on your grammar, you can switch states after the first newline, so the following lexical grammar will suffice:
<DEFAULT> TOKEN : {
<FEATURE: "Feature: " >
| <#LETTER: ["\u0027", "\u0041"-"\u005a", "\u005f", "\u0061"-"\u007a"] >
| <FEATURE_NAME: (<LETTER>)+ >
| <ENDFEATURETITLE: "#TITLEEND" >
| <NEWLINE: ("\r\n" | "\n\r" | "\r" | "\n") > : IN_BODY
}
<IN_BODY> TOKEN : { <TEXT : ~[] > }
Now the syntactic grammar is
Feature:
<FEATURE> FeatureName <NEWLINE>
Body
<EOF>;
FeatureName: <FEATURE_NAME>;
Body: (<TEXT>)*;

Related

Use map object in karate framework

I am trying to create a scenario where:
Scenario Outline: Create a request
Given print 'reason=<reason>, detail=<detail>, metainfo=<metainfo>'
When call create_request
Then match response.message == "#notnull"
* call json_to_proto request
* print 'response \n', response
Examples:
reason | detail | metainfo
test | Testing | { foo: bar }
My concern is metainfo is defined as a map, "metainfo": "#(karate.get('metainfo', {}))" how do I set values for it as the current logic gives me error: org.graalvm.polyglot.PolyglotException: Expect a map object but found...
Please read this section of the docs: https://github.com/karatelabs/karate#scenario-outline-enhancements
You can use JSON like this. And note that you don't need the <foo> placeholder system. Normal variables work:
Scenario Outline: ${payload.foo}
* match payload == { foo: 'bar' }
Examples:
| payload! |
| { foo: 'bar' } |

How to commit to an alternation branch in a Raku grammar token?

Suppose I have a grammar with the following tokens
token paragraph {
(
|| <header>
|| <regular>
)
\n
}
token header { ^^ '---' '+'**1..5 ' ' \N+ }
token regular { \N+ }
The problem is that a line starting with ---++Foo will be parsed as a regular paragraph because there is no space before "Foo". I'd like to fail the parse in this case, i.e. somehow "commit" to this branch of the alternation, e.g. after seeing --- I want to either parse the header successfully or fail the match completely.
How can I do this? The only way I see is to use a negative lookahead assertion before <regular> to check that it does not start with ---, but this looks rather ugly and impractical, considering that my actual grammar has many more than just these 2 branches. Is there some better way? Thanks in advance!
If I understood your question correctly, you could do something like this:
token header {
^^ '---' [
|| '+'**1..5 ' ' \N+
|| { die "match failed near position $/.pos()" }
]
}

How to get the line of a token in parse rules?

I have searched everywhere and can't find a solution. I am new to ANTLR and for an assignment, I need to print out (using similar syntax that I have below) an error message when my parser comes across an unidentified token with the line number and token. The documentation for antlr4 says line is an attribute for Token objects that gives "[t]he line number on which the token occurs, counting from 1; translates to a call to getLine. Example: $ID.line."
I attempted to implement this in the following chunk of code:
not_valid : not_digit { System.out.println("Line " + $not_digit.line + " Contains Unrecognized Token " $not_digit.text)};
not_digit : ~( DIGIT );
But I keep getting the error
unknown attribute line for rule not_digit in $not_digit.line
My first thought was that I was applying an attribute for a lexer token to a parser rule because the documentation splits Token and Rule attributes into two different tables. so then I change the code to be:
not_valid : NOT_DIGIT { System.out.println("Line " + $NOT_DIGIT.line + " Contains Unrecognized Token " $NOT_DIGIT.text)};
NOT_DIGIT : ~ ( DIGIT ) ;
and also
not_valid : NOT_DIGIT { System.out.println("Line " + $NOT_DIGIT.line + " Contains Unrecognized Token " $NOT_DIGIT.text)};
NOT_DIGIT : ~DIGIT ;
like the example in the documentation, but both times I got the error
rule reference DIGIT is not currently supported in a set
I'm not sure what I am missing. All my searches turn up how to do this in Java outside of the parser, but I need to work in the action code (I think that's what it is called) in the parser.
A block like { ... } is called an action. You embed target specific code in it. So if you're working with Java, then you need to write Java between { and }
A quick demo:
grammar T;
parse
: not_valid EOF
;
not_valid
: r=not_digit
{
System.out.printf("line=%s, charPositionInLine=%s, text=%s\n",
$r.start.getLine(),
$r.start.getCharPositionInLine(),
$r.start.getText()
);
}
;
not_digit
: ~DIGIT
;
DIGIT
: [0-9]
;
OTHER
: ~[0-9]
;
Test it with the code:
String source = "a";
TLexer lexer = new TLexer(CharStreams.fromString(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
which will print:
line=1, charPositionInLine=0, text=a

Turn absolute file paths and line numbers in the tool output into hyperlinks

This is an example output:
/usr/local/bin/node /usr/local/bin/elm-make src/elm/Main.elm --output=builds/main.js
-- TYPE MISMATCH ---------------------------------------------- src/elm/Main.elm
The type annotation for `init` does not match its definition.
35| init : Maybe Route.Location -> ( Model, Cmd Msg )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The type annotation is saying:
Maybe Route.Location -> ( { route : Maybe Route.Location }, Cmd Msg )
But I am inferring that the definition has this type:
Maybe Route.Location
-> ( { route : Maybe Route.Location -> Route.Model }, Cmd a )
Detected errors in 1 module.
Process finished with exit code 1
This is the regex that i came up with:
http://regexr.com/3egqu
However, creating output filter out of it like this:
doesn't work.
Thus far, I only know that the following works: ------ ($FILE_PATH$)
And it turns the file path into a link:
Help me find a way to include the line numbers into the links.
Here's what I've come up with;
First,
elm-make --report json
outputs the build errors in structured JSON;
$ elm-make --report json src/main.elm
[{"tag":"unused import","overview":"Module `Bootstrap.CDN` is unused.","details":"Best to remove it. Don't save code quality for later!","region":{"start":{"line":3,"column":1},"end":{"line":3,"column":28}},"type":"warning","file":"src/main.elm"}]
Now you can pipe that output through jq (see here). to reformat it into
elm make src/main.elm --report json --output ./public/app.js | \
jq '.[] | { type: .type, file: .file, line: .region.start.line|tostring, tag: .tag, column: .region.start.column|tostring, tag: .tag, details: .details }' | \
jq --raw-output '. | "[" + (.type|ascii_upcase) + "] " + .file + ":" + .line + ":" + .column + " " + .tag + " -- " + .details + "\n"'
that gives you a reformatted output;
[WARNING] src/main.elm:9:1 unused import -- Best to remove it. Don't save code quality for later!
[WARNING] src/main.elm:17:1 missing type annotation -- I inferred the type annotation so you can copy it into your code:
main : Program Never Model Main.Msg
Which you pick up in intellij using the format
$FILE_PATH$:$LINE$:$COLUMN$ $MESSAGE$
You then get to click on an error message to jump to the file, and the error text in a tooltip.

JavaCC: Defining a *password* token or grammar rule

I'm using JavaCC do simulate a small part of SQL grammars, and I'm having a problem with defining a password.
I'm writting grammar rules for a
CREATE USER user_name IDENTIFIED BY a_password
statement, and I'm stuck. Since a password can match with ANYTHING like asdkj*!##, or !#%^%ASDjnkj, _ASDJLJK##& etc. Note that in Oracle, it's totally legal to input your password without single quote mark ('). I could solve this problem easily if the quote marks are compulsory, but unfortunately they're not.
I've tried many ways to define a token/grammar rule for this password, but it didn't work as I expected, the latest rule I've tried is:
TOKEN : {
< S_PASSWORD: ( < DIGIT > | < LETTER > |< S_PASSCHAR >)+ >
| <#S_PASSCHAR : "!"|"#"|"#"|"$"|"%"|"^"|"&"|"*" >
| <#LETTER: ["a"-"z", "A"-"Z", "_"]>
| <#DIGIT: ["0" - "9"]>
}
But since < S_PASSWORD > can match ANYTHING, any other token that I defined earlier will be match with it, and I always get a JavaCC warning like this:
Warning: "#" cannot be matched as a string literal token at line 33515, column 13. It will be matched as < S_PASSWORD >.
There are similar suggestions from my friends, but they didn't work either.
Can someone help me with this?
Assuming there is some lexical way to tell where the password begins and ends, you can use lexical states. For example, if the sequence IDENTIFIED BY is only ever followed by spaces that are followed by a password, you make a state machine so that IDENTIFIED transitions from DEFAULT to S0. In S0 spaces are skipped and BY transitions to S1. In S1 spaces are skipped and a sequence of password characters is a PASSWORD token; the PASSWORD token transitions back to DEFAULT. Of course this only works if IDENTIFIED BY can only ever be followed by a password. Also in S0 you need to by able to deal with all the normal stuff, so most of your token rules should apply in both states S0 and DEFAULT but transition to DEFAULT. See the FAQ for more on lexical states.
If BY is only ever followed by a password, then it is even easier, as you don't need S0.
Edit
Here are some example rules. If the keyword BY is only ever followed by a password, you only need two states
TOKEN : { <BY : "BY"> : S1>
<S1> TOKEN : { <PASSWORD : ( <PASSWORDCHAR> )+ } : DEFAULT }
<DEFAULT, S1> : SKIP { " " } // Stays in the same state.
If you can use IDENTIFIED followed by BY then you need three states
<DEFAULT, S0> TOKEN : { <CREATE : "CREATE"> : DEFAULT } // And similar for most token rules.
<DEFAULT, S0> TOKEN : { <IDENTIFIED : "IDENTIFIED"> : S1 }
<S0> TOKEN : { <BY : "BY"> : S1> // BYs that follow IDENTIFIED
<DEFAULT> TOKEN : { <BY : "BY"> : DEFAULT } BYs that don't follow IDENTIFIED.
<S1> TOKEN : { <PASSWORD : ( <PASSWORDCHAR> )+ } : DEFAULT }
<DEFAULT, S0, S1> : SKIP { " " } // Stays in the same state.