Antlr: ignore keywords in specific context - antlr

I'm constructing an English-like domain specific language with ANTLR. Its keywords are context-sensitive. (I know it sounds dirty, but it makes a lot of sense for the non-programmer target users.) For example, the usual logical operators such as or and not are to be treated as identifiers when surrounded in brackets, [like this and this]. My current approach looks like this:
bracketedStatement
: '[' bracketedWord+ ']'
;
bracketedWord
: (~(']')+
;
This, when combined with lexical definitions such as the following:
AND: 'and' ;
OR: 'or' ;
Produces the warning"Decision can match input such as "{AND..PROCESS, RPAREN..'with'}" using multiple alternatives: 1, 2". I'm clearly creating ambiguity for ANTLR, but I don't know how to resolve it. How do I fix this?

For anyone who finds this, check out this stack overflow question. It clarifies how to use the negation symbol correctly.

Related

Parsing annotations within comments / Negative matching of strings in regexps

I'm defining the syntax of the Move IR. The test suite for this language includes various annotations to enable testing. I need to treat comments of this form specially:
//! new-transaction
// check: "Keep(ABORTED { code: 123,"
This file is an example arithmetic_operators_u8.mvir.
So far, I've got this working by disallowing ordinary single-line comments.
module MOVE-ANNOTATION-SYNTAX-CONCRETE
imports INT-SYNTAX
syntax #Layout ::= r"([\\ \\n\\r\\t])" // Whitespace
syntax Annotation ::= "//!" "new-transaction" [klabel(NewTransaction), symbol]
syntax Check ::= "//" "check:" "\"Keep(ABORTED { code:" Int ",\"" [klabel(CheckCode), symbol]
endmodule
module MOVE-ANNOTATION-SYNTAX-ABSTRACT
imports INT-SYNTAX
syntax Annotation ::= "#NewTransaction" [klabel(NewTransaction), symbol]
syntax Check ::= #CheckCode(Int) [klabel(CheckCode), symbol]
endmodule
I'd like to also be able to use ordinary comments.
As a first step, I was able to change the Layout to allow commits only if the begin with a ! using r"(\\/\\/[^!][^\\n\\r]*)"
I'd like to exclude all comments that start with either //! or // check: from comments. What's a good way of implementing this?
Where can I find documentation for the regular expression language that K uses?
K uses flex for its scanner, and thus for its regular expression language. As a result, you can find documentation on its regular expression language here.
You want a regular expression that expresses that comments can't start with ! or check:, but flex doesn't support negative lookahead or not patterns, so you will have to exhaustively enumerate all the cases of comments that don't start with those sequence of characters. It's a bit tedious, sadly.
For reference, here is a (simplified) regular expression drawn from the syntax of C that represents all pragmas that don't start with STDC. It should give you an idea of how to proceed:
#pragma([:space:]*)([^S].*|S|S[^T].*|ST|ST[^D].*|STD|STD[^C].*|STDC[_a-zA-Z0-9].*)?$

ANTLR4: parse number as identifier instead as numeric literal

I have this situation, of having to treat integer as identifier.
Underlying language syntax (unfortunately) allows this.
grammar excerpt:
grammar Alang;
...
NLITERAL : [0-9]+ ;
...
IDENTIFIER : [a-zA-Z0-9_]+ ;
Example code, that has to be dealt with:
/** declaration block **/
Method 465;
...
In above code example, because NLITERAL has to be placed before IDENTIFIER, parser picks 465 as NLITERAL.
What is a good way to deal with such a situations?
(Ideally, avoiding application code within grammar, to keep it runtime agnostic)
I found similar questions on SO, not exactly helpful though.
There's no good way to make 465 produce either an NLITERAL token or an IDENTIFIER token depending on context (you might be able to use lexer modes, but that's probably not a good fit for your needs).
What you can do rather easily though, is to allow NLITERALs in addition to IDENTIFIERS in certain places. So you could define a parser rule
methodName: IDENTIFIER | NLITERAL;
and then use that rule instead of IDENTIFIER where appropriate.

ANTLR4 : clean grammar and tree with keywords (aliases ?)

I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !
There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).
This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.

Solving a reduce/reduce conflict

If I have a grammar where a certain expression can match two productions, I will obviously have a reduce/reduce conflict with yacc. Specifically, say I have two productions (FirstProduction and SecondProduction) where both of them could be TOKEN END.
Then yacc will not be able to know what to reduce TOKEN END to (FirstProduction or SecondProduction). However, I want to make it so that yacc prioritises FirstProduction in this situation. How can I achieve that?
Note that both FirstProduction and SecondProduction could be a great deal of things and that Body is the only place in the grammar where these conflict.
Also, I do know that in these situations, yacc will choose the first production that was declared in the grammar. However, I want to avoid having any reduce/reduce warnings.
You can refactor the grammar to not allow the second list to start with something that could be part of the first list:
Body: FirstProductionList SecondProductionList
| FirstProductionList
;
FirstProductionList: FirstProductionList FirstProduction
| /* empty */
;
SecondProductionList: SecondProductionList SecondProduction
| NonFirstProduction
;
NonFirstProduction is any production that is unique to SecondProduction, and marks the transition from reducing FirstProdutions to SecondProductions
Bison has no way to explicitly mark one production as preferred over another; the only such mechanism is precedence relations, which resolve shift/reduce conflicts. As you say, the file order provides an implicit priority. You can suppress the warning with an %expect declaration; unfortunately, that only lets you tell bison how many conflicts to expect, and not which conflicts.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?