Negated lexer rules/tokens - antlr

I am trying to match (and ignore) c-style block comments. To me the sequence is (1) /* followed by (2) anything other than /* or */ until (3) */.
BLOCK_COMMENT_START
: "/*"
;
BLOCK_COMMENT_END
: "*/"
;
BLOCK_COMMENT
: BLOCK_COMMENT_START ( ~( BLOCK_COMMENT_START | BLOCK_COMMENT_END ) )* BLOCK_COMMENT_END {
// again, we want to skip the entire match from the lexer stream
$setType( Token.SKIP );
}
;
But Antlr does not think like I do ;)
sql-stmt.g:121:34: This subrule cannot be inverted. Only subrules of the form:
(T1|T2|T3...) or
('c1'|'c2'|'c3'...)
may be inverted (ranges are also allowed).
So the error message is a little cryptic, but I think it is trying to say that only ranges, single-char alts or token alts can be negated. But isn't that what I have? Both BLOCK_COMMENT_START and BLOCK_COMMENT_END are tokens. What am I missing?
Thanks a lot for any help.

Related

antlr how can I skip a token until I need it

Is there a way to skip tokens until I need them? In order to be more clear, here is a grammar that is as close as I could get to what I want:
grammar example;
file : statement* EOF ;
statement : ID EOL
| '{' (EOL statement*)? '}' EOL
;
EOL : ('\r'? '\n' | '\r') -> skip ;
WHITESPACE : [ \t]+ -> skip ;
Hopefully my intent is clear: all whitespace (including newlines) is skipped under normal circumstances, but I can demand the presence of a newline whenever I want, so
foo
{
bar
}
baz
would fit the grammar, but not
foo {
bar
} baz
or
foo bar
{
baz
}
Is there a way to do this, or do I just have to put a lot of EOL*'s in my grammar?
Not long ago I answered another question that needed the same kind of mechanism.
See here for further details.
Basically you achieve this by providing your own, custom TokenStream that implements a mechanism of either skipping whitespace or feeding it into the parser depending on it's setting.

ANTLR - how to skip missing tokens in a 'for' loop

I'm developing a 'toy' language to learn antlr.
My construct for a for loop look like this.
for(4,10){
//program expressions
};
I have a grammar that I think works, but it's a little ugly. Specifically I'm not sure that I've handled the semantically unimportant tokens very well.
For example, the comma in the middle there appears as a token, but it's unimportant to the parser, it just needs the 2 and the 3 for the loop bounds. This means when I see the child() elements for the parts of the loop token, I have to skip the unimportant ones.
You can probably see this best if you examine the ANTLR viewer and look at the parse tree for this. The red arrows point to the tokens I think are redundant.
Feel like I should be making more use of the skip() feature than I am, but I can't see how to insert into the grammar for the tokens at this level.
loop: 'for(' foridxitem ',' foridxitem '){' (programexpression)+ '}';
foridxitem: NUM #ForIndexNumÌ
|
var #ForIndexVar;
The short answer is Antlr produces a parse-tree, so there will always be cruft to step over or otherwise ignore when walking the tree.
The longer answer is that there is a tension between skipping cruft in the lexer and producing tokens of limited syntactic value that are nonetheless necessary for writing unambiguous rules.
For example, you identify for( as a candidate for skipping, yet is probably syntactically required. Conversely, the parameters comma could be truly without syntactic meaning. So, you might clean it up in the lexer (and parser) this way:
FOR: 'for(' -> pushMode(params) ;
ENDLOOP: '}' ;
WS: .... -> skip() ;
mode params;
NUM: .... ;
VAR: .... ;
COMMA: ',' -> skip() ;
ENDPARAMS: '){' -> skip(), popMode() ;
P_WS: .... -> skip() ;
Your parer rule then becomes
loop: FOR foridxitem* programexpression+ ENDLOOP ;
foridxitem: NUM | VAR ;
programexpression: .... ;
That should clean up the tree a fair bit.

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.