ANTLR - how to skip missing tokens in a 'for' loop - antlr

I'm developing a 'toy' language to learn antlr.
My construct for a for loop look like this.
for(4,10){
//program expressions
};
I have a grammar that I think works, but it's a little ugly. Specifically I'm not sure that I've handled the semantically unimportant tokens very well.
For example, the comma in the middle there appears as a token, but it's unimportant to the parser, it just needs the 2 and the 3 for the loop bounds. This means when I see the child() elements for the parts of the loop token, I have to skip the unimportant ones.
You can probably see this best if you examine the ANTLR viewer and look at the parse tree for this. The red arrows point to the tokens I think are redundant.
Feel like I should be making more use of the skip() feature than I am, but I can't see how to insert into the grammar for the tokens at this level.
loop: 'for(' foridxitem ',' foridxitem '){' (programexpression)+ '}';
foridxitem: NUM #ForIndexNumÌ
|
var #ForIndexVar;

The short answer is Antlr produces a parse-tree, so there will always be cruft to step over or otherwise ignore when walking the tree.
The longer answer is that there is a tension between skipping cruft in the lexer and producing tokens of limited syntactic value that are nonetheless necessary for writing unambiguous rules.
For example, you identify for( as a candidate for skipping, yet is probably syntactically required. Conversely, the parameters comma could be truly without syntactic meaning. So, you might clean it up in the lexer (and parser) this way:
FOR: 'for(' -> pushMode(params) ;
ENDLOOP: '}' ;
WS: .... -> skip() ;
mode params;
NUM: .... ;
VAR: .... ;
COMMA: ',' -> skip() ;
ENDPARAMS: '){' -> skip(), popMode() ;
P_WS: .... -> skip() ;
Your parer rule then becomes
loop: FOR foridxitem* programexpression+ ENDLOOP ;
foridxitem: NUM | VAR ;
programexpression: .... ;
That should clean up the tree a fair bit.

Related

Conditionally skipping an ANTLR lexer rule based on current line number

I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.
I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).

Writing parser rules sensitive to whitespace while skipping WS from the lexer

I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:
ENTITY_VAR
: 'user'
| 'resource'
;
INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;
NEWLINE : '\r'? '\n';
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );
The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.
Any ideas, please?
Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.
There are several approaches I can think of (not in a special order):
Emit several tokens from the rule ENTITY_ID. See ANTLR4: How to inject tokens for an inspiration
Allow whitespace in the parser and check afterwards
Use the single token and split in code
Use the single token and modify the token stream before passing it to the parser. I.e. lex, modify the ENTITY_ID tokens and split them into several other tokens, then pass this stream to the parser
Don't skip whitespace and when dealing with these "extra tokens" check if they are within a ENTITY_ID part (=> is error) or not (=> ignore error).
Don't skip whitespace and add "WS*" everywhere in your grammar where whitespace is allowed (ok if the grammar is not too large).
Insert predicates in the parser rule that checks if there is whitespace between.
Create a "trap" rule like this:
INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
;
This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.
I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.
As far as I managed to understand by browsing the documentation, it doesn't look like something like that is feasible.
Parser rules seem to work just on the default channel, so I can't send WS to channel(HIDDEN) and then recover it just for a single parser rule.
On the other hand, an author of antlr explains here that it's not possible to break down any token since version 4.
Even though I don't like it at all, it seems that the fastest way is to parse it from the lexer (as in the code from the question), only to get to re-parse it again from Java the whole string.
Still, any other better option or correction to my conclusions is welcome.
Hooking two parsers in a sort of pipeline, as your own answer suggets, is a sound and simple design/solution, and I'm pretty sure ANTLR is capable of helping with that.
I don't know far the ANTLR folks have gone in their work on stream/feed parsing. But, adopting a two-pass strategy should be efficient enough as the first pass would be just lexing a regular language, which is O(c * N) over the size of the input with a very small c.
If you want a single pass that costs O(k * N) (with a large k), you could consider PEG, for which there are implementations in Java (which I haven't tried).

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

Yacc "rule useless due to conflicts"

i need some help with yacc.
i'm working on a infix/postfix translator, the infix to postfix part was really easy but i'm having some issue with the postfix to infix translation.
here's an example on what i was going to do (just to translate an easy ab+c- or an abc+-)
exp: num {printf("+ ");} exp '+'
| num {printf("- ");} exp '-'
| exp {printf("+ ");} num '+'
| exp {printf("- ");} num '-'
|/* empty*/
;
num: number {printf("%d ", $1);}
;
obiously it doesn't work because i'm asking an action (with the printfs) before the actual body so, while compiling, I get many
warning: rule useless in parser due to conflict
the problem is that the printfs are exactly where I need them (or my output wont be an infix expression). is there a way to keep the print actions right there and let yacc identify which one it needs to use?
Basically, no there isn't. The problem is that to resolve what you've got, yacc would have to have an unbounded amount of lookahead. This is… problematic given that yacc is a fairly simple-minded tool, so instead it takes a (bad) guess and throws out some of your rules with a warning. You need to change your grammar so yacc can decide what to do with a token with only a very small amount of lookahead (a single token IIRC). The usual way to do this is to attach the interpretations of the values to the tokens and either use a post-action or, more practically, build a tree which you traverse as a separate step (doing print out of an infix expression from its syntax tree is trivial).
Note that when you've got warnings coming out of yacc, that typically means that your grammar is wrong and that the resulting parser will do very unexpected things. Refine it until you get no warnings from that stage at all. That is, treat grammar warnings as errors; anything else and you'll be sorry.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.