What are Recursive Lexer Rules in ANTLR4 good for? - antlr

Just beginning with ANTLR4. Found that lexer rules can be recursive https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#recursive-lexer-rules. What is the use case to use recursive lexer rules instead of recursive parser rules? Dont you loose all the information when using lexer recursive rules?

As mentioned in the comments by sepp2k: besides nested string literals, nested comments are 1 of the use cases:
COMMENT
: '/*' ( COMMENT | . )*? '*/'
;

Related

Did "!", "^" and "$" had a special meaning in Antlr3?

I dont have any prior knowledge about ANTLR(I recently learned a little bit about ANTLR4), but I have to translate an old grammar to a newer version and eclipse is telling me, that their are no viable alternatives for those characters and shows the syntax error " '!' came as a complete surprise to me".
I already deleted those characters and it does not seam to be a problem, but maybe it had a special function in ANTLR3.
Thanks in advance.
global_block:
DATABASE! IDENTIFIER!
| GLOBALS! define_section!+ END! GLOBALS!
| GLOBALS! STRING!
;
main_block: MAIN sequence? END em=MAIN
-> ^(MAIN MAIN '(' ')' sequence? $em)
;
^ and -> are related to tree rewriting: https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687090/Tree+construction
ANTLR4 does not support it (v4 has listeners and visitors for tree traversal, but no rewriting anymore). Just remove all of these ! and -> ... in parser rules (do not remove the -> ... inside lexer rules like -> channel(...), which is still supported in v4).
So in your case, these rules would be valid in ANTLR4:
global_block
: DATABASE IDENTIFIER
| GLOBALS define_section+ END GLOBALS
| GLOBALS STRING
;
main_block
: MAIN sequence? END MAIN
;
The $ can still be used in ANTLR4: they are used to reference sub-rules or tokens:
expression
: lhs=expression operator=(PLUS | MINUS) rhs=expression
| NUMBER
;
so that in embedded code block, you can do: $lhs.someField.someMethod(). In your case, you can also just remove them because they are probably only used in the tree rewrite rules.
EDIT
kaby76 has a Github page with some instructions for converting grammars to ANTLR4: https://github.com/kaby76/AntlrVSIX/blob/master/doc/Import.md#antlr3

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

How do I ignore arbitrary stuff inside braces in ANTLR?

I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.