skipping whitespace in ANTLR v4 depending on context - conditional-statements

In my ANTLR4 grammar, I would like to skip whitespace in general, in order to keep the grammar as simple as possible. For this purpose I use the lexer rule WS : [ \t\r\n]+ -> skip;.
However, there may be certain sections in the input, where whitespace matters. One example are tables that are either tab-delimited or which need to count the spaces to find out which number is written in which column.
If I could switch off skipping the whitespace in between some begin and end symbols (table{ ... }), this would be perfect. Is it possible?
If not, are there other solutions to switch between different lexer rules depending on the context?

Take a look at context-senstive tokens with lexical modes. It's covered in more depth in "The Definitive ANTLR 4" book -- Chapter 12. I think you should be able to pull it off with this.
Declare a rule that will change to the "skip spaces mode", and back to the default one.
OPEN: '<' -> mode (SKIP_SPACES);
mode: SKIP_SPACES;
CLOSE: '>' -> mode (DEFAULT_MODE);
WS : [ \t\r\n]+ -> skip;

Related

Conditionally skipping an ANTLR lexer rule based on current line number

I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.
I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).

How to tokenize Java8 program using antlr

Currently I am using the Java8.g4 of java 8 from this repo:
https://github.com/antlr/grammars-v4
However, I was wondering how can I modify the Java8.g4 file to make sure if I encounter multiple new lines I only tokenize one of them?
Refer to: Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3, I can add new line to the parse tree (by adding NEWLINE: ('\r\n'|'\n'|'\r') to the .g4 file. However, if I have multiple new lines, multiple lines will be parsed and added to the tree which is not what I want.
Hope someone can help me out!
Thanks
I guess you mean the whitespaces are not kept in the token list produced by the lexer, right? This happens when whitespaces are skipped in the grammar. Check it for e.g.
WS: [ \t] -> skip;
and change that to
WS: [ \t] -> channel(HIDDEN);
This way the whitespaces are kept on the hidden channel and you can read them via the CommonTokenStream instance, but do not get in the way (just like with skip).

antlr reverse phrase match issue

I have a rule like this,
BLOCK_COMMENT
: ('/*' ~[!] .*? '*/' | '/**/') -> channel(HIDDEN);
But when I try to match this line,
/**/and /**/1=1
The and symbol is HIDDEN as well. Since ANTLR is greedy, it matched the last occurrence of */, and it end up with only one BLOCK_COMMENT (I was expecting two)
So, I will need something that matches not '*/', and the BLOCK_COMMENT rule should become:
'/*' then not '*/' then '*/'
Anyone know what rules can match not '*/'?
First here is a quote from the book 'Definitive ANTLR4 Reference' on the ~ operator on lexer rules:
~x Match any single character not in the set described by x . Set x
can be a single character literal, a range, or a subrule set like
~('x'|'y'|'z') or ~[xyz] .
so basically we can't use something like ~'*/'.
Since you need to interpret the comments themselfs as well, best way to do it IMHO is with lexer modes.
...
COMMENT_START : '/*' -> mode (COMMENT_MODE);
mode COMMENT_MODE;
COMMENT_END : '*/' -> mode (DEFAULT_MODE);
//match anything else that you need in this mode
...
I have assumed that you only have one mode in addition to the default one. Of course if you have more of them, you can also use popMode and pushMode.

Writing parser rules sensitive to whitespace while skipping WS from the lexer

I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:
ENTITY_VAR
: 'user'
| 'resource'
;
INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;
NEWLINE : '\r'? '\n';
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );
The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.
Any ideas, please?
Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.
There are several approaches I can think of (not in a special order):
Emit several tokens from the rule ENTITY_ID. See ANTLR4: How to inject tokens for an inspiration
Allow whitespace in the parser and check afterwards
Use the single token and split in code
Use the single token and modify the token stream before passing it to the parser. I.e. lex, modify the ENTITY_ID tokens and split them into several other tokens, then pass this stream to the parser
Don't skip whitespace and when dealing with these "extra tokens" check if they are within a ENTITY_ID part (=> is error) or not (=> ignore error).
Don't skip whitespace and add "WS*" everywhere in your grammar where whitespace is allowed (ok if the grammar is not too large).
Insert predicates in the parser rule that checks if there is whitespace between.
Create a "trap" rule like this:
INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
;
This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.
I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.
As far as I managed to understand by browsing the documentation, it doesn't look like something like that is feasible.
Parser rules seem to work just on the default channel, so I can't send WS to channel(HIDDEN) and then recover it just for a single parser rule.
On the other hand, an author of antlr explains here that it's not possible to break down any token since version 4.
Even though I don't like it at all, it seems that the fastest way is to parse it from the lexer (as in the code from the question), only to get to re-parse it again from Java the whole string.
Still, any other better option or correction to my conclusions is welcome.
Hooking two parsers in a sort of pipeline, as your own answer suggets, is a sound and simple design/solution, and I'm pretty sure ANTLR is capable of helping with that.
I don't know far the ANTLR folks have gone in their work on stream/feed parsing. But, adopting a two-pass strategy should be efficient enough as the first pass would be just lexing a regular language, which is O(c * N) over the size of the input with a very small c.
If you want a single pass that costs O(k * N) (with a large k), you could consider PEG, for which there are implementations in Java (which I haven't tried).

Simple Island Grammar in ANTLR 4: Token Recognition Error

apparently, I wasn't able to deduce the answers to my problem from exiting posts on token recognition errors with Island Grammars here, so I hope someone can give me an advice on how to do this correctly.
Basically, I am trying to write a language that contains proprocessor directives. I narrowed my problem down to a very simple example. In my example lanuage, the following should be valid syntax:
##some preprocessor text
PRINT some regular text
When parsing the code, I want to be able to identify the tokens "some preprocessor text", "PRINT" and "some regular text".
This is the parser grammar:
parser grammar myp;
root: (preprocessor | command)*;
preprocessor: PREPROC PREPROCLINE;
command: PRINT STRINGLINE;
This is the lexer grammar:
lexer grammar myl;
PREPROC: '##' -> pushMode(PREPROC_MODE);
PRINT: 'PRINT' -> pushMode(STRING_MODE);
WS: [ \t\r\n] -> skip;
mode PREPROC_MODE;
PREPROCLINE: (~[\r\n])*[\r\n]+ -> popMode;
mode STRING_MODE;
STRINGLINE: (~[\r\n])*[\r\n]+ -> popMode;
When I parse the above example code, I get the following error:
line 1:2 extraneous input 'some preprocessor text\r\n' expecting
PREPROCLINE line 2:5 token recognition error at: ' some regular text'
This error occurs regardless of whether the line "WS: [ \t\r\n] -> skip;" is included in the lexer grammar or not. I guess that if I introduced quotes to the tokens PREPROCLINE and STRINGLINE instead of the line endings, it would work (at least I suceesfully implemented regular strings in other languages). But in this particular language, I really want to have the strings without the quotes.
Any help on why this error is occurring or how to implement a preprocessor language with unquoted strings is very appreciated.
Thanks
Updated: First, the recognition errors are because your parser needs to reference the lexer tokens. Add the options block to your parser:
options {
tokenVocab=MyLexer;
}
Second, when you generate your lexer/parser, be aware that the warnings usually need to be considered and corrected before proceeding.
Finally, these are all working alternatives, once you add the options block.
XXXX: (~[\r\n])*[\r\n]+ -> popMode;
is a bit cleaner as:
XXXX: .*? '\r'? '\n' -> popMode;
To not include the line endings, try
XXXX: .*? ~[\r\n] -> popMode;