ANTLR treating multiple EOLs as one? - antlr

I want to parse a language in which statements are separated by EOLs. I tried this in the lexer grammar (copied from an example in the docs):
EOL : ('\r'? '\n')+ ; // any number of consecutive linefeeds counts as a single EOL
and then used this in the parser grammar:
stmt_sequence : (stmt EOL)* ;
The parser rejected code with statements separated by one or more blank lines.
However, this was successful:
EOL : '\r'? '\n' ;
stmt_sequence : (stmt EOL+)* ;
I'm an ANTLR newbie. It seems like both should work. Is there something about greedy/nongreedy lexer scanning that I don't understand?
I tried this with both 3.2 and 3.4; I'm running the ANTLR IDE in Eclipse Indigo on OS X 10.6.
Thanks.

The error was not in the original grammar; but in the input data. I was using an editor (in Eclipse) that automatically inserted tabs after an EOL, so my "blank lines" were not really blank.
I modified the grammar as follows:
fragment SPACE: ' ' | '\t';
EOL : ( '\r'? '\n' SPACE* )+;
This grammar works as expected.
The lesson here is that one must be careful with white spaces. The lexer may see white spaces in the input that the parser does not see (because it has already been sent to the hidden channel).

Related

Handle strings starting with whitespaces

I'm trying to create an ANTLR v4 grammar with the following set of rules:
1.In case a line starts with #, it is considered a label:
#label
2.In case the line starts with cmd, it is treated as a command
cmd param1 param2
3.If a line starts with a whitespace, it is considered a string. All the text should be extracted. Strings can be multiline, so they end with an empty line
A long string with multiline support
and any special characters one can imagine.
<-empty line here->
4.Lastly, in case a line starts with anything but whitespace, # and cmd, it's first word should be considered a heading.
Heading A long string with multiline support
and any special characters one can imagine.
<-empty line here->
It was easy to handle lables and commands. But I am clueless about strings and headings.
What is the best way to separate whitespace word whitespace whatever doubleNewline and whatever doubleNewline? I've seen a lot of samples with whitespaces, but none of them works with both random text and newlines. I don't expect you to write actual code for me. Suggesting an approach will do.
Something like this should do the trick:
lexer grammar DemoLexer;
LABEL
: '#' [a-zA-Z]+
;
CMD
: 'cmd' ~[\r\n]+
;
STRING
: ' ' .*? NL NL
;
HEADING
: ( ~[# \t\r\nc] | 'c' ~'m' | 'cm' ~'d' ).*? NL NL
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
fragment NL
: '\r'? '\n'
| '\r'
;
This does not mandate the "beginning of the line" requirement. If that is something you want, you'll have to add semantic predicates to your grammar, which ties it to a target language. For Java, that would look like this:
LABEL
: {getCharPositionInLine() == 0}? '#' [a-zA-Z]+
;
See:
Semantic predicates in ANTLR4?
https://github.com/antlr/antlr4/blob/master/doc/predicates.md

Antlr4 extraneous input even though it's necessary?

I am new to writing Antlr grammars and I'm having some trouble. I have simplified my issue down to the following:
Here is my grammar:
grammar Test;
prog : stmt* ;
stmt : INT NEWLINE ;
INT : [0-9]+ ;
NEWLINE : '\r'? '\n' ;
My input.txt is simply 0\r\n (not the literals '\', 'r', '\', 'n', but the ASCII codes 0x0D 0x0A
I get the following error:
line 1:1 extraneous input '\r\n' expecting {<EOF>, INT}
Why does it expect another INT without first matching a NEWLINE when the stmt rule requires that the INT be followed by a NEWLINE?
If it's relevant, I'm using PyCharm and the Antlr4 plugin to do the grammar generation.
As it turns out, the issue was related to Python's odd behavior when running modules that import modules that have been altered (hint: they're not automatically reloaded as they should be). The solution was to simply restart the interpreter when you change the grammar.
I'm sorry that the original question didn't have enough information for anyone else to figure out that answer.

How do I ignore arbitrary stuff inside braces in ANTLR?

I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.

Antlr generated Java doesn't match Antlr IDE

I have a grammar that accepts key / value pairs that appear one per line. The values may be multi-line.
The Eclipse plug-in ANTLR IDE works correctly and accepts a valid test string. However, the generated Java does not accept the same string.
Here is the grammar:
message: block4 ;
block4: STARTBLOCK '4' COLON expr4+ ENDBLOCK ;
expr4: NEWLINE (COLON key COLON expr | '-')+;
key: FIELDVALUE* ;
expr: FIELDVALUE* ;
NEWLINE : ('\n'|'\r') ;
FIELDVALUE : (~('-'|COLON|ENDBLOCK|STARTBLOCK))+;
COLON : ':' ;
STARTBLOCK : '{' ;
ENDBLOCK : '}' ;
ANTLR IDE parses this correctly:
Don't squint... It is dividing up key/expression pairs whether they are single-line values (like 23B / CRED) or multiline values (like 59 / /13212312\r\nRECEIVER NAME S.A\r\n).
Here is the input string:
{4:
:20:007505327853
:23B:CRED
:32A:050902JPY3520000,
:33B:JPY3520000,
:50K:EUROXXXEI
:52A:FEBXXXM1
:53A:MHCXXXJT
:54A:FOOBICXX
:59:/13212312
RECEIVER NAME S.A
:70:FUTURES
:71A:SHA
:71F:EUR12,00
:71F:EUR2,34
-}
When Eclipse runs anltr-3.4-complete.jar on the grammar, it generates SwiftTinyLexer.java and SwiftTinyParser.java. The lexer lexes them into 35 tokens, starting with:
STARTBLOCK
4
COLON
FIELDVALUE
COLON
I would like token 4 to be an expr4 rather than a FIELDVALUE (and the IDE seems to agree with me). But since it is a FIELDVALUE, the parser is choking on that token with line 1:3 required (...)+ loop did not match anything at input '\r\n'.
Why is there a difference between the way that anltr 3.4 and ANTLR IDE 2.1.2.201108281759 lex the same string?
Is there a way to fix the grammar so that it matches expr4 before it matches FIELDVALUE?
The IDE input string has a single \n while the Java test code is getting a Windows-style \r\n.
I changed NEWLINE by adding a "1 or more," that is from
NEWLINE : ('\n'|'\r') ;
to
NEWLINE : ('\n'|'\r')+ ;
This allowed the parse go forward without the lexical error, and now it makes sense why the IDE behaved differently from generated Java: They were getting slightly different input strings.

Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.