I have a grammar that accepts key / value pairs that appear one per line. The values may be multi-line.
The Eclipse plug-in ANTLR IDE works correctly and accepts a valid test string. However, the generated Java does not accept the same string.
Here is the grammar:
message: block4 ;
block4: STARTBLOCK '4' COLON expr4+ ENDBLOCK ;
expr4: NEWLINE (COLON key COLON expr | '-')+;
key: FIELDVALUE* ;
expr: FIELDVALUE* ;
NEWLINE : ('\n'|'\r') ;
FIELDVALUE : (~('-'|COLON|ENDBLOCK|STARTBLOCK))+;
COLON : ':' ;
STARTBLOCK : '{' ;
ENDBLOCK : '}' ;
ANTLR IDE parses this correctly:
Don't squint... It is dividing up key/expression pairs whether they are single-line values (like 23B / CRED) or multiline values (like 59 / /13212312\r\nRECEIVER NAME S.A\r\n).
Here is the input string:
{4:
:20:007505327853
:23B:CRED
:32A:050902JPY3520000,
:33B:JPY3520000,
:50K:EUROXXXEI
:52A:FEBXXXM1
:53A:MHCXXXJT
:54A:FOOBICXX
:59:/13212312
RECEIVER NAME S.A
:70:FUTURES
:71A:SHA
:71F:EUR12,00
:71F:EUR2,34
-}
When Eclipse runs anltr-3.4-complete.jar on the grammar, it generates SwiftTinyLexer.java and SwiftTinyParser.java. The lexer lexes them into 35 tokens, starting with:
STARTBLOCK
4
COLON
FIELDVALUE
COLON
I would like token 4 to be an expr4 rather than a FIELDVALUE (and the IDE seems to agree with me). But since it is a FIELDVALUE, the parser is choking on that token with line 1:3 required (...)+ loop did not match anything at input '\r\n'.
Why is there a difference between the way that anltr 3.4 and ANTLR IDE 2.1.2.201108281759 lex the same string?
Is there a way to fix the grammar so that it matches expr4 before it matches FIELDVALUE?
The IDE input string has a single \n while the Java test code is getting a Windows-style \r\n.
I changed NEWLINE by adding a "1 or more," that is from
NEWLINE : ('\n'|'\r') ;
to
NEWLINE : ('\n'|'\r')+ ;
This allowed the parse go forward without the lexical error, and now it makes sense why the IDE behaved differently from generated Java: They were getting slightly different input strings.
Related
I'm trying to create an ANTLR v4 grammar with the following set of rules:
1.In case a line starts with #, it is considered a label:
#label
2.In case the line starts with cmd, it is treated as a command
cmd param1 param2
3.If a line starts with a whitespace, it is considered a string. All the text should be extracted. Strings can be multiline, so they end with an empty line
A long string with multiline support
and any special characters one can imagine.
<-empty line here->
4.Lastly, in case a line starts with anything but whitespace, # and cmd, it's first word should be considered a heading.
Heading A long string with multiline support
and any special characters one can imagine.
<-empty line here->
It was easy to handle lables and commands. But I am clueless about strings and headings.
What is the best way to separate whitespace word whitespace whatever doubleNewline and whatever doubleNewline? I've seen a lot of samples with whitespaces, but none of them works with both random text and newlines. I don't expect you to write actual code for me. Suggesting an approach will do.
Something like this should do the trick:
lexer grammar DemoLexer;
LABEL
: '#' [a-zA-Z]+
;
CMD
: 'cmd' ~[\r\n]+
;
STRING
: ' ' .*? NL NL
;
HEADING
: ( ~[# \t\r\nc] | 'c' ~'m' | 'cm' ~'d' ).*? NL NL
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
fragment NL
: '\r'? '\n'
| '\r'
;
This does not mandate the "beginning of the line" requirement. If that is something you want, you'll have to add semantic predicates to your grammar, which ties it to a target language. For Java, that would look like this:
LABEL
: {getCharPositionInLine() == 0}? '#' [a-zA-Z]+
;
See:
Semantic predicates in ANTLR4?
https://github.com/antlr/antlr4/blob/master/doc/predicates.md
I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.
This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;
I want to parse a language in which statements are separated by EOLs. I tried this in the lexer grammar (copied from an example in the docs):
EOL : ('\r'? '\n')+ ; // any number of consecutive linefeeds counts as a single EOL
and then used this in the parser grammar:
stmt_sequence : (stmt EOL)* ;
The parser rejected code with statements separated by one or more blank lines.
However, this was successful:
EOL : '\r'? '\n' ;
stmt_sequence : (stmt EOL+)* ;
I'm an ANTLR newbie. It seems like both should work. Is there something about greedy/nongreedy lexer scanning that I don't understand?
I tried this with both 3.2 and 3.4; I'm running the ANTLR IDE in Eclipse Indigo on OS X 10.6.
Thanks.
The error was not in the original grammar; but in the input data. I was using an editor (in Eclipse) that automatically inserted tabs after an EOL, so my "blank lines" were not really blank.
I modified the grammar as follows:
fragment SPACE: ' ' | '\t';
EOL : ( '\r'? '\n' SPACE* )+;
This grammar works as expected.
The lesson here is that one must be careful with white spaces. The lexer may see white spaces in the input that the parser does not see (because it has already been sent to the hidden channel).
How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....