ANTLR - Checking for a String's "contruction" - antlr

Currently working with ANTLR and found out something interesting that's not working as I intended.
I try to run something along the lines of "test 10 cm" through my grammar and it fails, however "test 10 c m" works as the previous should. The "cm" portion of the code is what I call "wholeunit" in my grammar and it is as follows:
wholeunit :
siunit
| unitmod siunit
| wholeunit NUM
| wholeunit '/' wholeunit
| wholeunit '.' wholeunit
;
What it's doing right now is the "unitmod siunit" portion of the rule where unitmod = c and siunit = m .
What I'd like to know is how would I make it so the grammar would still follow the rule "unitmod siunit" without the need for a space in the middle, I might be missing something huge. (Yes, I have spaces and tabs marked to be skipped)

Probable cause is "cm" being considered another token together (possibly same token type as "test"), rather than "c" and "m" as separate tokens.
Remember that in ANTLR lexer, the rule matching the longest input wins.
One solution would possibly be to make the wholeunit a lexer rule rather than parser rule, and make sure it's above the rule that matches any word (like "test") - if same input can be matched by multiple rules, ANTLR selects the first rule in order they're defined in.

Related

Conditionally skipping an ANTLR lexer rule based on current line number

I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.
I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).

ANTLR parser for alpha numeric words which may have whitespace in between

First I tried to identify a normal word and below works fine:
grammar Test;
myToken: WORD;
WORD: (LOWERCASE | UPPERCASE )+ ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9' ;
WHITESPACE : (' ' | '\t')+;
Just when I added below parser rule just beneath "myToken", even my WORD tokens weren't getting recognised with input string as "abc"
ALPHA_NUMERIC_WS: ( WORD | DIGIT | WHITESPACE)+;
Does anyone have any idea why is that?
This is because ANTLR's lexer matches "first come, first serve". That means it will tray to match the given input with the first specified (in the source code) rule and if that one can match the input, it won't try to match it with the other ones.
In your case ALPHA_NUMERIC_WS does match the same content as WORD (and more) and because it is specified before WORD, WORD will never be used to match the input as there is no input that can be matched by WORD that can't be matched by the first processed ALPHA_NUMERIC_WS. (The same applies for the WS and the DIGIT) rule.
I guess that what you want is not to create a ALPHA_NUMERIC_WS-token (as is done by specifying it as a lexer rule) but to make it a parser rule instead so it then can be referenced from another parsre rule to allow an arbitrary sequence of WORDs, DIGITs and WSs.
Therefore you'd want to write it like this:
alpha_numweric_ws: ( WORD | DIGIT | WHITESPACE)+;
If you actually want to create the respective token you can either remove the following rules or you need to think about what a lexer's job is and where to draw the line between lexer and parser (You need to redesign your grammar in order for this to work).

ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

ANTLR Parser Question

I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)
So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.
This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E // whatever your E is
PLUS : '+'
END : '#'
If ANTLR doesn't support anything like {1,5} then this is the same as:
(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?
which is not that clean, so maybe there is a nicer way to do it.