ANTLR behaviour with conflicting tokens - antlr

How is ANTLR lexer behavior defined in the case of conflicting tokens?
Let me explain what I mean by "conflicting" tokens.
For example, assume that the following is defined:
INT_STAGE : '1'..'6';
INT : '0'..'9'+;
There is a conflict here, because after reading a sequence of digits, the lexer would not know whether there is one INT or many INT_STAGE tokens (or different combinations of both).
After a test, it looks like that if INT is defined after INT_STAGE, the lexer would prefer to find INT_STAGE, but maybe not INT then? Otherwise, no INT_STAGE would ever be found.
Another example would be:
FOOL: ' fool'
FOO: 'foo'
ID : ('a'..'z'|'A'..'Z'|'_'|'%') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'%')*;
I was told that this is the "right" order to recognize all the tokens:
while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.

The following logic applies:
the lexer matches as much characters as possible
if after applying rule 1, there are 2 or more rules that match the same amount of characters, the rule defined first will "win"
Taking this into account, the input "1", "2", ..., "6" is tokenized as an INT_STAGE: both INT_STAGE and INT match the same amount of characters, but INT_STAGE is defined first.
The input "12" is tokenized as a INT since it matches the most characters.
I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.
That is correct.

Related

Antlr4 extremely simple grammar failing

Antlr4 has always been a kind of love-hate relationship for me, but I am currently a bit perplexed. I have started creating a grammar to my best knowledge and then wanted to test it and it didnt work at all. I then reduced it a lot to just a bare minimum example and I managed to make it not work. This is my grammar:
grammar SwiftMtComponentFormat;
separator : ~ZERO EOF;
ZERO : '0';
In my understanding it should anything except a '0' and then expect the end of the file. I have been testing it with the single character input '1' which I had expected to work. However this is what happens:
If i change the ~ZEROto ZERO and change my input from 1 to 0 it actually perfectly matches... For some reason the simple negation does not seem to work. I am failing to understand what the reason here is...
In a parser rule ~ZERO matches any token that is not a ZERO token. The problem in your case is that ZERO is the only type of token that you defined at all, so any other input will lead to a token recognition error and not get to the parser at all. So if you enter the input 1, the lexer will discard the 1 with a token recognition error and the parser will only see an empty token stream.
To fix this, you can simply define a lexer rule OTHER that matches any character not matched by previous lexer rules:
OTHER: .;
Note that this definition has to go after the definition of ZERO - otherwise it would match 0 as well.
Now the input 1 will produce an OTHER token and ~ZERO will match that token. Of course, you could now replace ~ZERO with OTHER and it wouldn't change anything, but once you add additional tokens, ~ZERO will match those as well whereas OTHER would not.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

Rule with identical string token twice

Using yacc, I want to parse text like
begin foo ... end foo
The string foo is not known at compile time and there can be different
such strings in the same input.
So far, the only option I see is to check for syntactical correctness after parsing:
block : BEGIN IDENT something END IDENT
{ if (strcmp($2, $5) != 0) yyerror("Mismatch"); }
This feels wrong. The parser should already detect the errors. Is there something built-in to yacc?
yacc only knows about tokens which the lexer can identify. Since those are identical, the lexer could only improve this case by using states.
That is, you could tell lex to remember that it saw a BEGIN and to count the tokens itself, and return a different type of IDENT (and do the checking there).
However, yacc is better suited to this sort of thing, so the answer to the original question is "no", there is no better solution.

No spaces between tokens in ANTLR

I'm writing a very simple subset of a C# grammar as an exercise.
However, I have a rule which whitespaces are giving me some troubles.
I want to distinguish the following:
int a;
int? b;
Where the the first is a "regular" int type and the second is a nullable int type.
However, with my current grammar I'm not being able to parse this.
type : typeBase x='?' -> { x == null } typeBase
-> ^('?' typeBase)
;
typeBase : 'int'
| 'float'
;
The thing is that whith these rules, it only works with a whitespace before '?', like this:
int ? a;
Which I'd don't want.
Any ideas?
1) Your definition of whitespace seems to be flawed ... the grammar you present should accept "int?" and "int ?". Maybe you should take a look of the definition of whitespace.
2) If you want to disallow "int ? a" you can define extra tokens 'int?' and 'float?' ... normally you allow whitespace to appear between every token, so you have to make it one token.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?