Use of Flex and Bison - grammar

I'm a newby for Flex and Bison, and I have tried to write a Flex lexical scanner and then a Bison grammar, but I encounter the following problem:
a word can sometimes match with different definitions in the Flex definitions, and I would like Bison to find from it's grammar the good Flex definition to choose.
For example, if the word abc can be seen as category1 or category2 in Flex, I would like Bison to choose category1 if it appears without syntax error as category1 in the Bison grammar and incorrect as category2; but if it appears as a syntax error when it is category1 and not as category2, then Flex should classify it as category2.
Is there a way to do this? Or am I totally misunderstanding Flex and Bison?

This situation typically arises with what are often called "semi-reserved" words, or what are called "contextual keywords" in C#. In bison/flex, these are a pain to deal with. (Lemon has an undocumented feature where you can define a fallback for a token using the %fallback directive, which is perfect for this use case; you simply make IDENTIFIER the fallback for any contextually reserved token.)
With some work, you might be able to achieve the same effect by defining non-terminals like:
identifier : IDENTIFIER | VAR | ADD | REMOVE | DYNAMIC | GLOBAL | ...
/* VAR is special in a local-variable-type: */
local_variable_type_identifier : IDENTIFIER | ADD | REMOVE | DYNAMIC | GLOBAL | ...
You can probably find the places you need to customize by using identifier throughout and then solving each conflict which includes a reduction to identifier by replacing it with a restricted non-terminal which excludes the semi-reserved words which participate in the conflict.
It's not great, but it's the best approach I know.

Flex supports 'start states' and 'exclusive start states' which might allow you to achieve the effect you want. If you can tell in advance that the context is such that abc should be category1, then you can tell Flex to start a state in which abc is classified as category1, while in other states, it is classified in category2. Don't forget to switch the state back when you're done with the special state. This sort of technique can be used to make selected keywords into a keyword in some contexts and leave it as an identifier in other contexts. Usually, though, you have the lexical analyzer always classify it the same way (e.g. as token KW_ABC) and let the grammar get on with using that token.

To reiterate Jonathan Leffler's above comment of Jan 13 at 19:39, you are trying to parse a context-sensitive language with context-insensitive parser-generator tools. You need to re-think the grammar or re-think your choice of parser-generator tools -- what you are doing is the equivalent of trying to use a screwdriver to hammer in a nail.
If it were me, I would go back to the books and the Interwebs to review handling of context-sensitive grammar parsing.

Related

how do I resolve this antlr ambiguity?

I have a 4000 line text file which is parsing slowly, taking perhaps 3 minutes. I am running the Intellij Antlr plugin. When I look at the profiler, I see this:
The time being consumed is the largest of all rules, by a factor of 15 or so. That's ok, the file is full of things I actually don't care about (hence 'trash'). However, the profiler says words_and_trash is ambiguous but I don't know why. Here are the productions in question. (There are many others of course...):
I have no idea why this is ambiguous. The parser isn't complaining about so_much_trash and I don't think word, trash, and OPEN_PAREN overlap.
What's my strategy for solving this ambiguity?
It's ambiguous because, given your two alternatives for words_and_trash, anything that matches the first alternative, could also match the second alternative (that's the definition ambiguity in this context).
It appears you might be using a technique common in other grammar tools to handle repetition. ANTLR can do this like so:
words_and_trash: so_much_trash+;
so_much_trash: word
| trash
| OPEN_PAREN words_and_trash CLOSE_PAREN
;
You might also find the following video, useful: ANTLR4 Intellij Plugin -- Parser Preview, Parse Tree, and Profiling. It's by the author of ANTLR, and covers ambiguities.

ANTLR4: parse number as identifier instead as numeric literal

I have this situation, of having to treat integer as identifier.
Underlying language syntax (unfortunately) allows this.
grammar excerpt:
grammar Alang;
...
NLITERAL : [0-9]+ ;
...
IDENTIFIER : [a-zA-Z0-9_]+ ;
Example code, that has to be dealt with:
/** declaration block **/
Method 465;
...
In above code example, because NLITERAL has to be placed before IDENTIFIER, parser picks 465 as NLITERAL.
What is a good way to deal with such a situations?
(Ideally, avoiding application code within grammar, to keep it runtime agnostic)
I found similar questions on SO, not exactly helpful though.
There's no good way to make 465 produce either an NLITERAL token or an IDENTIFIER token depending on context (you might be able to use lexer modes, but that's probably not a good fit for your needs).
What you can do rather easily though, is to allow NLITERALs in addition to IDENTIFIERS in certain places. So you could define a parser rule
methodName: IDENTIFIER | NLITERAL;
and then use that rule instead of IDENTIFIER where appropriate.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

Solving a reduce/reduce conflict

If I have a grammar where a certain expression can match two productions, I will obviously have a reduce/reduce conflict with yacc. Specifically, say I have two productions (FirstProduction and SecondProduction) where both of them could be TOKEN END.
Then yacc will not be able to know what to reduce TOKEN END to (FirstProduction or SecondProduction). However, I want to make it so that yacc prioritises FirstProduction in this situation. How can I achieve that?
Note that both FirstProduction and SecondProduction could be a great deal of things and that Body is the only place in the grammar where these conflict.
Also, I do know that in these situations, yacc will choose the first production that was declared in the grammar. However, I want to avoid having any reduce/reduce warnings.
You can refactor the grammar to not allow the second list to start with something that could be part of the first list:
Body: FirstProductionList SecondProductionList
| FirstProductionList
;
FirstProductionList: FirstProductionList FirstProduction
| /* empty */
;
SecondProductionList: SecondProductionList SecondProduction
| NonFirstProduction
;
NonFirstProduction is any production that is unique to SecondProduction, and marks the transition from reducing FirstProdutions to SecondProductions
Bison has no way to explicitly mark one production as preferred over another; the only such mechanism is precedence relations, which resolve shift/reduce conflicts. As you say, the file order provides an implicit priority. You can suppress the warning with an %expect declaration; unfortunately, that only lets you tell bison how many conflicts to expect, and not which conflicts.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?