type3-only lexers in ANTLR4? - antlr

I'm thinking about using ANTLR in my lecture on formal languages since it's input language is pretty clean and easy to learn.
Since I am not an expert using ANTLR I tried some standard examples to get familiar with it's syntax, error messages etc.
Doing so I found out, that:
lexer grammar KFG;
R : 'a'R'b' | 'ab';
is a valid lexer that can be executed e.g. by:
echo "aaabbb" | grun KFG tokens -tokens
Since the grammar is context free it should only be parsable by a parser an not a lexer.
Is there any way to force ANTLR to accept only type 3 grammars for lexers?
Cheers,
Alex

Is there any way to force ANTLR to accept only type 3 grammars for lexers?
AFAIK, no, that is not possible.

Related

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

How can I show that this grammar is ambiguous?

I want to prove that this grammar is ambiguous, but I'm not sure how I am supposed to do that. Do I have to use parse trees?
S -> if E then S | if E then S else S | begin S L | print E
L -> end | ; S L
E -> i
You can show it is ambiguous if you can find a string that parses more than one way:
if i then ( if i then print i else print i ; )
if i then ( if i then print i ) else print i ;
This happens to be the classic "dangling else" ambiguity. Googling your tag(s), title & grammar gives other hits.
However, if you don't happen to guess at an ambiguous string then googling your tag(s) & title:
how can i prove that this grammar is ambiguous?
There is no easy method for proving a context-free grammar ambiguous -- in fact,
the question is undecidable, by reduction to the Post correspondence problem.
You can put the grammar into a parser generator which supports all context-free grammars, a context-free general parser generator. Generate the parser, then parse a string which you think is ambiguous and find out by looking at the output of the parser.
A context-free general parser generator generates parsers which produce all derivations in polynomial time. Examples of such parser generators include SDF2, Rascal, DMS, Elkhound, ART. There is also a backtracking version of yacc (btyacc) but I don't think it does it in polynomial time. Usually the output is encoded as a graph where alternative trees for sub-sentences are encoded with a nested set of alternative trees.

Antlr Arrow Syntax

I found this syntax in an Antlr parser for bash:
file_descriptor
: DIGIT -> ^(FILE_DESCRIPTOR DIGIT)
| DIGIT MINUS -> ^(FILE_DESCRIPTOR_MOVE DIGIT);
What does the -> syntax do?
What is it called such that I can google it to read about it?
The 'Definitive Guide to Antlr4' only has one page about it. It refers to "lexer command", but it never names the operator. The usage in the book differs from the usage in the bash parser.
In ANTLR3, -> is used in parser rules and signifies a tree rewrite rule, which is no longer supported in ANTLR4.
In ANTLR4, the -> is used in lexer rules and has nothing to do with the old v3 functionality.

Antlr rule for matching filename

I am looking for a good way to match a filename in Antlr.
The filename could be DOS or Unix style.
If you have a good solution that to that, feel free to ignore the rest of this question because it is just my newbie attempt at solving the problem and I am probably way off. I have included it because some people like to see sample code.
For purposes of discussion, here is a here is what I am thinking. This is not my actual grammar as all I am interested in for this discussion is filename parsing so I reduced the sample that somewhat meaningful in that context.
Lexer.g4:
lexer grammar Lexer;
K_COPY : C O P Y ;
FILEPATH: [-.a-zA-Z0-9:/\]+;
Parser.g4
parser grammar Parser;
options { tokenVocab=Lexer; }
commandfile: (statement NEWLINE)* EOF;
statement : copy_stmt
;
copy_stmt: K_COPY left=filepath right=filepath
;
// Add characters as we make rules as to what characters are valid:
filepath: FILEPATH;
That is what I am thinking but I am new to Antlr so I wanted to get some feedback before I proceed.
I am using Antlr for this project is already decided and a good part of this project is already working in Antlr, so I am only looking for Antlr based solutions.

Do independent rules influence one another?

When I was debugging my grammar for C# I noticed something very unusual: some inputs that are not accepted by a full grammar are being accepted by the same grammar with some independent rules deleted. I could not find a logical explanation. For example:
CS - this grammar does not accept the input a<a<a>><EOF>
CS' - and this grammar which is basically the same as CS but with some independent rules deleted (rules are not reordered) does accept a<a<a>><EOF>
As you can see both grammars start with the rule start: namespaceOrTypeName EOF; and therefore they should call the same set of rules (CS will never call those rules that are deleted in CS'). I spent a day debugging this, deleting or adding new rules, but couldn't find a flaw in the logic. Any help would be of use, thank you.
Unicode
EDIT:
After changing the start rule in CS to start: Identifier EOF; the grammar starts rejecting the input method which is normally accepted when only Identifier rules are defined. So I guess, since there is a rule attributeTarget: ...| 'method' | ..., that after compiling the grammar some phrases get reserved such as 'method' in this case but I'm not still sure if thats the case.
The first grammar includes the overloadableBinaryOperator rule which implicitly defines the >> token. Since >> is a 2-character token, the lexer will never treat the input >> as two separate 1-character tokens >, >. If you open the grammar in ANTLRWorks 2, you'll see a warning indicator for each implicitly-defined token. You should remove all of these warnings by:
Creating explicit lexer rules for every token you intend to appear in the input.
Only using the syntax 'new' in a parser rule if a corresponding lexer rule exists for the literal 'new'.