Antlr rule for matching filename - antlr

I am looking for a good way to match a filename in Antlr.
The filename could be DOS or Unix style.
If you have a good solution that to that, feel free to ignore the rest of this question because it is just my newbie attempt at solving the problem and I am probably way off. I have included it because some people like to see sample code.
For purposes of discussion, here is a here is what I am thinking. This is not my actual grammar as all I am interested in for this discussion is filename parsing so I reduced the sample that somewhat meaningful in that context.
Lexer.g4:
lexer grammar Lexer;
K_COPY : C O P Y ;
FILEPATH: [-.a-zA-Z0-9:/\]+;
Parser.g4
parser grammar Parser;
options { tokenVocab=Lexer; }
commandfile: (statement NEWLINE)* EOF;
statement : copy_stmt
;
copy_stmt: K_COPY left=filepath right=filepath
;
// Add characters as we make rules as to what characters are valid:
filepath: FILEPATH;
That is what I am thinking but I am new to Antlr so I wanted to get some feedback before I proceed.
I am using Antlr for this project is already decided and a good part of this project is already working in Antlr, so I am only looking for Antlr based solutions.

Related

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

Upgrading Grammar file to Antlr4

I am upgrading my Antlr grammar file to latest Antlr4.
I have converted most of the file but stuck in syntax difference that I can't figure out. The 3 such difference is:
equationset: equation* EOF!;
equation: variable ASSIGN expression -> ^(EQUATION variable expression)
;
orExpression
: andExpression ( OR^ andExpression )*
;
In first one, the error is due to !. I am not sure whether EOF and EOF! is same or not. Removing ! resolves the error, but I want to be sure that is the correct fix.
In 2nd rule, -> and ^ is giving error. I am not sure what is Antlr4 equivalent.
In 3rd rule, ^ is giving error. Removing it fixes the error, but I can't find any migration guide that explains what should be equivalent for this.
Can you please give me the Antrl4 equivalent of these 3 rules and give some brief explanation what is the difference? If you can refer to any other resource where I can find the answer is OK as well.
Thanks in advance.
Many of the ANTLR3 grammars contain syntax tree manipulations which are no longer supported with ANTLR4 (now we get a parse tree instead of a syntax tree). What you see here is exactly that.
EOF! means EOF should be matched but not appear in the AST. Since there is no AST anymore you cannot change that, so remove the exclamation mark.
The construct -> ^(EQUATION variable expression) rewrites the AST created by the equation rule. Since there is no AST anymore you cannot change that, so remove that part.
OR^ finally determines that the OR operator should become the root of the generated AST. Since there is no AST anymore ..., you got the point now :-)

Parsing a G4 file to generate doc / schema

I realize this question is a bit meta, but I essentially want to parse an ANTLR4 grammar (an actual .g4 file) to then generate documentation and other artifacts based on the grammar (not an instance of the grammar).
For example, consider the example Java grammar that contains this rule:
compilationUnit
: packageDeclaration? importDeclaration* typeDeclaration* EOF
;
I want to be able to parse the Java.g4 file and produce documentation that says "A compilationUnit contains an optional packageDeclaration, 0 or more importDeclarations, and 0 or more typeDeclarations". Or perhaps I want to produce an XSD with a data type called "compilationUnit" that contains "packageDeclaration", "importDeclaration", and "typeDeclaration" elements (with proper cardinality set).
What is the best way of accomplishing something like this? Is it to create a target (even though the goal isn't to create lexers/parsers), or is it to use the example antlr4 grammar to parse the g4 file, or is it something else?
Thanks!
This would be a very typical use of ANTLR, and convenient given the existing ANTLR 4 grammar.

type3-only lexers in ANTLR4?

I'm thinking about using ANTLR in my lecture on formal languages since it's input language is pretty clean and easy to learn.
Since I am not an expert using ANTLR I tried some standard examples to get familiar with it's syntax, error messages etc.
Doing so I found out, that:
lexer grammar KFG;
R : 'a'R'b' | 'ab';
is a valid lexer that can be executed e.g. by:
echo "aaabbb" | grun KFG tokens -tokens
Since the grammar is context free it should only be parsable by a parser an not a lexer.
Is there any way to force ANTLR to accept only type 3 grammars for lexers?
Cheers,
Alex
Is there any way to force ANTLR to accept only type 3 grammars for lexers?
AFAIK, no, that is not possible.

Lexing space seperated words in ANTLR3 where some words are keywords

I am working on a project that involves transforming part of speech tagged text into an ANTLR3 AST with phrases as nodes of the AST.
The input to ANTLR looks like:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
i.e. (tag token)+ where neither the tag or the token contain white space.
Is the following a good way of lexing this:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
The grammar then has entries like the following to describe the lowest level of the AST:
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
(and 186 more of these!)
This approach seems to work but results in a ~9000 line Java Lexer and takes a large amount of memory to build (~2gb) hence I was wondering whether this is the optimal way of solving this problem.
Could you combine the TAG space TOKEN into a single AST tree? Then you could pass both the TAG and TOKEN into your source code for handling. If the Java code used to handle the resulting tree is very similar between the various TAGs, then you could perhaps simplify the ANTLR with the trade-off of a bit more complication in your Java code.