I want to write a minimalistic XML parser for my timetabling application. I do no want to use any libraries or parsers that will support XML parsing because they be less efficient for my use(as I only need to recognize a few tags only). Hence I have decided to write a parser using lex and yacc.
Is there any way that I can use the functions in the .h file created by lex and yacc in my code directly rather than writing the application code in the yacc itself.
The functions exported by your lex and yacc generated programs are minimal. The parser is invoked by calling yyparse. It calls yylex in the lexer. Everything else can be outside.
It is convenient and customary to have some parsing support routines in the lex and yacc files themselves (helpers which are called by lexing and parsing actions, and not by anything else). But not the application logic. (Except for very trivial textbook examples for Yacc.)
Related
I'm working on a parser that's getting a complex/large file in C++. Since each rule gets its own class created, for rules that are not dependent on each other I was wondering if there's a way to instruct the antlr tool to generate the C++ code in separate .cpp files.
Regards,
JZ
I was wondering if there's a way to instruct the antlr tool to generate the C++ code in separate .cpp files
No, there is no such thing possible out of the box.
This post about the antlr simple example shows how to create and us a grammar for java.
However, this intermixes the grammar and the Java source code in the Exp.g source.
My Question is, Is it possible to decouple the grammar file from the target language, so that the one grammar file can be used for generating multiple Java, Scala, C++, etc Lexers/Parsers?
It depends mostly on the reason why target code is used in the grammar. Is it only action code to do something with the found tokens (e.g. building a symbol table or alternative tree representation) then is indeed no problem do remove such native code and do the processing afterwards (using a parse tree walker or visitor).
However, predicates are a different. They are used to guide the parser and also require native code. What you can do is to move all the native code into a base class from which your generated parser derives. You then only need to re-write this base class in your target language and keep the grammar mostly free of native code (except for a single function call, which invokes the native code).
This approach has the advantage that no additional library reference is necessary (#include in C/C++, import in other languages), which also is native code preventing use for multiple targets.
I am wondering if I can generate ANTLR grammar from java source code. I want to do some kind of research project, but I am just exploring different open sources to see which one is best.
For ANTLR, do I always have to write a grammar and pass it to the ANTLR?
Is there a way to generate grammar from an existing Java source code?
Not easily. ANTLR generate a recursive descent parser from your grammar, encoding the tests into procedural code, as well as lots of other bookkeeping stuff.
Knowing how the code is generated, you might be able to take it apart but you'll have to reach into the middle of generated statements and that isn't easy without a full parser for the generated language. (Hint: regex won't work).
I don't see a lot of point of this exercise. Why don't you just use the original grammar?
I've been learning ANTLR for a few days now. My goal in learning it was that I would be able to generate parsers and lexers, and then personally hand-translate them from Java into my target language (neither C/C++/Java/C#/Python, no tool has support for it). I chose ANTLR because from its About page: ANTLR is widely used because it's easy to understand, powerful, flexible, generates human-readable output[...]
In learning this tool, I decided to start with a simple lexer for a simple grammar: JSON. However, once I generated the .java file for this lexer using ANTLR4 I was caught widely off-guard. I got a huge mess of far-from-human-readable serialized code, followed by:
public static final ATN _ATN =
ATNSimulator.deserialize(_serializedATN.toCharArray());
static {
_decisionToDFA = new DFA[_ATN.getNumberOfDecisions()];
}
A few Google searches were unable to provide me a way to disable this behavior.
Is there a way to disable this behavior and produce human-readable code, or am I going to have to hand-write my lexers and parsers for this target programming language?
ANTLR 4 uses a new algorithm for prediction. Terence Parr is currently working on a tech report describing the algorithm in detail. The human-readable output refers to the generated parsers.
ANTLR 4 lexers use a DFA recognizer for a massive speed and memory usage improvement over previous releases of ANTLR. For parsers, the _ATN field is a data structure used within calls to adaptivePredict (you'll notice lines in the generated code calling that method).
You won't be able to manually translate the generated Java code of an ANTLR 4 lexer to another programming language. You might be able to manually translate the code of a generated parser provided the grammar is strictly LL(1) (i.e. the generated code does not contain any calls to adaptivePredict). However, you will lose the error recovery ability that draws from information encoded in the serialized ATN.
As an project assignment, I need to parse a plain-C grammar from Java to generate AST output. As a startup, I am using the file c.jj that I have found among grammar files at
http://java.net/projects/javacc/sources/svn/
but I found that it only has syntactic and lexical actions and no real semantics for parsing C source. Is there some other source that incorporate typedef, variables, construct functions, include files?
You could go looking for a complete grammar. Will you learn much this way?
You could ask your lecturer which would impress them more: implementing some small subset of C grammar by writing your own rules, or by searching google for alternative complete rules?
I trust writing your own rules - and even your own hand-crafted parser - will be more a more useful exercise. Even if its only parsing expressions.