ANTLR4: What is the best approach to implement C like include file handling? - antlr

I am implementing a lexer/parser for the real-time language OpenPEARL. For better struturing of my testsuite I want to implement a include file handling similiar to C/C++. The parser iteself uses the visitors. What would be the best approach to implement this? One thing which concern me when instantiating a nested parser the included file does not need to contain a complete program depending where it is included.
Cheers
Marcel

I can't speak for ANTLR, but in general one implements a C-like preprocessor in the lexer.
You accomplish this by having a stack of input streams, with the base of the stack being the source file. You read input from the stream on top of the stack.
When an include is encountered in the lexer, a new stream is pushed on top of the stack, and reading continues (now from the new stream). When a stream encounters EOF, you pop the stack and continue; if the stack is empty, the lexer emits an EOF token.
You can abuse these streams to implement macros. On macro call, simply push a new stream that represents the macro body. When you encounter a macro parameter name, push a stream for the argument supplied to the corresponding macro.

I have seen implementations where include handling has been done in the (parser) grammar. Doing it in the lexer like Ira suggests is certainly possible, but with some extra work.
However, full include handling is more than simply switching input streams, namely macro handling, line splicing, trigraph handling, charizing and stringizing + as evaluator for #if(def) commands. All that I have implemented in my Windows Resource File Parser, which was written for ANTLR 2.7 and hence needs an update, but is certainly good for getting ideas.
In this project I handle include files outside of the normal ANTLR parsing chain, which follows more the preprocessor approach you often see for C/C++.

Related

Throw Custom Exception From ANTLR4 Grammar File

I have a grammar file which parse a specific file type. Now I need a simple thing.
I have a parser rule, and when the parser rule doesn't satisfy with the input token I need to throw my own custom exception.
That is, when my input file is giving me an extraneous error because the parser is expecting something and the i/p file doesn't has that. I want to throw an exception in this scenario.
Is this possible ?
If yes, How ?
If no, any work around ?
I'm beginner in this skill.
grammar Test
exampleParserRule : [a-z]+ ;
My input file contains 12345. Now I need to throw a custom exception
For parsing issues such as this, ANTLR will, internally, throw certain exceptions, but they are caught and handled by an ErrorListener. By default, ANTLR will hook up a ConsoleErrorListener that just formats and writes error messages to the console. But it will then continue processing, attempting to recover from the error, and sync back up with your input. This is something you want in a parser. It’s not very useful to have a parser just report the first problem it encounters and then exception out.
You can implement your own ErrorListener (there’s a BaseErrorListener class you can subclass). There you can handle the reported error yourself (the method you override provide a lot of detailed information about the error) and produce whatever message you’d like. (You can also do things like collect all the errors in a list, keep track of error levels, etc.)
In short, you probably don’t want a different exception, you want a different error message.
Sometimes, depending on how difficult it is to sort out your particular situation for a custom message, it’s really better to look for it in a Listener that processes the parse tree that ANTLR hands back. (Note: a very common path beginners take is to try to get everything into the grammar. It’s going to be hard to get really nice error messages if you do this. ANTLR is pretty good with error messages, but, as a generalized tool, it’s just more likely you can produce more meaningful messages.)
Just try to get ANTLR to produce a parse tree that accurately reflects the structure of your input. Then you can walk the ParseTree with a validation Listener of your own code, producing your own messages.
Another “trick” that doesn’t occur to many ANTLR devs (early on), is that the grammar doesn’t HAVE to ONLY include rules for valid input. If there’s some particular input you want to give a more helpful error message for, you can add a rule to match that (invalid) input, and when you encounter that Context in your validation listener, generate an error message specific to that construct.
BTW… [a-z]+ would almost always be a Lexer rule. If you don’t yet understand the difference between lexer and parser rules, or the processing pipeline ANTLR uses to tokenize an input stream, and then parse a token stream, do yourself a favor and get a firm understanding of those basics. ANTLR is going to be very confusing without that basic understanding. It’s pretty simple, but very important to understand.
You can do this in your grammar:
grammar Test
#header {
package $your.$package;
import $your.$package.$yourExceptionClass;
}
exampleParserRule : [a-z]+ ;
catch [RecognitionException re] {
reportError(re);
recover(input,re);
retval.tree = (CommonTree)adaptor.errorNode(input, retval.start, input.LT(-1), re);
String msg = getErrorMessage(re, this.getTokenNames());
throw new $yourExceptionClass(msg, re);
}
It's up to you if you really want to reportError(logs to console) , recover etc. - but these are the defaults so it may be good to use these.
Also, you may want to generate a more human readable error message (use getErrorMessage.
If you do more complex work follow #mike-cargal`s advice.

Read a single character or byte from stdIn without waiting for newline in SML

I encountered a problem while working with the TextIO structure,
because every input waites for a newline chacter and for the buffer to be full...
How can i work with BinIO and stdIn to solve that problem?
Any helpfull input is appreciated.
BTW: I am using MLTton so there is nothing more than the usual standard libs.
As a last resort, you could write it yourself in C, and then call it from SML using the foreign function interface. You can find out more info about MLton's FFI here: http://mlton.org/ForeignFunctionInterface
I encountered a problem while working with the TextIO structure, because every input waites for a newline chacter and for the buffer to be full... How can i work with BinIO and stdIn to solve that problem?
BinIO, like TextIO, implements buffered I/O. (They both implement the IMPERATIVE_IO signature.) For unbuffered I/O, you need to go "down" a level in abstraction, and use an implementation of PRIMITIVE_IO or POSIX_IO.
Specifically, Posix.IO.readVec lets you read unbufferedly from a file descriptor. (In the case of standard input, the file descriptor is Posix.FileSys.stdin.)
However, if your standard input is from the console (as opposed to being redirected from a file, or taken from a pipe, or whatnot), then there's a very good chance that the console only provides input to MLton after the user hits Enter. Using Posix.IO will bypass the line-buffering functionality that MLton provides, but if you also need to bypass your console's line buffering, then you'll likely need to use special C libraries (specific to your operating system), with the foreign function interface that Matt mentions in his answer.

Using ANTLR4 lexing for Code Completion in Netbeans Platform

I am using ANTLR4 to parse code in my Netbeans Platform application. I have successfully implemented syntax highlighting using ANTLR4 and Netbeans mechanisms.
I have also implemented a simple code completion for two of my tokens. At the moment I am using a simple implementation from a tutorial, which searches for a whitespace and starts the completion process from there. This works, but it deems the user to prefix a whitespace before starting code completion.
My question: is it possible or even contemplated using ANTLR's lexer to determine which tokens are currently read from the input to determine the correct completion item?
I would appreciate every pointer in the right direction to improve this behaviour.
not really an answer, but I do not have enough reputation points to post comments.
is it possible or even contemplated using ANTLR's lexer to determine which tokens are currently read from the input to determine the correct completion item?
Have a look here: http://www.antlr3.org/pipermail/antlr-interest/2008-November/031576.html
and here: https://groups.google.com/forum/#!topic/antlr-discussion/DbJ-2qBmNk0
Bear in mind that first post was written in 2008 and current antlr v4 is very different from the one available at the time, which is why Sam’s opinion on this topic appear to have evolved.
My personal experience - most of what you are asking is probably doable with antlr, but you would have to know antlr very well. A more straightforward option is to use antlr to gather information about the context and use your own heuristics to decide what needs to be shown in this context.
The ANTLRv3 grammar https://sourceware.org/git/?p=frysk.git;a=blob_plain;f=frysk-core/frysk/expr/CExpr.g;hb=HEAD implements context sensitive completion of C expressions (no macros).
For instance, if fed the string:
a_struct->a<tab>
it would just lists the fields of "a_struct" starting with "a" (tab could, technically be any character or marker).
The technique it used was to:
modify a C grammar to recognize both IDENT and IDENT_TAB tokens
for IDENT_TAB capture the partial expression AST and "TOKEN_TAB" and throw them back to 'main' (there are hacks to help capture the AST)
'main' then performs a type-eval on the partial expression (compute the expression's type not value) and use that to expand TOKEN_TAB
the same technique, while not exactly ideal, can certainly be used in ANTLRv4.

equivalent of nevow.tags.raw for twisted.web.template

I'm trying to port pydoctor to twisted.web.template and have hit a pretty basic problem: pydoctor uses epydoc to render docstrings into HTML but I can't see a way to include this HTML in the generated page without escaping. What can I do?
There is, somewhat intentionally, no way to insert HTML into the page without parsing; twisted.web.template is a bit more of a stickler about producing correct output than nevow was.
There are a couple of ways around this.
Ultimately, your HTML is going to some kind of output stream. You could simply insert a renderer that returns a pair of Deferred objects, and does a .write to the underlying stream after the first one fires but before the second. Kind of gross, but it effectively expresses your intent :).
You can simply re-parse the output of epydoc into HTML using XMLString or similar, so that twisted.web.template can write it out correctly. This will "waste" a little bit of CPU, but in my opinion it will be worth it for (A) the stress-test it will give t.w.t and (B) the guarantee - presuming that t.w.t is correct - that it will give you that you're emitting valid HTML.
As I was writing this answer, however, I realized that point 2 isn't generally possible with arbitrary HTML with the current public API of twisted.web.template. Ideally, you could use html5lib to parse this stuff, and then just dump the parsed input into your document tree.
If you don't mind mucking around with private API, you could probably hook up html5lib's SAX support to the internal SAX parser that we use to load templates.
Of course, the real solution is to fix the ticket you already filed, so you don't have to use private API outside of Twisted itself...

Switch input stream in lex and yacc

I insist on using lex and not the flex.
am developing an API in lex like the one existing in the flex util ( yy_switch_buffer, yy_create_buffer ...) offering the possibility to switch between multiple buffers .
This is the main difficulty for me until now :
for example when I encounter a #include token i should switch the buffer to the included file. So first i should interrupt the current parsing action( I tried fclose(yyin) FAILED) the parser complete the whole current yyin. NOT good because I should parse the included file to store structures (for example ) used in the main file.
Question is : How can I interrupt immediately a parser ? Is it enough for me to define a new buffer using yyin = fopen(somefile, "r"); ??
It is going to be tough to handle it, if it is doable at all. AFAIK, Lex only allows you to switch input streams on EOF (real or simulated) when it calls yywrap().
Maybe you can fake things so that when you find the 'include' directive, you fake an EOF on the current stream and then have yywrap() fix things so that the new input comes from the included file, and then when you reach EOF on the included file, you have yywrap() restore input from the original input stream at the original position. Clearly, this works for nested includes (if it works at all) unless you arbitrarily restrict the number of levels of inclusion.
There's no portable way to do this with POSIX lex -- different implementations have different internal arrangements of how they deal with and buffer input, and during lexing, may have read ahead of the currently processing token and buffered a bunch of the input. So you need to get it to save what it has currently buffered and switch to a different input, and then restore the buffered stuff (so it will be read next) after you're done with the #include or whatever. This is precisely what flex's buffer management calls are for, but if you insist on using lex, you'll need to (effectively) port these routines to understand the internals of whatever versions of lex you need to support.
The solution to the "included input files" is part of flex documentation, which provides an example how to switch between flex inputs ftp://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_mono/flex.html#SEC12 ("Multiple Input Buffers")
You can find flex tool ported to windows system here http://sourceforge.net/projects/winflexbison/