I'm writing a preprocessor for my language. In the preprocessor I've output a line that wasn't in the source file. This causes any error messages that Anltr creates to be incremented by one line.
The Lexer handles the line count so I'm wondering if there is a way for the parser to tell the lexer to decrement the line count, or to ignore a specific newline.
I'm also open to other suggestions on how to work around this.
The only constraint I have is putting the extra line inline with the existing code. I'd prefer to keep it on it's own line to keep my parsing sane.
What if you remove the pre-processor line and replace it with the new line instead of adding a new one? Then parse the output of the pre-processor.
Related
So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?
Probably quite a niche question, but I believe in the power of a big community: Is it possible to set up jEdit in way, that it automatically inserts a comment character (//, #, ... depending on the edit mode) at the beginning of a new line, if the line before the wrap was a comment?
Sample:
# This is a comment spanning multiple lines. If I continue to type here, it
# wraps around automatically, but I have to manually add a `#` to each line.
If I continue to type after the . the third line should start with the # automatically. I searched in the plugin repository but could not find anything related.
Background: jEdit has the concepct of soft and hard wrap. While soft wrap only breaks lines visually at a character limit, it does not insert line breaks in the file. Hard wrap on the other hand inserts \n into the file at the desired character count.
This is not exactly what you want: I use the macros Enter_with_Prefix.bsh to automatically insert the prefix (e.g., #, //) at the beginning of the new line.
Description copied from Enter_with_Prefix.bsh:
Enter_with_Prefix.bsh - a Beanshell macro for jEdit
that starts a new line continuing any recognized
sequence that started the previous. For example,
if the previous line beings with "1." the next will
be prefixed with "2.". It supports alpha lists (a., b., etc...),
bullet lists (+, =, *, etc..), comments, Javadocs,
Java import statements, e-mail replies (>, |, :),
and is easy to extend with new sequence types. Suggested
shortcut for this macro is S+ENTER (SHIFT+ENTER).
The TestDriver in ANTLRWorks2 seems kind of finicky about when it'll accept a grammer without and explicit EOF and when it will not. The Hello grammar in the ANTLR4 Getting Started Guide doesn't use EOF anywhere, so I inferred that it's better to avoid explicit EOF if possible.
What is the best practice for using EOF? When do you actually need it?
You should include an explicit EOF at the end of your entry rule any time you are trying to parse an entire input file. If you do not include the EOF, it means you are not trying to parse the entire input, and it's acceptable to parse only a portion of the input if it means avoiding a syntax error.
For example, consider the following rule:
file : item*;
This rule means "Parse as many item elements as possible, and then stop." In other words, this rule will never attempt to recover from a syntax error because it will always assume that the syntax error is part of some syntactic construct that's beyond the scope of the file rule. Syntax errors will not even be reported, because the parser will simply stop.
If instead I had the following rule:
file : item* EOF;
In means "A file consists exactly of a sequence of zero-or-more item elements." If a syntax error is reached while parsing an item element, this rule will attempt to recover from (and report) the syntax error and continue because the EOF is required and has not yet been reached.
For rules where you are only trying to parse a portion of the input, ANTLR 4 often works, but not always. The following issue describes a technical problem where ANTLR 4 does not always make the correct decision if the EOF is omitted.
https://github.com/antlr/antlr4/issues/118
Unfortunately the performance impact of this change is substantial, so until that is resolved there will be edge cases that do not behave as you expect.
I'm implementing a small shell, and using lex&yacc to parse command. Lex reads command from stdin and yacc execute the command after yyparse.
The problem is, when there is a syntax error, yacc prompt an error and parse from the begining. In this case, cmd1 >>> cmd2 leads to running cmd2 becuase >>> is a syntax error.
My question is how to discard the rest of current command after encounting a syntax error?
If you want to write an interactive language with a prompt that lets users enter expressions, it's a bad idea to simply use yacc on the entire input stream. Yacc might get confused about something on one line and then misinterpret subsequent lines. For instance, the user might have an unbalanced parenthesis on the first line. or a string literal which is not closed, and then yacc will just keep consuming subsequent lines of the input, looking to close the construct.
It's better to gather the line of input from the user, and then parse that as one unit. The end of the line then simply the end of the input as far as Yacc is concerned.
If you're using lex, there are ways to redirect lex to read characters from a buffer in memory instead of from a FILE * stream. Look for documentation on the YY_INPUT macro, which you can define in a Lex file to basically specify the code that Lex uses for obtaining input characters.
Analogy time: Using a scanner developed with lex/yacc for directly handling interactive user input is a little bit like using scanf for handling user input. Whereas capturing a line into a buffer and then parsing it is more like using sscanf. Quote:
It's perfectly appropriate to parse strings with sscanf (as long as the return value is checked), because it's so easy to regain control, restart the scan, discard the input if it didn't match, etc. [comp.lang.c FAQ, 12.20].
ANTLR gives me the following error when my input file has either no newline at the EOF, or more than one.
line 0:-1 mismatched input '' expecting NEWLINE
How would I go about taking into account the possibilities of having multiple or no newlines at the end of the input file. Preferably I'd like to account for this in the grammar.
The rule:
parse
: (Token LineBreak)+ EOF
;
only parses a stream of tokens, separated by exactly one line break, ending with exactly one line break.
While the rule:
parse
: Token (LineBreak+ Token)* LineBreak* EOF
;
parses a stream of tokens separated by one or more line breaks, ending with zero, one or more line breaks.
But do you really need to make the line breaks visible in the parser? Couldn't you put them on a "hidden channel" instead?
If this doesn't answer your question, you'll have to post your grammar (you can edit your original question for that).
HTH