Explanation for the following grammar written in ANTLR 4 - antlr

I have a sample grammar written in ANTLR 4
query : select from ';' !? EOF!
I have understood
query : select from ';'
how it works
What does !? EOF! means in the grammar and how it works?

The exclamation marks is used in ANTLR v3 grammars to denote that a certain node should be omitted from the generated AST. Since ANTLR v4 does not have AST's, this construct is no longer used.
In both v3 and v4, the ? denotes that a rule (lexer or parser) is optional and EOF means the end-of-file constant.
To summarize ';'!? means: optionally match a ';' and exclude it from the AST. And EOF! means: match the end-of-file and exclude this token from the AST.
So, the v3 parser rule:
query : select from ';'!? EOF!
should look like this in a v4 grammar:
query : select from ';'? EOF

Related

ANTLR4 problem with balanced parentheses grammar

I'm doing some experiments with ANTLR4 with this grammar:
srule
: '(' srule ')'
| srule srule
| '(' ')';
this grammar is for the language of balanced parentheses.
The problem is that when I run antlr with this string: (()))(
This string is obviously wrong but antlr simply return this AST:
It seems to stop when it finds the wrong parenthesis, but no error message returns. I would like to know more about this behavior. Thank you
The parser recognises (())) and then stops. If you want to force the parser to consume all tokens, "anchor" your test rule with the EOF token:
parse_all
: srule EOF
;
Btw, it's always a good idea to include the EOF token in the entry point (entry rule) of your grammar.

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?
text
: tag => (tag)!
| outsidetag
;
The following is invalid in ANTLR 3:
text
: tag => (tag)!
| outsidetag
;
You probably meant the following:
text
: (tag)=> (tag)!
| outsidetag
;
where ( ... )=> is a syntactic predicate, which has no ANTLR4 equivalent: simply remove them. As 280Z28 mentioned (and also explained in the previous link): the lack of syntactic predicates is not a feature that was removed from ANTLR 4. It's a workaround for a weakness in ANTLR 3's prediction algorithm that no longer applies to ANTLR 4.
The exlamation mark in v3 denotes to removal of a rule in the generated AST. Since ANTLR4 does not produce AST's, also just remove the exclamation mark.
So, the v4 equivalent would look like this:
text
: tag
| outsidetag
;

ANTLR 4 lexer subrule order

Does the order of choices among lexer subrules matter in ANTLR4? For example, is there any difference between the following rules?
STRING: '"' ('\\"' | .)*? '"';
STRING: '"' (. | '\\"')*? '"';
The first lexical rule can match the whole of such an input as: "abc\"def". the second will match only part of it, that is, "abc\", and then report error for the rest character sequence.
Antlr generated lexer matches the subrule first which defined first. I have tested them on Antlr 4.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

Match lowercase with ANTLR

I use ANTLRWorks for a simple grammar:
grammar boolean;
// [...]
lowercase_string
: ('a'..'z')+ ;
However, the lowercase_string doesn't match foobar according to the Interpreter (MismatchedSetException(10!={}). Ideas?
You can't use the .. operator inside parser rules like that. To match the range 'a' to 'z', create a lexer rule for it (lexer rules start with a capital).
Try it like this:
lowercase_string
: Lower+
;
Lower
: 'a'..'z'
;
or:
lowercase_string
: Lower
;
Lower
: 'a'..'z'+
;
Also see this previous Q&A: Practical difference between parser rules and lexer rules in ANTLR?