What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))? - antlr

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?
text
: tag => (tag)!
| outsidetag
;

The following is invalid in ANTLR 3:
text
: tag => (tag)!
| outsidetag
;
You probably meant the following:
text
: (tag)=> (tag)!
| outsidetag
;
where ( ... )=> is a syntactic predicate, which has no ANTLR4 equivalent: simply remove them. As 280Z28 mentioned (and also explained in the previous link): the lack of syntactic predicates is not a feature that was removed from ANTLR 4. It's a workaround for a weakness in ANTLR 3's prediction algorithm that no longer applies to ANTLR 4.
The exlamation mark in v3 denotes to removal of a rule in the generated AST. Since ANTLR4 does not produce AST's, also just remove the exclamation mark.
So, the v4 equivalent would look like this:
text
: tag
| outsidetag
;

Related

Did "!", "^" and "$" had a special meaning in Antlr3?

I dont have any prior knowledge about ANTLR(I recently learned a little bit about ANTLR4), but I have to translate an old grammar to a newer version and eclipse is telling me, that their are no viable alternatives for those characters and shows the syntax error " '!' came as a complete surprise to me".
I already deleted those characters and it does not seam to be a problem, but maybe it had a special function in ANTLR3.
Thanks in advance.
global_block:
DATABASE! IDENTIFIER!
| GLOBALS! define_section!+ END! GLOBALS!
| GLOBALS! STRING!
;
main_block: MAIN sequence? END em=MAIN
-> ^(MAIN MAIN '(' ')' sequence? $em)
;
^ and -> are related to tree rewriting: https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687090/Tree+construction
ANTLR4 does not support it (v4 has listeners and visitors for tree traversal, but no rewriting anymore). Just remove all of these ! and -> ... in parser rules (do not remove the -> ... inside lexer rules like -> channel(...), which is still supported in v4).
So in your case, these rules would be valid in ANTLR4:
global_block
: DATABASE IDENTIFIER
| GLOBALS define_section+ END GLOBALS
| GLOBALS STRING
;
main_block
: MAIN sequence? END MAIN
;
The $ can still be used in ANTLR4: they are used to reference sub-rules or tokens:
expression
: lhs=expression operator=(PLUS | MINUS) rhs=expression
| NUMBER
;
so that in embedded code block, you can do: $lhs.someField.someMethod(). In your case, you can also just remove them because they are probably only used in the tree rewrite rules.
EDIT
kaby76 has a Github page with some instructions for converting grammars to ANTLR4: https://github.com/kaby76/AntlrVSIX/blob/master/doc/Import.md#antlr3

Antlr Arrow Syntax

I found this syntax in an Antlr parser for bash:
file_descriptor
: DIGIT -> ^(FILE_DESCRIPTOR DIGIT)
| DIGIT MINUS -> ^(FILE_DESCRIPTOR_MOVE DIGIT);
What does the -> syntax do?
What is it called such that I can google it to read about it?
The 'Definitive Guide to Antlr4' only has one page about it. It refers to "lexer command", but it never names the operator. The usage in the book differs from the usage in the bash parser.
In ANTLR3, -> is used in parser rules and signifies a tree rewrite rule, which is no longer supported in ANTLR4.
In ANTLR4, the -> is used in lexer rules and has nothing to do with the old v3 functionality.

ANTLR and Literate Programming

I'm trying to write a parser for a simple, literate language -- with structure similar to PHP. The source might look something like:
blurb blurb blurb
[[ if mode == 5 ]]
then blurb blurb blurb
[[ else ]]
else blurb blurb blurb
[[ end ]]
The non-code sections -- those not nested in [[ ]] -- don't follow any syntax rules. It's just natural language.
However, I'm not sure how to write a grammar rules to match the non-code text. I'd welcome any help on how I might do this!
You can treat the non code text like comments.
To indicate whether it is a code or comment block you can introduce
some special symbols. Eg /* blub blub */ or something like that.
so your parser grammer could look like this:
program : program |
if_statement |
non-code
if_statement : '[[' 'if' expression ']]'
...
expressen : var OPERATOR var;
var : LITERAL;
non-code : '/*' any_text*'*/
any_text : LITERAL | DIGIT | SPECIAL_CHAR
where ANY_TEXT is a lexer rule for
SPECIAL_CHAR : '-'|'+' ....
OPERATOR : '<' | '>' ....
LITERAL : (CHAR | DIGIT)+
fragment CHAR : ('A'..'Z' | 'a'..'z')+
fragement DIGIT : ('0'..'9')+;
EDIT due to comment:
Ok then maybe you can try to make some kind of preprocessing or of chaining parsers. I just made something similar some time ago. In your case i would just parse the input string with simple REGEX rule and look for the coded parts and then internaly add some kind of tag to the non-coded part.
Input:
blub blub blah
[[ if express ]]
blah blah blub
--> Preprocess
<non-code>blub blub blah</non-code>
[[ if express ]]
<non-code>blah blah blub</non-code>
--> Paring using ANTLR Parser and Lexer
You may have also a look to TreeParser where you can reduce your input grammar to the parts you wish to evaluate by leaving unnecessary
tokens.
It looks like the ANTLR folks identified this task long ago. I guess what I'm trying to build is an island grammar, where islands of syntax appear within a sea of text that has no rules applied.
Chapter 12 of Parr's Definitive ANTLR 4 Reference led me to a solution, which involves switching between sublexers when I hit a delimiter.
In looking at the way GHC manages literate Haskell files, I think the best approach may be a preprocessing step that "deliterates" the source by turning the non-code sections into something that is more formally specified.
Maybe I have an emit function that takes as a parameter the non-code text. I can preprocess the source with something like:
src.gsub /(\A|\]\])(.*?)(\Z|\[\[)/ 'emit(\2)'

Explanation for the following grammar written in ANTLR 4

I have a sample grammar written in ANTLR 4
query : select from ';' !? EOF!
I have understood
query : select from ';'
how it works
What does !? EOF! means in the grammar and how it works?
The exclamation marks is used in ANTLR v3 grammars to denote that a certain node should be omitted from the generated AST. Since ANTLR v4 does not have AST's, this construct is no longer used.
In both v3 and v4, the ? denotes that a rule (lexer or parser) is optional and EOF means the end-of-file constant.
To summarize ';'!? means: optionally match a ';' and exclude it from the AST. And EOF! means: match the end-of-file and exclude this token from the AST.
So, the v3 parser rule:
query : select from ';'!? EOF!
should look like this in a v4 grammar:
query : select from ';'? EOF

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.