Include certain escapement symbols into ANTLR Lexer rules - antlr

I'm creating a parser in Antlr4 and Python. Below is the Lexer rules I created in Antlr.
VARIABLE_ID : [$][a-zA-Z][a-zA-Z0-9_]*;
ARRAY_ID : [*][a-zA-Z][a-zA-Z0-9_]*;
STRINGCONST : ["][/|:.a-zA-Z0-9 ]+["];
WS : [ \r\t\f\n]+ -> skip;
I am looking at the STRINGCONST rule and I'm trying to add symbols such as - and ~, however, since they are escapement characters, Antlr is just throwing errors for me. I've tried escaping them with themselves and I haven't been able to get that to work.
Is there a way to include them in the STRINGCONST rule? The basic idea is that I want a string to be identified as any character between two " " marks however I'm happy to limit it to what's currently in the rule as long as I can get - and ~ in there as well.

You can escape chars by adding a \ in front of them:
STRINGCONST : ["] [/|:.a-zA-Z0-9 \-~]+ ["];
And note that ~ has no special meaning inside a char class (only outside of them), so ~ doesn't need to be escaped.

Related

Did "!", "^" and "$" had a special meaning in Antlr3?

I dont have any prior knowledge about ANTLR(I recently learned a little bit about ANTLR4), but I have to translate an old grammar to a newer version and eclipse is telling me, that their are no viable alternatives for those characters and shows the syntax error " '!' came as a complete surprise to me".
I already deleted those characters and it does not seam to be a problem, but maybe it had a special function in ANTLR3.
Thanks in advance.
global_block:
DATABASE! IDENTIFIER!
| GLOBALS! define_section!+ END! GLOBALS!
| GLOBALS! STRING!
;
main_block: MAIN sequence? END em=MAIN
-> ^(MAIN MAIN '(' ')' sequence? $em)
;
^ and -> are related to tree rewriting: https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687090/Tree+construction
ANTLR4 does not support it (v4 has listeners and visitors for tree traversal, but no rewriting anymore). Just remove all of these ! and -> ... in parser rules (do not remove the -> ... inside lexer rules like -> channel(...), which is still supported in v4).
So in your case, these rules would be valid in ANTLR4:
global_block
: DATABASE IDENTIFIER
| GLOBALS define_section+ END GLOBALS
| GLOBALS STRING
;
main_block
: MAIN sequence? END MAIN
;
The $ can still be used in ANTLR4: they are used to reference sub-rules or tokens:
expression
: lhs=expression operator=(PLUS | MINUS) rhs=expression
| NUMBER
;
so that in embedded code block, you can do: $lhs.someField.someMethod(). In your case, you can also just remove them because they are probably only used in the tree rewrite rules.
EDIT
kaby76 has a Github page with some instructions for converting grammars to ANTLR4: https://github.com/kaby76/AntlrVSIX/blob/master/doc/Import.md#antlr3

Partially skip characters in ANTLR4

I'm trying to match the following phrase:
<svg/onload="alert(1);">
And I need the tokens to be like:
'<svg', 'onload="alert(1);", '>'
So basically I need to skip the / in the <svg/onload part. But the skip phrase is not allowed here:
Attribute
: ('/' -> skip) Identifier '=' StringLiteral?
;
The error was
error(133): HTML.g4:35:11: ->command in lexer rule Attribute must be last element of single outermost alt
Any ideas?
The error message pretty much tells you what the problem is. The skip command has to be at the end of the rule. You cannot skip intermediate tokens, but only entire rules.
However, I wonder why you want to skip the slash. Why not just let the lexer scan everything (it has to anyway) and then ignore the tokens you don't need? Also I wouldn't use a lexer rule, but a parser rule, to allow arbitrary spaces between elements.
Try lexer's setText(getText().replace("/", "")) or any other matched string manipulation

Optional Prefix in ANTLR parser/lexer

I'm trying to use ANTLR4 to parse input strings that are described by a grammar like:
grammar MyGrammar;
parse : PREFIX? SEARCH;
PREFIX
: [0-9]+ ':'
;
SEARCH
: .+
;
e.g. valid input strings include:
0: maracujá
apple
3:€53.60
1: 10kg
2:chilli pepper
But the SEARCH rule always matches the whole string - whether it has a prefix or not.
I understand this is because the ANTLR4 lexer gives preference to the rules that match the longest string. Therefore the SEARCH rule matches all input, not giving the PREFIX rule a chance.
And the non-greedy version (i.e. SEARCH : .+? ;) has the same problem because (as I understand) it's only non-greedy within the rule - and the SEARCH rule doesn't have any other parts to constrain it.
If it helps, I could constrain the SEARCH text to exclude ':' but I really would prefer it recognise anything else - unicode characters, symbols, numbers, space etc.
I've read Lexer to handle lines with line number prefix but in that case, the body of the string (after the prefix) is significantly more constrained.
Note: SEARCH text might have a structure to it - like €53.00 and 10kg above (which I'd also like ANTLR4 to parse) or it might just be free text - like apple, maracujá and chilli pepper above. But I've tried to simplify so I can solve the problem of extracting the PREFIX first.
ANTLR does lexing before parsing. The lexer prefers long matches and SEARCH tokens match every PREFIX token and even any character appended to it, so your complete line is matched by SEARCH.
To prevent this: Keep the lexer rules disjunct, or at least the tokens should not subsume each other.
parse : prefix? search;
search: (WORD | NUMBER)+;
prefix: NUMBER ':';
NUMBER : [0-9]+;
WORD : (~[0-9:])+;

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

complex AST rewrite rule in ANTLR

After the problem about AST rewrite rule with devide group technique at AST rewrite rule with " * +" in antlr.
I have a trouble with AST generating in ANTLR, again :).Here is my antlr code :
start : noun1+=n (prep noun2+=n (COMMA noun3+=n)*)*
-> ^(NOUN $noun1) (^(PREP prep) ^(NOUN $noun2) ^(NOUN $noun3)*)*
;
n : 'noun1'|'noun2'|'noun3'|'noun4'|'noun5';
prep : 'and'|'in';
COMMA : ',';
Now, with input : "noun1 and noun2, noun3 in noun4, noun5", i got following unexpected AST:
Compare with the "Parse Tree" in ANLRwork:
I think the $noun3 variable holding the list of all "n" in "COMMA noun3+=n". Consequently, AST parser ^(NOUN $noun3)* will draw all "n" without sperating which "n" actually belongs to the "prep"s.
Are there any way that can make the sepration in "(^(PREP prep) ^(NOUN $noun2) ^(NOUN $noun3))". All I want to do is AST must draw exactly, without token COMMA, with "Parse Tree" in ANTLRwork.
Thanks for help !
Getting the separation that you want is easiest if you break up the start rule. Here's an example (without writing COMMAs to the AST):
start : prepphrase //one prepphrase is required.
(COMMA! prepphrase)* //"COMMA!" means "match a COMMA but don't write it to the AST"
;
prepphrase: noun1=n //You can use "noun1=n" instead of "noun1+=n" when you're only using it to store one value
(prep noun2=n)?
-> ^(NOUN $noun1) ^(PREP prep)? ^(NOUN $noun2)?
;
A prepphrase is a noun that may be followed by a preposition with another noun. The start rule looks for comma-separated prepphrases.
The output appears like the parse tree image, but without the commas.
If you prefer explicitly writing out ASTs with -> or if you don't like syntax like COMMA!, you can write the start rule like this instead. The two different forms are functionally equivalent.
start : prepphrase //one prepphrase is required.
(COMMA prepphrase)*
-> prepphrase+ //write each prepphrase, which doesn't include commas
;