I'm new to antlr3, and I'm trying to write a lexer that accept '+' and '-' as a special symbol but when see '++' operator it should treat it as a error but I don't know how to implement it, now with below specification it tokenize '++' as two token '+' and '+'.
SPECIALSYMBOL: ('+'|'-');
Keep your SPECIALSYMBOL as it is and handle the case in the parser rules : if you do not allow repeated SPECIALSYMBOL in your rules, a ++ should produce an error.
Related
I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.
I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.
I'm using ANTLR to generate recognizer for a java-like language and the following rules are used to recognize generic types:
referenceType
: singleType ('.' singleType)*
;
singleType
: Identifier typeArguments?
;
typeArguments
: '<' typeArgument (',' typeArgument)* '>'
;
typeArgument
: referenceType
;
Now, for the following input statement, ANTLR produces the 'no viable alternative' error.
Iterator<Entry<K,V>> i = entrySet().iterator();
However, if I put a space between the two consecutive '>' characters, no error is produced. It seams that ANTLR cannot distinguish between the above rule and the rule used to recognize shift expressions, but I don't know how to modify the grammar to resolve this ambiguity. Any help would be appreciated.
You probably have a rule like the following in the lexer:
RightShift : '>>';
For ANTLR to recognize >> as either two > characters or one >> operator, depending on context, you'll need to instead place your shift operator in the parser:
rightShift : '>' '>';
If your language includes the >>> or >>= operators, those would need to be moved to the parser as well.
To validate that x > > y isn't allowed, you'll want to make a pass over the resulting parse tree (ANTLR 4) or AST (ANTLR 3) to verify that the two > characters parsed by the rightShift parser rule appear in sequence.
280Z28 is probably right in his diagnosis that you have a rule like
RightShift : '>>';
An alternative solution is to explicitly include the possibility of a trailing >> in your parser. (I have seen this in other grammars, but only in LALR.)
typeArguments
: ('<' typeArgument (',' typeArgument)* '>') |
('<' typeArgument ',' referenceType '<' typeArgument RightShift );
;
In Antlr3, that will need to be left factored.
Whether this is clearer or having a second pass that validates your right shift operator depends on how often you need to use this.
Is there NOT logic in ANTLR? Im basically trying to negate a rule that i have and was wondering if its possible, also is there AND logic?
#larsmans already supplied the answer, I just like to give an example of the legal negations in ANTLR rules (since it happens quite a lot that mistakes are made with them).
The negation operator in ANTLR is ~ (tilde). Inside lexer rules, the ~ negates a single character:
NOT_A : ~'A';
matches any character except 'A' and:
NOT_LOWER_CASE : ~('a'..'z');
matches any character except a lowercase ASCII letter. The lats example could also be written as:
NOT_LOWER_CASE : ~LOWER_CASE;
LOWER_CASE : 'a'..'z';
As long as you negate just a single character, it's valid to use ~. It is invalid to do something like this:
INVALID : ~('a' | 'aa');
because you can't negate the string 'aa'.
Inside parser rules, negation does not work with characters, but on tokens. So the parse rule:
parse
: ~B
;
A : 'a';
B : 'b';
C : 'c';
does not match any character other than 'b', but matches any token other than the B token. So it'd match either token A (character 'a') or token C (character 'c').
The same logic applies to the . (DOT) operator:
inside lexer rules it matches any character from the set \u0000..\uFFFF;
inside parser rules it matches any token (any lexer rule).
ANTLR produces parsers for context-free languages (CFLs). In that context, not would translate to complement and and to intersection. However, CFLs aren't closed under complement and intersection, i.e. not(rule) is not necessarily a CFG rule.
In other words, it's impossible to implement not and and in a sane way, so they're not supported.
I am working on an Antlr grammar in which single quotes are used both as operators and in string literals, something like:
operand: DIGIT | STRINGLIT | reference;
expression: operand SQUOTE;
STRINGLIT: '\'' ~('\\'|'\'')* '\'';
Expressions like 1' parse correctly, but when there is input that matches ~('\\'|'\'')* after the quote, such as 1'+2, the lexer attempts to match STRINGLIT and fails. I'd like to be able to recover and emit SQUOTE. Any ideas how to accomplish it?
Thanks.
After testing this grammar a bit in Antlrworks, I think that your problem is that you are being too restrictive here. Antlr would be able to parse 1'+2 if you had a rule that accepted something after it has seen operand SQUOTE. Since ANTLR doesn't know what to do with the +2, it throws an exception.