Antlr grammar with single quotes doubling as operators - antlr

I am working on an Antlr grammar in which single quotes are used both as operators and in string literals, something like:
operand: DIGIT | STRINGLIT | reference;
expression: operand SQUOTE;
STRINGLIT: '\'' ~('\\'|'\'')* '\'';
Expressions like 1' parse correctly, but when there is input that matches ~('\\'|'\'')* after the quote, such as 1'+2, the lexer attempts to match STRINGLIT and fails. I'd like to be able to recover and emit SQUOTE. Any ideas how to accomplish it?
Thanks.

After testing this grammar a bit in Antlrworks, I think that your problem is that you are being too restrictive here. Antlr would be able to parse 1'+2 if you had a rule that accepted something after it has seen operand SQUOTE. Since ANTLR doesn't know what to do with the +2, it throws an exception.

Related

ANTLR4 Best practice on token ambiguities: Lexer predicate, or Parser tree walker

I have a question about a certain ambiguity I am encountering in a grammar I am currently working on. Here is the problem, in brief. Consider these two inputs:
1010
0101
In isolation, in my grammar the first input is interpreted as a decimal number, the second as an octal due to the leading zero.
However, if the preceding character to each of these sequences is a % then both would be interpreted as a binary number. This wouldn't be a problem if we stopped there.
Now, let's say before the % we encountered a 5, what would happen? Does my grammar consider each of these as valid input:
5%1010
5%0101
The answer is "Yes!" The rightmost sequences of 1s and 0s simply revert back to decimal and octal, respectively, and the % is a modulo operator.
This wouldn't be a problem if expressions in my grammar only consisted of digits, but that unfortunately is not the case, as any number of non-digit tokens could substitute for the 5 in the example above, like variables, braces, and even other math operators like parentheses and minus signs.
The solution I have come to in ANTLR is simply to have an expression rule where one of the alternatives concatenates an expression and a binary number, so you have:
expr
: expr Binary
| expr '%' expr
| Integer
| Octal
| Binary
;
Integer
: '0'
| [1-9] [0-9]*
;
Octal
: '0' [0-7]+
;
Binary
: '%' [01]+
;
I then leave it up to my visitor to actually "pull apart" the right hand side of the expression type above (the expr Binary one), and properly calculate the modulo, which means I have to "re-tokenize" essentially the % and following digits.
I guess my question is: Is this the best solution given my case? I fully accept it if so, but I am curious if others have had to resort to things like these.
I cooked up a lexer predicate to do some crazy lookaheads (and lookbehinds) in the input, but my instinct was this felt wrong, as I was essentially hand-parsing, rather than leveraging the tool itself to give me enough what I needed to work with.

How do I properly parse Regex in ANTLR

I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.
Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.

Solve ambiguity in my grammar with LALR parser

I'm using whittle to parse a grammar, but I'm running into the classical LALR ambiguity problem. My grammar looks like this (simplified):
<comment> ::= '{' <string> '}' # string enclosed in braces
<tag> ::= '[' <name> <quoted-string> ']' # [tagname "tag value"]
<name> ::= /[A-Za-z_]+/ # subset of all printable chars
<quoted-string> ::= '"' <string> '"' # string enclosed in quotes
<string> ::= /[:print:]/ # regex for all printable chars
The problem, of course, is <string>. It contains all printable characters and is therefore very greedy. Since it's an LALR parser, it tries to parse a <name> as a <string> and everything breaks. The grammar complicates things because it uses different string delimiters for different things, which is why I tried to make the <string> rule in the first place.
Is there a canonical way to normalize this grammar to make it LALR compliant, if it's even possible?
This is not "the classical LALR ambiguity problem", whatever that might be. It is simply an error in the lexical specification of the language.
I took a quick glance at the Whittle readme, but it didn't bear any resemblance to the grammar in the OP. So I'm assuming that the text in the OP is conceptual rather than literal, and the fact that it includes the obviously incorrect
<string> ::= /[:print:]/ # regex for all printable chars
is just a typo.
Better would have been /[:print:]*/, assuming that Ruby lets you get away with [:print:] rather than the Posix-standard [[:print:]].
But that wouldn't be correct either because lexing (usually) matches the longest possible string, and consequently that will gobble up the closing quote and any following text.
So the correct solution for quoted-string is to write it out correctly:
<quoted-string> ::= /"[^"]*"/
or even
<quoted-string> :: /"([^\\"]|\\.)*"/
# any number of characters other than quote or escape, or escaped pairs
You might have other ideas about how to escape internal double quotes; those are just examples. In both cases, you need to postprocess the token in order to (at least) strip the double-quotes and possible interpret escape sequences. That's just the way it goes.
Your comment sequences present a more difficult issue, assuming that your intention was that a comment might include nested braces (eg. {This comment {with this} ends here}) because the nested brace syntax is not regular and thus cannot be matched with a regular expression. Of course, very few "regular expression" libraries are really regular these days, and I don't know if Ruby contains some sort of brace-counting extension, like for example Lua's pattern syntax. The nested brace syntax is certainly context-free but to actually parse it you need to lexically analyze the contents of the outer {...} in a different way than the rest of the program.
It is this latter observation, and not any weakness in the LALR algorithm, that is causing you pain, and I'd say that this is a weakness with the (mostly undocumented afaics) lexical analysis section of whittle. In a flex-generated lexer, for example, it would be normal to use start conditions to separate the lexical environments (program / quoted string / braced comment), and the parser would then have no ambiguity.
Hope that helps.

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.