How do I properly parse Regex in ANTLR - antlr

I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.

Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.

Related

Optional rule (myRule?) vs rule and empty alternative ((myRule | ))

In the ANTLRv4 grammar that one can find in the grammars-v4 repository (https://github.com/antlr/grammars-v4/blob/master/antlr4/ANTLRv4Parser.g4) the optional rule ebnfSuffix is:
sometimes matched using ebnfSuffix?, see lexerElement
sometimes matched using (ebnfSuffix | ), see element.
I was indeed asking to myself, and here as well, if the two have slightly different meaning.
The grammars-v4 repository has another example in https://github.com/antlr/grammars-v4/blob/master/cql3/CqlParser.g4 of the same two patterns with respect to beginBatch rule used has optional element or together with an empty alternative.
EDIT: I've added here the part of the grammar I'm referring to as suggested:
lexerElement
: labeledLexerElement ebnfSuffix? <-- case 1: optional rule
| lexerAtom ebnfSuffix?
| lexerBlock ebnfSuffix?
| actionBlock QUESTION?
;
element
: labeledElement (ebnfSuffix |) <-- case 2: block with empty alternative
| atom (ebnfSuffix |)
| ebnf
| actionBlock QUESTION?
;
Both ebnfSuffix? and (ebnfSuffix | ) result in exactly the same behaviour: they (greedily) optionally match ebnfSuffix.
The fact that they're both being used in a grammar could be because it was translated from some spec (or other grammar) that used that notation and that notation didn't have the ? operator, but that's just guessing.
Personally I'd just use ebnfSuffix?.

ANTLR - Checking for a String's "contruction"

Currently working with ANTLR and found out something interesting that's not working as I intended.
I try to run something along the lines of "test 10 cm" through my grammar and it fails, however "test 10 c m" works as the previous should. The "cm" portion of the code is what I call "wholeunit" in my grammar and it is as follows:
wholeunit :
siunit
| unitmod siunit
| wholeunit NUM
| wholeunit '/' wholeunit
| wholeunit '.' wholeunit
;
What it's doing right now is the "unitmod siunit" portion of the rule where unitmod = c and siunit = m .
What I'd like to know is how would I make it so the grammar would still follow the rule "unitmod siunit" without the need for a space in the middle, I might be missing something huge. (Yes, I have spaces and tabs marked to be skipped)
Probable cause is "cm" being considered another token together (possibly same token type as "test"), rather than "c" and "m" as separate tokens.
Remember that in ANTLR lexer, the rule matching the longest input wins.
One solution would possibly be to make the wholeunit a lexer rule rather than parser rule, and make sure it's above the rule that matches any word (like "test") - if same input can be matched by multiple rules, ANTLR selects the first rule in order they're defined in.

Is there an automatic way to simplify EBNF grammars?

I want to handle the Scala grammar in order to create a data structure and I want to create a graph to model it. The problem that I have is that I want to get rid of some syntax sugar of the EBNF grammar in order to parse it in an efficient way.
For example if I have the following rule:
Expr ::= (Bindings | id | '_') '=>' Expr | Expr1
I want the corresponding rules for it:
Expr ::= LeftExpr '=>' Expr | Expr1
LeftExpr ::= Bindings | id | '_'
What I would really appreciate would be a tool which can automatically convert those rules in order to have left-associativity rules with only OR operators in order to avoid doing it manually. I think it's possible a thing like that, maybe without some semantic sense of the new added rules, e.g. it's the same if LeftExpr is Foo or whatever other name.
I looked at web sources but the argument is so confusing that I cannot find the right answer. Any help is appreciated, also just a link (really) related to the problem. Thank you

Antlr grammar with single quotes doubling as operators

I am working on an Antlr grammar in which single quotes are used both as operators and in string literals, something like:
operand: DIGIT | STRINGLIT | reference;
expression: operand SQUOTE;
STRINGLIT: '\'' ~('\\'|'\'')* '\'';
Expressions like 1' parse correctly, but when there is input that matches ~('\\'|'\'')* after the quote, such as 1'+2, the lexer attempts to match STRINGLIT and fails. I'd like to be able to recover and emit SQUOTE. Any ideas how to accomplish it?
Thanks.
After testing this grammar a bit in Antlrworks, I think that your problem is that you are being too restrictive here. Antlr would be able to parse 1'+2 if you had a rule that accepted something after it has seen operand SQUOTE. Since ANTLR doesn't know what to do with the +2, it throws an exception.

How do I add parenthesis to this rule?

I have a left-recursive rule like the following:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION | UNARYOP EXPRESSION | NUMBER;
I need to add parenthesis to it but I'm not sure how to make a left parenthesis depend on a matching right parenthesis yet still optional. Can someone show me how? (Or am I trying to do entirely too much in lexing, and should I leave some or all of this to the parsing?)
You could add a recursive rule:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION
| UNARYOP EXPRESSION
| NUMBER
| OPENPARENS EXPRESSION CLOSEPARENS
;
Yes, you're trying to do too much in the lexer. Here's how to get around the left-recursive rules:
http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator (see how the parser rule expr trickles down to the rule atom and then get called recursively from atom again)
HTH