I'm trying to build a parser for a DSL I'm building using Irony in .NET and found a problem I can't find a way around. Since it handles BNF I think that any BNF solution will be of help.
I have the following input:
$10 yesterday at drug store
With the following grammar:
<expr> :== unit | expr + unit
<unit> :== money | date | location
<date> : == yesterday|today|tomorrow
<location> :== .* | <preposition> .*
<preposition> :== at
<money> :== ((\$)?\d*((\.*)\d*)*\,?\d{1,2})
It works like a charm with this input. I get exactly the result I wanted which is:
Money Amount: 10
Date: Yesterday
Location: Drug Store
However, if I change the order of the input as following
$10 at drug store yesterday
because of reduce steps it fails to give me the same output. The output becomes:
Money amount: 10
Location: Drug Store Yesterday
I was wondering if there is way to make sure that Location (which is a really broad regex match) is only evaluated when all the other tokens are captured and nothing else is left.
Any help is appreciated.
Thanks!
Edit: Updated title according to suggestion
Besides the fact that this is not a general answer to BNF ambiguity I was able to solve my problem with Irony by creating a new Terminal.
So if anyone else encounters this problem, the code for the new Terminal (while not added to main Irony project) can be found in this link: http://irony.codeplex.com/discussions/269483
Thanks
Related
Currently working with ANTLR and found out something interesting that's not working as I intended.
I try to run something along the lines of "test 10 cm" through my grammar and it fails, however "test 10 c m" works as the previous should. The "cm" portion of the code is what I call "wholeunit" in my grammar and it is as follows:
wholeunit :
siunit
| unitmod siunit
| wholeunit NUM
| wholeunit '/' wholeunit
| wholeunit '.' wholeunit
;
What it's doing right now is the "unitmod siunit" portion of the rule where unitmod = c and siunit = m .
What I'd like to know is how would I make it so the grammar would still follow the rule "unitmod siunit" without the need for a space in the middle, I might be missing something huge. (Yes, I have spaces and tabs marked to be skipped)
Probable cause is "cm" being considered another token together (possibly same token type as "test"), rather than "c" and "m" as separate tokens.
Remember that in ANTLR lexer, the rule matching the longest input wins.
One solution would possibly be to make the wholeunit a lexer rule rather than parser rule, and make sure it's above the rule that matches any word (like "test") - if same input can be matched by multiple rules, ANTLR selects the first rule in order they're defined in.
I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.
Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.
In the link https://cwiki.apache.org/Hive/languagemanual-udf.html, it is clearly mentioned that A && B is same as A AND B. But when i tried to use && in one of my hive query, it was not working (I am using hive-0.9.0-cdh4.1.2).
Sample Input:
12 23
2 6
Table schema as test(a int, b int). When I performed SELECT CASE WHEN (a<10 && b<10) THEN a+b END FROM test;, I got an exception message saying "FAILED: ParseException line 1:24 cannot recognize input near '&' 'b' '<' in expression specification".
Expected Output :
NULL
8
But when I replaced && with AND, it gave correct result. I want to know why this happened. Any help much appreciated! Thanks in advance.
I had run into this issue sometime ago and tried to do some research which remained incomplete because of other things. But what I found was that this is some parsing related issue. To be precise, ANTLR is not able to provide suitable parser to parse this query.
You could probably try using the latest version of Hive which has an upgraded version of ANTLR. But i'm still not sure that it will work even with that. To be frank, I wonder if && and || have ever worked.
Following a very interesing discussion with Bart Kiers on parsing a noisy datastream with ANTLR, I'm ending up with another problem...
The aim is still the same : only extracting useful information with the following grammar,
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
sentenceParts
: SUBJECT VERB INDIRECT_OBJECT
;
a sentence like it's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. will produce the following
This is perfect and it's doing exactly what I want.. from a big sentence, I'm extracting only the words that had a sense for me.... But the, I founded the following error. If somewhere in the text I'm introducing a word that begin exactly like a token, I'm ending up with a MismathedTokenException or a noViableException
it's 10PM and the Lazy CAT is currently SLEEPING heavily,
with a DOGGY bag, on the SOFA in front of the TV.
produce an error :
DOGGY is interpreted as the beginning for DOG which is also a part of the TOKEN SUBJECT and the lexer is lost... How could I avoid this without defining DOGGY as a special token... I would have like the parser to understand DOGGY as a word in itself.
Well, it seems that adding this ANY2 :'A'..'Z'+ {skip();}; solves my problem !
I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)
So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.
This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E // whatever your E is
PLUS : '+'
END : '#'
If ANTLR doesn't support anything like {1,5} then this is the same as:
(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?
which is not that clean, so maybe there is a nicer way to do it.