Following a very interesing discussion with Bart Kiers on parsing a noisy datastream with ANTLR, I'm ending up with another problem...
The aim is still the same : only extracting useful information with the following grammar,
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
sentenceParts
: SUBJECT VERB INDIRECT_OBJECT
;
a sentence like it's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. will produce the following
This is perfect and it's doing exactly what I want.. from a big sentence, I'm extracting only the words that had a sense for me.... But the, I founded the following error. If somewhere in the text I'm introducing a word that begin exactly like a token, I'm ending up with a MismathedTokenException or a noViableException
it's 10PM and the Lazy CAT is currently SLEEPING heavily,
with a DOGGY bag, on the SOFA in front of the TV.
produce an error :
DOGGY is interpreted as the beginning for DOG which is also a part of the TOKEN SUBJECT and the lexer is lost... How could I avoid this without defining DOGGY as a special token... I would have like the parser to understand DOGGY as a word in itself.
Well, it seems that adding this ANY2 :'A'..'Z'+ {skip();}; solves my problem !
Related
I am looking for a good way to match a filename in Antlr.
The filename could be DOS or Unix style.
If you have a good solution that to that, feel free to ignore the rest of this question because it is just my newbie attempt at solving the problem and I am probably way off. I have included it because some people like to see sample code.
For purposes of discussion, here is a here is what I am thinking. This is not my actual grammar as all I am interested in for this discussion is filename parsing so I reduced the sample that somewhat meaningful in that context.
Lexer.g4:
lexer grammar Lexer;
K_COPY : C O P Y ;
FILEPATH: [-.a-zA-Z0-9:/\]+;
Parser.g4
parser grammar Parser;
options { tokenVocab=Lexer; }
commandfile: (statement NEWLINE)* EOF;
statement : copy_stmt
;
copy_stmt: K_COPY left=filepath right=filepath
;
// Add characters as we make rules as to what characters are valid:
filepath: FILEPATH;
That is what I am thinking but I am new to Antlr so I wanted to get some feedback before I proceed.
I am using Antlr for this project is already decided and a good part of this project is already working in Antlr, so I am only looking for Antlr based solutions.
I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:
ENTITY_VAR
: 'user'
| 'resource'
;
INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;
NEWLINE : '\r'? '\n';
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );
The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.
Any ideas, please?
Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.
There are several approaches I can think of (not in a special order):
Emit several tokens from the rule ENTITY_ID. See ANTLR4: How to inject tokens for an inspiration
Allow whitespace in the parser and check afterwards
Use the single token and split in code
Use the single token and modify the token stream before passing it to the parser. I.e. lex, modify the ENTITY_ID tokens and split them into several other tokens, then pass this stream to the parser
Don't skip whitespace and when dealing with these "extra tokens" check if they are within a ENTITY_ID part (=> is error) or not (=> ignore error).
Don't skip whitespace and add "WS*" everywhere in your grammar where whitespace is allowed (ok if the grammar is not too large).
Insert predicates in the parser rule that checks if there is whitespace between.
Create a "trap" rule like this:
INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
;
This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.
I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.
As far as I managed to understand by browsing the documentation, it doesn't look like something like that is feasible.
Parser rules seem to work just on the default channel, so I can't send WS to channel(HIDDEN) and then recover it just for a single parser rule.
On the other hand, an author of antlr explains here that it's not possible to break down any token since version 4.
Even though I don't like it at all, it seems that the fastest way is to parse it from the lexer (as in the code from the question), only to get to re-parse it again from Java the whole string.
Still, any other better option or correction to my conclusions is welcome.
Hooking two parsers in a sort of pipeline, as your own answer suggets, is a sound and simple design/solution, and I'm pretty sure ANTLR is capable of helping with that.
I don't know far the ANTLR folks have gone in their work on stream/feed parsing. But, adopting a two-pass strategy should be efficient enough as the first pass would be just lexing a regular language, which is O(c * N) over the size of the input with a very small c.
If you want a single pass that costs O(k * N) (with a large k), you could consider PEG, for which there are implementations in Java (which I haven't tried).
Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.
I'm trying to build a parser for a DSL I'm building using Irony in .NET and found a problem I can't find a way around. Since it handles BNF I think that any BNF solution will be of help.
I have the following input:
$10 yesterday at drug store
With the following grammar:
<expr> :== unit | expr + unit
<unit> :== money | date | location
<date> : == yesterday|today|tomorrow
<location> :== .* | <preposition> .*
<preposition> :== at
<money> :== ((\$)?\d*((\.*)\d*)*\,?\d{1,2})
It works like a charm with this input. I get exactly the result I wanted which is:
Money Amount: 10
Date: Yesterday
Location: Drug Store
However, if I change the order of the input as following
$10 at drug store yesterday
because of reduce steps it fails to give me the same output. The output becomes:
Money amount: 10
Location: Drug Store Yesterday
I was wondering if there is way to make sure that Location (which is a really broad regex match) is only evaluated when all the other tokens are captured and nothing else is left.
Any help is appreciated.
Thanks!
Edit: Updated title according to suggestion
Besides the fact that this is not a general answer to BNF ambiguity I was able to solve my problem with Irony by creating a new Terminal.
So if anyone else encounters this problem, the code for the new Terminal (while not added to main Irony project) can be found in this link: http://irony.codeplex.com/discussions/269483
Thanks
I am working on a project that involves transforming part of speech tagged text into an ANTLR3 AST with phrases as nodes of the AST.
The input to ANTLR looks like:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
i.e. (tag token)+ where neither the tag or the token contain white space.
Is the following a good way of lexing this:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
The grammar then has entries like the following to describe the lowest level of the AST:
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
(and 186 more of these!)
This approach seems to work but results in a ~9000 line Java Lexer and takes a large amount of memory to build (~2gb) hence I was wondering whether this is the optimal way of solving this problem.
Could you combine the TAG space TOKEN into a single AST tree? Then you could pass both the TAG and TOKEN into your source code for handling. If the Java code used to handle the resulting tree is very similar between the various TAGs, then you could perhaps simplify the ANTLR with the trade-off of a bit more complication in your Java code.