How do I exclude characters / symbols using ANTLR grammar? - antlr

I'm trying to write a grammar for various time formats (12:30, 0945, 1:30-2:45, ...) using ANTLR. So far it works like a charm as long as I don't type in characters that haven't been defined in the grammar file.
I'm using the following JUnit test for example:
final CharStream stream = new ANTLRStringStream("12:40-1300,15:123-18:59");
final TimeGrammarLexer lexer = new TimeGrammarLexer(stream);
final CommonTokenStream tokenStream = new CommonTokenStream(lexer);
final TimeGrammarParser parser = new TimeGrammarParser(tokenStream);
try {
final timeGrammar_return tree = parser.timeGrammar();
fail();
} catch (final Exception e) {
assertNotNull(e);
}
An Exception gets thrown (as expected) because "15:123" isn't valid.
If I try ("15:23a") though, no exception gets thrown and ANTLR treats it like a valid input.
Now if I define characters in my grammar, ANTLR seems to notice them and I once again get the exception I want:
CHAR: ('a'..'z')|('A'..'Z');
But how do I exclude umlauts, symbols and other stuff a user is able to type in (äöü{%&<>!). So basically I'm looking for some kind of syntax that says: match everything BUT "0..9,:-"

...
So basically I'm looking for some kind of syntax that says: match everything BUT "0..9,:-"
The following rule matches any single character except a digit, ,, : and -:
Foo
: ~('0'..'9' | ',' | ':' | '-')
;
(the ~ negates single characters inside lexer-rules)
But you might want to post your entire grammar: I get the impression there are some other things you're not doing as they should have been done. Your call.

you can define a literal, that matches all the characters, that you do not want. If this literal is not contained in any of your rules, antlr will throw a NonViableException.
For unicode this could look like this:
UTF8 : ('\u0000'..'\u002A' // ! to *
| '\u002E'..'\u002F' // . /
| '\u003B'..'\u00FF' // ; < = > ? # as well as letters brackets and stuff
)
;

Related

Can a parser fail silently?

May ANTLR generated parsers fail silently? That
is, can they omit diagnosing when not recognising?
Using a very small grammar for a demonstration and using defaults only for ANTLR, these are the contrasting observations:
When sending input to the usual test rig for the grammar below, I am
noticing two things:
the parsers recognize valid input (actions show that), o.K.;
however, the recognisers seem to “accept” certain invalid(?) inputs, meaning there is no
diagnosis. V3 and v4 parsers behave similarly. The issue—if there is
an issue—appears when there are characters ('1') missing
at the front of an input for stat, provided that prior to this input another input of
just a NEWLINE had been sent.
This is the v4 grammar:
grammar Simp;
prog : stat+ ;
stat : '1' '+' '1' NEWLINE
| NEWLINE
;
NEWLINE : [\r]?[\n] ;
The v3 grammar is the same, mutatis mutandis.
Some runs using v4; class TestSimp4 is the usual test rig as in the book(s),
see below:
% printf "1+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
% printf "+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
line 1:0 extraneous input '+' expecting {'1', NEWLINE}
line 1:2 mismatched input '\n' expecting '+'
% printf "\n+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
%
The first two invocations' results I had expected. I had expected the last invocation to visibly fail, though. Correct?
Looking at the generated SimpParser.java, the silent exit seems
consequential, as outlined below. But should it be that way? I am thinking that ANTLR just
stops before recognising invalid input here, but it shouldn't just stop.
Question: Is this silent failure rather to be expected? Have I
overlooked something like a greedyness setting for grammar tokens with a
+ suffix?
Some code analysis.
Referring to the loop that calls stat() (in the
prog() procedure):
The v3 parser sets a counter variable to >= 1 on sucessfully matching
the initial NEWLINE. The effect is that EarlyExitException is then
not being thrown on later inputs, it just breaks the loop.
The v4 parser similarly calls _input.LA(1) and then just terminates
the loop whenever that call’s result cannot be at the start of stat.
(So no recovery?)
The test rig:
class TestSimp4 {
public static void main(String[] args) throws Exception {
final CharStream subject = CharStreams.fromStream(System.in);
final TokenSource tknzr = new SimpLexer(subject);
final CommonTokenStream ts = new CommonTokenStream(tknzr);
final SimpParser parser = new SimpParser(ts);
parser.prog();
}
}
So another paraphrase of my question would be: “How does one
create ANTLR parsers such that they will always say YES or NO?”
Your 3rd test input, \n+1\n, does not produce an error because you're telling it to recognize the production/rule stat once or more. And prog successfully matches the input \n and then stops. If you want the entire input (token stream) to be consumed, "anchor" your prog rule with the EOF token:
prog : stat+ EOF;

What is the ANTLR4 equivalent of a ! in a lexer rule?

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

antlr4 multiline string parsing

If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?
ie,
"foo" "bar"
would be parsed as two STRING tokens, "foo" followed by "bar"
while:
"foo"
"bar"
would be seen as one STRING token: "foobar"
For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.
Sample1:
"desc" "this sample will parse as two strings.
Sample3 (note, 'output' is a keyword in the language):
output "this is a very long line that I've explicitly made so that it does not "
"easily fit on just one line, so it gets split up into separate ones for "
"ease of reading, but the parser should see it all as one long string. "
"This example will parse as if the output command had been followed by "
"only a single string, even though it is composed of multiple string "
"fragments, all of which should be invisible to the parser.%n";
Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.
Addendum:
I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.
However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.
Okay, I have it.
I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.
So in the Lexer file I have this:
BEGIN_STRING : '"' -> pushMode(StringMode);
mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); };
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;
And in the parser file I have this:
string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
({_input.LT(1)!=null && _input.LT(1).getLine()>$line}?
a=stringLiteral { $line = $a.line; $text+=$a.text; })*
;
stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
(a=(STRING_LITERAL_TEXT
| STRING_LITERAL_ESCAPE_NEWLINE
| STRING_LITERAL_ESCAPE_QUOTE
| STRING_LITERAL_ESCAPE_PERCENT
) {$text+=$a.text;} )*
stringEnd { $line = $BEGIN_STRING.line; }
;
stringEnd: END_STRING #string_finish
| UNTERMINATED_STRING #string_hang
;
The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.
EDIT: Sorry, have not read your requirements fully. The following approach would match both examples not only the desired one. Have to think about it...
The simplest way would be to do this in the parser. And I see no point that would require this to be done in the lexer.
multiString : singleString +;
singleString : ONELINE_STRING;
ONELINE_STRING: ...; // no fragment!
WS : ... -> skip;
Comment : ... -> skip;
As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:
STRING
: SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
;
HIDDEN
: ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
;
fragment SINGLE_STRING
: '"' ~'"'* '"'
;
fragment LINE_CONTINUATION
: ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
;
fragment SPACE
: [ \t]
;
fragment LINE_BREAK
: [\r\n]
| '\r\n'
;
fragment COMMENT
: '//' ~[\r\n]+
;
Tokenizing the input:
"a" "b"
"c"
"d"
"e"
"f"
would create the following 5 tokens:
"a"
"b"
"c"\n"d"
"e"
"f"
However, if the token would include a comment:
"c" // comment
"d"
then you'd need to strip this "// comment" from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip it.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

antlr3 - read closure value to a variable

I would like to parse and read a closure value in a simple text line like this:
1 !something
line
: (NUMBER EXCLAMATION myText=~('\r\n')*)
{ myFunction($myText.text); }
NUMBER
: '0'..'9'+;
EXCLAMATION
: '!';
What I get in myText variable is just the final 'g' of 'something' because as can see in generated code myText is rewrited in a while loop for each occurence of ~('\r\n').
My answer is: is there any elegant way to read the 'something' value to the variable 'myText'?
TIA
Inside parser rules, the ~ does not negate characters, but tokens. So ~('\r\n') would match any token other than the literal '\r\n' token (in your example, that would be a NUMBER or EXCLAMATION).
The lexer cannot be "driven" by the parser: after the parser matched a NUMBER and a EXCLAMATION, you can't tell the lexer to produce some other tokens than it has previously done. The lexer will always produce tokens based on some simple rules, regardless of what the parser "needs".
In other words: you can't handle this in the parser.