ANTLR Grammar to parse a Asterisk-delimited input - antlr

I am attempting to use ANTLR (v4) to create a parser generator for a asterisk-delimited list encapsulated by START and END markers.
START**na**na**aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
Where a normal input string would be something like:
START*na*na*aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
I would still need to be able to allow spaces, tabs, null/empty fields (basically any character except START, END, * between the asterisks.
that includes things like ** * * *asdf fdsa* * asdf *
Here is my grammar so far:
parseIt: ENTRY ;
ENTRY : 'START*' FIELD_SET 'END' ;
fragment Delim : '*' ;
fragment Data : (ANY | WS)* ;
fragment FIELD_SET : Data (Delim Data|Delim)* ;
I can recognize simple input (like the first example I gave), but am having trouble recognizing tokens that have spaces or special characters between the asterisks.

I’m pretty sure you could handle this with a RegEx and capture groups, but if you really want to use ANTLR…
The following works:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+ {!(
(getText().endsWith("E") && _input.LA(1) == (int) 'N' && _input.LA(2) == (int) 'D') ||
(getText().endsWith("EN") && _input.LA(1) == (int) 'D') ||
(getText().endsWith("END")))}?;
and gives the following parse tree (for you first input) (click on it to view it full size):
Unfortunately for you, the way the lexer works, a simple lexer rule like Data : ~[*]+ will preferentially match aEND over your END implied lexer rule, because the ANTLR lexer uses the rule that matches the longest sequence ion input characters, and Data : ~[*]+ matches aEND while END only matches END (ANTLR also, doesn't look ahead for token matches). As a result the rather tortured semantic predicate is the only way to disallow a token that is a stream of characters that ends with END.
(Note: Semantic predicates a target-language specific, and this predicate is for Java. Other targets would require the equivalent int that target language.)
Another approach would be to check if your input endswith(“END”), and then just remove it prior to parsing using this grammar:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+;
This avoids the END token problem by just removing it from the input stream. Given that it's the very end of the stream, this might be simpler.

Related

Lex matching doesn't enter recursive rule as expected

I am trying to match words between # characters. Here is my attempt:
init : (TEXT | HASH | placeholder) init? EOF ;
placeholder : HASH lexeme HASH ;
lexeme : LEXEME;
HASH : '#' ;
LEXEME : [a-zA-Z0-9-_]+ ;
TEXT : ~'#'+ ;
My input string: "The good text with a #LEXEME#followed# by hashes of death#############"
And the resulting ParseTree:
I'm expecting the "followed" word to be parsed as a TEXT in the next recursive init but it looks like it's parsed in the same init iteration, thus not recognized. This happens every time a pattern like #letters#letters# is encountered.
How do I solve this?
It looks like you want the #s to mark the start and stop of your placeholders (aka LEXEMEs). You could do that by breaking the grammar into a Lexer grammar and a Parser grammar:
lexer grammar HashLexer
;
HASH: '#' -> mode(PLACEHOLDER_MODE);
TEXT: ~'#'+;
mode PLACEHOLDER_MODE
;
LEXEME: [a-zA-Z0-9\-_]+;
HASH_TERM: '#' -> mode(DEFAULT_MODE);
parser grammar HashParser
;
options {
tokenVocab = HashLexer;
}
init: (TEXT | placeholder)* EOF;
placeholder: HASH LEXEME? HASH_TERM;
When I try to parse your input "The good text with a #LEXEME#followed# by hashes of death#############" however, I get the following token stream:
[#0,0:20='The good text with a ',<TEXT>,1:0]
[#1,21:21='#',<HASH>,1:21]
[#2,22:27='LEXEME',<LEXEME>,1:22]
[#3,28:28='#',<HASH_TERM>,1:28]
[#4,29:36='followed',<TEXT>,1:29]
[#5,37:37='#',<HASH>,1:37]
[#6,39:40='by',<LEXEME>,1:39]
[#7,42:47='hashes',<LEXEME>,1:42]
[#8,49:50='of',<LEXEME>,1:49]
[#9,52:56='death',<LEXEME>,1:52]
[#10,57:57='#',<HASH_TERM>,1:57]
[#11,58:58='#',<HASH>,1:58]
[#12,59:59='#',<HASH_TERM>,1:59]
[#13,60:60='#',<HASH>,1:60]
[#14,61:61='#',<HASH_TERM>,1:61]
[#15,62:62='#',<HASH>,1:62]
[#16,63:63='#',<HASH_TERM>,1:63]
[#17,64:64='#',<HASH>,1:64]
[#18,65:65='#',<HASH_TERM>,1:65]
[#19,66:66='#',<HASH>,1:66]
[#20,67:67='#',<HASH_TERM>,1:67]
[#21,68:68='#',<HASH>,1:68]
[#22,69:69='#',<HASH_TERM>,1:69]
[#23,70:70='\n',<TEXT>,1:70]
[#24,71:70='<EOF>',<EOF>,2:0]
The # after followed pushes us into the PLACEHOLDER_MODE so " by hashes of death" is Lexed in PLACEHOLDER mode and generates recognition errors as it does not match the LEXEME rule. And you get the following parse tree:
This seems the correct interpretation of your input (assuming that #s act like ( and ) to bracket some input, then you're going to get situations like this when they're not matched up correctly. The only solution to that would be to relax the grammar quite a bit and handle more of the validation in a a listener/visitor.

antlr4 multiline string parsing

If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?
ie,
"foo" "bar"
would be parsed as two STRING tokens, "foo" followed by "bar"
while:
"foo"
"bar"
would be seen as one STRING token: "foobar"
For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.
Sample1:
"desc" "this sample will parse as two strings.
Sample3 (note, 'output' is a keyword in the language):
output "this is a very long line that I've explicitly made so that it does not "
"easily fit on just one line, so it gets split up into separate ones for "
"ease of reading, but the parser should see it all as one long string. "
"This example will parse as if the output command had been followed by "
"only a single string, even though it is composed of multiple string "
"fragments, all of which should be invisible to the parser.%n";
Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.
Addendum:
I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.
However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.
Okay, I have it.
I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.
So in the Lexer file I have this:
BEGIN_STRING : '"' -> pushMode(StringMode);
mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); };
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;
And in the parser file I have this:
string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
({_input.LT(1)!=null && _input.LT(1).getLine()>$line}?
a=stringLiteral { $line = $a.line; $text+=$a.text; })*
;
stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
(a=(STRING_LITERAL_TEXT
| STRING_LITERAL_ESCAPE_NEWLINE
| STRING_LITERAL_ESCAPE_QUOTE
| STRING_LITERAL_ESCAPE_PERCENT
) {$text+=$a.text;} )*
stringEnd { $line = $BEGIN_STRING.line; }
;
stringEnd: END_STRING #string_finish
| UNTERMINATED_STRING #string_hang
;
The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.
EDIT: Sorry, have not read your requirements fully. The following approach would match both examples not only the desired one. Have to think about it...
The simplest way would be to do this in the parser. And I see no point that would require this to be done in the lexer.
multiString : singleString +;
singleString : ONELINE_STRING;
ONELINE_STRING: ...; // no fragment!
WS : ... -> skip;
Comment : ... -> skip;
As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:
STRING
: SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
;
HIDDEN
: ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
;
fragment SINGLE_STRING
: '"' ~'"'* '"'
;
fragment LINE_CONTINUATION
: ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
;
fragment SPACE
: [ \t]
;
fragment LINE_BREAK
: [\r\n]
| '\r\n'
;
fragment COMMENT
: '//' ~[\r\n]+
;
Tokenizing the input:
"a" "b"
"c"
"d"
"e"
"f"
would create the following 5 tokens:
"a"
"b"
"c"\n"d"
"e"
"f"
However, if the token would include a comment:
"c" // comment
"d"
then you'd need to strip this "// comment" from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip it.

ANTLR v4: Same character has different meaning in different contexts

This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.