antlr4 multiline string parsing - antlr

If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?
ie,
"foo" "bar"
would be parsed as two STRING tokens, "foo" followed by "bar"
while:
"foo"
"bar"
would be seen as one STRING token: "foobar"
For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.
Sample1:
"desc" "this sample will parse as two strings.
Sample3 (note, 'output' is a keyword in the language):
output "this is a very long line that I've explicitly made so that it does not "
"easily fit on just one line, so it gets split up into separate ones for "
"ease of reading, but the parser should see it all as one long string. "
"This example will parse as if the output command had been followed by "
"only a single string, even though it is composed of multiple string "
"fragments, all of which should be invisible to the parser.%n";
Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.
Addendum:
I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.
However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.
Okay, I have it.
I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.
So in the Lexer file I have this:
BEGIN_STRING : '"' -> pushMode(StringMode);
mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); };
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;
And in the parser file I have this:
string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
({_input.LT(1)!=null && _input.LT(1).getLine()>$line}?
a=stringLiteral { $line = $a.line; $text+=$a.text; })*
;
stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
(a=(STRING_LITERAL_TEXT
| STRING_LITERAL_ESCAPE_NEWLINE
| STRING_LITERAL_ESCAPE_QUOTE
| STRING_LITERAL_ESCAPE_PERCENT
) {$text+=$a.text;} )*
stringEnd { $line = $BEGIN_STRING.line; }
;
stringEnd: END_STRING #string_finish
| UNTERMINATED_STRING #string_hang
;
The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.

EDIT: Sorry, have not read your requirements fully. The following approach would match both examples not only the desired one. Have to think about it...
The simplest way would be to do this in the parser. And I see no point that would require this to be done in the lexer.
multiString : singleString +;
singleString : ONELINE_STRING;
ONELINE_STRING: ...; // no fragment!
WS : ... -> skip;
Comment : ... -> skip;

As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:
STRING
: SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
;
HIDDEN
: ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
;
fragment SINGLE_STRING
: '"' ~'"'* '"'
;
fragment LINE_CONTINUATION
: ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
;
fragment SPACE
: [ \t]
;
fragment LINE_BREAK
: [\r\n]
| '\r\n'
;
fragment COMMENT
: '//' ~[\r\n]+
;
Tokenizing the input:
"a" "b"
"c"
"d"
"e"
"f"
would create the following 5 tokens:
"a"
"b"
"c"\n"d"
"e"
"f"
However, if the token would include a comment:
"c" // comment
"d"
then you'd need to strip this "// comment" from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip it.

Related

Handle strings starting with whitespaces

I'm trying to create an ANTLR v4 grammar with the following set of rules:
1.In case a line starts with #, it is considered a label:
#label
2.In case the line starts with cmd, it is treated as a command
cmd param1 param2
3.If a line starts with a whitespace, it is considered a string. All the text should be extracted. Strings can be multiline, so they end with an empty line
A long string with multiline support
and any special characters one can imagine.
<-empty line here->
4.Lastly, in case a line starts with anything but whitespace, # and cmd, it's first word should be considered a heading.
Heading A long string with multiline support
and any special characters one can imagine.
<-empty line here->
It was easy to handle lables and commands. But I am clueless about strings and headings.
What is the best way to separate whitespace word whitespace whatever doubleNewline and whatever doubleNewline? I've seen a lot of samples with whitespaces, but none of them works with both random text and newlines. I don't expect you to write actual code for me. Suggesting an approach will do.
Something like this should do the trick:
lexer grammar DemoLexer;
LABEL
: '#' [a-zA-Z]+
;
CMD
: 'cmd' ~[\r\n]+
;
STRING
: ' ' .*? NL NL
;
HEADING
: ( ~[# \t\r\nc] | 'c' ~'m' | 'cm' ~'d' ).*? NL NL
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
fragment NL
: '\r'? '\n'
| '\r'
;
This does not mandate the "beginning of the line" requirement. If that is something you want, you'll have to add semantic predicates to your grammar, which ties it to a target language. For Java, that would look like this:
LABEL
: {getCharPositionInLine() == 0}? '#' [a-zA-Z]+
;
See:
Semantic predicates in ANTLR4?
https://github.com/antlr/antlr4/blob/master/doc/predicates.md

Parse string antlr

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".
string
: '"' character* '"'
;
character
: escapeSequence
| .
;
escapeSequence
: '\(' expression ')'
;
IDENTIFIER
: [a-zA-Z][a-zA-Z0-9]*
;
WHITESPACE
: [ \r\t,] -> skip
;
This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.
How can I parse strings that can have expressions inside of them?
Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.
This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().
The most basic way to achieve this would be like this:
Lexer:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
Parser:
parser grammar StringParser;
options {
tokenVocab = 'StringLexer';
}
start: exp EOF ;
exp : '(' exp ')'
| IDENTIFIER
| DQUOTE stringContents* DQUOTE
;
stringContents : TEXT
| ESCAPE_SEQUENCE
| '\\(' exp ')'
;
Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.
This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.
To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):
#members {
int nesting = 0;
}
LPAR: '(' {
nesting++;
pushMode(DEFAULT_MODE);
};
RPAR: ')' {
if (nesting > 0) {
nesting--;
popMode();
}
};
mode IN_STRING;
BACKSLASH_PAREN: '\\(' {
nesting++;
pushMode(DEFAULT_MODE);
};
(The parts I left out are the same as in the previous version).
This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).
If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
mode EMBEDDED;
E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;
Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).
Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.
The full code for all three alternatives can also be found on GitHub.

What is the ANTLR4 equivalent of a ! in a lexer rule?

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

How can I make lexer backtrack based on predicate in Antlr 3 (or 4)?

I have the following (simplified) problem in Antlr3.
I have a set of lexer rules for a grammar with special strings and regular strings. Both are single-quoted. Special strings conform to a pattern (for example, suppose they only contain letters). There's also a function that determines if a matched a string is special. Whitespace is ignored.
Now suppose for example that isSpecial returns true only for string "foo". If I'm looking at "'foo' 'bar' '123'", the lexer should produce 1 special string token for foo, and then 2 regular string tokens.
I am trying to do something like this:
SpecialLiteral : { isSpecial(getText()) }?=> '\'' (Letter)+ '\'' ;
StringLiteral : '\'' ( ~('\') )* '\'' ;
This doesn't work as getText() is not yet populated in gating predicate, so empty string is passed to isSpecial, and "'foo'" becomes StringLiteral.
If I change gating predicate to validating predicate, I have the correct getText in isSpecial, but when it returns false (for "'bar'", for example), lexer doesn't backtrack and make "'bar'" a StringLiteral, it just fails.
Is it possible to solve this problem? If I were to just edit the lexer manually, it seems like it should be doable. We know the position before mSpecialLiteral, so we can just go back there and then act like we do after gating predicate.
How to do it using the grammar/what is to be passed to isSpecial?
You could match both in the same rule, and then check after the literal has matched which type it is, and change the type to SpecialLiteral when isSpecial(getText()) is true:
grammar T;
...
tokens {
SpecialLiteral;
}
...
StringLiteral
: '\'' ( ~( '\'' ) )* '\''
{
if(isSpecial(getText())) {
$type = SpecialLiteral;
}
}
;

Antlr greedy-option

(I edited my question based on the first comment of #Bart Kiers - thank you!)
I have the following grammar:
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
START : 'START:';
STRING_LITERAL : ('"' .* '"')+;
rule : START STRING_LITERAL;
and I want to parse languages like: 'START: "abcd" START: "img src="test.jpg""' (string literals could be inside string literals).
The grammar defined above does not work if there are string literals inside a string literal because for the language 'START: "img src="test.jpg""' the lexer translates it into the following tokens: START('START:') STRING_LITERAL("img src=") test.jpg.
Is there any way to define a grammar which is fine for my problem?
There are a couple of things wrong here:
you cannot use fragment rules inside parser rules. You grammar will never create a START token;
a . char (DOT-char) inside a parser rule matches any token, while inside a lexer rule, it matches any character;
if you let .* match greedily (and you had defined a proper lexer rule that matches a string literal), the input START: "abcd" START: "img src="test.jpg"" would then have one large string in it: "abcd" START: "img src="test.jpg"" (the first and the last quote would be matched).
So, you cannot embed string literals inside string literals using the same quotes. The lexer is not able to determine if a quote is meant to close the string, or if it's the start of a (new) embedded string. You will need to change that in your grammar.