How to parse dynamic delimiter using antlr? - antlr

I am trying to parse the following command from Cisco IOS config:
banner exec <d> <message> <d>
where <d> is the delimiting character of user's choice—a pound sign (#), for example. The <message> cannot use the delimiting character in it.
It seems that I will need to use semantic predicates for it. But couldn't figure out a way to do it.
Yang

As long as you know the delimiter in advance, you can use something like this. You can modify isDelimiter to support any single-character delimiter.
#lexer::members {
private boolean isDelimiter(int c) { return c == '#'; }
}
Message : Delimiter NotDelimiter* Delimiter;
fragment Delimiter : {isDelimiter(_input.LA(1))}? . ;
fragment NotDelimiter : {!isDelimiter(_input.LA(1))}? . ;

Related

ANTLR how to create a "rest of it" token

I have realy simple DSL defined in ANTLR like this.
grammar Transformer;
fragment Digit : [0-9];
Amp:'\'';
Left:'(';
Right: ')';
Comma: ',';
Id: [A-Za-z][a-zA-Z0-9]+;
Int: '-'? Digit+;
WS: [\n\r\t]+ ->skip;
FuncStart: '>';
DataStart: '#';
parse: (datainput | function)+;
qoutedtext: Amp .*? Amp;
datainput: DataStart Id;
function: FuncStart Id Left param (Comma param)* Right;
param: (datainput|function|qoutedtext|Int);
When parsing this text
#Id;>ToUpper(#Name);ThisShouldEndUpAsAToken>FillLeft(#EmpNo,20,'abc')
This is the "tree" i get:
The tree looks as expecte, except that I am not able to catch the ThisShouldEndUpAsAToken tekst as a token.
I know that I do not have any parse in the grammer that should do that now, but I'm not able to figure out how to do it.
HEEELP :)
How about changing your parse rule like this:
parse: (datainput | function | Id)+;
(Your test input is sprinkled with ; that shouldn't parse. Are you sure that's the input you're parsing?)

Grammar to negate two like characters in a lexer rule inside a single quoted string

ANLTR 4:
I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.
1. 'this is a string literal with an escaped\' character'
2. 'this is a string {{functionName(x)}} literal with double curlies'
StringLiteral
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;
fragment
ESC : '\\' [btnr\'\\];
fragment
AnyExceptDblCurlies
: '{' ~'{'
| ~'{' .;
I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...
Negating inside lexer- and parser rules
But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.
if I alter the string literal token rule to the following it works...
StringLiteral
: '\'' (ESC | .)*? '\'' ;
Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.
To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.
Parser:
options {
tokenVocab = TesterLexer ;
}
test : string EOF ;
string : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
template : TMPLBEG TMPLNAME TMPLEND ;
Lexer:
STRBEG : Squote -> pushMode(strMode) ;
mode strMode ;
STRESQ : Esqote -> type(SCHAR) ; // predeclare SCHAR in tokens block
STREND : Squote -> popMode ;
TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
STRCHAR : . -> type(SCHAR) ;
mode tmplMode ;
TMPLEND : DBrClose -> popMode ;
TMPLNAME : ~'}'* ;
fragment Squote : '\'' ;
fragment Esqote : '\\\'' ;
fragment DBrOpen : '{{' ;
fragment DBrClose : '}}' ;
Updated to correct the TMPLNAME rule, add main rule and options block.

Antlr4: How can I match end of lines inside multiline comments?

I have to create a program that counts lines of code ignoring those inside a comment. I'm a newbie working with Antlr, and after trying a lot, the nearest I came to a solution is this erroneous grammar:
grammar Comments;
comment : startc content endc;
startc : '/*';
endc : '*/';
content : newline | contenttext;
contenttext : CONTENTCHARS+;
newline : '\r\n';
CONTENTCHARS
: ~'*' '/'
| ~'/' .
;
WS : [ \r\t]+ -> skip;
If I try with /*hello\r\nworld*/ the parser recognizes this, which is erroneous:
In order to count lines, the parser needs to detect newline characters, inside and outside multiline comments. I think my problem is that I don't know how to say "match everything inside /* and */ except \r\n.
Please, can you point me in the right direction? Any help will be appreciated.
Solution
Let's simplify your grammar! In the grammar we will ignore whitespace characters and comments at the lexer stage (and the unwanted newlines at the same time!). For example the COMMENT section will match one line comments or multi-line comments and just skip them!
Next, we will introduce counter variable for counting NEWLINE tokens that are used only in content grammar rule (because COMMENT token is skipped so the NEWLINE token in it!).
Whenever we encounter a NEWLINE token we increment the counter variable.
grammar Comments;
#lexer::members {
int counter = 0;
}
WS : [ \r\t]+ -> skip;
COMMENT : '/*' .*? '*/' NEWLINE? -> skip;
TEXT : [a-zA-Z0-9]+;
NEWLINE : '\r'? '\n' { {System.out.println("Newlines so far: " + (++counter)); } };
content: (TEXT | COMMENT | NEWLINE )* EOF;

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')

Special character handling in ANTLR lexer

I wrote the following grammar for string variable declaration. Strings are defined like anything between single quotes, but there must be a way to add a single quote to the string value by escaping using $ letter.
grammar test;
options
{
language = Java;
}
tokens
{
VAR = 'VAR';
END_VAR = 'END_VAR';
}
var_declaration: VAR string_type_declaration END_VAR EOF;
string_type_declaration: identifier ':=' string;
identifier: ID;
string: STRING_VALUE;
STRING_VALUE: '\'' ('$\''|.)* '\'';
ID: LETTER+;
WSFULL:(' ') {$channel=HIDDEN;};
fragment LETTER: (('a'..'z') | ('A'..'Z'));
This grammar doesn't work, if you try to run this code for var_declaration rule:
VAR A :='$12.2' END_VAR
I get MismatchedTokenException.
But this code works fine for string_type_declaration rule:
A :='$12.2'
Your STRING_VALUE isn't properly tokenized. Inside the loop ( ... )*, the $ expects a single quote after it, but the string in your input, '$12.2', doesn't have a quote after $. You should make the single quote optional ('$' '\''? | .)*. But now your alternative in the loop, the ., will also match a single quote: better let it match anything other than a single quote and $:
STRING_VALUE
: '\'' ( '$' '\''? | ~('$' | '\'') )* '\''
;
resulting in the following parse tree: