Antlr linphone line 2:0 required (...)+ loop did not match anything at character 'V' - antlr

I make a simple copy of the .g file in linphone sip decode antlr file to make things clear.
My problem is when I use this file to decode sip's firstline it will failed sometimes.
The file is shown below:
grammar sipmessage;
options{
language = Java;
output = AST;
}
#lexer::header{
package org.meri.antlr_step_by_step.parsers;
import java.util.HashMap;
}
#parser::header{
package org.meri.antlr_step_by_step.parsers;
import java.util.HashMap;
}
status_line :
sip_version
sptab status_code {System.out.println($status_code.text);}
sptab reason_phrase {System.out.println($reason_phrase.text);}
CRLF
;
sip_version
: word;// 'SIP/' DIGIT '.' DIGIT;
reason_phrase
: ~(CRLF)*;
status_code
: extension_code;
word
:
(alphanum | mark | '%'
| PLUS | '`' |
LAQUOT | RAQUOT |
COLON | '\\' | DQUOTE | SLASH | '[' | ']' | '?' | '{' | '}'
)+
;
extension_code : DIGIT+;
alphanum : alpha | DIGIT ;
mark : '-' | '_' | '.' | '!' | '~' | STAR | '\'' | LPAREN | RPAREN ;
alpha : HEX_CHAR | COMMON_CHAR;
COMMON_CHAR : 'g'..'z' | 'G'..'Z' ;
HEX_CHAR: 'a'..'f' |'A'..'F';
RPAREN : ')';
LPAREN : '(';
STAR : '*';
DIGIT : '0'..'9' ;
PLUS: '+';
COLON : ':';
RAQUOT : '>';
LAQUOT : '<';
DQUOTE : '"';
SLASH : '/';
SP : ' ';
CRLF : '\r\n';
HTAB : ' ';
LWS : (SP* CRLF)? SP+ ; //linear whitespace
sptab : (SP|HTAB)+;
MyProblem is :when decode str eg:
"SIP/2.0 200 OK.\r\nVia: SIP/2.0/TCP 192.168.26.116:46448;alias;branch=z9hG4bK.TteIOuQeu;rport\r\n"
It can print firstline's status code and reason right.That's 200 and OK.
But if I add some spaces before the first \r\n before the Via that is,
"SIP/2.0 200 OK. \r\nVia: SIP/2.0/TCP 192.168.26.116:46448;alias;branch=z9hG4bK.TteIOuQeu;rport\r\n"
The result is so wrong.
the reason printed out will become "OK.ia: SIP/2.0/TCP 192.168.26.116:46448aliasbranchz9hG4bK.TteIOuQeurport"
And I got a warning:"line 2:0 required (...)+ loop did not match anything at character 'V'"
Can somebody tell me why ,I am not quite good at English and Antlr.
Thanks in advance!

Related

How to make antlr find invalid input throw exception

I have the following grammar:
grammar Expr;
expr : '-' expr # unaryOpExpr
| expr ('*'|'/'|'%') expr # mulDivModuloExpr
| expr ('+'|'-') expr # addSubExpr
| '(' expr ')' # nestedExpr
| IDENTIFIER '(' fnArgs? ')' # functionExpr
| IDENTIFIER # identifierExpr
| DOUBLE # doubleExpr
| LONG # longExpr
| STRING # string
;
fnArgs : expr (',' expr)* # functionArgs
;
IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"';
LONG : [0-9]+;
DOUBLE : [0-9]+ '.' [0-9]*;
WS : [ \t\r\n]+ -> skip ;
STRING: '"' (~["\\\r\n] | ESC)* '"';
fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MODULO : '%' ;
PLUS : '+' ;
// math function
MAX: 'MAX';
when I enter following text,It should be effective
-1.1
bug when i enter following text:
-1.1ffff
I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff",
but i want to change this behavior, didn't discard invalid token, but throw exception,report
detection invalid token.
So what should i do, Thanks for your advice
Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this:
parse: expr EOF;
This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.

Antlr4 ignoring newlines at all but one point

I am writing a parser for a scripting language, and using antlr 4.5.3 for the purpose.
grammar VSE;
chunk
: block* EOF
;
block
: var '=' exp
| functioncall
;
var
: NAME
| var '[' exp ']'
| var '.' var
;
exp
: number
| string
| var
| functioncall
| <assoc=right> exp exp //concat
;
functioncall
: NAME '(' (exp)? (',' exp)* ')'
| var '.' functioncall
;
string
: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"'
;
NAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
number
: INT | HEX | FLOAT
;
INT
: Digit+
;
HEX
: '0' [xX] [0-9a-fA-F]+
;
FLOAT
: Digit* '.' Digit+
;
Digit
: [0-9]
;
WS
: [ \t\u000C\r\n]+ -> skip
;
However, while testing it, I found a variable assignment like var = something followed by some function call in next line leads to a concat statement. (My concat statement is a variable followed by another like var = var1 var2) I understand that antlr is skipping ALL the new lines in favor of line continuation, but I'd like to add the condition that if there is a new line between two exps, it would treat them as two separate blocks instead of a concat statement. i.e.
var = var2
functioncall(var)
These should be two separate blocks instead of concat statement.
Is there any way to do this?
Does the following rule suitable for you?
block
: var '=' exp NEW_LINE
| functioncall NEW_LINE
;
NEW_LINE: '\r'? '\n'
WS
: [ \t]+ -> skip
;
In another case you should use Semantic Predicates or very unclear grammar.

Precedence in Antlr using parentheses

We are developing a DSL, and we're facing some problems:
Problem 1:
In our DSL, it's allowed to do this:
A + B + C
but not this:
A + B - C
If the user needs to use two or more different operators, he'll need to insert parentheses:
A + (B - C) or (A + B) - C.
Problem 2:
In our DSL, the most precedent operator must be surrounded by parentheses.
For example, instead of using this way:
A + B * C
The user needs to use this:
A + (B * C)
To solve the Problem 1 I've got a snippet of ANTLR that worked, but I'm not sure if it's the best way to solve it:
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr;
To solve the Problem 2, I have no idea where to start.
I appreciate your help to find out a better solution to the first problem and a direction to solve the seconde one. (Sorry for my bad english)
Below is the grammar that we have developed:
grammar TclGrammar;
options {
output=AST;
ASTLabelType=CommonTree;
}
#members {
public boolean isSum(String type) {
System.out.println("Tipo: " + type);
return "+".equals(type);
}
public boolean isSub(String type) {
System.out.println("Tipo: " + type);
return "-".equals(type);
}
}
prog
: exprMain ';' {System.out.println(
$exprMain.tree == null ? "null" : $exprMain.tree.toStringTree());}
;
exprMain
: exprQuando? (exprDeveSatis | exprDeveFalharCaso)
;
exprDeveSatis
: 'DEVE SATISFAZER' '{'! expr '}'!
;
exprDeveFalharCaso
: 'DEVE FALHAR CASO' '{'! expr '}'!
;
exprQuando
: 'QUANDO' '{'! expr '}'!
;
expr
: logicExpr
;
logicExpr
: boolExpr (('E'|'OU')^ boolExpr)*
;
boolExpr
: comparatorExpr
| emExpr
| 'VERDADE'
| 'FALSO'
;
emExpr
: FIELD 'EM' '[' (variable_lista | field_lista) comCruzamentoExpr? ']'
-> ^('EM' FIELD (variable_lista+)? (field_lista+)? comCruzamentoExpr?)
;
comCruzamentoExpr
: 'COM CRUZAMENTO' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COM CRUZAMENTO' FIELD+)
;
comparatorExpr
: sumExpr (('<'^|'<='^|'>'^|'>='^|'='^|'<>'^) sumExpr)?
| naoPreenchidoExpr
| preenchidoExpr
;
naoPreenchidoExpr
: FIELD 'NAO PREENCHIDO' -> ^('NAO PREENCHIDO' FIELD)
;
preenchidoExpr
: FIELD 'PREENCHIDO' -> ^('PREENCHIDO' FIELD)
;
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr
;
multExpr
: funcExpr(('*'^|'/'^) funcExpr)?
;
funcExpr
: 'QUANTIDADE'^ '('! FIELD ')'!
| 'EXTRAI_TEXTO'^ '('! FIELD ';' INTEGER ';' INTEGER ')'!
| cruzaExpr
| 'COMBINACAO_UNICA' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COMBINACAO_UNICA' FIELD+)
| 'EXISTE'^ '('! FIELD ')'!
| 'UNICO'^ '('! FIELD ')'!
| atom
;
cruzaExpr
: operadorCruzaExpr ('CRUZA COM'^|'CRUZA AMBOS'^) operadorCruzaExpr ondeExpr?
;
operadorCruzaExpr
: FIELD('('!field_lista')'!)?
;
ondeExpr
: ('ONDE'^ '('!expr')'!)
;
atom
: FIELD
| VARIABLE
| '('! expr ')'!
;
field_lista
: FIELD(';' field_lista)?
;
variable_lista
: VARIABLE(';' variable_lista)?
;
FIELD
: NONCONTROL_CHAR+
;
NUMBER
: INTEGER | FLOAT
;
VARIABLE
: '\'' NONCONTROL_CHAR+ '\''
;
fragment SIGN: '+' | '-';
fragment NONCONTROL_CHAR: LETTER | DIGIT | SYMBOL;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
fragment SYMBOL: '_' | '.' | ',';
fragment FLOAT: INTEGER '.' '0'..'9'*;
fragment INTEGER: '0' | SIGN? '1'..'9' '0'..'9'*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {skip();}
;
This is similar to not having operator precedence at all.
expr
: funcExpr
( ('+' funcExpr)*
| ('-' funcExpr)*
| ('*' funcExpr)*
| ('/' funcExpr)*
)
;
I think the following should work. I'm assuming some lexer tokens with obvious names.
expr: sumExpr;
sumExpr: onlySum | subExpr;
onlySum: atom ( PLUS onlySum )?;
subExpr: onlySub | multExpr;
onlySub: atom ( MINUS onlySub )? ;
multExpr: atom ( STAR atomic )? ;
parenExpr: OPEN_PAREN expr CLOSE_PAREN;
atom: FIELD | VARIABLE | parenExpr
The only* rules match an expression if it only has one type of operator outside of parentheses. The *Expr rules match either a line with the appropriate type of operators or go to the next operator.
If you have multiple types of operators, then they are forced to be inside parentheses because the match will go through atom.

ANTLR: Why the invalid input could match the grammar definition

I've written a very simple grammar definition for a calculation expression:
grammar SimpleCalc;
options {
output=AST;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
ID : ('a'..'z' | 'A' .. 'Z' | '0' .. '9')+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { Skip(); } ;
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
start: expr EOF;
expr : multExpr ((PLUS | MINUS)^ multExpr)*;
multExpr : atom ((MULT | DIV)^ atom )*;
atom : ID
| '(' expr ')' -> expr;
I've tried the invalid expression ABC &* DEF by start but it passed. It looks like the & charactor is ignored. What's the problem here?
Actually your invalid expression ABC &= DEF hasn't been passed; it causes NoViableAltException.

How do I distinguish a very keyword-like token from a keyword using ANTLR?

I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar "look" to the keyword.
Here's the grammar:
grammar Query;
options {
output = AST;
backtrack = true;
}
tokens {
DefaultBooleanNode;
}
// Parser
startExpression : expression EOF ;
expression : withinExpression ;
withinExpression
: defaultBooleanExpression
(WSLASH^ NUMBER defaultBooleanExpression)*
defaultBooleanExpression
: (queryFragment -> queryFragment)
(e=queryFragment -> ^(DefaultBooleanNode $defaultBooleanExpression $e))*
;
queryFragment : unquotedQuery ;
unquotedQuery : UNQUOTED | NUMBER ;
// Lexer
WSLASH : ('W'|'w') '/';
NUMBER : Digit+ ('.' Digit+)? ;
UNQUOTED : UnquotedStartChar UnquotedChar* ;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
fragment Digit : ('0'..'9') ;
fragment HexDigit : ('0'..'9' | 'a'..'f' | 'A'..'F') ;
WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) { skip(); };
I have simplified it enough to get rid of the distractions but I think removing any more would remove the problem.
A slash is permitted in the middle of an unquoted query fragment.
Boolean queries in particular have no required keyword.
A new syntax (e.g. W/3) is being introduced but I'm trying not to affect existing queries which happen to look similar (e.g. X/Y)
Due to '/' being valid as part of a word, ANTLR appears to be giving me "W/3" as a single token of type UNQUOTED instead of it being a WSLASH followed by a NUMBER.
Due to the above, I end up with a tree like: DefaultBooleanNode(DefaultBooleanNode(~first clause~, "W/3"), ~second clause~), whereas what I really wanted was WSLASH(~first clause~, "3", ~second clause~).
What I would like to do is somehow write the UNQUOTED rule as "what I have now, but not matching ~~~~", but I'm at a loss for how to do that.
I realise that I could spell it out in full, e.g.:
Any character from UnquotedStartChar except 'w', followed by the rest of the rule
'w' followed by any character from UnquotedChar except '/', followed by the rest of the rule
'w/' followed by any character from UnquotedChar except digits
...
However, that would look awful. :)
When a lexer generated by ANTLR "sees" that certain input can be matched by more than 1 rule, it chooses the longest match. If you want a shorter match to take precedence, you'll need to merge all the similar rules into one and then check with a gated sematic predicate if the shorter match is ahead or not. If the shorter match is ahead, you change the type of the token.
A demo:
Query.g
grammar Query;
tokens {
WSlash;
}
#lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
parse
: (t=. {System.out.printf("\%-10s \%s\n", tokenNames[$t.type], $t.text);} )* EOF
;
NUMBER
: Digit+ ('.' Digit+)?
;
UNQUOTED
: {ahead("W/")}?=> 'W/' { $type=WSlash; /* change the type of the token */ }
| {ahead("w/")}?=> 'w/' { $type=WSlash; /* change the type of the token */ }
| UnquotedStartChar UnquotedChar*
;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~'u'
)
;
fragment Digit : '0'..'9';
fragment HexDigit : '0'..'9' | 'a'..'f' | 'A'..'F';
WHITESPACE : (' ' | '\r' | '\t' | '\u000C' | '\n') { skip(); };
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
QueryLexer lexer = new QueryLexer(new ANTLRStringStream("P/3 W/3"));
QueryParser parser = new QueryParser(new CommonTokenStream(lexer));
parser.parse();
}
}
To run the demo on *nix/MacOS:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
or on Windows:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
which will print the following:
UNQUOTED P/3
WSlash W/
NUMBER 3
EDIT
To eliminate the warning when using the WSlash token in a parser rule, simply add an empty fragment rule to your grammar:
fragment WSlash : /* empty */ ;
It's a bit of a hack, but that's how it's done. No more warnings.