Add quoted strings support to Antlr3 grammar - antlr

I'm trying to implement a grammar for parsing queries. Single query consists of items where each item can be either name or name-ref.
name is either mystring (only letters, no spaces) or "my long string" (letters and spaces, always quoted). name-ref is very similar to name and the only difference is that it should start with ref: (ref:mystring, ref:"my long string"). Query should contain at least 1 item (name or name-ref).
Here's what I have:
NAME: ('a'..'z')+;
REF_TAG: 'ref:';
SP: ' '+;
name: NAME;
name_ref: REF_TAG name;
item: name | name_ref;
query: item (SP item)*;
This grammar demonstrates what I basically need to get and the only feature is that it doesn't support long quoted strings (it works fine for names that doesn't have spaces).
SHORT_NAME: ('a'..'z')+;
LONG_NAME: SHORT_NAME (SP SHORT_NAME)*;
REF_TAG: 'ref:';
SP: ' '+;
Q: '"';
short_name: SHORT_NAME;
long_name: LONG_NAME;
name_ref: REF_TAG (short_name | (Q long_name Q));
item: (short_name | (Q long_name Q)) | name_ref;
query: item (SP item)*;
But that doesn't work. Any ideas what's the problem? Probably, that's important: my first query should be treated as 3 items (3 names) and "my first query" is 1 item (1 long_name).

ANTLR's lexer matches greedily: that is why input like my first query is being tokenized as LONG_NAME instead of 3 SHORT_NAMEs with spaces in between.
Simply remove the LONG_NAME rule and define it in the parser rule long_name.
The following grammar:
SHORT_NAME : ('a'..'z')+;
REF_TAG : 'ref:';
SP : ' '+;
Q : '"';
short_name : SHORT_NAME;
long_name : Q SHORT_NAME (SP SHORT_NAME)* Q;
name_ref : REF_TAG (short_name | (Q long_name Q));
item : short_name | long_name | name_ref;
query : item (SP item)*;
will parse the input:
my first query "my first query" ref:mystring
as follows:
However, you could also tokenize a quoted name in the lexer and strip the quotes from it with a bit of custom code. And removing spaces from the lexer could also be an option. Something like this:
SHORT_NAME : ('a'..'z')+;
LONG_NAME : '"' ~'"'* '"' {setText(getText().substring(1, getText().length()-1));};
REF_TAG : 'ref:';
SP : ' '+ {skip();};
name_ref : REF_TAG (SHORT_NAME | LONG_NAME);
item : SHORT_NAME | LONG_NAME | name_ref;
query : item+ EOF;
which would parse the same input as follows:
Note that the actual token LONG_NAME will be stripped of its start- and end-quote.

Here's a grammar that should work for your requirements:
SP: ' '+;
SHORT_NAME: ('a'..'z')+;
LONG_NAME: '"' SHORT_NAME (SP SHORT_NAME)* '"';
REF: 'ref:' (SHORT_NAME | LONG_NAME);
item: SHORT_NAME | LONG_NAME | REF;
query: item (SP item)*;
If you put this at the top:
grammar Query;
#members {
public static void main(String[] args) throws Exception {
QueryLexer lex = new QueryLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
QueryParser parser = new QueryParser(tokens);
try {
TokenSource ts = parser.getTokenStream().getTokenSource();
Token tok = ts.nextToken();
while (EOF != (tok.getType())) {
System.out.println("Got a token: " + tok);
tok = ts.nextToken();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
You should see the lexer break everything apart nicely (I hope ;-) )
hi there "long name" ref:shortname ref:"long name"
Should give:
Got a token: [#-1,0:1='hi',<6>,1:0]
Got a token: [#-1,2:2=' ',<7>,1:2]
Got a token: [#-1,3:7='there',<6>,1:3]
Got a token: [#-1,8:8=' ',<7>,1:8]
Got a token: [#-1,9:19='"long name"',<4>,1:9]
Got a token: [#-1,20:20=' ',<7>,1:20]
Got a token: [#-1,21:33='ref:shortname',<5>,1:21]
Got a token: [#-1,34:34=' ',<7>,1:34]
Got a token: [#-1,35:49='ref:"long name"',<5>,1:35]
I'm not 100% sure what the problem is with your grammar, but I suspect the issue relates to your definition of a LONG_NAME without the quotes. Perhaps you can see what the distinction is?

Related

Ignore spaces, but allow text with spaces

I need to write a simple antlr4 grammar for expressions like this:
{paramName=simple text} //correct
{ paramName = simple text} //correct
{bad param=text} //incorrect
First two expression is almost equal. The difference is a space before and after parameter name. Third is incorrect, spaces not allowed in parameter name. I write a grammar:
grammar Test;
prog : '{' paramName '=' paramValue '}' ;
paramName : PARAM_NAME ;
paramValue : TEXT_WITH_SPACES ;
PARAM_NAME : [A-Za-zА-Яа-я_] [A-Za-zА-Яа-я_0-9]* ;
TEXT_WITH_SPACES : (LETTERS_EN|' ')+ ;
WS : [ ]+ -> skip;
fragment LETTERS_EN : ([A-Za-z]) ;
So, the task is ignore spaces around parameter name, but allow spaces in parameter value. But when I add a space inside rule TEXT_WITH_SPACES, my second expression highlight as icorrect.
screenshot
What can I do? Thank you in advance!
Ignore all spaces, but consider them to be "end of word", and allow more words in the value:
grammar Test;
prog : '{' paramName '=' paramValue '}' ;
paramName : WORD ;
paramValue : WORD+ ;
WORD : [A-Za-zА-Яа-я_] [A-Za-zА-Яа-я_0-9]* ;
WS : [ ]+ -> skip;
Update: To preserve spaces in the value:
grammar Test;
prog : '{' paramName '=' paramValue '}' ;
paramName : WORD ;
paramValue : WORD | MULTIWORD ;
MULTIWORD : WORD ((' ')+ WORD)* ;
WORD : [A-Za-zА-Яа-я_] [A-Za-zА-Яа-я_0-9]* ;
WS : [ ]+ -> skip;
This is based on MULTIWORD matching multiple words with nothing but space in between them, and other cases being matched by sequence of WORD and WS.

How to fix extraneous input ' ' expecting, in antlr4

Hello when running antlr4 with the following input i get the following error
image showing problem
[
I have been trying to fix it by doing some changes here and there but it seems it only works if I write every component of whileLoop in a new line.
Could you please tell me what i am missing here and why the problem persits?
grammar AM;
COMMENTS :
'{'~[\n|\r]*'}' -> skip
;
body : ('BODY' ' '*) anything | 'BODY' 'BEGIN' anything* 'END' ;
anything : whileLoop | write ;
write : 'WRITE' '(' '"' sentance '"' ')' ;
read : 'READ' '(' '"' sentance '"' ')' ;
whileLoop : 'WHILE' expression 'DO' ;
block : 'BODY' anything 'END';
expression : 'TRUE'|'FALSE' ;
test : ID? {System.out.println("Done");};
logicalOperators : '<' | '>' | '<>' | '<=' | '>=' | '=' ;
numberExpressionS : (NUMBER numberExpression)* ;
numberExpression : ('-' | '/' | '*' | '+' | '%') NUMBER ;
sentance : (ID)* {System.out.println("Sentance");};
WS : [ \t\r\n]+ -> skip ;
NUMBER : [0-9]+ ;
ID : [a-zA-Z0-9]* ;
**`strong text`**
Your lexer rules produce conflicts:
body : ('BODY' ' '*) anything | 'BODY' 'BEGIN' anything* 'END' ;
vs
WS : [ \t\r\n]+ -> skip ;
The critical section is the ' '*. This defines an implicit lexer token. It matches spaces and it is defined above of WS. So any sequence of spaces is not handled as WS but as implicit token.
If I am right putting tabs between the components of whileloop will work, also putting more than one space between them should work. You should simply remove ' '*, since whitespace is to be skipped anyway.

Antlr grammar for parsing simple expression

I would like to parse following expresion with antlr4
termspannear ( xxx, xxx , 5 , true )
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
Where termspannear functions can be nested
Here is my grammar:
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : TERMSPAN ;
TERMSPAN : TERMSPANNEAR | 'xxx' ;
TERMSPANNEAR: 'termspannear' OPENP BODY CLOSEP ;
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
After running:
antlr4 TermSpanNear.g4
javac TermSpanNear*.java
grun TermSpanNear start -gui
termspannear ( xxx, xxx , 5 , true )
^D![enter image description here][1]
line 1:0 token recognition error at: 'termspannear '
line 1:13 extraneous input '(' expecting TERMSPAN
and the tree looks like:
Can someone help me with this grammar ?
So the parsed tree contains all params and and also nesting works
NOTE:
After suggestion by I rewrote it to
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : termspan EOF;
termspan : termspannear | 'xxx' ;
termspannear: 'termspannear' '(' body ')' ;
body : termspan ',' termspan ',' SLOP ',' ORDERED ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I think now it works
I'm geting the following trees:
For
termspannear ( xxx, xxx , 5 , true )
For
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
You're using way too many lexer rules.
When you're defining a token like this:
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
then the tokenizer (lexer) will try to create the (single!) token: xxx,xxx,5,true. E.g. it does not allow any space in between it. Lexer rules (the ones starting with a capital) should really be the "atoms" of your language (the smallest parts). Whenever you start creating elements like a body, you glue atoms together in parser rules, not in lexer rules.
Try something like this:
grammar TermSpanNear;
// parser rules (the elements)
start : termpsan EOF ;
termpsan : termpsannear | 'xxx' ;
termpsannear : 'termspannear' OPENP body CLOSEP ;
body : termpsan COMMA termpsan COMMA SLOP COMMA ORDERED ;
// lexer rules (the atoms)
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ;

Antlr grammar won't work

My grammar won't accept inputs like "x = 4, xy = 1"
grammar Ex5;
prog : stat+;
stat : ID '=' NUMBER;
LETTER : ('a'..'z'|'A'..'Z');
DIGIT : ('0'..'9');
ID : LETTER+;
NUMBER : DIGIT+;
WS : (' '|'\t'| '\r'? '\n' )+ {skip();};
What do i have to do to make it accept a multitude of inputs like ID = NUMBER??
Thank you in advance.
You'll have to account for the comma, ',', inside your grammar. Also, since you (most probably) don't want LETTER and DIGIT tokens to be created since they are only used in other lexer rules, you should make these fragments:
grammar Ex5;
prog : stat (',' stat)*;
stat : ID '=' NUMBER;
ID : LETTER+;
NUMBER : DIGIT+;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
WS : (' '|'\t'| '\r'? '\n')+ {skip();};

ANTLR grammar error

I'm trying to built C-- compiler using ANTLR 3.4.
Full set of the grammar listed here,
program : (vardeclaration | fundeclaration)* ;
vardeclaration : INT ID (OPENSQ NUM CLOSESQ)? SEMICOL ;
fundeclaration : typespecifier ID OPENP params CLOSEP compoundstmt ;
typespecifier : INT | VOID ;
params : VOID | paramlist ;
paramlist : param (COMMA param)* ;
param : INT ID (OPENSQ CLOSESQ)? ;
compoundstmt : OPENCUR vardeclaration* statement* CLOSECUR ;
statementlist : statement* ;
statement : expressionstmt | compoundstmt | selectionstmt | iterationstmt | returnstmt;
expressionstmt : (expression)? SEMICOL;
selectionstmt : IF OPENP expression CLOSEP statement (options {greedy=true;}: ELSE statement)?;
iterationstmt : WHILE OPENP expression CLOSEP statement;
returnstmt : RETURN (expression)? SEMICOL;
expression : (var EQUAL expression) | sampleexpression;
var : ID ( OPENSQ expression CLOSESQ )? ;
sampleexpression: addexpr ( ( LOREQ | LESS | GRTR | GOREQ | EQUAL | NTEQL) addexpr)?;
addexpr : mulexpr ( ( PLUS | MINUS ) mulexpr)*;
mulexpr : factor ( ( MULTI | DIV ) factor )*;
factor : ( OPENP expression CLOSEP ) | var | call | NUM;
call : ID OPENP arglist? CLOSEP;
arglist : expression ( COMMA expression)*;
Used lexer rules as following,
ELSE : 'else' ;
IF : 'if' ;
INT : 'int' ;
RETURN : 'return' ;
VOID : 'void' ;
WHILE : 'while' ;
PLUS : '+' ;
MINUS : '-' ;
MULTI : '*' ;
DIV : '/' ;
LESS : '<' ;
LOREQ : '<=' ;
GRTR : '>' ;
GOREQ : '>=' ;
EQUAL : '==' ;
NTEQL : '!=' ;
ASSIGN : '=' ;
SEMICOL : ';' ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
OPENSQ : '[' ;
CLOSESQ : ']' ;
OPENCUR : '{' ;
CLOSECUR: '}' ;
SCOMMENT: '/*' ;
ECOMMENT: '*/' ;
ID : ('a'..'z' | 'A'..'Z')+/*(' ')*/ ;
NUM : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;};
COMMENT: '/*' .* '*/' {$channel = HIDDEN;};
But I try to save this it give me the error,
error(211): /CMinusMinus/src/CMinusMinus/CMinusMinus.g:33:13: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
|---> expression : (var EQUAL expression) | sampleexpression;
1 error
How can I resolve this problem?
As already mentioned: your grammar rule expression is ambiguous: both alternatives in that rule start, or can be, a var.
You need to "help" your parser a bit. If the parse can see a var followed by an EQUAL, it should choose alternative 1, else alternative 2. This can be done by using a syntactic predicate (the (var EQUAL)=> part in the rule below).
expression
: (var EQUAL)=> var EQUAL expression
| sampleexpression
;
More about predicates in this Q&A: What is a 'semantic predicate' in ANTLR?
The problem is this:
expression : (var EQUAL expression) | sampleexpression;
where you either start with var or sampleexpression. But sampleexpression can be reduced to var as well by doing sampleexpression->addExpr->MultExpr->Factor->var
So there is no way to find a k-length predicate for the compiler.
You can as suggested by the error message set backtrack=true to see whether this solves your problem, but it might lead not to the AST - parsetrees you would expect and might also be slow on special input conditions.
You could also try to refactor your grammar to avoid such recursions.