Ignore spaces, but allow text with spaces - antlr

I need to write a simple antlr4 grammar for expressions like this:
{paramName=simple text} //correct
{ paramName = simple text} //correct
{bad param=text} //incorrect
First two expression is almost equal. The difference is a space before and after parameter name. Third is incorrect, spaces not allowed in parameter name. I write a grammar:
grammar Test;
prog : '{' paramName '=' paramValue '}' ;
paramName : PARAM_NAME ;
paramValue : TEXT_WITH_SPACES ;
PARAM_NAME : [A-Za-zА-Яа-я_] [A-Za-zА-Яа-я_0-9]* ;
TEXT_WITH_SPACES : (LETTERS_EN|' ')+ ;
WS : [ ]+ -> skip;
fragment LETTERS_EN : ([A-Za-z]) ;
So, the task is ignore spaces around parameter name, but allow spaces in parameter value. But when I add a space inside rule TEXT_WITH_SPACES, my second expression highlight as icorrect.
screenshot
What can I do? Thank you in advance!

Ignore all spaces, but consider them to be "end of word", and allow more words in the value:
grammar Test;
prog : '{' paramName '=' paramValue '}' ;
paramName : WORD ;
paramValue : WORD+ ;
WORD : [A-Za-zА-Яа-я_] [A-Za-zА-Яа-я_0-9]* ;
WS : [ ]+ -> skip;
Update: To preserve spaces in the value:
grammar Test;
prog : '{' paramName '=' paramValue '}' ;
paramName : WORD ;
paramValue : WORD | MULTIWORD ;
MULTIWORD : WORD ((' ')+ WORD)* ;
WORD : [A-Za-zА-Яа-я_] [A-Za-zА-Яа-я_0-9]* ;
WS : [ ]+ -> skip;
This is based on MULTIWORD matching multiple words with nothing but space in between them, and other cases being matched by sequence of WORD and WS.

Related

ANTLR4: wrong lexer rule matches

I'm at a very beginning of learning ANTLR4 lexer rules. My goal is to create a simple grammar for Java properties files. Here is what I have so far:
lexer grammar PropertiesLexer;
LineComment
: ( LineCommentHash
| LineCommentExcl
)
-> skip
;
fragment LineCommentHash
: '#' ~[\r\n]*
;
fragment LineCommentExcl
: '!' ~[\r\n]*
;
fragment WrappedLine
: '\\'
( '\r' '\n'?
| '\n'
)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
Key
: KeyLetterStart
( KeyLetter
| Escaped
)*
;
fragment KeyLetterStart
: ~[ \t\r\n:=]
;
fragment KeyLetter
: ~[\t\r\n:=]
;
fragment Escaped
: '\\' .?
;
Equal
: ( '\\'? ':'
| '\\'? '='
)
;
Value
: ValueLetterBegin
( ValueLetter
| Escaped
| WrappedLine
)*
;
fragment ValueLetterBegin
: ~[ \t\r\n]
;
fragment ValueLetter
: ~ [\r\n]+
;
Whitespace
: [ \t]+
-> skip
;
My test file is this one:
# comment 1
# comment 2
#
.key1= value1
key2\:sub=value2
key3 \= value3
key4=value41\
value42
# comment3
#comment4
key=value
When I run grun, I'm getting following output:
[#0,30:42='.key1= value1',<Value>,4:0]
[#1,45:60='key2\:sub=value2',<Value>,5:0]
[#2,63:76='key3 \= value3',<Value>,6:0]
[#3,81:102='key4=value41\\r\nvalue42',<Value>,8:0]
[#4,130:138='key=value',<Value>,13:0]
[#5,141:140='<EOF>',<EOF>,14:0]
I don't understand why the Value definition is matched. When commenting out the Value definition, however, it recognizes the Key and Equal definitions:
[#0,30:34='.key1',<Key>,4:0]
[#1,35:35='=',<Equal>,4:5]
[#2,37:42='value1',<Key>,4:7]
[#3,45:49='key2\',<Key>,5:0]
[#4,50:50=':',<Equal>,5:5]
[#5,51:53='sub',<Key>,5:6]
[#6,54:54='=',<Equal>,5:9]
[#7,55:60='value2',<Key>,5:10]
[#8,63:68='key3 \',<Key>,6:0]
[#9,69:69='=',<Equal>,6:6]
[#10,71:76='value3',<Key>,6:8]
[#11,81:84='key4',<Key>,8:0]
[#12,85:85='=',<Equal>,8:4]
[#13,86:93='value41\',<Key>,8:5]
[#14,96:102='value42',<Key>,9:0]
[#15,130:132='key',<Key>,13:0]
[#16,133:133='=',<Equal>,13:3]
[#17,134:138='value',<Key>,13:4]
[#18,141:140='<EOF>',<EOF>,14:0]
but how to let it recognize the Key, Equal and Value definitons?
ANTLR's lexer rules match as much characters as possible, that is why you're seeing all these Value tokens being created (they match the most characters).
Lexical modes seem like a good fit to use here. Something like this:
lexer grammar PropertiesLexer;
COMMENT
: [!#] ~[\r\n]* -> skip
;
KEY
: ( '\\' ~[\r\n] | ~[\r\n\\=:] )+
;
EQUAL
: [=:] -> pushMode(VALUE_MODE)
;
NL
: [\r\n]+ -> skip
;
mode VALUE_MODE;
VALUE
: ( ~[\\\r\n] | '\\' . )+
;
END_VALUE
: [\r\n]+ -> skip, popMode
;

How to fix extraneous input ' ' expecting, in antlr4

Hello when running antlr4 with the following input i get the following error
image showing problem
[
I have been trying to fix it by doing some changes here and there but it seems it only works if I write every component of whileLoop in a new line.
Could you please tell me what i am missing here and why the problem persits?
grammar AM;
COMMENTS :
'{'~[\n|\r]*'}' -> skip
;
body : ('BODY' ' '*) anything | 'BODY' 'BEGIN' anything* 'END' ;
anything : whileLoop | write ;
write : 'WRITE' '(' '"' sentance '"' ')' ;
read : 'READ' '(' '"' sentance '"' ')' ;
whileLoop : 'WHILE' expression 'DO' ;
block : 'BODY' anything 'END';
expression : 'TRUE'|'FALSE' ;
test : ID? {System.out.println("Done");};
logicalOperators : '<' | '>' | '<>' | '<=' | '>=' | '=' ;
numberExpressionS : (NUMBER numberExpression)* ;
numberExpression : ('-' | '/' | '*' | '+' | '%') NUMBER ;
sentance : (ID)* {System.out.println("Sentance");};
WS : [ \t\r\n]+ -> skip ;
NUMBER : [0-9]+ ;
ID : [a-zA-Z0-9]* ;
**`strong text`**
Your lexer rules produce conflicts:
body : ('BODY' ' '*) anything | 'BODY' 'BEGIN' anything* 'END' ;
vs
WS : [ \t\r\n]+ -> skip ;
The critical section is the ' '*. This defines an implicit lexer token. It matches spaces and it is defined above of WS. So any sequence of spaces is not handled as WS but as implicit token.
If I am right putting tabs between the components of whileloop will work, also putting more than one space between them should work. You should simply remove ' '*, since whitespace is to be skipped anyway.

ParserRule matching the wrong token

I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;

Guide or approval for ANTLR example

I have an AlgebraRelacional.g4 file with this. I need to read a file with a syntax like a CSV file, put the content in some memory tables and then resolve relational algebra operations with that. Can you tell me if I am doing it right?
Example data file to read:
cod_buy(char);name_suc(char);Import(int);date_buy(date)
“P-11”;”DC Med”;900;01/03/14
“P-14”;”Center”;1500;02/05/14
Current ANTLR grammar:
grammar AlgebraRelacional;
SEL : '\u03C3'
;
PRO : '\u220F'
;
UNI : '\u222A'
;
DIF : '\u002D'
;
PROC : '\u0058'
;
INT : '\u2229'
;
AND : 'AND'
;
OR : 'OR'
;
NOT : 'NOT'
;
EQ : '='
;
DIFERENTE : '!='
;
MAYOR : '>'
;
MENOR : '<'
;
SUMA : '+'
;
MULTI : '*'
;
IPAREN : '('
;
DPAREN : ')'
;
COMA : ','
;
PCOMA : ';'
;
Comillas: '"'
;
file : hdr row+ ;
hdr : row ;
row : field (',' field)* '\r'? '\n' ;
field : TEXT | STRING | ;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ;
I suggest you that read this document (http://is.muni.cz/th/208197/fi_b/bc_thesis.pdf), It contains usefull information about how to write a parser for relational algebra. That is not ANTLR, but you only has to translate the grammar in BNF to EBNF.

Antlr grammar for parsing simple expression

I would like to parse following expresion with antlr4
termspannear ( xxx, xxx , 5 , true )
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
Where termspannear functions can be nested
Here is my grammar:
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : TERMSPAN ;
TERMSPAN : TERMSPANNEAR | 'xxx' ;
TERMSPANNEAR: 'termspannear' OPENP BODY CLOSEP ;
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
After running:
antlr4 TermSpanNear.g4
javac TermSpanNear*.java
grun TermSpanNear start -gui
termspannear ( xxx, xxx , 5 , true )
^D![enter image description here][1]
line 1:0 token recognition error at: 'termspannear '
line 1:13 extraneous input '(' expecting TERMSPAN
and the tree looks like:
Can someone help me with this grammar ?
So the parsed tree contains all params and and also nesting works
NOTE:
After suggestion by I rewrote it to
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : termspan EOF;
termspan : termspannear | 'xxx' ;
termspannear: 'termspannear' '(' body ')' ;
body : termspan ',' termspan ',' SLOP ',' ORDERED ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I think now it works
I'm geting the following trees:
For
termspannear ( xxx, xxx , 5 , true )
For
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
You're using way too many lexer rules.
When you're defining a token like this:
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
then the tokenizer (lexer) will try to create the (single!) token: xxx,xxx,5,true. E.g. it does not allow any space in between it. Lexer rules (the ones starting with a capital) should really be the "atoms" of your language (the smallest parts). Whenever you start creating elements like a body, you glue atoms together in parser rules, not in lexer rules.
Try something like this:
grammar TermSpanNear;
// parser rules (the elements)
start : termpsan EOF ;
termpsan : termpsannear | 'xxx' ;
termpsannear : 'termspannear' OPENP body CLOSEP ;
body : termpsan COMMA termpsan COMMA SLOP COMMA ORDERED ;
// lexer rules (the atoms)
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ;