eliminate extra spaces in a given ANTLR grammar

eliminate extra spaces in a given ANTLR grammar - antlr

In any grammar I create in ANTLR, is it possible to parse the grammar and the result of the parsing can eleminate any extra spaces in the grammar. f.e
simple example ;
int x=5;
if I write
int x = 5 ;
I would like that the text changes to the int x=5 without the extra spaces. Can the parser return the original text without extra spaces?

Can the parser return the original text without extra spaces?
Yes, you need to define a lexer rule that captures these spaces and then skip() them:
Space
: (' ' | '\t') {skip();}
;
which will cause spaces and tabs to be ignored.
PS. I'm assuming you're using Java as the target language. The skip() can be different in other targets (Skip() for C#, for example). You may also want to include \r and \n chars in this rule.
EDIT
Let's say your language only consists of a couple of variable declarations. Assuming you know the basics of ANTLR, the following grammar should be easy to understand:
grammar T;
parse
: stat* EOF
;
stat
: Type Identifier '=' Int ';'
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
And you're parsing the source:
int x = 5 ; double y =5;boolean z = 0 ;
which you'd like to change into:
int x=5;
double y=5;
boolean z=0;
Here's a way to embed code in your grammar and let the parser rules return custom objects (Strings, in this case):
grammar T;
parse returns [String str]
#init{StringBuilder buffer = new StringBuilder();}
#after{$str = buffer.toString();}
: (stat {buffer.append($stat.str).append('\n');})* EOF
;
stat returns [String str]
: Type Identifier '=' Int ';'
{$str = $Type.text + " " + $Identifier.text + "=" + $Int.text + ";";}
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
Test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "int x = 5 ; double y =5;boolean z = 0 ;";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("Result:\n"+parser.parse());
}
}
which produces:
Result:
int x=5;
double y=5;
boolean z=0;

Related

Antlr4 ignoring newlines at all but one point

I am writing a parser for a scripting language, and using antlr 4.5.3 for the purpose.
grammar VSE;
chunk
: block* EOF
;
block
: var '=' exp
| functioncall
;
var
: NAME
| var '[' exp ']'
| var '.' var
;
exp
: number
| string
| var
| functioncall
| <assoc=right> exp exp //concat
;
functioncall
: NAME '(' (exp)? (',' exp)* ')'
| var '.' functioncall
;
string
: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"'
;
NAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
number
: INT | HEX | FLOAT
;
INT
: Digit+
;
HEX
: '0' [xX] [0-9a-fA-F]+
;
FLOAT
: Digit* '.' Digit+
;
Digit
: [0-9]
;
WS
: [ \t\u000C\r\n]+ -> skip
;
However, while testing it, I found a variable assignment like var = something followed by some function call in next line leads to a concat statement. (My concat statement is a variable followed by another like var = var1 var2) I understand that antlr is skipping ALL the new lines in favor of line continuation, but I'd like to add the condition that if there is a new line between two exps, it would treat them as two separate blocks instead of a concat statement. i.e.
var = var2
functioncall(var)
These should be two separate blocks instead of concat statement.
Is there any way to do this?

Does the following rule suitable for you?
block
: var '=' exp NEW_LINE
| functioncall NEW_LINE
;
NEW_LINE: '\r'? '\n'
WS
: [ \t]+ -> skip
;
In another case you should use Semantic Predicates or very unclear grammar.

Using Antlr to parse formulas with multiple locales

I'm very new to Antlr, so forgive what may be a very easy question.
I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.
Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.
I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.

What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.
I don't have ANTLR and C# running here, but the Java demo should be pretty similar:
grammar LocaleDemo;
#lexer::header {
import java.text.DecimalFormatSymbols;
import java.util.Locale;
}
#lexer::members {
private char decimalSeparator = '.';
private char groupingSeparator = ',';
public LocaleDemoLexer(CharStream input, Locale locale) {
this(input);
DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
this.decimalSeparator = dfs.getDecimalSeparator();
this.groupingSeparator = dfs.getGroupingSeparator();
}
}
parse
: .*? EOF
;
NUMBER
: D D? ( DG D D D )* ( DS D+ )?
;
OTHER
: .
;
fragment D : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}? . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;
To test the grammar above, run this class:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;
public class Main {
private static void tokenize(String input, Locale locale) {
LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
for (Token t : lexer.getAllTokens()) {
System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
public static void main(String[] args) throws Exception {
tokenize("1.23", Locale.ENGLISH);
tokenize("1.23", Locale.GERMAN);
tokenize("12.345.678,90", Locale.ENGLISH);
tokenize("12.345.678,90", Locale.GERMAN);
}
}
which would print:
input='1.23', locale=en, tokens:
NUMBER '1.23'
input='1.23', locale=de, tokens:
NUMBER '1'
OTHER '.'
NUMBER '23'
input='12.345.678,90', locale=en, tokens:
NUMBER '12.345'
OTHER '.'
NUMBER '67'
NUMBER '8'
OTHER ','
NUMBER '90'
input='12.345.678,90', locale=de, tokens:
NUMBER '12.345.678,90'
Related Q&A's:
What is a 'semantic predicate' in ANTLR?
What does "fragment" mean in ANTLR?

As a follow-up to Bart's answer, this is the grammar I created with his suggestions:
grammar ExcelScript;
#lexer::header
{
using System;
using System.Globalization;
}
#lexer::members
{
private Int32 listseparator = 44; // UTF16 value for comma
private Int32 decimalseparator = 46; // UTF16 value for period
/// <summary>
/// Creates a new lexer object
/// </summary>
/// <param name="input">The input stream</param>
/// <param name="locale">The locale to use in parsing numbers</param>
/// <returns>A new lexer object</returns>
public ExcelScriptLexer (ICharStream input, CultureInfo locale)
: this(input)
{
this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);
// special case for 8 locales where the list separator is a , and the number separator is a , too
// Excel uses semicolon for list separator, so we will too
if (this.listseparator == 44 && this.decimalseparator == 44)
this.listseparator = 59; // UTF16 value for semicolon
}
}
/*
* Parser Rules
*/
formula
: numberLiteral
| Identifier
| '=' expression
;
expression
: primary # PrimaryExpression
| Identifier arguments # FunctionCallExpression
| ('+' | '-') expression # UnarySignExpression
| expression ('*' | '/' | '%') expression # MulDivModExpression
| expression ('+' | '-') expression # AddSubExpression
| expression ('<=' | '>=' | '>' | '<') expression # CompareExpression
| expression ('=' | '<>') expression # EqualCompareExpression
;
primary
: '(' expression ')' # ParenExpression
| literal # LiteralExpression
| Identifier # IdentifierExpression
;
literal
: numberLiteral # NumberLiteralRule
| booleanLiteral # BooleanLiteralRule
;
numberLiteral
: IntegerLiteral
| FloatingPointLiteral
;
booleanLiteral
: TrueKeyword
| FalseKeyword
;
arguments
: '(' expressionList? ')'
;
expressionList
: expression (ListSeparator expression)*
;
/*
* Lexer Rules
*/
AddOperator : '+' ;
SubOperator : '-' ;
MulOperator : '*' ;
DivOperator : '/' ;
PowOperator : '^' ;
EqOperator : '=' ;
NeqOperator : '<>' ;
LeOperator : '<=' ;
GeOperator : '>=' ;
LtOperator : '<' ;
GtOperator : '>' ;
ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;
TrueKeyword : [Tt][Rr][Uu][Ee] ;
FalseKeyword : [Ff][Aa][Ll][Ss][Ee] ;
Identifier
: Letter (Letter | Digit)*
;
fragment Letter
: [A-Z_a-z]
;
fragment Digit
: [0-9]
;
IntegerLiteral
: '0'
| [1-9] [0-9]*
;
FloatingPointLiteral
: [0-9]+ DecimalSeparator [0-9]* Exponent?
| DecimalSeparator [0-9]+ Exponent?
| [0-9]+ Exponent
;
fragment Exponent
: ('e' | 'E') ('+' | '-')? ('0'..'9')+
;
WhiteSpace
: [ \t]+ -> channel(HIDDEN)
;

Precedence in Antlr using parentheses

We are developing a DSL, and we're facing some problems:
Problem 1:
In our DSL, it's allowed to do this:
A + B + C
but not this:
A + B - C
If the user needs to use two or more different operators, he'll need to insert parentheses:
A + (B - C) or (A + B) - C.
Problem 2:
In our DSL, the most precedent operator must be surrounded by parentheses.
For example, instead of using this way:
A + B * C
The user needs to use this:
A + (B * C)
To solve the Problem 1 I've got a snippet of ANTLR that worked, but I'm not sure if it's the best way to solve it:
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr;
To solve the Problem 2, I have no idea where to start.
I appreciate your help to find out a better solution to the first problem and a direction to solve the seconde one. (Sorry for my bad english)
Below is the grammar that we have developed:
grammar TclGrammar;
options {
output=AST;
ASTLabelType=CommonTree;
}
#members {
public boolean isSum(String type) {
System.out.println("Tipo: " + type);
return "+".equals(type);
}
public boolean isSub(String type) {
System.out.println("Tipo: " + type);
return "-".equals(type);
}
}
prog
: exprMain ';' {System.out.println(
$exprMain.tree == null ? "null" : $exprMain.tree.toStringTree());}
;
exprMain
: exprQuando? (exprDeveSatis | exprDeveFalharCaso)
;
exprDeveSatis
: 'DEVE SATISFAZER' '{'! expr '}'!
;
exprDeveFalharCaso
: 'DEVE FALHAR CASO' '{'! expr '}'!
;
exprQuando
: 'QUANDO' '{'! expr '}'!
;
expr
: logicExpr
;
logicExpr
: boolExpr (('E'|'OU')^ boolExpr)*
;
boolExpr
: comparatorExpr
| emExpr
| 'VERDADE'
| 'FALSO'
;
emExpr
: FIELD 'EM' '[' (variable_lista | field_lista) comCruzamentoExpr? ']'
-> ^('EM' FIELD (variable_lista+)? (field_lista+)? comCruzamentoExpr?)
;
comCruzamentoExpr
: 'COM CRUZAMENTO' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COM CRUZAMENTO' FIELD+)
;
comparatorExpr
: sumExpr (('<'^|'<='^|'>'^|'>='^|'='^|'<>'^) sumExpr)?
| naoPreenchidoExpr
| preenchidoExpr
;
naoPreenchidoExpr
: FIELD 'NAO PREENCHIDO' -> ^('NAO PREENCHIDO' FIELD)
;
preenchidoExpr
: FIELD 'PREENCHIDO' -> ^('PREENCHIDO' FIELD)
;
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr
;
multExpr
: funcExpr(('*'^|'/'^) funcExpr)?
;
funcExpr
: 'QUANTIDADE'^ '('! FIELD ')'!
| 'EXTRAI_TEXTO'^ '('! FIELD ';' INTEGER ';' INTEGER ')'!
| cruzaExpr
| 'COMBINACAO_UNICA' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COMBINACAO_UNICA' FIELD+)
| 'EXISTE'^ '('! FIELD ')'!
| 'UNICO'^ '('! FIELD ')'!
| atom
;
cruzaExpr
: operadorCruzaExpr ('CRUZA COM'^|'CRUZA AMBOS'^) operadorCruzaExpr ondeExpr?
;
operadorCruzaExpr
: FIELD('('!field_lista')'!)?
;
ondeExpr
: ('ONDE'^ '('!expr')'!)
;
atom
: FIELD
| VARIABLE
| '('! expr ')'!
;
field_lista
: FIELD(';' field_lista)?
;
variable_lista
: VARIABLE(';' variable_lista)?
;
FIELD
: NONCONTROL_CHAR+
;
NUMBER
: INTEGER | FLOAT
;
VARIABLE
: '\'' NONCONTROL_CHAR+ '\''
;
fragment SIGN: '+' | '-';
fragment NONCONTROL_CHAR: LETTER | DIGIT | SYMBOL;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
fragment SYMBOL: '_' | '.' | ',';
fragment FLOAT: INTEGER '.' '0'..'9'*;
fragment INTEGER: '0' | SIGN? '1'..'9' '0'..'9'*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {skip();}
;

This is similar to not having operator precedence at all.
expr
: funcExpr
( ('+' funcExpr)*
| ('-' funcExpr)*
| ('*' funcExpr)*
| ('/' funcExpr)*
)
;

I think the following should work. I'm assuming some lexer tokens with obvious names.
expr: sumExpr;
sumExpr: onlySum | subExpr;
onlySum: atom ( PLUS onlySum )?;
subExpr: onlySub | multExpr;
onlySub: atom ( MINUS onlySub )? ;
multExpr: atom ( STAR atomic )? ;
parenExpr: OPEN_PAREN expr CLOSE_PAREN;
atom: FIELD | VARIABLE | parenExpr
The only* rules match an expression if it only has one type of operator outside of parentheses. The *Expr rules match either a line with the appropriate type of operators or go to the next operator.
If you have multiple types of operators, then they are forced to be inside parentheses because the match will go through atom.

ANTLR Template translator match part of grammar

I wrote a grammar for a language and now I want to treat some syntactic sugar constructions, for that I was thinking of writing a template translator.
The problem is I want my template grammar to translate only some constructions of the language and leave the rest as it is.
For example:
I have this as input:
class Main {
int a[10];
}
and I want to translate that into something like:
class Main {
Array a = new Array(10);
}
Ideally I would like to do some think like this in ANTLR
grammer Translator
options { output=template;}
decl
: TYPE ID '[' INT ']' -> template(name = {$ID.text}, size ={$INT.text})
"Array <name> = new Array(<size>);
I would like it to leave the rest of the input that doesn't match rule decl as it is.
How can I achieve this in ANTLR without writing the full grammar for the language ?

I would simply handle such things in the parser grammar.
Assuming you're constructing an AST in your parser grammar, I guess you'll have a rule to parse input like Array a = new Array(10); similar to:
decl
: TYPE ID '=' expr ';' -> ^(DECL TYPE ID expr)
;
where expr eventually matches a term like this:
term
: NUMBER
| 'new' ID '(' (expr (',' expr)*)? ')' -> ^('new' ID expr*)
| ...
;
To account for your short-hand declaration int a[10];, all you have to do is expand decl like this:
decl
: TYPE ID '=' expr ';' -> ^(DECL TYPE ID expr)
| TYPE ID '[' expr ']' ';' -> ^(DECL 'Array' ID ^(NEW ARRAY expr))
;
which will rewrite the input int a[10]; into the following AST:
which is exactly the same as the AST created for input Array a = new Array(10);.
EDIT
Here's a small working demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
DECL;
NEW='new';
INT='int';
ARRAY='Array';
}
parse
: decl+ EOF -> ^(ROOT decl+)
;
decl
: type ID '=' expr ';' -> ^(DECL type ID expr)
| type ID '[' expr ']' ';' -> ^(DECL ARRAY ID ^(NEW ARRAY expr))
;
expr
: Number
| NEW type '(' (expr (',' expr)*)? ')' -> ^(NEW ID expr*)
;
type
: INT
| ARRAY
| ID
;
ID : ('a'..'z' | 'A'..'Z')+;
Number : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n') {skip();};
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "Array a = new Array(10); int a[10];";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

ANTLR parsing Java Properties

I'm trying to pick up ANTLR and writing a grammar for Java Properties. I'm hitting an issue here and will appreciate some help.
In Java Properties, it has a little strange escape handling. For example,
key1=1=Key1
key\=2==
results in key-value pairs in Java runtime as
KEY VALUE
=== =====
key1 1=Key1
key=2 =
So far, this is the best I can mimic.. by folding the '=' and value into one single token.
grammar Prop;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : (~('='|'\r'|'\n') | '\\=')* ;
VALUE : '=' (~('\r'|'\n'))*;
CARRIAGE_RETURN
: ('\r'|'\n') + {$channel=HIDDEN;}
;
LINE_COMMENT
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
;
Is there any good suggestion if I can implement a better one?
Thanks a lot

It's not as easy as that. You can't handle much at the lexing level because many things depend on a certain context. So at the lexing level, you can only match single characters and construct key and values in parser rules. Also, the = and : as possible key-value separators and the fact that these characters can be the start of a value, makes them a pain in the butt to translate into a grammar. The easiest would be to include these (possible) separator chars in your value-rule and after matching the separator and value together, strip the separator chars from it.
A small demo:
JavaProperties.g
grammar JavaProperties;
parse
: line* EOF
;
line
: Space* keyValue
| Space* Comment eol
| Space* LineBreak
;
keyValue
: key separatorAndValue eol
{
// Replace all escaped `=` and `:`
String k = $key.text.replace("\\:", ":").replace("\\=", "=");
// Remove the separator, if it exists
String v = $separatorAndValue.text.replaceAll("^\\s*[:=]\\s*", "");
// Remove all escaped line breaks with trailing spaces
v = v.replaceAll("\\\\(\r?\n|\r)[ \t\f]*", "").trim();
System.out.println("\nkey : `" + k + "`");
System.out.println("value : `" + v + "`");
}
;
key
: keyChar+
;
keyChar
: AlphaNum
| Backslash (Colon | Equals)
;
separatorAndValue
: (Space | Colon | Equals) valueChar+
;
valueChar
: AlphaNum
| Space
| Backslash LineBreak
| Equals
| Colon
;
eol
: LineBreak
| EOF
;
Backslash : '\\';
Colon : ':';
Equals : '=';
Comment
: ('!' | '#') ~('\r' | '\n')*
;
LineBreak
: '\r'? '\n'
| '\r'
;
Space
: ' '
| '\t'
| '\f'
;
AlphaNum
: 'a'..'z'
| 'A'..'Z'
| '0'..'9'
;
The grammar above can be tested with the class:
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRFileStream("test.properties");
JavaPropertiesLexer lexer = new JavaPropertiesLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
JavaPropertiesParser parser = new JavaPropertiesParser(tokens);
parser.parse();
}
}
and the input file:
test.properties
key1 = value 1
key2:value 2
key3 :value3
ke\:\=y4=v\
a\
l\
u\
e 4
key\=5==
key6 value6
to produce the following output:
key : `key1`
value : `value 1`
key : `key2`
value : `value 2`
key : `key3`
value : `value3`
key : `ke:=y4`
value : `value 4`
key : `key=5`
value : `=`
key : `key6`
value : `value6`
Realize that my grammar is just an example: it does not account for all valid properties files (sometimes backslashes should be ignored, there's no Unicode escapes, many characters are missing in the key and value). For a complete specification of properties files, see:
http://download.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

eliminate extra spaces in a given ANTLR grammar - antlr

Related

Antlr4 ignoring newlines at all but one point

Using Antlr to parse formulas with multiple locales

Precedence in Antlr using parentheses

ANTLR Template translator match part of grammar

ANTLR parsing Java Properties

Categories

Resources