ANTLR: not match if a certain character follows - antlr

Following code is completely valid in the V programming language:
fn main() {
a := 1.
b := .1
println("$a $b")
for i in 0..10 {
println(i)
}
}
I want to write a Lexer for syntax coloring such files. 1. and .1 should be matched by FloatNumber fragment while the .. in the for-loop should match by a punctuation rule. The problem I have is that my FloatNumber implementation already matches 0. and .10 from the 0..10 and I have no idea how to tell it not to match if a . follows (or is in front of it). A little bit simplified (leaving possible underscores aside) my grammar looks like this:
fragment FloatNumber
: ( Digit+ ('.' Digit*)? ([eE] [+-]? Digit+)?
| Digit* '.' Digit+ ([eE] [+-]? Digit+)?
)
;
fragment Digit
: [0-9]
;

Then you will have to introduce a predicate that checks if there is no . ahead when matching a float like 1..
The following rules:
Plus
: '+'
;
FloatLiteral
: Digit+ '.' {_input.LA(1) != '.'}?
| Digit* '.' Digit+
;
Int
: Digit+
;
Range
: '..'
;
given the input "1.2 .3 4. 5 6..7 8.+9", will produce the following tokens:
FloatLiteral `1.2`
FloatLiteral `.3`
FloatLiteral `4.`
Int `5`
Int `6`
Range `..`
Int `7`
FloatLiteral `8.`
Plus `+`
Int `9`
Code inside a predicate is target specific. The predicate above ({_input.LA(1) != '.'}?) works with the Java target.

Related

Antlr 4.6.1 not generating errorNodes for inputstream

I have a simple grammar like :
grammar CellMath;
equation : expr EOF;
expr
: '-'expr #UnaryNegation // unary minus
| expr op=('*'|'/') expr #MultiplicativeOp // MultiplicativeOperation
| expr op=('+'|'-') expr #AdditiveOp // AdditiveOperation
| FLOAT #Float // Floating Point Number
| INT #Integer // Integer Number
| '(' expr ')' #ParenExpr // Parenthesized Expression
;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
FLOAT
: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
INT : DIGIT+ ;
fragment
DIGIT : [0-9] ; // match single digit
//fragment
//ATSIGN : [#];
WS : [ \t\r\n]+ -> skip ;
ERRORCHAR : . ;
Not able to throw an exception in case of special char in between expression
[{Number}{SPLChar}{Chars}]
Ex:
"123#abc",
"123&abc".
I expecting an exception to throw
For Example:
Input stream : 123#abc Just like in ANTLR labs Image
But in my case Output : '123' without any errors
I'm using Listener pattern, Error nodes are just ignored not going through VisitTerminal([NotNull] ITerminalNode node) / VisitErrorNode([NotNull] IErrorNode node) in the BaseListener class. Also all the BaseErrorListener class methods has been overridden not even there.
Thanks in advance for your help.

Token matching, but it shouldn't

I am trying to parse Windows header files to extract function prototypes. Microsoft being Microsoft means that the function prototypes are not in a regular, easily parseable format. Routine arguments usually, but not always, are annotated with Microsoft's Structured Annotation Language, which starts with an identifier that begins and ends with an underscore, and may have an underscore in the middle. A SAL identifier may be followed by parentheses and contain a variety of compile-time checks, but I don't care about SAL stuff. Routines are generally annotated with an access specifier, which is usually something like WINAPI, APIENTRY, etc., but there may be more than one. There are cases where the arguments are specified only by their types, too. Sheesh!
My grammar looks like this:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME SAL_EXPR?
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')?
;
argument :
sal_statement type = IDENTIFIER is_pointer = '*'? arg = IDENTIFIER
;
arg_list :
argument (',' argument)*
;
hex_number :
'0x' HEX_DIGIT+
;
//
// Lexer rules
//
INTEGER : Digit+;
HEX_DIGIT : [a-fA-F0-9];
SAL_NAME : '_' Capital (Letter | '_')+? '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
SAL_EXPR : '(' ( ~( '(' | ')' ) | SAL_EXPR )* ')'; // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];
I am using the TestRig, and providing the following input:
PVOID WINAPI routine ();
PVOID WINAPI routine (type param);
extern int PASCAL FAR __WSAFDIsSet(SOCKET fd, fd_set FAR *);
// comment
/*
Another comment*/
int
WSPAPI
WSCSetApplicationCategory(
_Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,
_In_ DWORD PathLength,
_In_reads_opt_(ExtraLength) LPCWSTR Extra,
_When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))
DWORD ExtraLength,
_In_ DWORD PermittedLspCategories,
_Out_opt_ DWORD * pPrevPermLspCat,
_Out_ LPINT lpErrno
);
I'm getting this output:
[#0,0:4='PVOID',<IDENTIFIER>,1:0]
[#1,6:11='WINAPI',<'WINAPI'>,1:6]
[#2,13:19='routine',<IDENTIFIER>,1:13]
[#3,21:22='()',<SAL_EXPR>,1:21]
[#4,23:23=';',<';'>,1:23]
[#5,28:32='PVOID',<IDENTIFIER>,3:0]
[#6,34:39='WINAPI',<'WINAPI'>,3:6]
[#7,41:47='routine',<IDENTIFIER>,3:13]
[#8,49:60='(type param)',<SAL_EXPR>,3:21]
[#9,61:61=';',<';'>,3:33]
[#10,66:71='extern',<'extern'>,5:0]
[#11,73:75='int',<IDENTIFIER>,5:7]
[#12,77:82='PASCAL',<'PASCAL'>,5:11]
[#13,84:86='FAR',<'FAR'>,5:18]
[#14,88:99='__WSAFDIsSet',<IDENTIFIER>,5:22]
[#15,100:124='(SOCKET fd, fd_set FAR *)',<SAL_EXPR>,5:34]
[#16,125:125=';',<';'>,5:59]
[#17,130:141='// comment\r\n',<CPP_COMMENT>,channel=1,7:0]
[#18,142:162='/*\r\nAnother comment*/',<C_COMMENT>,channel=1,8:0]
[#19,167:169='int',<IDENTIFIER>,11:0]
[#20,172:177='WSPAPI',<'WSPAPI'>,12:0]
[#21,180:204='WSCSetApplicationCategory',<IDENTIFIER>,13:0]
[#22,205:568='(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )',<SAL_EXPR>,13:25]
[#23,569:569=';',<';'>,22:5]
[#24,572:571='<EOF>',<EOF>,23:0]
line 1:21 mismatched input '()' expecting '('
line 3:21 mismatched input '(type param)' expecting '('
line 5:18 extraneous input 'FAR' expecting IDENTIFIER
line 5:34 mismatched input '(SOCKET fd, fd_set FAR *)' expecting '('
line 13:25 mismatched input '(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )' expecting '('
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
Why doesn't SAL_NAME match at line 22?
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
The lexer doesn't know anything about parser rules, it operates on input only. It cannot know that it "should only match something if SAL_NAME matches first".
The best way is perhaps not taking this logic into lexer, i.e. only decide whether the input is SAL expression or something else in brackets in parser, not in lexer.
Here is the functioning grammar:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec* routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME sal_expr?
;
sal_expr :
'(' ( ~( '(' | ')' ) | sal_expr )* ')' // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')
;
argument :
sal_statement? type = IDENTIFIER access_spec? is_pointer = '*'? arg = IDENTIFIER?
| sal_statement
;
arg_list :
argument (',' argument)*
;
//
// Lexer rules
//
SAL_NAME : '_' Capital (Letter | '_')+ '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
OPERATORS : [&|=!><];
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];

Antlr4 Grammar for Function Application

I'm trying to write a simple lambda calculus grammar (show below). The issue I am having is that function application seems to be treated as right associative instead of left associative e.g. "f 1 2" is parsed as (f (1 2)) instead of ((f 1) 2). ANTLR has an assoc option for tokens, but I don't see how that helps here since there is no operator for function application. Does anyone see a solution?
LAMBDA : '\\';
DOT : '.';
OPEN_PAREN : '(';
CLOSE_PAREN : ')';
fragment ID_START : [A-Za-z+\-*/_];
fragment ID_BODY : ID_START | DIGIT;
fragment DIGIT : [0-9];
ID : ID_START ID_BODY*;
NUMBER : DIGIT+ (DOT DIGIT+)?;
WS : [ \t\r\n]+ -> skip;
parse : expr EOF;
expr : variable #VariableExpr
| number #ConstantExpr
| function_def #FunctionDefinition
| expr expr #FunctionApplication
| OPEN_PAREN expr CLOSE_PAREN #ParenExpr
;
function_def : LAMBDA ID DOT expr;
number : NUMBER;
variable : ID;
Thanks!
this breaks 4.1's pattern matcher for left-recursion. cleaned up in main branch I believe. try downloading last master and build. CUrrently 4.1 generates:
expr[int _p]
: ( {} variable
| number
| function_def
| OPEN_PAREN expr CLOSE_PAREN
)
(
{2 >= $_p}? expr
)*
;
for that rule. expr ref in loop is expr[0] actually, which isn't right.

Using floats in range and grammar mistakes?

I am trying to convert a LALR grammar to LL using ANTLR and I am running into a few problems. So far, I think converting the expressions into a Top-Down approach is straight forward to me. The problem is when I include Range (1..10) and (1.0..10.0) with floats.
I have tried to use the answer found here and somehow it is not even running correctly with my code, let alone solving a range of float, i.e. (float..float).
Float literal and range parameter in ANTLR
Attached is a sample of my grammar that just focuses on this issue.
grammar Test;
options {
language = Java;
output = AST;
}
parse: 'in' rangeExpression ';'
;
rangeExpression : expression ('..' expression)?
;
expression : addingExpression (('=='|'!='|'<='|'<'|'>='|'>') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/'|'div') unaryExpression)*
;
unaryExpression: ('+'|'-')* primitiveElement;
primitiveElement : literalExpression
| id ('.' id)?
| '(' expression ')'
;
literalExpression : NUMBER
| BOOLEAN_LITERAL
| 'infinity'
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
Range
: '..'
;
NUMBER
: (DIGITS Range) => DIGITS {$type=DIGITS;}
| (FloatLiteral) => FloatLiteral {$type=FloatLiteral;}
| DIGITS {$type=DIGITS;}
;
// fragments
fragment FloatLiteral : Float;
fragment Float
: DIGITS ( options {greedy = true; } : '.' DIGIT* EXPONENT?)
| '.' DIGITS EXPONENT?
| DIGITS EXPONENT
;
BOOLEAN_LITERAL : 'false'
| 'true'
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
fragment EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
Any reason why it is not even taking:
in 10;
or
in 10.0;
Thanks in advance!
The following things are not correct:
you're never matching a FloatLiteral in your literalExpression rule
in every alternative of NUMBER you're changing the type of the token, therefor a NUMBER token will never be created
Something like this should work for both 11..22 and 1.1..2.2:
...
literalExpression : INT
| BOOLEAN_LITERAL
| FLOAT
| 'infinity'
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
Range
: '..'
;
INT
: (DIGITS Range)=> DIGITS
| DIGITS (('.' DIGITS EXPONENT? | EXPONENT) {$type=FLOAT;})?
;
BOOLEAN_LITERAL : 'false'
| 'true'
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
fragment EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment FLOAT : ;
To your question about handling (1.0 .. 10.0):
Notice that parser rule primitiveElement defines an alternative as '(' expression ')', but rule expression can never reach rule rangeExpression.
Consider redefining expression and rangeExpression like so:
expression : rangeExpression
;
rangeExpression : compExpression ('..' compExpression)?
;
compExpression : addingExpression (('=='|'!='|'<='|'<'|'>='|'>') addingExpression)*
;
This ensures that the expression rule sits above all forms of expressions and will work as expected in parentheses.

antlr 2 rule ambiguity

DECIMAL_LITERAL : ('0' | '1'..'9' ('0'..'9')*) (INTEGER_TYPE_SUFFIX)? ;
FLOATING_POINT_LITERAL
: ('0'..'9')+
(
DOT ('0'..'9')* (EXPONENT)? (FLOAT_TYPE_SUFFIX)?
| EXPONENT (FLOAT_TYPE_SUFFIX)?
| FLOAT_TYPE_SUFFIX
)
| DOT ('0'..'9')+ (EXPONENT)? (FLOAT_TYPE_SUFFIX)?
;
DECIMAL_LITERAL match int literal in c language and FLOATING_POINT_LITERAL match float literal in c language.But when the lexer meet a float ,such as 3.44, 3 will match rule DECIMAL_LITERAL.
What can I do to make it recognize float literal?
You combine the rules into one lexer rule and then change the type based on whether you see the DOT or not. This should give you an idea, although it's not exactly equivalent what you had written above.
DECIMAL_LITERAL
: ('0'..'9')+
(
DOT { _ttype = FLOATING_POINT_LITERAL; } ('0'..'9')* (EXPONENT)? (FLOAT_TYPE_SUFFIX)?
| EXPONENT (FLOAT_TYPE_SUFFIX)?
| FLOAT_TYPE_SUFFIX
)
| DOT { _ttype = FLOATING_POINT_LITERAL; } ('0'..'9')+ (EXPONENT)? (FLOAT_TYPE_SUFFIX)?
;
For a more complete example see my C grammar at http://www.antlr3.org/grammar/cgram/