Token matching, but it shouldn't - antlr
I am trying to parse Windows header files to extract function prototypes. Microsoft being Microsoft means that the function prototypes are not in a regular, easily parseable format. Routine arguments usually, but not always, are annotated with Microsoft's Structured Annotation Language, which starts with an identifier that begins and ends with an underscore, and may have an underscore in the middle. A SAL identifier may be followed by parentheses and contain a variety of compile-time checks, but I don't care about SAL stuff. Routines are generally annotated with an access specifier, which is usually something like WINAPI, APIENTRY, etc., but there may be more than one. There are cases where the arguments are specified only by their types, too. Sheesh!
My grammar looks like this:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME SAL_EXPR?
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')?
;
argument :
sal_statement type = IDENTIFIER is_pointer = '*'? arg = IDENTIFIER
;
arg_list :
argument (',' argument)*
;
hex_number :
'0x' HEX_DIGIT+
;
//
// Lexer rules
//
INTEGER : Digit+;
HEX_DIGIT : [a-fA-F0-9];
SAL_NAME : '_' Capital (Letter | '_')+? '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
SAL_EXPR : '(' ( ~( '(' | ')' ) | SAL_EXPR )* ')'; // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];
I am using the TestRig, and providing the following input:
PVOID WINAPI routine ();
PVOID WINAPI routine (type param);
extern int PASCAL FAR __WSAFDIsSet(SOCKET fd, fd_set FAR *);
// comment
/*
Another comment*/
int
WSPAPI
WSCSetApplicationCategory(
_Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,
_In_ DWORD PathLength,
_In_reads_opt_(ExtraLength) LPCWSTR Extra,
_When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))
DWORD ExtraLength,
_In_ DWORD PermittedLspCategories,
_Out_opt_ DWORD * pPrevPermLspCat,
_Out_ LPINT lpErrno
);
I'm getting this output:
[#0,0:4='PVOID',<IDENTIFIER>,1:0]
[#1,6:11='WINAPI',<'WINAPI'>,1:6]
[#2,13:19='routine',<IDENTIFIER>,1:13]
[#3,21:22='()',<SAL_EXPR>,1:21]
[#4,23:23=';',<';'>,1:23]
[#5,28:32='PVOID',<IDENTIFIER>,3:0]
[#6,34:39='WINAPI',<'WINAPI'>,3:6]
[#7,41:47='routine',<IDENTIFIER>,3:13]
[#8,49:60='(type param)',<SAL_EXPR>,3:21]
[#9,61:61=';',<';'>,3:33]
[#10,66:71='extern',<'extern'>,5:0]
[#11,73:75='int',<IDENTIFIER>,5:7]
[#12,77:82='PASCAL',<'PASCAL'>,5:11]
[#13,84:86='FAR',<'FAR'>,5:18]
[#14,88:99='__WSAFDIsSet',<IDENTIFIER>,5:22]
[#15,100:124='(SOCKET fd, fd_set FAR *)',<SAL_EXPR>,5:34]
[#16,125:125=';',<';'>,5:59]
[#17,130:141='// comment\r\n',<CPP_COMMENT>,channel=1,7:0]
[#18,142:162='/*\r\nAnother comment*/',<C_COMMENT>,channel=1,8:0]
[#19,167:169='int',<IDENTIFIER>,11:0]
[#20,172:177='WSPAPI',<'WSPAPI'>,12:0]
[#21,180:204='WSCSetApplicationCategory',<IDENTIFIER>,13:0]
[#22,205:568='(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )',<SAL_EXPR>,13:25]
[#23,569:569=';',<';'>,22:5]
[#24,572:571='<EOF>',<EOF>,23:0]
line 1:21 mismatched input '()' expecting '('
line 3:21 mismatched input '(type param)' expecting '('
line 5:18 extraneous input 'FAR' expecting IDENTIFIER
line 5:34 mismatched input '(SOCKET fd, fd_set FAR *)' expecting '('
line 13:25 mismatched input '(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )' expecting '('
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
Why doesn't SAL_NAME match at line 22?
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
The lexer doesn't know anything about parser rules, it operates on input only. It cannot know that it "should only match something if SAL_NAME matches first".
The best way is perhaps not taking this logic into lexer, i.e. only decide whether the input is SAL expression or something else in brackets in parser, not in lexer.
Here is the functioning grammar:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec* routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME sal_expr?
;
sal_expr :
'(' ( ~( '(' | ')' ) | sal_expr )* ')' // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')
;
argument :
sal_statement? type = IDENTIFIER access_spec? is_pointer = '*'? arg = IDENTIFIER?
| sal_statement
;
arg_list :
argument (',' argument)*
;
//
// Lexer rules
//
SAL_NAME : '_' Capital (Letter | '_')+ '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
OPERATORS : [&|=!><];
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];
Related
Antlr 4.6.1 not generating errorNodes for inputstream
I have a simple grammar like : grammar CellMath; equation : expr EOF; expr : '-'expr #UnaryNegation // unary minus | expr op=('*'|'/') expr #MultiplicativeOp // MultiplicativeOperation | expr op=('+'|'-') expr #AdditiveOp // AdditiveOperation | FLOAT #Float // Floating Point Number | INT #Integer // Integer Number | '(' expr ')' #ParenExpr // Parenthesized Expression ; MUL : '*' ; DIV : '/' ; ADD : '+' ; SUB : '-' ; FLOAT : DIGIT+ '.' DIGIT* | '.' DIGIT+ ; INT : DIGIT+ ; fragment DIGIT : [0-9] ; // match single digit //fragment //ATSIGN : [#]; WS : [ \t\r\n]+ -> skip ; ERRORCHAR : . ; Not able to throw an exception in case of special char in between expression [{Number}{SPLChar}{Chars}] Ex: "123#abc", "123&abc". I expecting an exception to throw For Example: Input stream : 123#abc Just like in ANTLR labs Image But in my case Output : '123' without any errors I'm using Listener pattern, Error nodes are just ignored not going through VisitTerminal([NotNull] ITerminalNode node) / VisitErrorNode([NotNull] IErrorNode node) in the BaseListener class. Also all the BaseErrorListener class methods has been overridden not even there. Thanks in advance for your help.
Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity
I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like: i,j : bool; setvar : set<bool>; i > 5; j < 10; But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token. Here is the grammar: //// Parser Rules grammar MLTL1; start: block*; block: var_list ';' | expr ';' ; var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ; type: BASE_TYPE | KW_SET REL_LT BASE_TYPE REL_GT ; expr: expr REL_OP expr | '(' expr ')' | IDENTIFIER | INT ; //// Lexical Spec // Types BASE_TYPE: 'bool' | 'int' | 'float' ; // Keywords KW_SET: 'set' ; // Op groups for precedence REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE ; // Relational ops REL_EQ: '==' ; REL_NEQ: '!=' ; REL_GT: '>' ; REL_LT: '<' ; REL_GTE: '>=' ; REL_LTE: '<=' ; IDENTIFIER : LETTER (LETTER | DIGIT)* ; INT : SIGN? NONZERODIGIT DIGIT* | '0' ; fragment SIGN : [+-] ; fragment DIGIT : [0-9] ; fragment NONZERODIGIT : [1-9] ; fragment LETTER : [a-zA-Z_] ; COMMENT : '#' ~[\r\n]* -> skip; WS : [ \t\r\n]+ -> channel(HIDDEN); I tested the grammar to see what tokens it is generating for the test input above using this python: from antlr4 import InputStream, CommonTokenStream import MLTL1Lexer import MLTL1Parser input=""" i,j : bool; setvar: set<bool>; i > 5; j < 10; """ lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input)) stream = CommonTokenStream(lexer) stream.fill() tokens = stream.getTokens(0,100) for t in tokens: print(str(t.type) + " " + t.text) parser = MLTL1Parser.MLTL1Parser(stream) parse_tree = parser.start() print(parse_tree.toStringTree(recog=parser)) And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?
(There may be more than just these two instances, but...) Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase. As you've used them, you're turning many of your intended Lexer rules, effectively into fragments. I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type. (If you used grun to dump the tokens you would have seen them identified as REL_OP tokens. With the changes below, your sample input works just fine. grammar MLTL1 ; start: block*; block: var_list ';' | expr ';'; var_list: IDENTIFIER (',' IDENTIFIER)* ':' type; type: baseType | KW_SET REL_LT baseType REL_GT; expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT; //// Lexical Spec // Types baseType: 'bool' | 'int' | 'float'; // Keywords KW_SET: 'set'; // Op groups for precedence rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE; // Relational ops REL_EQ: '=='; REL_NEQ: '!='; REL_GT: '>'; REL_LT: '<'; REL_GTE: '>='; REL_LTE: '<='; IDENTIFIER: LETTER (LETTER | DIGIT)*; INT: SIGN? NONZERODIGIT DIGIT* | '0'; fragment SIGN: [+-]; fragment DIGIT: [0-9]; fragment NONZERODIGIT: [1-9]; fragment LETTER: [a-zA-Z_]; COMMENT: '#' ~[\r\n]* -> skip; WS: [ \t\r\n]+ -> channel(HIDDEN);
How to make antlr find invalid input throw exception
I have the following grammar: grammar Expr; expr : '-' expr # unaryOpExpr | expr ('*'|'/'|'%') expr # mulDivModuloExpr | expr ('+'|'-') expr # addSubExpr | '(' expr ')' # nestedExpr | IDENTIFIER '(' fnArgs? ')' # functionExpr | IDENTIFIER # identifierExpr | DOUBLE # doubleExpr | LONG # longExpr | STRING # string ; fnArgs : expr (',' expr)* # functionArgs ; IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"'; LONG : [0-9]+; DOUBLE : [0-9]+ '.' [0-9]*; WS : [ \t\r\n]+ -> skip ; STRING: '"' (~["\\\r\n] | ESC)* '"'; fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ; fragment UNICODE : 'u' HEX HEX HEX HEX ; fragment HEX : [0-9a-fA-F] ; MINUS : '-' ; MUL : '*' ; DIV : '/' ; MODULO : '%' ; PLUS : '+' ; // math function MAX: 'MAX'; when I enter following text,It should be effective -1.1 bug when i enter following text: -1.1ffff I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff", but i want to change this behavior, didn't discard invalid token, but throw exception,report detection invalid token. So what should i do, Thanks for your advice
Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this: parse: expr EOF; This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.
Using Antlr to parse formulas with multiple locales
I'm very new to Antlr, so forgive what may be a very easy question. I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale. Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful. I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.
What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically. I don't have ANTLR and C# running here, but the Java demo should be pretty similar: grammar LocaleDemo; #lexer::header { import java.text.DecimalFormatSymbols; import java.util.Locale; } #lexer::members { private char decimalSeparator = '.'; private char groupingSeparator = ','; public LocaleDemoLexer(CharStream input, Locale locale) { this(input); DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale); this.decimalSeparator = dfs.getDecimalSeparator(); this.groupingSeparator = dfs.getGroupingSeparator(); } } parse : .*? EOF ; NUMBER : D D? ( DG D D D )* ( DS D+ )? ; OTHER : . ; fragment D : [0-9]; fragment DS : {_input.LA(1) == decimalSeparator}? . ; fragment DG : {_input.LA(1) == groupingSeparator}? . ; To test the grammar above, run this class: import org.antlr.v4.runtime.ANTLRInputStream; import org.antlr.v4.runtime.Token; import java.util.Locale; public class Main { private static void tokenize(String input, Locale locale) { LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale); System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale); for (Token t : lexer.getAllTokens()) { System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText()); } } public static void main(String[] args) throws Exception { tokenize("1.23", Locale.ENGLISH); tokenize("1.23", Locale.GERMAN); tokenize("12.345.678,90", Locale.ENGLISH); tokenize("12.345.678,90", Locale.GERMAN); } } which would print: input='1.23', locale=en, tokens: NUMBER '1.23' input='1.23', locale=de, tokens: NUMBER '1' OTHER '.' NUMBER '23' input='12.345.678,90', locale=en, tokens: NUMBER '12.345' OTHER '.' NUMBER '67' NUMBER '8' OTHER ',' NUMBER '90' input='12.345.678,90', locale=de, tokens: NUMBER '12.345.678,90' Related Q&A's: What is a 'semantic predicate' in ANTLR? What does "fragment" mean in ANTLR?
As a follow-up to Bart's answer, this is the grammar I created with his suggestions: grammar ExcelScript; #lexer::header { using System; using System.Globalization; } #lexer::members { private Int32 listseparator = 44; // UTF16 value for comma private Int32 decimalseparator = 46; // UTF16 value for period /// <summary> /// Creates a new lexer object /// </summary> /// <param name="input">The input stream</param> /// <param name="locale">The locale to use in parsing numbers</param> /// <returns>A new lexer object</returns> public ExcelScriptLexer (ICharStream input, CultureInfo locale) : this(input) { this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]); this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]); // special case for 8 locales where the list separator is a , and the number separator is a , too // Excel uses semicolon for list separator, so we will too if (this.listseparator == 44 && this.decimalseparator == 44) this.listseparator = 59; // UTF16 value for semicolon } } /* * Parser Rules */ formula : numberLiteral | Identifier | '=' expression ; expression : primary # PrimaryExpression | Identifier arguments # FunctionCallExpression | ('+' | '-') expression # UnarySignExpression | expression ('*' | '/' | '%') expression # MulDivModExpression | expression ('+' | '-') expression # AddSubExpression | expression ('<=' | '>=' | '>' | '<') expression # CompareExpression | expression ('=' | '<>') expression # EqualCompareExpression ; primary : '(' expression ')' # ParenExpression | literal # LiteralExpression | Identifier # IdentifierExpression ; literal : numberLiteral # NumberLiteralRule | booleanLiteral # BooleanLiteralRule ; numberLiteral : IntegerLiteral | FloatingPointLiteral ; booleanLiteral : TrueKeyword | FalseKeyword ; arguments : '(' expressionList? ')' ; expressionList : expression (ListSeparator expression)* ; /* * Lexer Rules */ AddOperator : '+' ; SubOperator : '-' ; MulOperator : '*' ; DivOperator : '/' ; PowOperator : '^' ; EqOperator : '=' ; NeqOperator : '<>' ; LeOperator : '<=' ; GeOperator : '>=' ; LtOperator : '<' ; GtOperator : '>' ; ListSeparator : {_input.La(1) == listseparator}? . ; DecimalSeparator : {_input.La(1) == decimalseparator}? . ; TrueKeyword : [Tt][Rr][Uu][Ee] ; FalseKeyword : [Ff][Aa][Ll][Ss][Ee] ; Identifier : Letter (Letter | Digit)* ; fragment Letter : [A-Z_a-z] ; fragment Digit : [0-9] ; IntegerLiteral : '0' | [1-9] [0-9]* ; FloatingPointLiteral : [0-9]+ DecimalSeparator [0-9]* Exponent? | DecimalSeparator [0-9]+ Exponent? | [0-9]+ Exponent ; fragment Exponent : ('e' | 'E') ('+' | '-')? ('0'..'9')+ ; WhiteSpace : [ \t]+ -> channel(HIDDEN) ;
Lexer to handle lines with line number prefix
I'm writing a parser for a language that looks like the following: L00<<identifier>> L10<<keyword>> L250<<identifier>> <<identifier>> That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases). My current lexer looks like: // Parser rules commands : command*; command : LINE_NUM? keyword NEWLINE | LINE_NUM? IDENTIFIER NEWLINE; keyword : KEYWORD_A | KEYWORD_B | ... ; // Lexer rules fragment INT : [0-9]+; LINE_NUM : 'L' INT; KEYWORD_A : 'someKeyword'; KEYWORD_B : 'reservedWord'; ... IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]* However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs. Is there a way to properly tokenize this input using an ANTLR grammar?
You need to add a semantic predicate to IDENTIFIER: IDENTIFIER : {_input.getCharPositionInLine() != 0 || _input.LA(1) != 'L' || !Character.isDigit(_input.LA(2))}? [a-zA-Z_] [a-zA-Z0-9_]* ; You could also avoid semantic predicates by using lexer modes. // // Default mode is active at the beginning of a line // LINE_NUM : 'L' [0-9]+ -> pushMode(NotBeginningOfLine) ; KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine); KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine); IDENTIFIER : ( 'L' | 'L' [a-zA-Z_] [a-zA-Z0-9_]* | [a-zA-KM-Z_] [a-zA-Z0-9_]* ) -> pushMode(NotBeginningOfLine) ; NL : ('\r' '\n'? | '\n'); mode NotBeginningOfLine; NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode; NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A); NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B); NotBeginningOfLine_IDENTIFIER : [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER) ;