I am trying to parse Windows header files to extract function prototypes. Microsoft being Microsoft means that the function prototypes are not in a regular, easily parseable format. Routine arguments usually, but not always, are annotated with Microsoft's Structured Annotation Language, which starts with an identifier that begins and ends with an underscore, and may have an underscore in the middle. A SAL identifier may be followed by parentheses and contain a variety of compile-time checks, but I don't care about SAL stuff. Routines are generally annotated with an access specifier, which is usually something like WINAPI, APIENTRY, etc., but there may be more than one. There are cases where the arguments are specified only by their types, too. Sheesh!
My grammar looks like this:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME SAL_EXPR?
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')?
;
argument :
sal_statement type = IDENTIFIER is_pointer = '*'? arg = IDENTIFIER
;
arg_list :
argument (',' argument)*
;
hex_number :
'0x' HEX_DIGIT+
;
//
// Lexer rules
//
INTEGER : Digit+;
HEX_DIGIT : [a-fA-F0-9];
SAL_NAME : '_' Capital (Letter | '_')+? '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
SAL_EXPR : '(' ( ~( '(' | ')' ) | SAL_EXPR )* ')'; // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];
I am using the TestRig, and providing the following input:
PVOID WINAPI routine ();
PVOID WINAPI routine (type param);
extern int PASCAL FAR __WSAFDIsSet(SOCKET fd, fd_set FAR *);
// comment
/*
Another comment*/
int
WSPAPI
WSCSetApplicationCategory(
_Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,
_In_ DWORD PathLength,
_In_reads_opt_(ExtraLength) LPCWSTR Extra,
_When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))
DWORD ExtraLength,
_In_ DWORD PermittedLspCategories,
_Out_opt_ DWORD * pPrevPermLspCat,
_Out_ LPINT lpErrno
);
I'm getting this output:
[#0,0:4='PVOID',<IDENTIFIER>,1:0]
[#1,6:11='WINAPI',<'WINAPI'>,1:6]
[#2,13:19='routine',<IDENTIFIER>,1:13]
[#3,21:22='()',<SAL_EXPR>,1:21]
[#4,23:23=';',<';'>,1:23]
[#5,28:32='PVOID',<IDENTIFIER>,3:0]
[#6,34:39='WINAPI',<'WINAPI'>,3:6]
[#7,41:47='routine',<IDENTIFIER>,3:13]
[#8,49:60='(type param)',<SAL_EXPR>,3:21]
[#9,61:61=';',<';'>,3:33]
[#10,66:71='extern',<'extern'>,5:0]
[#11,73:75='int',<IDENTIFIER>,5:7]
[#12,77:82='PASCAL',<'PASCAL'>,5:11]
[#13,84:86='FAR',<'FAR'>,5:18]
[#14,88:99='__WSAFDIsSet',<IDENTIFIER>,5:22]
[#15,100:124='(SOCKET fd, fd_set FAR *)',<SAL_EXPR>,5:34]
[#16,125:125=';',<';'>,5:59]
[#17,130:141='// comment\r\n',<CPP_COMMENT>,channel=1,7:0]
[#18,142:162='/*\r\nAnother comment*/',<C_COMMENT>,channel=1,8:0]
[#19,167:169='int',<IDENTIFIER>,11:0]
[#20,172:177='WSPAPI',<'WSPAPI'>,12:0]
[#21,180:204='WSCSetApplicationCategory',<IDENTIFIER>,13:0]
[#22,205:568='(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )',<SAL_EXPR>,13:25]
[#23,569:569=';',<';'>,22:5]
[#24,572:571='<EOF>',<EOF>,23:0]
line 1:21 mismatched input '()' expecting '('
line 3:21 mismatched input '(type param)' expecting '('
line 5:18 extraneous input 'FAR' expecting IDENTIFIER
line 5:34 mismatched input '(SOCKET fd, fd_set FAR *)' expecting '('
line 13:25 mismatched input '(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )' expecting '('
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
Why doesn't SAL_NAME match at line 22?
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
The lexer doesn't know anything about parser rules, it operates on input only. It cannot know that it "should only match something if SAL_NAME matches first".
The best way is perhaps not taking this logic into lexer, i.e. only decide whether the input is SAL expression or something else in brackets in parser, not in lexer.
Here is the functioning grammar:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec* routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME sal_expr?
;
sal_expr :
'(' ( ~( '(' | ')' ) | sal_expr )* ')' // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')
;
argument :
sal_statement? type = IDENTIFIER access_spec? is_pointer = '*'? arg = IDENTIFIER?
| sal_statement
;
arg_list :
argument (',' argument)*
;
//
// Lexer rules
//
SAL_NAME : '_' Capital (Letter | '_')+ '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
OPERATORS : [&|=!><];
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];
So this is my grammar:
grammar Test;
prog: stmt_list;
stmt_list
: stmt_list stmt ';'
| stmt ';'
;
stmt
: assignment
| bind
;
assignment: 'var' IDENTIFIER ('=' | '+=' | '-=' | '*=' | '/=') expression;
type
: IDENTIFIER
| primitiveType
;
primitiveType
: 'int'
| 'float'
| 'string'
| 'bool'
;
expression
: atom
| expression ('*' | '/') expression
| expression ('+' | '-') expression
;
atom
: '(' expression ')'
| IDENTIFIER
| INT
| STRING
;
IDENTIFIER: [A-z_][A-z_0-9]*;
INT: [1-9][0-9]*;
STRING: '"' [A-z] '"';
WS: [\t\r\n]+ -> channel(HIDDEN);
I can compile it with antlr and everything works fine. When I test it with grun it will compile but it throws a "token recognition error" whenever there's a whitespace. For example with this input:
var a = b + c;
I get:
line 1:3 token recognition error at: ' '
line 1:5 token recognition error at: ' '
line 1:7 token recognition error at: ' '
line 1:9 token recognition error at: ' '
line 1:11 token recognition error at: ' '
Besides this everything works but it would still be nice if I could get rid of these messages.
You're only putting tabs and line break chars to the hidden channel, not spaces.
Instead of:
WS: [\t\r\n]+ -> channel(HIDDEN);
do:
WS: [ \t\r\n]+ -> channel(HIDDEN);
I have a grammar to parse some source code:
document
: header body_block* EOF
-> body_block*
;
header
: header_statement*
;
body_block
: '{' block_contents '}'
;
block_contents
: declaration_list
| ... other things ....
It's legal for a document to have a header without a body or a body without a header.
If I try to parse a document that looks like
int i;
then ANTLR complains that it found int when it was expecting EOF. This is true, but I'd like it to say that it was expecting {. That is, if the input contains something between the header and the EOF that's not a body_block, then I'd like to suggest to the user that they meant to enclose that text inside a body_block.
I've made a couple almost working attempts at this that I can post if that's illuminating, but I'm hoping that I've just missed something easy.
Not pretty, but something like this would do it:
body_block
: ('{')=> '{' block_contents '}'
| t=.
{
if(!$t.text.equals("{")) {
String message = "expected a '{' on line " + $t.getLine() + " near '" + $t.text + "'";
}
else {
String message = "encountered a '{' without a '}' on line " + $t.getLine();
}
throw new RuntimeException(message);
}
;
(not tested, may contain syntax errors!)
So, whenever '{' ... '}' is not matched, it falls through to .1 and produces a more understandable error message.
1 note that a . in a parser rule matches any token, not any character!
I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?
Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed:
i want to parse something like this in my lexer:
( begin expression )
where expressions are also surrounded by brackets. it isn't important what is in the expression, i just want to have all what's between the (begin and the matching ) as a token. an example would be:
(begin
(define x (+ 1 2)))
so the text of the token should be (define x (+ 1 2)))
something like
PROGRAM : LPAREN BEGIN .* RPAREN;
does (obviously) not work because as soon as he sees a ")", he thinks the rule is over, but i need the matching bracket for this.
how can i do that?
Inside lexer rules, you can invoke rules recursively. So, that's one way to solve this. Another approach would be to keep track of the number of open- and close parenthesis and let a gated semantic predicate loop as long as your counter is more than zero.
A demo:
T.g
grammar T;
parse
: BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF
;
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // keep reapeating `( ... )*` as long as open > 0
( ~('(' | ')') // match anything other than parenthesis
| '(' {open++;} // match a '(' in increase the var `open`
| ')' {open--;} // match a ')' in decrease the var `open`
)
)*
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "(begin (define x (+ (- 1 3) 2)))";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
java -cp antlr-3.3-complete.jar org.antlr.Tool T.g
javac -cp antlr-3.3-complete.jar *.java
java -cp .:antlr-3.3-complete.jar Main
parsed :: (begin (define x (+ (- 1 3) 2)))
Note that you'll need to beware of string literals inside your source that might include parenthesis:
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // ...
( ~('(' | ')' | '"') // ...
| '(' {open++;} // ...
| ')' {open--;} // ...
| '"' ... // TODO: define a string literal here
)
)*
;
or comments that may contain parenthesis.
The suggestion with the predicate uses some language specific code (Java, in this case). An advantage of calling a lexer rule recursively is that you don't have custom code in your lexer:
BeginToken
: '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')'
;
fragment NestedParens
: '(' ( ~('(' | ')') | NestedParens )* ')'
;
fragment Spaces
: (' ' | '\t')+
;