Lexer to handle lines with line number prefix - antlr

I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.
Is there a way to properly tokenize this input using an ANTLR grammar?

You need to add a semantic predicate to IDENTIFIER:
IDENTIFIER
: {_input.getCharPositionInLine() != 0
|| _input.LA(1) != 'L'
|| !Character.isDigit(_input.LA(2))}?
[a-zA-Z_] [a-zA-Z0-9_]*
;
You could also avoid semantic predicates by using lexer modes.
//
// Default mode is active at the beginning of a line
//
LINE_NUM
: 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
;
KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
: ( 'L'
| 'L' [a-zA-Z_] [a-zA-Z0-9_]*
| [a-zA-KM-Z_] [a-zA-Z0-9_]*
)
-> pushMode(NotBeginningOfLine)
;
NL : ('\r' '\n'? | '\n');
mode NotBeginningOfLine;
NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
NotBeginningOfLine_IDENTIFIER
: [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
;

Related

Semantic Predicate affected scope

please consider the following grammar which gives me unexpected behavior:
lexer grammar TLexer;
WS : [ \t]+ -> channel(HIDDEN) ;
NEWLINE : '\n' -> channel(HIDDEN) ;
ASTERISK : '*' ;
SIMPLE_IDENTIFIER : [a-zA-Z_] [a-zA-Z0-9_$]* ;
NUMBER : [0-9] [0-9_]* ;
and
parser grammar TParser;
options { tokenVocab=TLexer; }
seq_input_list :
level_input_list | edge_input_list ;
level_input_list :
( level_symbol_any )+ ;
edge_input_list :
( level_symbol_any )* edge_symbol ;
level_symbol_any :
{getCurrentToken().getText().matches("[0a]")}? ( NUMBER | SIMPLE_IDENTIFIER ) ;
edge_symbol :
SIMPLE_IDENTIFIER | ASTERISK ;
The input 0 * is parsed fine but 0 f is not recognized by the parser (no viable alternative at input 'f'). If I change the ordering of rules in seq_input_list, both inputs are recognized.
My question to you is, if this indeed is an ANTLR issue or I understand the usage of semantic predicates wrong. I would expect the input 0 f to be recognized as (seq_input_list (edge_input_list (level_symbol_any ( NUMBER) edge_symbol ( SIMPLE_IDENTIFIER ) ) ).
Thank you in advance!
Julian

Token matching, but it shouldn't

I am trying to parse Windows header files to extract function prototypes. Microsoft being Microsoft means that the function prototypes are not in a regular, easily parseable format. Routine arguments usually, but not always, are annotated with Microsoft's Structured Annotation Language, which starts with an identifier that begins and ends with an underscore, and may have an underscore in the middle. A SAL identifier may be followed by parentheses and contain a variety of compile-time checks, but I don't care about SAL stuff. Routines are generally annotated with an access specifier, which is usually something like WINAPI, APIENTRY, etc., but there may be more than one. There are cases where the arguments are specified only by their types, too. Sheesh!
My grammar looks like this:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME SAL_EXPR?
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')?
;
argument :
sal_statement type = IDENTIFIER is_pointer = '*'? arg = IDENTIFIER
;
arg_list :
argument (',' argument)*
;
hex_number :
'0x' HEX_DIGIT+
;
//
// Lexer rules
//
INTEGER : Digit+;
HEX_DIGIT : [a-fA-F0-9];
SAL_NAME : '_' Capital (Letter | '_')+? '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
SAL_EXPR : '(' ( ~( '(' | ')' ) | SAL_EXPR )* ')'; // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];
I am using the TestRig, and providing the following input:
PVOID WINAPI routine ();
PVOID WINAPI routine (type param);
extern int PASCAL FAR __WSAFDIsSet(SOCKET fd, fd_set FAR *);
// comment
/*
Another comment*/
int
WSPAPI
WSCSetApplicationCategory(
_Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,
_In_ DWORD PathLength,
_In_reads_opt_(ExtraLength) LPCWSTR Extra,
_When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))
DWORD ExtraLength,
_In_ DWORD PermittedLspCategories,
_Out_opt_ DWORD * pPrevPermLspCat,
_Out_ LPINT lpErrno
);
I'm getting this output:
[#0,0:4='PVOID',<IDENTIFIER>,1:0]
[#1,6:11='WINAPI',<'WINAPI'>,1:6]
[#2,13:19='routine',<IDENTIFIER>,1:13]
[#3,21:22='()',<SAL_EXPR>,1:21]
[#4,23:23=';',<';'>,1:23]
[#5,28:32='PVOID',<IDENTIFIER>,3:0]
[#6,34:39='WINAPI',<'WINAPI'>,3:6]
[#7,41:47='routine',<IDENTIFIER>,3:13]
[#8,49:60='(type param)',<SAL_EXPR>,3:21]
[#9,61:61=';',<';'>,3:33]
[#10,66:71='extern',<'extern'>,5:0]
[#11,73:75='int',<IDENTIFIER>,5:7]
[#12,77:82='PASCAL',<'PASCAL'>,5:11]
[#13,84:86='FAR',<'FAR'>,5:18]
[#14,88:99='__WSAFDIsSet',<IDENTIFIER>,5:22]
[#15,100:124='(SOCKET fd, fd_set FAR *)',<SAL_EXPR>,5:34]
[#16,125:125=';',<';'>,5:59]
[#17,130:141='// comment\r\n',<CPP_COMMENT>,channel=1,7:0]
[#18,142:162='/*\r\nAnother comment*/',<C_COMMENT>,channel=1,8:0]
[#19,167:169='int',<IDENTIFIER>,11:0]
[#20,172:177='WSPAPI',<'WSPAPI'>,12:0]
[#21,180:204='WSCSetApplicationCategory',<IDENTIFIER>,13:0]
[#22,205:568='(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )',<SAL_EXPR>,13:25]
[#23,569:569=';',<';'>,22:5]
[#24,572:571='<EOF>',<EOF>,23:0]
line 1:21 mismatched input '()' expecting '('
line 3:21 mismatched input '(type param)' expecting '('
line 5:18 extraneous input 'FAR' expecting IDENTIFIER
line 5:34 mismatched input '(SOCKET fd, fd_set FAR *)' expecting '('
line 13:25 mismatched input '(\r\n _Out_writes_bytes_to_(nNumberOfBytesToRead, *lpNumberOfBytesRead) LPBYTE lpBuffer,\r\n _In_ DWORD PathLength,\r\n _In_reads_opt_(ExtraLength) LPCWSTR Extra,\r\n _When_(pbCancel != NULL, _Pre_satisfies_(*pbCancel == FALSE))\r\nDWORD ExtraLength,\r\n _In_ DWORD PermittedLspCategories,\r\n _Out_opt_ DWORD * pPrevPermLspCat,\r\n _Out_ LPINT lpErrno\r\n )' expecting '('
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
Why doesn't SAL_NAME match at line 22?
What I don't understand is why is SAL_EXPR matching on lines 3 and 8? It should only match something if SAL_NAME matches first.
The lexer doesn't know anything about parser rules, it operates on input only. It cannot know that it "should only match something if SAL_NAME matches first".
The best way is perhaps not taking this logic into lexer, i.e. only decide whether the input is SAL expression or something else in brackets in parser, not in lexer.
Here is the functioning grammar:
//
// Parse C function declarations from a header file
//
grammar FuncDef;
//
// Parser rules
//
start :
func_def+
;
func_def :
'extern'? ret_type = IDENTIFIER access = access_spec* routine = IDENTIFIER '(' arg_list* ')' ';'
;
sal_statement :
SAL_NAME sal_expr?
;
sal_expr :
'(' ( ~( '(' | ')' ) | sal_expr )* ')' // We don't care about anything within a SAL expression, so eat everything within matched and nested parentheses
;
access_spec :
('FAR' | 'PASCAL' | 'WINAPI' | 'APIENTRY' | 'WSAAPI' | 'WSPAPI')
;
argument :
sal_statement? type = IDENTIFIER access_spec? is_pointer = '*'? arg = IDENTIFIER?
| sal_statement
;
arg_list :
argument (',' argument)*
;
//
// Lexer rules
//
SAL_NAME : '_' Capital (Letter | '_')+ '_'; // Restricted form of IDENTIFIER, so it must be first
IDENTIFIER : Id_chars+;
CPP_COMMENT : '//' .*? '\r'? '\n' -> channel (HIDDEN);
C_COMMENT : '/*' .*? '*/' -> channel (HIDDEN);
OPERATORS : [&|=!><];
WS : [ \t\r\n]+ -> skip; // Ignore all whitespace
fragment Id_chars : Letter | Digit | '_' | '$';
fragment Capital : [A-Z];
fragment Letter : [a-zA-Z];
fragment Digit : [0-9];

Antlr Skip text outside tag

Im trying to skip/ignore the text outside a custom tag:
This text is a unique token to skip < ?compo \5+5\ ?> also this < ?compo \1+1\ ?>
I tried with the follow lexer:
TAG_OPEN : '<?compo ' -> pushMode(COMPOSER);
mode COMPOSER;
TAG_CLOSE : ' ?>' -> popMode;
NUMBER_DIGIT : '1'..'9';
ZERO : '0';
LOGICOP
: OR
| AND
;
COMPAREOP
: EQ
| NE
| GT
| GE
| LT
| LE
;
WS : ' ';
NEWLINE : ('\r\n'|'\n'|'\r');
TAB : ('\t');
...
and parser:
instructions
: (TAG_OPEN statement TAG_CLOSE)+?;
statement
: if_statement
| else
| else_if
| if_end
| operation_statement
| mnemonic
| comment
| transparent;
But it doesn't work (I test it by using the intelliJ tester on the rule "instructions")...
I have also add some skip rules outside the "COMPOSER" mode:
TEXT_SKIP : TAG_CLOSE .*? (TAG_OPEN | EOF) -> skip;
But i don't have any results...
Someone can help me?
EDIT:
I change "instructions" and now the parser tree is correctly builded for every instruction of every tag:
instructions : (.*? TAG_OPEN statement TAG_CLOSE .*?)+;
But i have a not recognized character error outside the the tags...
Below is a quick demo that worked for me.
Lexer grammar:
lexer grammar CompModeLexer;
TAG_OPEN
: '<?compo' -> pushMode(COMPOSER)
;
OTHER
: . -> skip
;
mode COMPOSER;
TAG_CLOSE
: '?>' -> popMode
;
OPAR
: '('
;
CPAR
: ')'
;
INT
: '0'
| [1-9] [0-9]*
;
LOGICOP
: 'AND'
| 'OR'
;
COMPAREOP
: [<>!] '='
| [<>=]
;
MULTOP
: [*/%]
;
ADDOP
: [+-]
;
SPACE
: [ \t\r\n\f] -> skip
;
Parser grammar:
parser grammar CompModeParser;
options {
tokenVocab=CompModeLexer;
}
parse
: tag* EOF
;
tag
: TAG_OPEN statement TAG_CLOSE
;
statement
: expr
;
expr
: '(' expr ')'
| expr MULTOP expr
| expr ADDOP expr
| expr COMPAREOP expr
| expr LOGICOP expr
| INT
;
A test with the input This text is a unique token to skip <?compo 5+5 ?> also this <?compo 1+1 ?> resulted in the following tree:
I found another solution (not elegant as the previous):
Create a generic TEXT token in the general context (so outside the tag's mode)
TEXT : ( ~[<] | '<' ~[?])+ -> skip;
Create a parser rule for handle a generic text
code
: TEXT
| (TEXT? instruction TEXT?)+;
Create a parser rule for handle an instruction
instruction
: TAG_OPEN statement TAG_CLOSE;

ANTLR4 using HIDDEN channel causes errors while using skip does not

In my grammar I use:
WS: [ \t\r\n]+ -> skip;
when I change this to use HIDDEN channel:
WS: [ \t\r\n]+ -> channel(HIDDEN);
I receive errors (extraneous input ' '...) I did not receive while using 'skip'.
I thought, that skipping and sending to a channel does not differ if it comes to a content passed to a parser.
Below you can find a code excerpt in which the parser is executed:
CharStream charStream = new ANTLRInputStream(formulaString);
FormulaLexer lexer = new FormulaLexer(charStream);
BufferedTokenStream tokens = new BufferedTokenStream(lexer);
FormulaParser parser = new FormulaParser(tokens);
ParseTree tree = parser.startRule();
StartRuleVisitor startRuleVisitor = new StartRuleVisitor();
startRuleVisitor.visit(tree);
VariableVisitor variableVisitor = new VariableVisitor(tokens);
variableVisitor.visit(tree);
And a grammar itself:
grammar Formula;
startRule
: variable RELATION_OPERATOR integer
;
integer
: DIGIT+
;
identifier
: (LETTER | DIGIT) ( DIGIT | LETTER | '_' | '.')+
;
tableId
: 'T_' (identifier | WILDCARD)
;
rowId
: 'R_' (identifier | WILDCARD)
;
columnId
: 'C_' (identifier | WILDCARD)
;
sheetId
: 'S_' (identifier | WILDCARD)
;
variable
: L_CURLY_BRACKET cellIdComponent (COMMA cellIdComponent)+ R_CURLY_BRACKET
;
cellIdComponent
: tableId | rowId | columnId | sheetId
;
COMMA
: ','
;
RELATION_OPERATOR
: EQ
;
WILDCARD
: 'NNN'
;
L_CURLY_BRACKET
: '{'
;
R_CURLY_BRACKET
: '}'
;
LETTER
: ('a' .. 'z') | ('A' .. 'Z')
;
DIGIT
: ('0' .. '9')
;
EQ
: '='
| 'EQ' | 'eq'
;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
String I try to parse:
{T_C 00.01, R_010, C_010} = 1
Output I get with channel(HIDDEN) used:
line 1:4 extraneous input ' ' expecting {'_', '.', LETTER, DIGIT}
line 1:11 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:18 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:27 extraneous input ' ' expecting RELATION_OPERATOR
line 1:29 extraneous input ' ' expecting DIGIT
But if I change channel(HIDDEN) to 'skip' there are no errors.
What is more, I have observed that for more complex grammar than this i get 'no viable alternative at input...' if I use channel(HIDDEN) and once again the error disappear for the 'skip'.
Do you know what may be the cause of it?
You should use CommonTokenStream instead of BufferedTokenStream. See BufferedTokenStream description on github:
This token stream ignores the value of {#link Token#getChannel}. If your
parser requires the token stream filter tokens to only those on a particular
channel, such as {#link Token#DEFAULT_CHANNEL} or
{#link Token#HIDDEN_CHANNEL}, use a filtering token stream such a
{#link CommonTokenStream}.

antlr4 grammar errors when parsing

I have the following grammar:
grammar Token;
prog: (expr NL?)+ EOF;
expr: '[' type ']';
type : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value: Text | INT;
INT : '0' | [1-9] [0-9]*;
//WS : [ \t]+;
WS : [ \t\n\r]+ -> skip ;
NL: '\r'? '\n';
Text : ~[\]\[\n\r"]+ ;
and the text I need to parse is something like this below
[TXT:look at me!]
[USR:19700]
[TXT:, can I go there?]
[ENC:124124]
[TXT:this is needed for you to go...]
I need to split this text but I getting some errors when I run grun.bat Token prog -gui -trace -diagnostics
enter prog, LT(1)=[
enter expr, LT(1)=[
consume [#0,0:0='[',<3>,1:0] rule expr
enter type, LT(1)=TXT:look at me!
enter typeid, LT(1)=TXT:look at me!
line 1:1 mismatched input 'TXT:look at me!' expecting {'TXT', 'ENC', 'USR'}
... much more ...
what is wrong with my grammar? please, help me!
You must understand that the tokens are not created based on what the parser is trying to match. The lexer tries to match as much characters as possible (independently from that parser!): your Text token should be defined differently.
You could let the Text rule become a parser rule instead, and match single char tokens like this:
grammar Token;
prog : expr+ EOF;
expr : '[' type ']';
type : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value : text | INT;
text : CHAR+;
INT : '0' | [1-9] [0-9]*;
WS : [ \t\n\r]+ -> skip ;
CHAR : ~[\[\]\r\n];