using antlr parse message, how to resolve ambiguous literal - antlr

I am a newbie with ANTLR and try to parse a World Meteorological Organization (WMO) messages using ANTLR. A message like this: “AVB 12 CVD A12”。This is my grammar:
grammar a;
rule : aaa bbb? ccc ddd;
aaa: char char char ;
bbb: Digit Digit ;
ccc: ('+'|'-')? char 'V' char;
ddd: 'A' Digit Digit;
char : 'A'|'V'| Char;
Char: [A-Z];
Digit: [0-9];
WS: [ \t\n\r=] ->skip;
and it works! But the lexer tokenizes just a single char from the input and I don't know another method. Can anyone suggest a better approach?

It would clean things up a bit to recognize most of these as tokens.
I don't know the semantics of what you're trying to parse, so I don't know what would be appropriate names. (I'll use L# an p# for Lexer and Parser rules accordingly.
grammar a;
rule : L1 L2? L3 L4;
fragment CHAR: [A-Z];
fragment DIGIT: [0-9];
L3: ('+'|'-')? CHAR 'V' CHAR; // place before L1 for to take precedence for "*V*" chars
L1: CHAR CHAR CHAR ;
L2: DIGIT DIGIT ;
L4: 'A' Digit Digit;
WS: [ \t\n\r=] ->skip;
Once you give the tokens meaningful names, the grammar (and the generated *Context classes will be easier to deal with than if you treat each character as a token.

Related

How do I parse PDF strings with nested string delimiters in antlr?

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:
A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.
EXAMPLE 1:
The following are valid literal strings:
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)
It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.
lexer grammar PdfStringLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ;
// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
mode STR;
LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ;
TEXT : . -> more ;
parser grammar PdfStringParser;
options { tokenVocab=PdfStringLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
When I process this file:
(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj
I get this error and parse tree:
line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'
So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?
Edit
I left out the instructions regarding escape sequences within the string:
Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.
Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)
An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.
EXAMPLE 2:
(These \
two strings \
are the same.)
(These two strings are the same.)
EXAMPLE 3:
(This string has an end-of-line at the end of it.
)
(So does this one.\n)
Should I use this STRING definition:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?
You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
And with escape sequences, you could try:
STRING
: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | STRING )* ')'
;
fragment ESCAPE_SEQUENCE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;

Error in array declaration grammar

%token DIGIT RETURN IDENTIFIER COLON COMMA ELSE IF NL KEYWORD BR READ WRITE WHILE EQUAL
%start y2
%left '-'
%left '+'
%right '='
%%
stmt1:KEYWORD IDENTIFIER X1 //for initialization.
;
y2:stmt1 stmt2 //y2 is starting variable
|
;
X1:COLON {printf(" for int a/ char a");}
|'['DIGIT']'COLON {printf("for array declarations");}
;
stmt2:KEYWORD IDENTIFIER"("stmt3")"stmt5 {printf("for functions");}
|
;
stmt3:KEYWORD IDENTIFIER X2
|
;
X2:stmt4 {printf("for parameter int/char");}
|"["DIGIT"]"COLON {printf("for parameter int arr[]/char arr[]");} //in this production parser is not responding
;
stmt4:COMMA stmt3 {printf("to have multiple arguments");}
|
;
%%
I am parsing string int a[10];
but it instead of parsing,
execute yyerror() every time.
This code parses int a; single statement char a; also.
Make sure that your lexer returns '[' when it sees an open bracket.
Mixing single-quoted tokens like '[' and named tokens like COLON is confusing and suggests you are copy-and-pasting from different sources, rather than actually designing a program. Since the lexer and parser must agree on the handling of tokens, this form of creating programs is error-prone. I recommend using single-quoted single-character tokens throughout, since it is more readable and simplifies the lexer.
With respect to X2, there is a difference between '[' and "[". You probably want the first one. The same problem is found in stmt2, which uses​ "(" and ")" instead of the single-quoted versions.

ANTLR parse strings (keep whitespaces) and parse normal identifiers

I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).
I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.
grammar parseString;
rules
: stringRule+
;
stringRule
: formatString
| idString
;
formatString
: STRING_DOUBLEQUOTE STRING STRING_DOUBLEQUOTE
;
idString
: (NONTERM | TERM)
;
// LEXER
STRING_DOUBLEQUOTE
: '"' ;
DIGITS
: DIGIT+
;
TERM
: UPPERCHAR CHAR+
;
NONTERM
: LOWERCHAR CHAR+
;
fragment
CHAR
: LOWERCHAR
| UPPERCHAR
| DIGIT
| '-'
| '_'
;
fragment
DIGIT
: [0-9]
;
fragment
LOWERCHAR
: [a-z]
;
fragment
UPPERCHAR
: [A-Z]
;
WS
: (' ' | '\t' | '\r' | '\n')+ -> skip
; // skip spaces, tabs, newlines
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
STRING
: ~('"')*
;
For the test cases that I use,
Test
HelloWorld
"$this is a string"
"*this is another string!"
I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.
Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.
Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.
Think about changing the token STRING.

terminal/datatype/parser rules in xtext

I'm using xtext 2.4.
What I want to do is a SQL-like syntax.
The things confuse me are I'm not sure which things should be treated as terminal/datatype/parser rules. So far my grammar related to MyTerm is:
Model:
(terms += MyTerm ';')*
;
MyTerm:
constant=MyConstant | variable?='?'| collection_literal=CollectionLiteral
;
MyConstant
: string=STRING
| number=MyNumber
| date=MYDATE
| uuid=UUID
| boolean=MYBOOLEAN
| hex=BLOB
;
MyNumber:
int=SIGNINT | float=SIGNFLOAT
;
SIGNINT returns ecore::EInt:
'-'? INT
;
SIGNFLOAT returns ecore::EFloat:
'-'? INT '.' INT;
;
CollectionLiteral:
=> MapLiteral | SetLiteral | ListLiteral
;
MapLiteral:
'{' {MapLiteral} (entries+=MapEntry (',' entries+=MapEntry)* )? '}'
;
MapEntry:
key=MyTerm ':' value=MyTerm
;
SetLiteral:
'{' {SetLiteral} (values+=MyTerm (',' values+=MyTerm)* )+ '}'
;
ListLiteral:
'[' {ListLiteral} ( values+=MyTerm (',' values+=MyTerm)* )? ']'
;
terminal MYDATE:
'0'..'9' '0'..'9' '0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9'
;
terminal HEX:
'a'..'h'|'A'..'H'|'0'..'9'
;
terminal UUID:
HEX HEX HEX HEX HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX
;
terminal BLOB:
'0' ('x'|'X') HEX+
;
terminal MYBOOLEAN returns ecore::EBoolean:
'true' | 'false' | 'TRUE' | 'FALSE'
;
Few questions:
How to define integer with sign? If I define another terminal rule terminal SIGNINT: '-'? '0'..'9'+;, antlr will complain about INT becoming unreachable. Therefore I define it as a datatype rule SIGNINT: '-'? INT; Is this the correct way to do it?
How to define float with sign? I did exactly the same as define integer with sign, SIGNFLOAT: '-'? INT '.' INT;, not sure if this is correct as well.
How to define a date rule? I want to use a parser rule to store year/month/day info in fields, but define it as MyDate: year=INT '-' month=INT '-' date=INT; antlr will complain Decision can match input such as "RULE_INT '-' RULE_INT '-' RULE_INT" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
I also have some other rules like
the following
RelationCompare:
name=ID compare=COMPARE term=MyTerm
;
but a=4 won't be a valid RelationCompare because a and 4 will be treat as HEXs. I found this because if I change the relation to j=44 then it works. In this post it said terminal rule defined eariler will shadow those defined later. However, if I redefine terminal ID in my grammar, whether put it in front or after of terminal HEX, antlr will conplain The following token definitions can never be matched because prior tokens match the same input: RULE_HEX,RULE_MYBOOLEAN. This problem happens in k=0x00b as well. k=0xaab is valid but k=0x00b is not.
Any suggestion?
How do you define an integer with sign?
Treat it as two separate tokens '-' and INT, and use a parser rule instead of a lexer rule.
How do you define a float with sign?
Treat it as two separate tokens '-' and FLOAT, and use a parser rule instead of a lexer rule.
How do you define a date rule?
Treat it as five separate tokens and use a parser rule instead of a lexer rule.
I don't know the answer to the last question since this is in Xtext as opposed to just ANTLR.
Later I found the original antlr grammar for what I want to do therefore I simply translate the antlr grammar to xtext grammar. Here is how I defining those basic types:
terminal fragment A: 'a'|'A';
...
terminal fragment Z: 'z'|'Z';
terminal fragment DIGIT: '0'..'9';
terminal fragment LETTER: ('a'..'z'|'A'..'Z');
terminal fragment HEX: ('a'..'f'|'A'..'F'|'0'..'9');
terminal fragment EXPONENT: E ('+'|'-')? DIGIT+;
terminal INTEGER returns ecore::EInt: '-'? DIGIT+;
terminal FLOAT returns ecore::EFloat: INTEGER EXPONENT | INTEGER '.' DIGIT* EXPONENT?;
terminal BOOLEAN: T R U E | F A L S E;
The Date rule in original grammar is treated as a string.
About rules name (Rules: Antlr Grammar => xtext Grammar)
parser rule: starting with lowercase => rules starting with uppercase (each will be a Java Class)
terminal rule: starting with uppercase => using all uppercase with terminal prefix
fragment terminal rule: fragment ID => terminal fragment ID
In antlr a list of arguments is defined like this:
functionArgs
: '(' ')'
| '(' t1=term ( ',' tn=term )* ')'
;
The corresponding xtext grammar is:
FunctionArgs
: '(' ')'
| '(' ts+=Term (',' ts+=Term )* ')'
;
For those parser rules with an argument enclosed by [ ]
properties[PropertyDefinitions props]
: property[props] (K_AND property[props])*
;
Most of the time they could be moved to the left hand side
Properties
: props+=Property (K_AND props+=Property)*
;
Now it's working as expected.

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:
expr
: special_ident
| ident
;
special_ident : LETTER DIGIT;
ident : LETTER (LETTER | DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
When I try to check this grammar, I get this warning:
Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2.
As a result, alternative(s) 2 were disabled for that input
I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.
Here's some sample input and what I'd like it to match:
A : ident
A1 : special_ident
A1A : ident
A12 : ident
AA1 : ident
How can I form my grammar such that I correctly identify my two types of identifiers?
Seems that you have 3 cases:
A
AN
A(A|N)(A|N)+
You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.
I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:
long_ident : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident : LETTER | long_ident;
Expanding on Carl's thought, I would guess you have four different cases:
A
AN
AA(A|N)*
AN(A|N)+
Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.
prog
: (expr WS)+ EOF;
expr
: special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
| ident {System.out.println("Found ident:" + $ident.text + "\n");}
;
special_ident : LETTER DIGIT;
ident : LETTER
|LETTER DIGIT (LETTER|DIGIT)+
|LETTER LETTER (LETTER|DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
WS
: (' '|'\t'|'\n'|'\r')+;