How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

How can my ANTLR lexer match a token made of characters that are subset of another kind of token? - antlr

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:
expr
: special_ident
| ident
;
special_ident : LETTER DIGIT;
ident : LETTER (LETTER | DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
When I try to check this grammar, I get this warning:
Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2.
As a result, alternative(s) 2 were disabled for that input
I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.
Here's some sample input and what I'd like it to match:
A : ident
A1 : special_ident
A1A : ident
A12 : ident
AA1 : ident
How can I form my grammar such that I correctly identify my two types of identifiers?

Seems that you have 3 cases:
A
AN
A(A|N)(A|N)+
You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.
I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:
long_ident : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident : LETTER | long_ident;

Expanding on Carl's thought, I would guess you have four different cases:
A
AN
AA(A|N)*
AN(A|N)+
Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.
prog
: (expr WS)+ EOF;
expr
: special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
| ident {System.out.println("Found ident:" + $ident.text + "\n");}
;
special_ident : LETTER DIGIT;
ident : LETTER
|LETTER DIGIT (LETTER|DIGIT)+
|LETTER LETTER (LETTER|DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
WS
: (' '|'\t'|'\n'|'\r')+;

Related

Parsing strings with embedded multi line control character seuqences

I am writing a compiler for the realtime programming language PEARL.
PEARL supports strings with embedded control character sequence like this e.g.
'some text'\1B 1B 1B\'some more text'.
The control character sequence is prefixed with '\ and ends with \'.
Inside the control sequence are two digits numbers, which specify the control character.
In the above example the resulting string would be
'some textESCESCESCsome more text'
ESC stands for the non-printable ASCII escape character.
Furthermore inside the control char sequence are newline allowed to build multi line strings like e.g.
'some text'\1B
1B
1B\'some more text'.
which results in the same string as above.
grammar stringliteral;
tokens {
CHAR,CHARS,CTRLCHARS,ESC,WHITESPACE,NEWLINE
}
stringLiteral: '\'' CHARS? '\'' ;
fragment
CHARS: CHAR+ ;
fragment
CHAR: CTRLCHARS | ~['\n\r] ;
fragment
ESC: '\'\\' ;
fragment
CTRLCHARS: ESC ~['] ESC;
WHITESPACE: (' ' | '\t')+ -> channel(HIDDEN);
NEWLINE: ( '\r' '\n'? | '\n' ) -> channel(HIDDEN);
The lexer/parser above behaves very strangely, because it accepts only
string in the form 'x' and ignores multiple characters and the control chars sequence.
Probably I am overseeing something obvious. Any hint or idea how to solves this issue is welcome!
I have now corrected the grammar according the hints from Mike:
grammar stringliteral;
tokens {
STRING
}
stringLiteral: STRING;
STRING: '\'' ( '\'' '\\' | '\\' '\'' | . )*? '\'';
There is still a problem with the recognition of the end of the control char sequence:
The input 'A STRING'\CTRL\'' produces the errors
Line 1:10 token recognition error at: '\'
line 1:11 token recognition error at: 'C'
line 1:12 token recognition error at: 'T'
line 1:13 token recognition error at: 'R'
line 1:14 token recognition error at: 'L'
line 1:15 token recognition error at: '\'
Any idea? Btw: We are using antlr v 4.5.

There are multiple issues with this grammar:
You cannot use a fragment lexer rule in a parser rule.
Your string rule is a parser rule, so it's subject to automatic whitespace removal you defined with your WHITESPACE and NEWLINE rules.
You have no rule to accept a control char sequence like \1B 1B 1B.
Especially the third point is a real problem, since you don't know where your control sequence ends (unless this was just a typo and you actually meant: \1B \1B \1B.
In any case, don't deal with escape sequences in your lexer (except the minimum handling required to make the rule work, i.e. handling of the \' sequence. You rule just needs to parse the entire text and you can figure out escape sequences in your semantic phase:
STRING: '\' ('\\' '\'' | . )*? '\'';
Note *? is the non-greedy operator to stop at the first closing quote char. Without that the lexer would continue to match all following (escaped and non-escaped) quote chars in the same string rule (greedy behavior). Additionally, the string rule is now a lexer rule, which is not affected by the whitespace skipping.

I solved the problem with this grammar snippet by adapting the approriate rules from the lates java grammar example:
StringLiteral
: '\'' StringCharacters? '\''
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~['\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\'\\' (HexEscape| ' ' | [\r\n])* '\\\''
;
fragment
HexEscape
: B4Digit B4Digit
;
fragment
B4Digit
: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
;

ANTLR parse strings (keep whitespaces) and parse normal identifiers

I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).
I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.
grammar parseString;
rules
: stringRule+
;
stringRule
: formatString
| idString
;
formatString
: STRING_DOUBLEQUOTE STRING STRING_DOUBLEQUOTE
;
idString
: (NONTERM | TERM)
;
// LEXER
STRING_DOUBLEQUOTE
: '"' ;
DIGITS
: DIGIT+
;
TERM
: UPPERCHAR CHAR+
;
NONTERM
: LOWERCHAR CHAR+
;
fragment
CHAR
: LOWERCHAR
| UPPERCHAR
| DIGIT
| '-'
| '_'
;
fragment
DIGIT
: [0-9]
;
fragment
LOWERCHAR
: [a-z]
;
fragment
UPPERCHAR
: [A-Z]
;
WS
: (' ' | '\t' | '\r' | '\n')+ -> skip
; // skip spaces, tabs, newlines
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
STRING
: ~('"')*
;
For the test cases that I use,
Test
HelloWorld
"$this is a string"
"*this is another string!"
I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.

Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.

Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.
Think about changing the token STRING.

terminal/datatype/parser rules in xtext

I'm using xtext 2.4.
What I want to do is a SQL-like syntax.
The things confuse me are I'm not sure which things should be treated as terminal/datatype/parser rules. So far my grammar related to MyTerm is:
Model:
(terms += MyTerm ';')*
;
MyTerm:
constant=MyConstant | variable?='?'| collection_literal=CollectionLiteral
;
MyConstant
: string=STRING
| number=MyNumber
| date=MYDATE
| uuid=UUID
| boolean=MYBOOLEAN
| hex=BLOB
;
MyNumber:
int=SIGNINT | float=SIGNFLOAT
;
SIGNINT returns ecore::EInt:
'-'? INT
;
SIGNFLOAT returns ecore::EFloat:
'-'? INT '.' INT;
;
CollectionLiteral:
=> MapLiteral | SetLiteral | ListLiteral
;
MapLiteral:
'{' {MapLiteral} (entries+=MapEntry (',' entries+=MapEntry)* )? '}'
;
MapEntry:
key=MyTerm ':' value=MyTerm
;
SetLiteral:
'{' {SetLiteral} (values+=MyTerm (',' values+=MyTerm)* )+ '}'
;
ListLiteral:
'[' {ListLiteral} ( values+=MyTerm (',' values+=MyTerm)* )? ']'
;
terminal MYDATE:
'0'..'9' '0'..'9' '0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9'
;
terminal HEX:
'a'..'h'|'A'..'H'|'0'..'9'
;
terminal UUID:
HEX HEX HEX HEX HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX
;
terminal BLOB:
'0' ('x'|'X') HEX+
;
terminal MYBOOLEAN returns ecore::EBoolean:
'true' | 'false' | 'TRUE' | 'FALSE'
;
Few questions:
How to define integer with sign? If I define another terminal rule terminal SIGNINT: '-'? '0'..'9'+;, antlr will complain about INT becoming unreachable. Therefore I define it as a datatype rule SIGNINT: '-'? INT; Is this the correct way to do it?
How to define float with sign? I did exactly the same as define integer with sign, SIGNFLOAT: '-'? INT '.' INT;, not sure if this is correct as well.
How to define a date rule? I want to use a parser rule to store year/month/day info in fields, but define it as MyDate: year=INT '-' month=INT '-' date=INT; antlr will complain Decision can match input such as "RULE_INT '-' RULE_INT '-' RULE_INT" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
I also have some other rules like
the following
RelationCompare:
name=ID compare=COMPARE term=MyTerm
;
but a=4 won't be a valid RelationCompare because a and 4 will be treat as HEXs. I found this because if I change the relation to j=44 then it works. In this post it said terminal rule defined eariler will shadow those defined later. However, if I redefine terminal ID in my grammar, whether put it in front or after of terminal HEX, antlr will conplain The following token definitions can never be matched because prior tokens match the same input: RULE_HEX,RULE_MYBOOLEAN. This problem happens in k=0x00b as well. k=0xaab is valid but k=0x00b is not.
Any suggestion?

How do you define an integer with sign?
Treat it as two separate tokens '-' and INT, and use a parser rule instead of a lexer rule.
How do you define a float with sign?
Treat it as two separate tokens '-' and FLOAT, and use a parser rule instead of a lexer rule.
How do you define a date rule?
Treat it as five separate tokens and use a parser rule instead of a lexer rule.
I don't know the answer to the last question since this is in Xtext as opposed to just ANTLR.

Later I found the original antlr grammar for what I want to do therefore I simply translate the antlr grammar to xtext grammar. Here is how I defining those basic types:
terminal fragment A: 'a'|'A';
...
terminal fragment Z: 'z'|'Z';
terminal fragment DIGIT: '0'..'9';
terminal fragment LETTER: ('a'..'z'|'A'..'Z');
terminal fragment HEX: ('a'..'f'|'A'..'F'|'0'..'9');
terminal fragment EXPONENT: E ('+'|'-')? DIGIT+;
terminal INTEGER returns ecore::EInt: '-'? DIGIT+;
terminal FLOAT returns ecore::EFloat: INTEGER EXPONENT | INTEGER '.' DIGIT* EXPONENT?;
terminal BOOLEAN: T R U E | F A L S E;
The Date rule in original grammar is treated as a string.
About rules name (Rules: Antlr Grammar => xtext Grammar)
parser rule: starting with lowercase => rules starting with uppercase (each will be a Java Class)
terminal rule: starting with uppercase => using all uppercase with terminal prefix
fragment terminal rule: fragment ID => terminal fragment ID
In antlr a list of arguments is defined like this:
functionArgs
: '(' ')'
| '(' t1=term ( ',' tn=term )* ')'
;
The corresponding xtext grammar is:
FunctionArgs
: '(' ')'
| '(' ts+=Term (',' ts+=Term )* ')'
;
For those parser rules with an argument enclosed by [ ]
properties[PropertyDefinitions props]
: property[props] (K_AND property[props])*
;
Most of the time they could be moved to the left hand side
Properties
: props+=Property (K_AND props+=Property)*
;
Now it's working as expected.

How to parse and split alphabet characters and numbers from a string using ANTLR grammar

I have a grammar which parses alphabet characters and numbers separately:
grammar Demo;
options
{
language = C;
}
program : process+
;
process : Alphanumeric {printf("\%s",$Alphanumeric.text->chars);}
;
Alphanumeric : (Alphabet | Number)+
;
fragment Alphabet : ('a'..'z')+
;
fragment Number : ('0'..'9')+
;
Suppose, the input is 'a10' or 'b10', the printf statement would display a10 or b10, but I want the alphabet character and the number to be split i.e, a and 10 has to be split separately, because I need 'a' to be compared with an other string and save the number, the one next to 'a' or 'b' etc., to a table.
To be precise, a10 has to be split -> a for comparison and 10 for the storage and I should be able to fetch both the alphabet and number separately.
How to define a grammar for something like this?

You need to expose Alphabet and Number to your parser separately, which means they should be top-level rules in the lexer (not fragment rules). As a result, Alphanumeric will also become a parser rule:
alphanumeric : (Alphabet | Number)+
;
Alphabet : ('a'..'z')+
;
Number : ('0'..'9')+
;

Fixed number format in ANTLR

How to specify a fixed digit number in antlr grammar?
I want to parse a line which contains fields of fixed number of characters. Each field is a number.
0034|9056|4567|0987|-2340| +345|1000
The above line is a sample line. | indicates field boundaries (which will not be in the actual file. shown here just to indicate the boundary).
The fields can include blank characters +/-

I'd keep the lexer grammar as simple as possible and just match zero or more spaces followed by an optional sign followed by a number in your parser grammar. After matching that, check (in your parser grammar) if the "width" of the field is correct.
An example grammar:
line
: field ('|' field)*
;
field
: Spaces? ('+' | '-')? Number // validate if 'field' is correct in this rule
;
Number
: '0'..'9'+
;
Spaces
: ' '+
;
And a possible validation scheme could look like:
line
: field ('|' field)*
;
field
#init{int length = 0;}
: (Spaces {length += $Spaces.text.length();})?
('+' | '-')? Number {length += $Number.text.length(); if(length != 4) {/* do something */}}
;
Number
: '0'..'9'+
;
Spaces
: ' '+
;

What about the following:
INT : ('+'|'-')? ('0'..'9')+;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can my ANTLR lexer match a token made of characters that are subset of another kind of token? - antlr

Related

Parsing strings with embedded multi line control character seuqences

ANTLR parse strings (keep whitespaces) and parse normal identifiers

terminal/datatype/parser rules in xtext

How to parse and split alphabet characters and numbers from a string using ANTLR grammar

Fixed number format in ANTLR

Categories

Resources