ANTLR parse strings (keep whitespaces) and parse normal identifiers - antlr

I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).
I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.
grammar parseString;
rules
: stringRule+
;
stringRule
: formatString
| idString
;
formatString
: STRING_DOUBLEQUOTE STRING STRING_DOUBLEQUOTE
;
idString
: (NONTERM | TERM)
;
// LEXER
STRING_DOUBLEQUOTE
: '"' ;
DIGITS
: DIGIT+
;
TERM
: UPPERCHAR CHAR+
;
NONTERM
: LOWERCHAR CHAR+
;
fragment
CHAR
: LOWERCHAR
| UPPERCHAR
| DIGIT
| '-'
| '_'
;
fragment
DIGIT
: [0-9]
;
fragment
LOWERCHAR
: [a-z]
;
fragment
UPPERCHAR
: [A-Z]
;
WS
: (' ' | '\t' | '\r' | '\n')+ -> skip
; // skip spaces, tabs, newlines
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
STRING
: ~('"')*
;
For the test cases that I use,
Test
HelloWorld
"$this is a string"
"*this is another string!"
I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.

Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.

Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.
Think about changing the token STRING.

Related

Parsing strings with embedded multi line control character seuqences

I am writing a compiler for the realtime programming language PEARL.
PEARL supports strings with embedded control character sequence like this e.g.
'some text'\1B 1B 1B\'some more text'.
The control character sequence is prefixed with '\ and ends with \'.
Inside the control sequence are two digits numbers, which specify the control character.
In the above example the resulting string would be
'some textESCESCESCsome more text'
ESC stands for the non-printable ASCII escape character.
Furthermore inside the control char sequence are newline allowed to build multi line strings like e.g.
'some text'\1B
1B
1B\'some more text'.
which results in the same string as above.
grammar stringliteral;
tokens {
CHAR,CHARS,CTRLCHARS,ESC,WHITESPACE,NEWLINE
}
stringLiteral: '\'' CHARS? '\'' ;
fragment
CHARS: CHAR+ ;
fragment
CHAR: CTRLCHARS | ~['\n\r] ;
fragment
ESC: '\'\\' ;
fragment
CTRLCHARS: ESC ~['] ESC;
WHITESPACE: (' ' | '\t')+ -> channel(HIDDEN);
NEWLINE: ( '\r' '\n'? | '\n' ) -> channel(HIDDEN);
The lexer/parser above behaves very strangely, because it accepts only
string in the form 'x' and ignores multiple characters and the control chars sequence.
Probably I am overseeing something obvious. Any hint or idea how to solves this issue is welcome!
I have now corrected the grammar according the hints from Mike:
grammar stringliteral;
tokens {
STRING
}
stringLiteral: STRING;
STRING: '\'' ( '\'' '\\' | '\\' '\'' | . )*? '\'';
There is still a problem with the recognition of the end of the control char sequence:
The input 'A STRING'\CTRL\'' produces the errors
Line 1:10 token recognition error at: '\'
line 1:11 token recognition error at: 'C'
line 1:12 token recognition error at: 'T'
line 1:13 token recognition error at: 'R'
line 1:14 token recognition error at: 'L'
line 1:15 token recognition error at: '\'
Any idea? Btw: We are using antlr v 4.5.
There are multiple issues with this grammar:
You cannot use a fragment lexer rule in a parser rule.
Your string rule is a parser rule, so it's subject to automatic whitespace removal you defined with your WHITESPACE and NEWLINE rules.
You have no rule to accept a control char sequence like \1B 1B 1B.
Especially the third point is a real problem, since you don't know where your control sequence ends (unless this was just a typo and you actually meant: \1B \1B \1B.
In any case, don't deal with escape sequences in your lexer (except the minimum handling required to make the rule work, i.e. handling of the \' sequence. You rule just needs to parse the entire text and you can figure out escape sequences in your semantic phase:
STRING: '\' ('\\' '\'' | . )*? '\'';
Note *? is the non-greedy operator to stop at the first closing quote char. Without that the lexer would continue to match all following (escaped and non-escaped) quote chars in the same string rule (greedy behavior). Additionally, the string rule is now a lexer rule, which is not affected by the whitespace skipping.
I solved the problem with this grammar snippet by adapting the approriate rules from the lates java grammar example:
StringLiteral
: '\'' StringCharacters? '\''
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~['\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\'\\' (HexEscape| ' ' | [\r\n])* '\\\''
;
fragment
HexEscape
: B4Digit B4Digit
;
fragment
B4Digit
: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
;

How do I parse PDF strings with nested string delimiters in antlr?

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:
A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.
EXAMPLE 1:
The following are valid literal strings:
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)
It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.
lexer grammar PdfStringLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ;
// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
mode STR;
LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ;
TEXT : . -> more ;
parser grammar PdfStringParser;
options { tokenVocab=PdfStringLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
When I process this file:
(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj
I get this error and parse tree:
line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'
So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?
Edit
I left out the instructions regarding escape sequences within the string:
Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.
Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)
An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.
EXAMPLE 2:
(These \
two strings \
are the same.)
(These two strings are the same.)
EXAMPLE 3:
(This string has an end-of-line at the end of it.
)
(So does this one.\n)
Should I use this STRING definition:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?
You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
And with escape sequences, you could try:
STRING
: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | STRING )* ')'
;
fragment ESCAPE_SEQUENCE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;

Is it possible to distinguish escape sequences in lexer in Antlr4?

I would like to match sequences like \' and \" as lexer elements
ESCAPESEQUECE :
'\\\"' |
'\\\''
;
while also distinguish individual quotes when they are not escaped
SINGLEQUOTE:
'\''
;
DOUBLEQUOTE:
'\"'
;
The final goal it to recognize MySQL like strings with parser.
Is this possible / correct way?
Answer
Yes, it is totally possible by having separate tokens.
Example
grammar escp;
SINGLE: '\'';
DOUBLE: '\"';
ESCAPED : '\\"' | '\\\'';
char: SINGLE | DOUBLE;
escaped : ESCAPED;
program: (char | escaped)+;
The AST for input string '\"'"\"""'\'\"\' will be:

Fixed number format in ANTLR

How to specify a fixed digit number in antlr grammar?
I want to parse a line which contains fields of fixed number of characters. Each field is a number.
0034|9056|4567|0987|-2340| +345|1000
The above line is a sample line. | indicates field boundaries (which will not be in the actual file. shown here just to indicate the boundary).
The fields can include blank characters +/-
I'd keep the lexer grammar as simple as possible and just match zero or more spaces followed by an optional sign followed by a number in your parser grammar. After matching that, check (in your parser grammar) if the "width" of the field is correct.
An example grammar:
line
: field ('|' field)*
;
field
: Spaces? ('+' | '-')? Number // validate if 'field' is correct in this rule
;
Number
: '0'..'9'+
;
Spaces
: ' '+
;
And a possible validation scheme could look like:
line
: field ('|' field)*
;
field
#init{int length = 0;}
: (Spaces {length += $Spaces.text.length();})?
('+' | '-')? Number {length += $Number.text.length(); if(length != 4) {/* do something */}}
;
Number
: '0'..'9'+
;
Spaces
: ' '+
;
What about the following:
INT : ('+'|'-')? ('0'..'9')+;

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:
expr
: special_ident
| ident
;
special_ident : LETTER DIGIT;
ident : LETTER (LETTER | DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
When I try to check this grammar, I get this warning:
Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2.
As a result, alternative(s) 2 were disabled for that input
I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.
Here's some sample input and what I'd like it to match:
A : ident
A1 : special_ident
A1A : ident
A12 : ident
AA1 : ident
How can I form my grammar such that I correctly identify my two types of identifiers?
Seems that you have 3 cases:
A
AN
A(A|N)(A|N)+
You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.
I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:
long_ident : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident : LETTER | long_ident;
Expanding on Carl's thought, I would guess you have four different cases:
A
AN
AA(A|N)*
AN(A|N)+
Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.
prog
: (expr WS)+ EOF;
expr
: special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
| ident {System.out.println("Found ident:" + $ident.text + "\n");}
;
special_ident : LETTER DIGIT;
ident : LETTER
|LETTER DIGIT (LETTER|DIGIT)+
|LETTER LETTER (LETTER|DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
WS
: (' '|'\t'|'\n'|'\r')+;