How do I parse PDF strings with nested string delimiters in antlr? - pdf

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:
A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.
EXAMPLE 1:
The following are valid literal strings:
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)
It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.
lexer grammar PdfStringLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ;
// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
mode STR;
LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ;
TEXT : . -> more ;
parser grammar PdfStringParser;
options { tokenVocab=PdfStringLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
When I process this file:
(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj
I get this error and parse tree:
line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'
So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?
Edit
I left out the instructions regarding escape sequences within the string:
Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.
Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)
An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.
EXAMPLE 2:
(These \
two strings \
are the same.)
(These two strings are the same.)
EXAMPLE 3:
(This string has an end-of-line at the end of it.
)
(So does this one.\n)
Should I use this STRING definition:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?

You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
And with escape sequences, you could try:
STRING
: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | STRING )* ')'
;
fragment ESCAPE_SEQUENCE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;

Related

Parsing strings with embedded multi line control character seuqences

I am writing a compiler for the realtime programming language PEARL.
PEARL supports strings with embedded control character sequence like this e.g.
'some text'\1B 1B 1B\'some more text'.
The control character sequence is prefixed with '\ and ends with \'.
Inside the control sequence are two digits numbers, which specify the control character.
In the above example the resulting string would be
'some textESCESCESCsome more text'
ESC stands for the non-printable ASCII escape character.
Furthermore inside the control char sequence are newline allowed to build multi line strings like e.g.
'some text'\1B
1B
1B\'some more text'.
which results in the same string as above.
grammar stringliteral;
tokens {
CHAR,CHARS,CTRLCHARS,ESC,WHITESPACE,NEWLINE
}
stringLiteral: '\'' CHARS? '\'' ;
fragment
CHARS: CHAR+ ;
fragment
CHAR: CTRLCHARS | ~['\n\r] ;
fragment
ESC: '\'\\' ;
fragment
CTRLCHARS: ESC ~['] ESC;
WHITESPACE: (' ' | '\t')+ -> channel(HIDDEN);
NEWLINE: ( '\r' '\n'? | '\n' ) -> channel(HIDDEN);
The lexer/parser above behaves very strangely, because it accepts only
string in the form 'x' and ignores multiple characters and the control chars sequence.
Probably I am overseeing something obvious. Any hint or idea how to solves this issue is welcome!
I have now corrected the grammar according the hints from Mike:
grammar stringliteral;
tokens {
STRING
}
stringLiteral: STRING;
STRING: '\'' ( '\'' '\\' | '\\' '\'' | . )*? '\'';
There is still a problem with the recognition of the end of the control char sequence:
The input 'A STRING'\CTRL\'' produces the errors
Line 1:10 token recognition error at: '\'
line 1:11 token recognition error at: 'C'
line 1:12 token recognition error at: 'T'
line 1:13 token recognition error at: 'R'
line 1:14 token recognition error at: 'L'
line 1:15 token recognition error at: '\'
Any idea? Btw: We are using antlr v 4.5.
There are multiple issues with this grammar:
You cannot use a fragment lexer rule in a parser rule.
Your string rule is a parser rule, so it's subject to automatic whitespace removal you defined with your WHITESPACE and NEWLINE rules.
You have no rule to accept a control char sequence like \1B 1B 1B.
Especially the third point is a real problem, since you don't know where your control sequence ends (unless this was just a typo and you actually meant: \1B \1B \1B.
In any case, don't deal with escape sequences in your lexer (except the minimum handling required to make the rule work, i.e. handling of the \' sequence. You rule just needs to parse the entire text and you can figure out escape sequences in your semantic phase:
STRING: '\' ('\\' '\'' | . )*? '\'';
Note *? is the non-greedy operator to stop at the first closing quote char. Without that the lexer would continue to match all following (escaped and non-escaped) quote chars in the same string rule (greedy behavior). Additionally, the string rule is now a lexer rule, which is not affected by the whitespace skipping.
I solved the problem with this grammar snippet by adapting the approriate rules from the lates java grammar example:
StringLiteral
: '\'' StringCharacters? '\''
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~['\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\'\\' (HexEscape| ' ' | [\r\n])* '\\\''
;
fragment
HexEscape
: B4Digit B4Digit
;
fragment
B4Digit
: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
;

Is it possible to distinguish escape sequences in lexer in Antlr4?

I would like to match sequences like \' and \" as lexer elements
ESCAPESEQUECE :
'\\\"' |
'\\\''
;
while also distinguish individual quotes when they are not escaped
SINGLEQUOTE:
'\''
;
DOUBLEQUOTE:
'\"'
;
The final goal it to recognize MySQL like strings with parser.
Is this possible / correct way?
Answer
Yes, it is totally possible by having separate tokens.
Example
grammar escp;
SINGLE: '\'';
DOUBLE: '\"';
ESCAPED : '\\"' | '\\\'';
char: SINGLE | DOUBLE;
escaped : ESCAPED;
program: (char | escaped)+;
The AST for input string '\"'"\"""'\'\"\' will be:

ANTLR parse strings (keep whitespaces) and parse normal identifiers

I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).
I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.
grammar parseString;
rules
: stringRule+
;
stringRule
: formatString
| idString
;
formatString
: STRING_DOUBLEQUOTE STRING STRING_DOUBLEQUOTE
;
idString
: (NONTERM | TERM)
;
// LEXER
STRING_DOUBLEQUOTE
: '"' ;
DIGITS
: DIGIT+
;
TERM
: UPPERCHAR CHAR+
;
NONTERM
: LOWERCHAR CHAR+
;
fragment
CHAR
: LOWERCHAR
| UPPERCHAR
| DIGIT
| '-'
| '_'
;
fragment
DIGIT
: [0-9]
;
fragment
LOWERCHAR
: [a-z]
;
fragment
UPPERCHAR
: [A-Z]
;
WS
: (' ' | '\t' | '\r' | '\n')+ -> skip
; // skip spaces, tabs, newlines
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
STRING
: ~('"')*
;
For the test cases that I use,
Test
HelloWorld
"$this is a string"
"*this is another string!"
I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.
Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.
Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.
Think about changing the token STRING.

terminal/datatype/parser rules in xtext

I'm using xtext 2.4.
What I want to do is a SQL-like syntax.
The things confuse me are I'm not sure which things should be treated as terminal/datatype/parser rules. So far my grammar related to MyTerm is:
Model:
(terms += MyTerm ';')*
;
MyTerm:
constant=MyConstant | variable?='?'| collection_literal=CollectionLiteral
;
MyConstant
: string=STRING
| number=MyNumber
| date=MYDATE
| uuid=UUID
| boolean=MYBOOLEAN
| hex=BLOB
;
MyNumber:
int=SIGNINT | float=SIGNFLOAT
;
SIGNINT returns ecore::EInt:
'-'? INT
;
SIGNFLOAT returns ecore::EFloat:
'-'? INT '.' INT;
;
CollectionLiteral:
=> MapLiteral | SetLiteral | ListLiteral
;
MapLiteral:
'{' {MapLiteral} (entries+=MapEntry (',' entries+=MapEntry)* )? '}'
;
MapEntry:
key=MyTerm ':' value=MyTerm
;
SetLiteral:
'{' {SetLiteral} (values+=MyTerm (',' values+=MyTerm)* )+ '}'
;
ListLiteral:
'[' {ListLiteral} ( values+=MyTerm (',' values+=MyTerm)* )? ']'
;
terminal MYDATE:
'0'..'9' '0'..'9' '0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9'
;
terminal HEX:
'a'..'h'|'A'..'H'|'0'..'9'
;
terminal UUID:
HEX HEX HEX HEX HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX
;
terminal BLOB:
'0' ('x'|'X') HEX+
;
terminal MYBOOLEAN returns ecore::EBoolean:
'true' | 'false' | 'TRUE' | 'FALSE'
;
Few questions:
How to define integer with sign? If I define another terminal rule terminal SIGNINT: '-'? '0'..'9'+;, antlr will complain about INT becoming unreachable. Therefore I define it as a datatype rule SIGNINT: '-'? INT; Is this the correct way to do it?
How to define float with sign? I did exactly the same as define integer with sign, SIGNFLOAT: '-'? INT '.' INT;, not sure if this is correct as well.
How to define a date rule? I want to use a parser rule to store year/month/day info in fields, but define it as MyDate: year=INT '-' month=INT '-' date=INT; antlr will complain Decision can match input such as "RULE_INT '-' RULE_INT '-' RULE_INT" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
I also have some other rules like
the following
RelationCompare:
name=ID compare=COMPARE term=MyTerm
;
but a=4 won't be a valid RelationCompare because a and 4 will be treat as HEXs. I found this because if I change the relation to j=44 then it works. In this post it said terminal rule defined eariler will shadow those defined later. However, if I redefine terminal ID in my grammar, whether put it in front or after of terminal HEX, antlr will conplain The following token definitions can never be matched because prior tokens match the same input: RULE_HEX,RULE_MYBOOLEAN. This problem happens in k=0x00b as well. k=0xaab is valid but k=0x00b is not.
Any suggestion?
How do you define an integer with sign?
Treat it as two separate tokens '-' and INT, and use a parser rule instead of a lexer rule.
How do you define a float with sign?
Treat it as two separate tokens '-' and FLOAT, and use a parser rule instead of a lexer rule.
How do you define a date rule?
Treat it as five separate tokens and use a parser rule instead of a lexer rule.
I don't know the answer to the last question since this is in Xtext as opposed to just ANTLR.
Later I found the original antlr grammar for what I want to do therefore I simply translate the antlr grammar to xtext grammar. Here is how I defining those basic types:
terminal fragment A: 'a'|'A';
...
terminal fragment Z: 'z'|'Z';
terminal fragment DIGIT: '0'..'9';
terminal fragment LETTER: ('a'..'z'|'A'..'Z');
terminal fragment HEX: ('a'..'f'|'A'..'F'|'0'..'9');
terminal fragment EXPONENT: E ('+'|'-')? DIGIT+;
terminal INTEGER returns ecore::EInt: '-'? DIGIT+;
terminal FLOAT returns ecore::EFloat: INTEGER EXPONENT | INTEGER '.' DIGIT* EXPONENT?;
terminal BOOLEAN: T R U E | F A L S E;
The Date rule in original grammar is treated as a string.
About rules name (Rules: Antlr Grammar => xtext Grammar)
parser rule: starting with lowercase => rules starting with uppercase (each will be a Java Class)
terminal rule: starting with uppercase => using all uppercase with terminal prefix
fragment terminal rule: fragment ID => terminal fragment ID
In antlr a list of arguments is defined like this:
functionArgs
: '(' ')'
| '(' t1=term ( ',' tn=term )* ')'
;
The corresponding xtext grammar is:
FunctionArgs
: '(' ')'
| '(' ts+=Term (',' ts+=Term )* ')'
;
For those parser rules with an argument enclosed by [ ]
properties[PropertyDefinitions props]
: property[props] (K_AND property[props])*
;
Most of the time they could be moved to the left hand side
Properties
: props+=Property (K_AND props+=Property)*
;
Now it's working as expected.

Fixed number format in ANTLR

How to specify a fixed digit number in antlr grammar?
I want to parse a line which contains fields of fixed number of characters. Each field is a number.
0034|9056|4567|0987|-2340| +345|1000
The above line is a sample line. | indicates field boundaries (which will not be in the actual file. shown here just to indicate the boundary).
The fields can include blank characters +/-
I'd keep the lexer grammar as simple as possible and just match zero or more spaces followed by an optional sign followed by a number in your parser grammar. After matching that, check (in your parser grammar) if the "width" of the field is correct.
An example grammar:
line
: field ('|' field)*
;
field
: Spaces? ('+' | '-')? Number // validate if 'field' is correct in this rule
;
Number
: '0'..'9'+
;
Spaces
: ' '+
;
And a possible validation scheme could look like:
line
: field ('|' field)*
;
field
#init{int length = 0;}
: (Spaces {length += $Spaces.text.length();})?
('+' | '-')? Number {length += $Number.text.length(); if(length != 4) {/* do something */}}
;
Number
: '0'..'9'+
;
Spaces
: ' '+
;
What about the following:
INT : ('+'|'-')? ('0'..'9')+;