BNF for comma separated sequence? - grammar

Can't come up with a BNF grammar for the sequence of characters (possibly empty), separated by comma, but not starting or ending with a comma,
So this is OK:
<--- Empty sequence is ok!
A
A,B
A,B,C
This is NOT ok:
A,
,A
A,,B
AB
The empty case throws me off. What I got so far is:
<char-seq> ::= <empty> | <char> , <char-seq> | <char>
but this produces strings like A, :-(

The empty char sequence is what gives you the trouble. You need a rule that matches a non-empty sequence to be separate from the rule that matches both an empty and a non-empty one, like this:
<char-seq> ::= <empty> | <non-empty-char-seq>
<non-empty-char-seq> ::= <char> | <char> , <non-empty-char-seq>

<char-seq> ::= <empty> | <chars>
<chars> ::= <char> | <char> , <chars>

Related

Only return fields that contain numbers or special characters EXCEPT . Error

In Redshift I want to return fields that contain numbers or special characters EXCEPT . (anything other and a-z and A-Z)
The following gets me anything that contains a number but I need to extend this to any special character except full stop (.)
SELECT DISTINCT name
FROM table
WHERE name ~ '[0-9]'
I need something like:
SELECT DISTINCT name
FROM table
WHERE name ~ '[0-9]' OR name ~'[,#';:#~[]{}etcetc'
Sample Data:
name
john
joh1n1
j!ohn!
jo!h2n
joh.n
jo.&hn
j.3ohn
j.$9ohn
Expected Output:
name
joh1n1
j!ohn!
jo!h2n
jo.&hn
j.3ohn
j.$9ohn
You may use
WHERE name !~ '^[[:alpha:].]+$'
Here, all records that do not consist of only alphabetic or dot symbols will be returned. ^ matches the start of a string position, [[:alpha:].]+ matches one or more letters or dots and $ matches the end of string position.
If it is for PostgreSQL you may use
WHERE name SIMILAR TO '%[^[:alpha:].]%'
The SIMILAR TO operator accepts POSIX character classes and bracket expressions and wildcards, too, and requires a full string match. So, % allows any chars before any 1 char other than letter or dot ([^[:alpha:].]), and then there may also be any other chars till the end of the string.
You can do:
SELECT DISTINCT name FROM table WHERE name !~* '[a-z]'
This means: match on names that do not contain any alphanumeric character.
Operator !~* means:
Does not match regular expression, case insensitive
Edit based on the provided sample data and expected results.
If you want to match on names that contain at least one character other than an alphabetic character or a dot, then you can do:
select * from mytable where name ~* '[^a-z.]'
Demo on DB Fiddle:
with mytable(name) as (values
('john'),
('joh1n1'),
('j!ohn!'),
('jo!h2n'),
('joh.n'),
('jo.&hn'),
('j.3ohn'),
('j.$9ohn')
)
select * from mytable where name ~* '[^a-z.]'
| name |
| :------ |
| joh1n1 |
| j!ohn! |
| jo!h2n |
| jo.&hn |
| j.3ohn |
| j.$9ohn |

How do I parse PDF strings with nested string delimiters in antlr?

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:
A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.
EXAMPLE 1:
The following are valid literal strings:
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)
It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.
lexer grammar PdfStringLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ;
// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
mode STR;
LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ;
TEXT : . -> more ;
parser grammar PdfStringParser;
options { tokenVocab=PdfStringLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
When I process this file:
(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj
I get this error and parse tree:
line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'
So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?
Edit
I left out the instructions regarding escape sequences within the string:
Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.
Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)
An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.
EXAMPLE 2:
(These \
two strings \
are the same.)
(These two strings are the same.)
EXAMPLE 3:
(This string has an end-of-line at the end of it.
)
(So does this one.\n)
Should I use this STRING definition:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?
You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
And with escape sequences, you could try:
STRING
: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | STRING )* ')'
;
fragment ESCAPE_SEQUENCE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;

Antlr4 parser not parsing reassignment statement correctly

I've been creating a grammar parser using Antlr4 and wanted to add variable reassignment (without having to declare a new variable)
I've tried changing the reassignment statement to be an expression, but that didn't change anything
Here's a shortened version of my grammar:
grammar MyLanguage;
program: statement* EOF;
statement
: expression EOC
| variable EOC
| IDENTIFIER ASSIGNMENT expression EOC
;
variable: type IDENTIFIER (ASSIGNMENT expression)?;
expression
: STRING
| INTEGER
| IDENTIFIER
| expression MATH expression
| ('+' | '-') expression
;
MATH: '+' | '-' | '*' | '/' | '%' | '//' | '**';
ASSIGNMENT: MATH? '=';
EOC: ';';
WHITESPACE: [ \t\r\n]+ -> skip;
STRING: '"' (~[\u0000-\u0008\u0010-\u001F"] | [\t])* '"' | '\'' (~[\u0000-\u0008\u0010-\u001F'] | [\t])* '\'';
INTEGER: '0' | ('+' | '-')? [1-9][0-9]*;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
type: 'str';
if anything else might be of relevance, please ask
so I tried to parse
str test = "empty";
test = "not empty";
which worked, but when I tried (part of the fibbionaci function)
temp = n1;
n1 = n1 + n2;
n2 = temp;
it got an error and parsed it as
temp = n1; //statement
n1 = n1 //statement - <missing ';'>
+n2; //statement
n2 = temp; //statement
Your problem has nothing to do with assignment statements. Additions simply don't work at all - whether they're part of an assignment or not. So the simplest input to get the error would be x+y;. If you print the token stream for that input (using grun with the -tokens option for example), you'll get the following output:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
line 1:1 no viable alternative at input 'x+'
Now compare this to x*y;, which works fine:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='*',<MATH>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
The important difference here is that * is recognized as a MATH token, but + isn't. It's recognized as a '+' token instead.
This happens because you introduced a separate '+' (and '-') token type in the alternative | ('+' | '-') expression. So whenever the lexer sees a + it produces a '+' token, not a MATH token, because string literals in parser rules take precedence over named lexer rules.
If you turn MATH into a parser rule math (or maybe mathOperator) instead, all of the operators will be literals and the problem will go away. That said, you probably don't want a single rule for all math operators because that doesn't give you the precedence you want, but that's a different issue.
PS: Something like x+1 still won't work because it will see +1 as a single INTEGER token. You can fix that by removing the leading + and - from the INTEGER rule (that way x = -2 would be parsed as a unary minus applied to the integer 2 instead of just the integer -2, but that's not a problem).

ANTLR: Match unescaped characters?

I've got a rule like,
charGroup
: '[' .+ ']';
But I'm guessing that'll match something like [abc\]. Assuming I want it to match only unescaped ]s, how do I do that? In a regular expression I'd use a negative look-behind.
Edit: I'd also like it to be ungreedy/lazy if possible. So as to match only [a] in [a][b].
You probably wanted to do something like:
charGroup
: '[' ('\\' . | ~('\\' | ']'))+ ']'
;
where ~('\\' | ']') matches a single character other than \ and ]. Note that you can only negate single characters! There's no such thing as ~('ab'). Another mistake often made is that negating inside parser rules does not negate a character, but a token instead. An example might be in order:
foo : ~(A | D);
A : 'a';
B : 'b';
C : 'c';
D : ~A;
Now parser rule foo matches either token B or token C (so only the characters 'b' and 'c') while lexer rule D matches any character other than 'a'.
I'd use a negative look-behind
Isn't that unnecessarily complex? How about:
charGroup
: '[' ('\\]' | .)+ ']';

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:
expr
: special_ident
| ident
;
special_ident : LETTER DIGIT;
ident : LETTER (LETTER | DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
When I try to check this grammar, I get this warning:
Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2.
As a result, alternative(s) 2 were disabled for that input
I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.
Here's some sample input and what I'd like it to match:
A : ident
A1 : special_ident
A1A : ident
A12 : ident
AA1 : ident
How can I form my grammar such that I correctly identify my two types of identifiers?
Seems that you have 3 cases:
A
AN
A(A|N)(A|N)+
You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.
I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:
long_ident : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident : LETTER | long_ident;
Expanding on Carl's thought, I would guess you have four different cases:
A
AN
AA(A|N)*
AN(A|N)+
Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.
prog
: (expr WS)+ EOF;
expr
: special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
| ident {System.out.println("Found ident:" + $ident.text + "\n");}
;
special_ident : LETTER DIGIT;
ident : LETTER
|LETTER DIGIT (LETTER|DIGIT)+
|LETTER LETTER (LETTER|DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
WS
: (' '|'\t'|'\n'|'\r')+;