ANTLR3 Dynamic quotes in lexer - antlr

I need to match something like the Perl regexp matcher
m/my regex!*/
where the quotes can be any character from a range. So the above is the same as
m%my regex!*%
A naive guess of a lexer rule would be
REGEX: 'm' quote=. (~(quote))* quote;
but that does not work, because the latter quote is not referring to the quote= but to some rule.
I can do it with a lot of own code, like
REGEX: 'm' quote=. { ... implement the loop and final match myself ... } ;
but somehow I think there should be a canonical way to do such things.

... but somehow I think there should be a canonical way to do such things.
There is not. You'll have to do this with custom code.

Take a look at PL/SQL parser (here). Oracle also supports those Perl style quoted strings.
Like:
q':select * from employees where last_name = 'smith':'
Use the custom code as an example. (It contains C and Java implementation).
Maybe in your case it can be even simplified.
Ivan

Related

Why this notation for the Lexer production in antlr?

In the following SQL lexer:
https://github.com/tshprecher/antlr_psql/blob/master/antlr4/PostgreSQLLexer.g4
It defines true as:
TRUE : T R U E;
Why the capitals spaced out like that instead of just TRUE: 'TRUE' ? What's the reasoning for that notation? Does T refer to another production or something and that's why it's spelled like that?
These single letters are (fragment) lexer rules too. Check the grammar out! This way you can define case-insensitive keywords. This was the usual approach for case-insensitivity until this was built into ANTLR4 in version 4.10.

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

Parse WHERE condition with regular expressions

I was wondering if anyone can help me with a regular expression - not my strongest point - to parse the WHERE part of a SQL statement. I need to extract the column names, either in "column" or "table.column" format. I'm using MySQL and PHP.
For example, parsing:
(table.column_a = '1') OR (table.column_a = '0'))
AND (date_column < '2014-07-03')
AND column_c LIKE '%my search string%'
should yield
table.column_a
table.column_b
date_column
column_c
Edit: clarification - the strings will be parsed in PHP with preg_* functions!
Thank you!
Assuming you are not doing this in SQL, you can use a regex like this:
[A-Za-z._]+(?=[ ]*(?:[<>=]|LIKE))
See regex demo.
This would work in Notepad++ and many languages.
Explanation
[A-Za-z._]+ matches the characters in your word (if you want to add digits, add 0-9
The lookahead (?=[ ]*(?:[<>=]|LIKE)) asserts that what follows is optional spaces (the brackets are optional, they make the space stand out), then one of the characters in this class [<>=] (an operator OR | LIKE
You can add operators inside [<>=], or tag them at the end with another alternation, e.g. |REGEXP
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Write regex for pattern like W00001

I am new to Regular Expressions and any help is highly appreciated.
Pattern like W00000,W00001,W00002,W00004
Must begin with W
Each string before comma must be six characters
String can only be repeated four times
Comma in between
Must not begin or end with comma
I tried below pattern and some others, like (^[W]{1}\d{5}){1,4}'), and none of them work correctly:
Select 'X' from dual Where REGEXP_LIKE ('W12342','(^[W]{1}\d{5})(?<!,)$')
My understanding is that the OP is saying the match should fail if the string begins or ends with a comma, not just that the preceding or trailing commas shouldn't match, so anchors are needed. Also, based on the regex he attempted, I infer that a single group, such as W00000, should match. So, I think the regex should be this, if the characters following the W must always be digits:
^W[:digit:]{5}(,W[:digit:]{5}){0,3}$
Or this, if they can be something other than digits:
^W[^,]{5}(,W[^,]{5}){0,3}$
UPDATE:
The OP posted the following comment:
I am on Oracle 11g and [:digit:] doesn't work. When I replace it with [0-9] it then works fine.
According to the documentation, Oracle 11g conforms to the POSIX regex standard and should be able to use POSIX character classes such as [:digit:]. However, I noticed in the docs that Oracle 11g does support Perl-style backslash character class abbreviations, which I didn't think was the case when I originally wrote this answer. In that case, the following should work:
^W\d{5}(,W\d{5}){0,3}$
Well in that case, you can do this:
(W[^,]{5},){3}W[^,]{5}
If I understood correctly, this should do it!
^W[0-9]{5}(,W[0-9]{5}){0,3}$
One W12345 pattern, maybe followed by one to 3 ,W12345 blocks.
Edit1: Adding ^$ to fail if there is a comma
Edit2: Fix class, since it fails on Oracle 11g

Regex in Objective-C and regexpal

I created a regex expression and tested with a string in Rexpal.
Then, i tested it in Objective-C with the same string and i get no result.
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^Page(\\d+| \\d+)(:| :|)$" options:NSRegularExpressionCaseInsensitive error:&regError];
As you can see, i did add another '\' char before the 'd', but i get no result at all.
If i change the regex expression from "^Page(\d+| \d+)(:| :|)$" to "Page(\d+| \d+)(:| :|)", i get way too many results. It's like my 'AND' statements were understoud as 'OR' statements. Anybody got an idea of what is happening?
EDIT :
For the regex expression "^page(\d+| \d+)(:| :|)$" for the string "page 15 :", will return me with 3 solutions "page 15 :", "15", and ":". I only want the first one. Like i said, it's like my AND is transformed in a OR/AND. I would like the number and the semi-colon (or not like my regex says) to always be attached to 'page'
Turn on multi-line option. then anchors ^$ will mean begining and end of line.
Instead, by default, ^$ mean begin/end of entire string.
In RxPal you can see Match at line-breaks (m) option is checked.
edit
If you are getting too much sub-expression data, then you should replace the
context into cluster groups. (..) -> (?: ..).
This is an 'extended' context.
If you can't do that, then just go with the data in group 0, which is the entire match, and ignore the rest. Not sure how to do this.
As pointed out by sln (he solved that issue), you don't match anything in C because you have to turn multiline option on (with m), and you do match in regexpal because it is on.
Regarding your regex, it could be improved with ^(Page\s*\d+\s*:?\s*)$. The question mark means that the preceding character doesn't have to be here, the \s matches any whitespace-type character (whitespace, tab, etc).
Regarding your selecting issue, parenthesis in regex are what catches variables. So if you do (Page( \d+|\d+)) you'll have two different variables. What you wanted was (Page(?: \d+|\d+)), since (?: ) counts as parenthesis not assigning any value. But | aren't usually used when a simple ? does the trick.