Markup parser failing - antlr

For a markup language I'm trying to parse, I decided to give parser generation a try with ANTLR. I'm new to the field, and I'm messing something up.
My grammar is
grammar Test;
DIGIT : ('0'..'9');
LETTER : ('A'..'Z');
SLASH : '/';
restriction
: ('E' ap)
| ('L' ap)
| 'N';
ap : LETTER LETTER LETTER;
car : LETTER LETTER;
fnum : DIGIT DIGIT DIGIT DIGIT? LETTER?;
flt : car fnum?;
message : 'A' (SLASH flt)? (SLASH restriction)?;
which does exactly what I want, when I give it an input string A/KK543/EPOS. When I give it A/KL543/EPOS however, it fails (MismatchedTokenException(9!=5)). It seems like some sort of conflict; it wants to generate restriction on the first L, so it seems I'm doing something wrong in the language definition, but I can't properly find out what.

For the input "A/KK543/EPOS", the following tokens are created:
'A' 'A'
SLASH '/'
LETTER 'K'
LETTER 'K'
DIGIT '5'
DIGIT '4'
DIGIT '3'
SLASH '/'
'E' 'E'
LETTER 'P'
LETTER 'O'
LETTER 'S'
But for the input "A/KL543/EPOS", these are created:
'A' 'A'
SLASH '/'
LETTER 'K'
'L' 'L'
DIGIT '5'
DIGIT '4'
DIGIT '3'
SLASH '/'
'E' 'E'
LETTER 'P'
LETTER 'O'
LETTER 'S'
As you can see, the char 'L' does not get tokenized as a LETTER. For the literal tokens 'A', 'E', 'L' and 'N' inside your parser rules, ANTLR (automatically) creates separate lexer rules that are place before all other lexer rules. This causes your lexer to look like this behind the scenes:
A : 'A';
E : 'E';
L : 'L';
N : 'N';
DIGIT : '0'..'9';
LETTER : 'A'..'Z';
SLASH : '/';
Therefor, any single 'A', 'E', 'L' and 'N' will never become a LETTER token. This is simply how ANTLR works. If you want to match them as letters, you'll need to create a parser rule letter and let it match these tokens too. Something like this:
message
: A (SLASH flt)? (SLASH restriction)?
;
flt
: car fnum?
;
fnum
: DIGIT DIGIT DIGIT DIGIT? letter?
;
restriction
: E ap
| L ap
| N
;
ap
: letter letter letter
;
car
: letter letter
;
letter
: A
| E
| L
| N
| LETTER
;
A : 'A';
E : 'E';
L : 'L';
N : 'N';
DIGIT : '0'..'9';
LETTER : 'A'..'Z';
SLASH : '/';
which will parse the input "A/KL543/EPOS" like this:

Related

REGEXP_REPLACE for spark.sql()

I need to write a REGEXP_REPLACE query for a spark.sql() job.
If the value, follows the below pattern then only, the words before the first hyphen are extracted and assigned to the target column 'name', but if the pattern doesn't match, the entire 'name' should be reported.
Pattern:
Values should be hyphen delimited. Any values can be present before the first hyphen (be it numbers,
alphabets, special characters or even space)
First hyphen should be exactly followed by 2 words, separated by hyphen (it can only be numbers,
alphabets or alphanumeric) (Note: Special characters & blanks are not allowed)
Two words should be followed by one or more digits, followed by hyphen.
Last portion should be only one or more digits.
For Example:
if name = abc45-dsg5-gfdvh6-9890-7685, output of REGEXP_REPLACE = abc45
if name = abc, output of REGEXP_REPLACE = abc
if name = abc-gf5-dfg5-asd5-98-00, output of REGEXP_REPLACE = abc-gf5-dfg5-asd5-98-00
I have
spark.sql("SELECT REGEXP_REPLACE(name , '-[^-]+-\\w{2}-\\d+-\\d+$','',1,1,'i') AS name").show();
But it does not work.
Use
^([^-]*)(-[a-zA-Z0-9]+){2}-[0-9]+-[0-9]+$
See proof. Replace with $1. If $1 does not work, use \1. If \1 does not work use \\1.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^-]* any character except: '-' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2 (2 times):
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
[a-zA-Z0-9]+ any character of: 'a' to 'z', 'A' to
'Z', '0' to '9' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
){2} end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

How to make a regex of alpha numeric in objective c

How can i make a regex that string should contain char and number. if its just letter or just number it should return me false
Eg:
123swift -> true
swift123 -> true
1231 -> false
swift -> false
My regex:
[a-z]|[0-9]
Use
^(?=.*?[A-Za-z])(?=.*?[0-9])[0-9A-Za-z]+$
Or, a presumably more efficient version:
^(?=[^A-Za-z]*[A-Za-z])(?=[^0-9]*[0-9])[0-9A-Za-z]+$
See proof.
Expanation:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[A-Za-z] any character of: 'A' to 'Z', 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[0-9] any character of: '0' to '9'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[0-9A-Za-z]+ any character of: '0' to '9', 'A' to 'Z',
'a' to 'z' (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

How to write a lexer rule for UUID v4 in ANTLR4?

How to write a lexer rule for UUID v4 in ANTLR4?
UUIDV4: [0-9a-fA-F]{8}'-'[0-9a-fA-F]{4}'-'[0-9a-fA-F]{4}'-'[0-9a-fA-F]{4}'-'[0-9a-fA-F]{12};;
I am also importing another grammar where I have the following rule
WS
: [ \t\n\r] + -> skip
;
I don't want to allow any spaces before and after dashes in UUID V4 while satisfying the WS rule. How can I do that?
ANTLR does not have a {...} quantifier. You will have to repeat them yourself. Something like this should do it:
UUIDV4
: HEX_4 HEX_4 '-' HEX_4 '-' HEX_4 '-' HEX_4 '-' HEX_4 HEX_4 HEX_4
;
fragment HEX_4
: HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;

Antlr4 unexpectedly stops parsing expression

I'm developing a simple calculator with the formula grammar:
grammar Formula ;
expr : <assoc=right> expr POW expr # pow
| MINUS expr # unaryMinus
| PLUS expr # unaryPlus
| expr PERCENT # percent
| expr op=(MULTIPLICATION|DIVISION) expr # multiplyDivide
| expr op=(PLUS|MINUS) expr # addSubtract
| ABS '(' expr ')' # abs
| '|' expr '|' # absParenthesis
| MAX '(' expr ( ',' expr )* ')' # max
| MIN '(' expr ( ',' expr )* ')' # min
| '(' expr ')' # parenthesis
| NUMBER # number
| '"' COLUMN '"' # column
;
MULTIPLICATION: '*' ;
DIVISION: '/' ;
PLUS: '+' ;
MINUS: '-' ;
PERCENT: '%' ;
POW: '^' ;
ABS: [aA][bB][sS] ;
MAX: [mM][aA][xX] ;
MIN: [mM][iI][nN] ;
NUMBER: [0-9]+('.'[0-9]+)? ;
COLUMN: (~[\r\n"])+ ;
WS : [ \t\r\n]+ -> skip ;
"column a"*"column b" input gives me following tree as expected:
But "column a" * "column b" input unexpectedly stops parsing:
What am I missing?
Your WS rule is broken by the COLUMN rule, which has a higher precedence. More precisely, the issue is that ~[\r\n"] matches space characters too.
"column a"*"column b" lexes as follows: '"' COLUMN '"' MULTIPLICATION '"' COLUMN '"'
"column a" * "column b" lexes as follows: '"' COLUMN '"' COLUMN '"' COLUMN '"'
Yes, "space star space" got lexed as a COLUMN token because that's how ANTLR lexer rules work: longer token matches get priority.
As you can see, this token stream does not match the expr rule as a whole, so expr matches as much as it could, which is '"' COLUMN '"'.
Declaring a lexer rule with only a negative rule like you did is always a bad idea. And having separate '"' tokens doesn't feel right for me either.
What you should have done is to include the quotes in the COLUMN rule as they're logically part of the token:
COLUMN: '"' (~["\r\n])* '"';
Then remove the standalone quotes from your parser rule. You can either unquote the text later when you'll be processing the parse tree, or change the token emission logic in the lexer to change the underlying value of the token.
And in order to not ignore trailing input, add another rule which will make sure you've consumed the whole input:
formula: expr EOF;
Then use this rule as your entry rule instead of expr when calling your parser.
But "column a" * "column b" input unexpectedly stops parsing
If I run your grammar with ANTLR 4.6, it does not stop parsing, it parses the whole file and displays in pink what the parser can't match :
The dots represent spaces.
And there is an important error message :
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}
As I explain here as soon as you have a "mismatched" error, add -tokens to grun.
With "column a"*"column b" :
$ grun Formula expr -tokens -diagnostics t1.text
[#0,0:0='"',<'"'>,1:0]
[#1,1:8='column a',<COLUMN>,1:1]
[#2,9:9='"',<'"'>,1:9]
[#3,10:10='*',<'*'>,1:10]
[#4,11:11='"',<'"'>,1:11]
[#5,12:19='column b',<COLUMN>,1:12]
[#6,20:20='"',<'"'>,1:20]
[#7,22:21='<EOF>',<EOF>,2:0]
With "column a" * "column b":
$ grun Formula expr -tokens -diagnostics t2.text
[#0,0:0='"',<'"'>,1:0]
[#1,1:8='column a',<COLUMN>,1:1]
[#2,9:9='"',<'"'>,1:9]
[#3,10:12=' * ',<COLUMN>,1:10]
[#4,13:13='"',<'"'>,1:13]
[#5,14:21='column b',<COLUMN>,1:14]
[#6,22:22='"',<'"'>,1:22]
[#7,24:23='<EOF>',<EOF>,2:0]
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}
you immediately see that " * "is interpreted as COLUMN.
Many questions about matching input with lexer rules have been asked these last days :
extraneous input
ordering
greedy
ambiguity
expression
So many times that Lucas has posted a false question just to make an answer which summarizes all that problematic : disambiguate.

Oracle replace a character not followed by another character

I am attempting to replace all of the &'s in a string with &amp unless the & is followed by lt, apos, gt or quot.
Running this statement
select
regexp_replace('&lt &apos &gt &quot &','&(^lt|^gt|^quot|^apos)','&amp')
however results in no changes to the string.
The output I would be looking for is
'&lt &apos &gt &quot &amp'
A direct and efficient solution (but difficult to write, read and maintain) is:
set define off
(in case you are using a front-end that uses & to mark substitution variables)
then
with
inputs ( inp_str ) as (
select '&lt &apos &gt &quot &' from dual union all
select 'Hello, World!' from dual union all
select '' from dual union all
select '7 &lt 10 &and &&quot' from dual
)
select inp_str,
regexp_replace(inp_str,
'&($|[^lagq]|(g|l)([^t]|$)|a($|[^p]|p($|[^o]|o($|[^s])))|q($|[^u]|u($|[^o]|o($|[^t]))))',
'&amp\1') as new_str
from inputs;
Explanation: (partial...) This will replace every & with &amp, with a few exceptions. The & will be replaced if:
It is followed by the end of the string ($), or
It is followed by any character other than l, a, g or q; or
it is followed by g or l, which is then followed by a character other than t, or by the end of string ($); or
It is followed by a, followed by the end of string, by any letter other than p, or by the letter p followed by the end of string, or .........
Output (from my inputs):
INP_STR NEW_STR
---------------------------- ----------------------------
&lt &apos &gt &quot & &lt &apos &gt &quot &amp
Hello, World! Hello, World!
7 &lt 10 &and &&quot 7 &lt 10 &ampand &amp&quot
4 rows selected.
(Note: I always include an empty string and a string with no ampersands among the inputs, to verify that the query works correctly on them too.)
These codes look much like HTML entity names, but the ending semi-colons are missing... making it less clear where a name ends.
In the following solution I assume that these entities cannot be followed immediately by a letter, a digit nor underscore.
When a & is followed by such a character, it is considered an entity, and not touched. Only the other & are replaced.
select regexp_replace('&lt &apos &gt &quot &', '&(\W|$)', '&amp\1') from dual;
The \W|$ matches either with a character that is not a letter, digit or underscore, or with the end of the string.