antlr 3 ambiguity - antlr

I try to write some simple rules and I get this ambiguity
rule: field1 field2; //ambiguity between nsf1 and nsf2 even if I use lookahead k=4
field1: nsf1 | whatever1...;
field2: nsf2 | whatever2...;
nsf1: 'N' 'S' 'F' '1'; //meaning: no such field 1
nsf2: 'N' 'S' 'F' '2'; //meaning: no such field 2
I understand the ambiguity, but I don't understand why lookahead doesn't solve this.
I have a simple solution but I don't like it:
rule: (nsf1 (nsf2 | whatever2))
| (whatever1 (nsf2 | whatever2));
Does anybody have a more elegant solution?
Thanks a lot,
Chris

I couldn't reproduce your problem, but all I could do was guess what the rules for 'whatever1' and 'whatever2' were. Can you post a more complete grammar?
However, there's nothing in the grammar that couldn't be done entirely with lexer token rather than parser rules. Try capitalizing all of the rule names to turn them into lexer token and see if that helps.

Related

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))
You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.
Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.
Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Q: ANTLR 4 Grammar recognition of whole odd value not only the last digit

I'm trying to make grammar for the calculator, however it have to be working only for odd numbers.
For example it works like that:
If I put 123 the result is 123.
If I put 1234 the result is 123, and the token recognition error at: 4 but should be at: 1234.
There is my grammar:
grammar G;
DIGIT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
operator : ('+' | '-' | '*' | ':');
result: DIGIT operator (DIGIT | result);
I mean specifically to make that, the 1234 should be recognized as an error, not only the last digit.
The way that tokenization works is that it tries to find the longest prefix of the input that matches any of your regular expressions and then produces the appropriate token, consuming that prefix. So when the input is 1234, it sees 123 as the longest prefix that matches the DIGIT pattern (which should really be called ODD_INT or something) and produces the corresponding token. Then it sees the remaining 4 and produces an error because no rule matches it.
Note that it's not necessarily only the last digit that produces the error. For the input 1324, it would produce a DIGIT token for 13 and then a token recognition error for 24.
So how can you get the behaviour that you want? One approach would be to rewrite your pattern to match all sequences of digits and then use a semantic predicate to verify that the number is odd. The way that semantic predicates work on lexer rules is that it first takes the longest prefix that matches the pattern (without taking into account the predicate) and then checks the predicate. If the predicate is false, it moves on to the other patterns - it does not try to match the same pattern to a smaller input to make the predicate return true. So for the input 1234, the pattern would match the entire number and then the predicate would return false. Then it would try the other patterns, none of which match, so you'd get a token recognition error for the full number.
ODD_INT: ('0'..'9') + { Integer.parseInt(getText()) % 2 == 1 }?;
The down side of this approach is that you'll need to write some language-specific code (and if you're not using Java, you'll need to adjust the above code accordingly).
Alternatively, you could just recognize all integers in the lexer - not just odd ones - and then check whether they're odd later during semantic analysis.
If you do want to check the oddness using patterns only, you can also work around the problem by defining rules for both odd and even integers:
ODD_INT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
EVEN_INT: ('0'..'9') * ('0' | '2' | '4' | '6'| '8');
This way for an input like 1234, the longest match would always be 1234, not 123. It's just that this would match the EVEN_INT pattern, not ODD_INT. So you wouldn't get a token recognition error, but, if you consistently only use ODD_INT in the grammar, you would get an error saying that an ODD_INT was expected, but an EVEN_INT found.

REGEXP_REPLACE for name and surname masking

select REGEXP_REPLACE('Tina Frederich Piedro', '\w+', '*') from table;
I'm using \w+ this but it returns * * * what is the true regex for expected output?
Input;
Tina Frederich Piedro
Expected Output;
T*** F******** P*****
This is not a general solution, but it might work in your case. You can replace the lower case letters with '*'s:
select REGEXP_REPLACE('Tina Frederich Piedro', '[a-z]', '*', 1, 0, 'c')
The 'c' is for a case-sensitive replace.
I'm by no means a regexp expert, I had to split the answer into two stages. I imagine someone more capable can combine the two steps. But this does the trick and also accounts for upper case letters in the middle of names or punctuation for example O'Brian.
select
regexp_replace(lowers_done,'\*[A-Z]','**') first_letters_only
from
(
select
regexp_replace('Tina McDonald O''Brian','[a-z]|[[:punct:]]','*') lowers_done
from
dual
)
Output:
T*** M******* O******
If it's for security reasons you can use the DBMS_REDACT package to apply masking pattern on sensitive information. Documentation is here
I'm aware it's not a regexp solution though and further more this functionality may be subject to additional licencing from oracle, but that the solution Oracle suggest to PCI compliant solution on sensitive data.

Parsekit equivalent for ABNF construct <any A except B>

I want to translate a given ABNF grammar into a valid ParseKit grammar. Actually I'm trying to find a solution for this kind of statement:
tag = 1*<any Symbol except "C">
with
Symbol = "A" / "B" / "C" / "D" // a lot more symbols here...
The symbol definition is simplified for this question and normally contains a lot of special characters.
My current solution is to hard code all allowed symbols for tag, like
tag = ('A' | 'B' | 'D')+;
But what I really want is something like a "without operator"
tag = Symbol \ 'C';
Is there any construct that allows me to keep my symbol list and define some excludes?
Developer of ParseKit here.
Yes, there is a feature for exactly this. Here's an example:
allItems = 'A' | 'B' | 'C' | 'D';
someItems = allItems - 'C';
Use the - operator.

ANTLR, how to convert BNF,EBNF data in ANTLR?

I have to generate parser of CSV data. Somehow I managed to write BNF, EBNF for CSV data but I don't know how to convert this into an ANTLR grammar (which is a parser generator). For example, in EBNF we write:
[{header entry}newline]newline
but when I write this in ANTLR to generate a parser, it's giving an error and not taking brackets. I am not expert in ANTLR can anyone help?
hi , i have to generate parser of CSV data ...
In most languages I know, there already exists a decent 3rd party CSV parser. So, chances are that you're reinventing the wheel.
For Example in EBNF we wrire [{header entry}newline]newline
The equivalent in ANTLR would look like this:
((header entry)* newline)? newline
In other words:
| (E)BNF | ANTLR
-----------------+--------+------
'a' zero or once | [a] | a?
'a' zero or more | {a} | a*
'a' once or more | a {a} | a+
Note that you can group rules by using parenthesis (sub-rules is what they're called):
'a' 'b'+
matches: ab, abb, abbb, ..., while:
('a' 'b')+
matches: ab, abab, ababab, ...