Q: ANTLR 4 Grammar recognition of whole odd value not only the last digit - grammar

I'm trying to make grammar for the calculator, however it have to be working only for odd numbers.
For example it works like that:
If I put 123 the result is 123.
If I put 1234 the result is 123, and the token recognition error at: 4 but should be at: 1234.
There is my grammar:
grammar G;
DIGIT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
operator : ('+' | '-' | '*' | ':');
result: DIGIT operator (DIGIT | result);
I mean specifically to make that, the 1234 should be recognized as an error, not only the last digit.

The way that tokenization works is that it tries to find the longest prefix of the input that matches any of your regular expressions and then produces the appropriate token, consuming that prefix. So when the input is 1234, it sees 123 as the longest prefix that matches the DIGIT pattern (which should really be called ODD_INT or something) and produces the corresponding token. Then it sees the remaining 4 and produces an error because no rule matches it.
Note that it's not necessarily only the last digit that produces the error. For the input 1324, it would produce a DIGIT token for 13 and then a token recognition error for 24.
So how can you get the behaviour that you want? One approach would be to rewrite your pattern to match all sequences of digits and then use a semantic predicate to verify that the number is odd. The way that semantic predicates work on lexer rules is that it first takes the longest prefix that matches the pattern (without taking into account the predicate) and then checks the predicate. If the predicate is false, it moves on to the other patterns - it does not try to match the same pattern to a smaller input to make the predicate return true. So for the input 1234, the pattern would match the entire number and then the predicate would return false. Then it would try the other patterns, none of which match, so you'd get a token recognition error for the full number.
ODD_INT: ('0'..'9') + { Integer.parseInt(getText()) % 2 == 1 }?;
The down side of this approach is that you'll need to write some language-specific code (and if you're not using Java, you'll need to adjust the above code accordingly).
Alternatively, you could just recognize all integers in the lexer - not just odd ones - and then check whether they're odd later during semantic analysis.
If you do want to check the oddness using patterns only, you can also work around the problem by defining rules for both odd and even integers:
ODD_INT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
EVEN_INT: ('0'..'9') * ('0' | '2' | '4' | '6'| '8');
This way for an input like 1234, the longest match would always be 1234, not 123. It's just that this would match the EVEN_INT pattern, not ODD_INT. So you wouldn't get a token recognition error, but, if you consistently only use ODD_INT in the grammar, you would get an error saying that an ODD_INT was expected, but an EVEN_INT found.

Related

Oracle SQL regexp_substr number extraction behavior

In a sense I've answered my own question, but I'm trying to understand the answer better:
When using regexp_substr (in oracle) to extract the first occurrence of a number (either single or multi digits), how/why do the modifiers * and + impact the results? Why does + provide the behavior I'm looking for and * does not? * is my default usage in most regular expressions so I was surprised it didn't suit my need.
For example, in the following:
select test,
regexp_substr(TEST,'\d') Pattern1,
regexp_substr(TEST,'\d*') Pattern2,
regexp_substr(TEST,'\d+') Pattern3
from (
select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
the use of regexp_substr(TEST,'\d*') returns null value for the input "W 123" - since 'zero or more' digits exist in the string, I'm confused by this behavior. I'm also confused why it does work on the string '123 W'
my understanding is that * means zero or more occurrences of the element it follows and + means 1 or more occurrence of the preceding element. In the example provided for pattern2 [\d*] why does it successfully capture "123" from "123 W" but it does not take 123 from "W 123" as zero or more occurrences of a digit do exist, they just don't exist in the beginning of the string. Is there additional [implied] logic attached to using *?
Note: I looked around for a while trying to find similar questions that helped me capture the '123' from 'W 123' but the closest i found was variations of regexp_replace which would not meet my needs.
So the regexp_count indicates there are FOUR substrings that match the \d* pattern.
The third of those is the '123'. The implication is that the first and second are derived from the W and space and what you have is a zero length result that 'consumes' one character of the source string.
select test,
regexp_count(TEST,'\d*') Pattern2_c,
regexp_substr(TEST,'\d*') Pattern2,
regexp_substr(TEST,'\d*',1,1) Pattern2_1,
regexp_substr(TEST,'\d*',1,2) Pattern2_2,
regexp_substr(TEST,'\d*',1,3) Pattern2_3,
regexp_substr(TEST,'\d*',1,4) Pattern2_4
from (select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
Oracle has a weird thing about zero length strings and null.
The result doesn't "feel" right, but then if you ask a computer deep philosophical questions about how many zero length substrings are contained in a string, I wouldn't bet on any answer.
After thinking through this, it actually makes sense. The pattern \d* is saying to match any number zero or more times. The problem here is that the beginning of the string will always match this pattern, because of the zero or more times.
If the string begins with a number, then it will include those numbers, so given 123 W, the pattern matches 123. However, given the pattern W 123 the pattern also matches at the beginning, but it matches against 0 characters. This is why you get a NULL result.
This is a general regex thing and not an Oracle thing. You have to be careful with the * quantifier.
Here are two regex fiddle examples to illustrate this, using the string W 123:
\d+ shows 1 match on 123
\d* shows 1 match on nothing (i.e. the beginning of the string)

regular expression using oracle REGEXP_INSTR

I want to use REGEXP_INSTR function in a query to search for any match for user input but I don't know how to write the regular expression that for example will match any value that includes the word car followed by unspecified numbers of letters/numbers/spaces and then the word Paterson. can any one please help me with writing this regEx?
Ok, so let's break this down.
"any value that includes the word car"
I surmise from this that the word car doesn't need to be at the start of the string, therefore i would start the format string with...
'^.*'
Here the '^' character means the start of the string, the '.' means any character and '*' means 0 or more of the preceding character. So zero or more of any character after the start of the string.
Then the word 'car', so...
'^.*car'
Next up...
"followed by unspecified numbers of letters/numbers/spaces"
I'm guessing that unspecified means zero or more. This is very similar to what we did to identify any characters that might come before 'car'. Where the '.' means any character
'^.*car.*'
However, if unspecified means one or more, then you can use '+' in place of '*'
"then the word Paterson"
I'm going to assume that as this is the end of the description, there are no more characters after 'Paterson'.
'^.*car.*Paterson$'
The '$' symbol means that the 'n' of 'Paterson' must be at the end of the string.
Code example:
select
REGEXP_INSTR('123456car1234ABCDPaterson', '^.*car.*Paterson$') as rgx
from dual
Output
RGX
----------
1

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

Postgresql : Pattern matching of values starting with "IR"

If I have table contents that looks like this :
id | value
------------
1 |CT 6510
2 |IR 52
3 |IRAB
4 |IR AB
5 |IR52
I need to get only those rows with contents starting with "IR" and then a number, (the spaces ignored). It means I should get the values :
2 |IR 52
5 |IR52
because it starts with "IR" and the next non space character is an integer. unlike IRAB, that also starts with "IR" but "A" is the next character. I've only been able to query all starting with IR. But other IR's are also appearing.
select * from public.record where value ilike 'ir%'
How do I do this? Thanks.
You can use the operator ~, which performs a regular expression matching.
e.g:
SELECT * from public.record where value ~ '^IR ?\d';
Add a asterisk to perform a case insensitive matching.
SELECT * from public.record where value ~* '^ir ?\d';
The symbols mean:
^: begin of the string
?: the character before (here a white space) is optional
\d: all digits, equivalent to [0-9]
See for more info: Regular Expression Match Operators
See also this question, very informative: difference-between-like-and-in-postgres

PostgreSQL String search for partial patterns removing exrtaneous characters

Looking for a simple SQL (PostgreSQL) regular expression or similar solution (maybe soundex) that will allow a flexible search. So that dashes, spaces and such are omitted during the search. As part of the search and only the raw characters are searched in the table.:
Currently using:
SELECT * FROM Productions WHERE part_no ~* '%search_term%'
If user types UTR-1 it fails to bring up UTR1 or UTR 1 stored in the database.
But the matches do not happen when a part_no has a dash and the user omits this character (or vice versa)
EXAMPLE search for part UTR-1 should find all matches below.
UTR1
UTR --1
UTR 1
any suggestions...
You may well find the offical, built-in (from 8.3 at least) fulltext search capabilities in postrgesql worth looking at:
http://www.postgresql.org/docs/8.3/static/textsearch.html
For example:
It is possible for the parser to produce overlapping tokens from the
same of text.
As an example, a hyphenated word will be reported both as the entire word
and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1
SELECT *
FROM Productions
WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '')
= REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '')
Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.