Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore - sql

I am new to regex and need to search a string field in Impala for multiple matches to this exact sequence of characters: ~FC* followed by 11 more * that could have letters/digits between (but could not, they are basically delimiters in this string field). After the 12th * (if you count #1 in ~FC*) it should be immediately followed by Y~.
since the asterisks are not letters or digits, I am unsure on how to search for these delimiters properly.
This is my SQL so far:
select
regexp_extract(col_name, '(~FC\\*).*(\\*Y~)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
data returned:
pattern_found
--------------
~FC*
(~FC\\*) in Impala SQL it returns ~FC* which is great (got it from my other question)
Been trying this (~FC\\*).*(\\*Y~) which obviously isnt counting the number of asterisks but its is also not picking the Y up.
This is a test string, it has 2 occurrences:
N4*CITY*STATE*2155446*2120~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
results should be these 2, which has an overlapping ~ between them. but will settle for at least the first being found if both cannot.
~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~
~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~

figured out a solution but happy to learn of a better way to accomplish this
This is what worked in Impala SQL, needed parentheses and double escape backslashes for allllll the asterisks:
(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)
Full SQL:
select
regexp_extract(col_name, '(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
and here is the RegexDemo without the additional syntax needed for Impala SQL

Related

Extracting string between two characters in sql oracle database

I need to extract a string that will located between two characters, with always the same pattern
sample string:
A CRN_MOB_H_001 a--> <AVLB>
What is in bold AVLB is what I want to extract, the whole string will always have the same pattern, and everything that is before the < is irrelevant to me.
The string will always have the same pattern:
Some string with possible special characters such as <>, although very unlikely so, it can be ignored if too complicated
a space
then -->
a space
and then the part that is interesting <XXXXXXX>
The XXXXXXX representing the part I want to extract
thank you for your time.
I have tried several things, could not get anywhere I wanted.
Please try this REGEXP_SUBSTR(), which selects what is in the angled brackets when they occur at the end of the string.
Note the WITH clause just sets up test data and is a good way to supply data for people to help you here.
WITH tbl(str) AS (
SELECT 'A CRN_MOB_H_001 a--> <AVLB>' FROM dual
)
SELECT REGEXP_SUBSTR(str, '.*<(.*)>$', 1, 1, NULL, 1) DATA
FROM tbl;
DATA
----
AVLB
1 row selected.

Using regexp_like in Oracle to match on multiple string conditions using a range of values

I have a field in my Oracle DB that contains codes and from this I need to pull multiple values using a range of values.
As an example I need to pull all codes in the range C00.0 - C39.9 i.e. begins with C, the second character can be 0-3, third character is 0-9, followed by a "." and then the last digit is 0-9 e.g.
CODES
-----
C00.0
C10.4
C15.8
C39.8
The example above is for one pattern, I have multiple patterns to match on, here is another example
C50.011-C69.92
Again, starts with C, second character is 5-6, third is 0-9, fourth is ".", fifth is 0-9, sixth is 1-2 etc.
I have tried the following but my pipe function doesn't appear to pick up the second condition and therefore I am only getting results for the first condition '^[C][0-3][0-9][.][0-9]':
SELECT DISTINCT CODES
FROM
TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]|
^[C][4][0-3][.][0-9]|
^[C][4][A][.][0-9]|
^[C][4][4-9][.][0-9]|
^[C][4][9][.][A][0-9]|
^[C][5-6][0-9][.][0-9][1-9]|
^[C][7][0-5][.][0-9]|
^[C][7][A-B][.][0-8]')
ORDER BY CODES
I would be very grateful if anyone could make a suggestion on how I can pull the additional patterns.
You have newlines in the pattern -- in other words, your attempt at readability is causing the problem. You can just remove them, although I would probably factor out common elements:
WHERE REGEXP_LIKE (CODES, '^[C]([0-3][0-9][.][0-9]|[4][0-3][.][0-9]|[4][A][.][0-9]|[4][4-9][.][0-9]|[4][9][.][A][0-9]|[5-6][0-9][.][0-9][1-9]|[7][0-5][.][0-9]|[7][A-B][.][0-8])')
I think you also want $ at the end.
If you want readability, you could use or:
SELECT DISTINCT CODES
FROM TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]') OR
REGEXP_LIKE (CODES, '^[C][4][0-3][.][0-9]|') OR
. . .
Here is a regex pattern for what you want to match here:
^C[0-3][0-9][.][0-9]$
Demo
This would match the range of C00.0 - C39.9. If you want to match other ranges, then you would need an alternation with another pattern to cover those ranges.
Applying this to your current query:
SELECT DISTINCT CODES
FROM yourTable
WHERE REGEXP_LIKE (CODES, '^C[0-3][0-9][.][0-9]$');

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

SQL : REGEX MATCH - Character followed by numbers inside quotes

I have a column in sql which holds value inside double quotes like "P1234567" , "P1234" etc..
I need to identify only columns which start with letter P and is followed by seven digits (numbers) only. I tried where column like'"P[0-9][0-9][0-9][0-9][0-9][0-9][0-9]"' but it doesn't seem to work.
Can someone please correct me or point me to a thread which can help me out?
Thanks
Standard SQL has no regex support, but most SQL engines have regex extensions added to them on top of the standard SQL. So, for example, if you're using MySQL then you'd do this:
... WHERE column REGEXP '^"P[0-9]{7}"'
And if you're using Postgres then that would be:
... WHERE column ~ '^"P[0-9]{7}"'
(updated to match the double-quote part of the question, I'd misunderstood that to begin with)
How about using length and isnumeric:
Select
*
from
mytable
where
mycolumn like '"P%'
and len(mycolumn) = 10 --2 chars for quotes + 1 for 'P' + 7 for the digits
and isnumeric(substring(mycolumn, 3, 7))=1
This answer is for SQL Server, other DBMS's may have a different syntax for length

Finding first and second word in a string in SQL Developer

How can I find the first word and second word in a string separated by unknown number of spaces in SQL Developer? I need to run a query to get the expected result.
String:
Hello Monkey this is me
Different sentences have different number of spaces between the first and second word and I need a generic query to get the result.
Expected Result:
Hello
Monkey
I have managed to find the first word using substr and instr. However, I do not know how to find the second word due to the unknown number of spaces between the first and second word.
select substr((select ltrim(sentence) from table1),1,
(select (instr((select ltrim(sentence) from table1),' ',1,1)-1)
from table1))
from table1
Since you seem to want them as separate result rows, you could use a simple common table expression to duplicate the rows, once with the full row, then with the first word removed. Then all you have to do is get the first word from each;
WITH cte AS (
SELECT value FROM table1
UNION ALL
SELECT SUBSTR(TRIM(value), INSTR(TRIM(value), ' ')) FROM table1
)
SELECT SUBSTR(TRIM(value), 1, INSTR(TRIM(value), ' ') -1) word
FROM cte
Note that this very simple example assumes that there is a second word, if there isn't, NULL will be returned for both words.
An SQLfiddle to test with.
While Joachim Isaksson's answer is a robust and fast approach, you can also consider splitting the string and selecting from the resulting pieces set. This is just meant as hint for another approach, if your requirements alter (e.g. more than two string pieces).
You could split finally by the regex /[ ]+/, and so getting the words between the blanks.
Find more about splitting here: How do I split a string so I can access item x?
This will strongly depend on the SQL dialect you are using.
Try this with REGEXP_SUBSTR:
SELECT
REGEXP_SUBSTR(sentence,'\w+\s+'),
REGEXP_SUBSTR(sentence,'\s+(\w+)\s'),
REGEXP_SUBSTR(sentence,'\s+(\w+)\s+(\w+)'),
REGEXP_SUBSTR(REGEXP_SUBSTR(sentence,'\s+(\w+)\s+(\w+)'),'\w+$'),
REGEXP_SUBSTR(sentence,'\s+(\w+)\s+$')
FROM table1;
result:
1 2 3 4 5
Hello Monkey Monkey this this is_me
Learn more about REGEXP_SUBSTR reference to Using Regular Expressions With Oracle Database
Test use SqlFiddle: http://sqlfiddle.com/#!4/8e9ef/9
If you only want to get the first and the second word, use REGEXP_INSTR to get second word start position :
SELECT
REGEXP_SUBSTR(sentence,'\w+\s+') AS FIRST,
REGEXP_SUBSTR(sentence,'\w+\s',REGEXP_INSTR(sentence,'\w+\s+')+length(REGEXP_SUBSTR(sentence,'\w+\s+'))) AS SECOND
FROM table1;