Oracle: regexp for a complicated case - sql

I have a table, and one of the columns contains a string with items separated by semicolons(;)
I want to selectively transfer the data to a new table based on the pattern of the String.
For example, it may look like
16;;14;30;24;11;13;14;14;10;13;18;15;18;24;13/18;11;;23;12;;19;10;;11;26;;;42;26;38/39;12;;;;;;;11;;;;;;;;;;;;;;;
or
11;;11;11;11;11;11;11;11;11;11;11;11;11;11;11;11;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
I don't care about what's between the semicolons, but I care about which positions contain items. For example, if I only want the 1st, 3rd, 4th position to contain items, I would allow the following...
32;;14;18/12;;;;;;;;; or 32;;14;18/12;;;;55;;;;11;;;;;;;
This one down below is not okay because the 3rd position does not hold any value.
32;;;18/12;;;;;;;;;
If regexp works for this, then I can use merge into to move the desired records to the target table. If this cannot be done, I'll have to process each record in Java, and selectively insert the records to the new table.
source table:
id | StringValue | count
target table:
id | StringValue | count
The sql that I have in mind:
merge into you_target_table tt
using ( select StringValue, count
from source_table where REGEXP_LIKE ( StringValue, 'some pattern')
) st
on ( st.StringValue = tt.StringValue and st.count=tt.count )
when not matched then
insert (id, StringValue , count)
values (someseq.nextval, st.value1, st.count)
when matched then
update
set tt.count = tt.count + st.count;
Also I'm certain that all StringValue in source table is unique, so what's after when matched then is not important, but due to the syntax, I think I must have something.

For each position you want a value put [^;]+;, that matches any character, that is not ; and occurs at least one time followed by a ;. If you don't care for a position put [^;]*;. That's almost similar to the first one but the characters, that are before the ; may also be none. Anchor the whole thing to the beginning with ^.
So for your 1st, 3rd and 4th position example you'd get:
^[^;]+;[^;]*;[^;]+;[^;]+;
In a query that'd look like:
SELECT *
FROM elbat
WHERE regexp_like(nmuloc, '^[^;]+;[^;]*;[^;]+;[^;]+;');
db<>fiddle
It may be further improved by putting the sub expressions in a group, that is, put parenthesis around them, and use quantors -- a number in curly braces after the group. For example ([^;]+;){2} would match two positions that are not empty. Your example would get shorten to:
^[^;]+;[^;]*;([^;]+;){2}

While #stiky bit answer is totally correct there is another similar but perhaps more readable solution:
SELECT *
FROM elbat
WHERE regexp_substr(nmuloc, '(.*?)(;|$)', 1, 1, '', 1) is not null
AND regexp_substr(nmuloc, '(.*?)(;|$)', 1, 3, '', 1) is not null
AND regexp_substr(nmuloc, '(.*?)(;|$)', 1, 4, '', 1) is not null;
db<>fiddle
Pros:
clearly states position number that should not be null
has universal pattern for any condition, so no need in changing regex
can use any regex as delimiter, not only single character
actually extracts item, so you can further test it with any function
Cons:
rather verbose
n times slower, where n is condition count
even more slower (up to 2 times) cause of backtracking on each non-delimiter symbol
However in my experience this efficiency difference is minor if query is not run against billions of rows. And even then disk reading would consume most of the time.
How it's made:
(.*?)(;|$) - lazily searches for any character sequence (possibly zero-length) ended with delimiter or end of string
1 - position to start search. 1 is default. Needed only to get to the next parameter
1, 3 or 4 - occurrence or pattern
'' - match_parameter. Can be used for setting up matching mode, but here also only to get to the last parameter
1 - sub-expression number makes regexp_substr return only first capturing group. That is (.*?) i.e. item itself without delimiter.

Related

Underscore and LEFT function

I have a column that has values that look like the following:
17_data...
18_data...
1801151...data
The data isn't the cleanest in this columns, so I am trying to use a LEFT function to identify the rows that have the 2017 year followed by an underscore LEFT(column, 3) = '17[_]' This doesn't return a single column. So to troubleshoot, I added this WHERE clause to the SELECT statement to see what was getting returned, and I got the value 175 where the actual first three characters are "17_".
Why is this, and how can I structure my WHERE clause to pick up those rows?
When you tried adding 'where' with a rule of LEFT(column, 3) = '17[_]', it was doomed to fail. Operator '=' performs exact comparison: both sides must be equal. That is, it would look for rows whose first 3 characters (left,3) are equal to 17[_], that is, 5 characters, one, seven, bracket, underscore, bracket. Text of 3 characters will not exactly-match 5 characters, ever.
You should have written simply:
WHERE LEFT(column, 3) = '17_'
I guess that you've got the idea for adding a bracket from reading about LIKE patterns. LIKE operator allows you to look for strings contained at start/end/middle of the data.
WHERE column LIKE 'mom%' - starts with mom
WHERE column LIKE '%dad' - ends with dad
and so on. LIKE supports '%' meaning "and then text of any length", and also "_" meaning "and then just one character". This forms a problem: when you want to say "starts with _mom", you cannot write
WHERE column LIKE '_mom%'
because it would also match 9mom, Bmom, and so on, due to _ meaning 'any single character'. That's why in such cases, only in LIKE, you have to write the underscore in brackets:
WHERE column LIKE '[_]mom%' - starts with _mom
Knowing that, it's obvious that you could construct your 'starts with 17_' with LIKE as well:
SELECT column1, column2, ..., columnN
FROM sometable
WHERE column LIKE '17[_]%'

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

Comparing fields when a field has data in between 2 characters that match the field being compared

I have code that looks like this:
left outer join
gme_batch_header bh
on
substr(ln.lot_number,instr(ln.lot_number,'(') + 1,
instr(ln.lot_number,')') - instr(ln.lot_number,'(') - 1)
=
bh.batch_no
It works fine, but I have come across a few lot numbers that have two sections of strings that are between parenthesis. How would I compare what is between the second set of parenthesis? Here is an example of the data in the lot number field:
E142059-307-SCRAP-(74055)
This one works with the code,
58LF-3-B-2-2-2 (SCRAP)-(61448)
This one tries comparing SCRAP with the batch no, which isn't correct. It needs to be the 61448.
The result is always the last item in parenthesis.
After more research, I actually got it to work with this code:
substr(ln.lot_number,instr(ln.lot_number,'(',-1) + 1, instr(ln.lot_number,')',-1) - instr(ln.lot_number,'(',-1) - 1)
Assuming SQL2005+, and it is always the last occurrence you want, then I would suggest finding the last instance of a ( in your query and substring to there. To get the last instance you could use something like:
REVERSE(SUBSTRING(REVERSE(lot_number),0,CHARINDEX('(',REVERSE(lot_number))))
If your version of Oracle supports regular expressions try this:
substr(regexp_substr(ln.lot_number,'[0-9]+\)$'),1,length(regexp_substr(ln.lot_number,'[0-9]+\)$'))-1)
Explanation:
regexp_substr(scrap_row,'[0-9]+\)$' ==> find me just numbers in the string that ends in ). This returns the numbers but it includes the closing parenthesis.
To remove the closing parenthsis, just send it through substring and extract first number through the length of the number stopping at 1 character from the end of the string.
Query for analysis:
with scrap
as (select '58LF-3-B-2-2-2 (SCRAP)-(61448)' as scrap_row from dual)
select scrap_row,
regexp_substr(scrap_row,'[0-9]+\)$') as regex_substring,
length(regexp_substr(scrap_row,'[0-9]+\)$')) as length_regex_substring,
substr(regexp_substr(scrap_row,'[0-9]+\)$'),1,length(regexp_substr(scrap_row,'[0-9]+\)$'))-1) as regex_sans_parenthesis
from scrap
If you have 11g, this will do it pretty simply by using the subgroup argument of regexp_substr() and constructing the regex appropriately:
SQL> with tbl(data) as
(
select 'E142059-307-SCRAP-(74055)' from dual
union
select '58LF-3-B-2-2-2 (SCRAP)-(61448)' from dual
)
select data from tbl
where regexp_substr(data, '\((\d+)\)$', 1, 1, NULL, 1)
= '61448';
DATA
------------------------------
58LF-3-B-2-2-2 (SCRAP)-(61448)
The regular expression can be read as:
\( - Search for a literal left paren
( - Start a remembered subgroup
\d+ - followed by 1 more more digits
) - End remembered subgroup
\) - followed by a literal right paren
$ - at the end of the line.
The regexp_substr function arguments are:
Source - the source string
Pattern - The regex pattern to look for
position - Position in the string to start looking for the pattern
occurrence - If the pattern occurs multiple times, which occurrence you want
match_params - See the docs, not used here
subexpression - which subexpression to use (the remembered group)
So in English, look for a series of 1 or more digits surrounded by parens, where it occurs at the end of the line and save the digit part only to use to compare. IMHO a lot easier to follow/maintain than nested instr(), substr().
For re-useability, make a function called get_last_number_in_parens() that contains this code and uses an argument of the string to search. This way that logic is encapsulated and can be re-used by folks that may not be so comfortable with regular expressions, but can benefit from the power! One place to maintain code too. Then call like this:
select data from tbl
where get_last_number_in_parens(data) = '61448';
How easy is that?!
Hello you can check with this code. It works whaever the condition may be
SELECT SUBSTR('58LF-3-B-2-2-2-(61448)',instr('58LF-3-B-2-2-2-(61448)','(',-1)+1,LENGTH('58LF-3-B-2-2-2-(61448)')-instr('58LF-3-B-2-2-2-(61448)','(',-1)-1)
FROM dual;
SELECT SUBSTR('58LF-3-B-2-2-2 (SCRAP)-(61448)',instr('58LF-3-B-2-2-2 (SCRAP)-(61448)','(',-1)+1,LENGTH('58LF-3-B-2-2-2 (SCRAP)-(61448)')-instr('58LF-3-B-2-2-2 (SCRAP)-(61448)','(',-1)-1)
FROM dual;
Output
==================================
61448
==================================

Count with muliple where conditions in ms access

I have the query below;
Select count(*) as poor
from records where deviceId='00019' and type='Poor' and timestamp between #14-Sep-2012 01:01:01# and #24-Sep-2012 01:01:01#
table is like;
id. deviceId, type, timestamp
data is like;
data is like;
1, '00019', 'Poor', '19-Sep-2012 01:01:01'
2, '00019', 'Poor', '19-Sep-2012 01:01:01'
3, '00019', 'Poor', '19-Sep-2012 01:01:01'
4, '00019', 'Poor', '19-Sep-2012 01:01:01'
i am trying to count the devices with a specific specific type.
Please help.. access always returns wrong data. it is returning 1 while 00019 has 4 entries for poor
Type and timestamp are both reserved words, so enclose them in square brackets in your query like this: [type] and [timestamp]. I doubt those reserved words are the cause of your problem, but it's hard to predict exactly when reserved words will cause query problems, so just rule out this possibility by using the square brackets.
Beyond that, stored text values sometimes contained extra non-visible characters. Check the lengths of the stored text values to see whether any are longer than expected.
SELECT
Len(deviceId) AS LenOfDeviceId,
Len([type]) AS LenOfType,
Len([timestamp]) AS LenOfTimestamp
FROM records;
In comments you mentioned spaces (ASCII value 32) in your stored values. I had been thinking we were dealing with other non-printable/invisible characters. If you have one or more actual space characters at the beginning and/or end of a stored deviceId value, the Trim() function will discard them. So this query will give you different length numbers in the two columns:
SELECT
Len(deviceId) AS LenOfDeviceId,
Len(Trim(deviceId)) AS LenOfDeviceId_NoSpaces
FROM records;
If the stored values can also include spaces within the string (not just at the beginning and/or end), Trim() will not remove those. In that case, you could use the Replace() function to discard all the spaces. Note however a query which uses Replace() must be run from inside an Access application session --- you can't use it from Java code.
SELECT
Len(deviceId) AS LenOfDeviceId,
Len(Replace(deviceId, ' ', '')) AS LenOfDeviceId_NoSpaces
FROM records;
If that query returns the same length numbers in both columns, then we are not dealing with actual space characters (ASCII value 32) ... but some other type of character(s) which look "space-like".
If you want to count devices with specific type irrespective of deviceids then use this:
Select count(*) as excellent
from records where type='Poor'
If you want to count devices with specific deviceid irrespective of types then use this:
Select count(*) as excellent
from records where deviceId='00019'

Reading a part of a alpha numeric string in SQL

I have a table with one column " otname "
table1.otname contains multiple rows of alpha-numeric string resembling the following data sample:
11.10.32.12.U.A.F.3.2.21.249.1
2001.1.1003.8281.A.LE.P.P
2010.1.1003.8261.A.LE.B.B
I want to read the fourth number in every string ( part of the string in bold ) and write a query in Oracle 10g
to read its description stored in another table. My dilemma is writing the first part of the query.i.e. choosing the fourth number of every string in a table
My second query will be something like this:
select description_text from table2 where sncode = 8281 -- fourth part of the data sample in every string
Many thanks.
novice
Works with 9i+:
WITH portion AS (
SELECT SUBSTR(t.otname, INSTR(t.otname, ".", 1, 3)+1, INSTR(t.otname, ".", 1, 4)) 'sncode'
FROM TABLE t)
SELECT t.description_text
FROM TABLE2 t
JOIN portion p ON p.sncode = t.sncode
The use of SUBSTR should be obvious; INSTR is being used to find location the period (.), starting at the first character in the string (parameter value 1), on the 3rd and 4th appearance in the string. You might have to subtract one from the position returned for the 4th instance of the period - test this first to be sure you're getting the right values:
SELECT SUBSTR(t.otname, INSTR(t.otname, ".", 1, 3)+1, INSTR(t.otname, ".", 1, 4)) 'sncode'
FROM TABLE t
I used subquery factoring so the substring happens before you join to the second table. It can be done as a subquery, but subquery factoring is faster.
Newer versions of oracle (including 10g) have various regular expression functions. So you can do something like this:
where sncode = to_number(regexp_replace(otname, '^(\d+\.\d+\.\d+\.(\d+))?.+$', '\2'))
This matches 3 sets of digits-followed-by-a-dot, and a fourth grouped set of digits, followed by the rest of the string, and returns a string consisting of all that entirely replaced by the first group (the fourth set of digits).
Here's a complete query (if I understood your description of the two tables correctly):
select t2.description_text
from table1 t1, table2 t2
where t2.sncode = to_number(regexp_replace(t1.otname, '^(\d+\.\d+\.\d+\.(\d+))?.+$', '\2'))
Another slightly shorter alternative regex:
where t2.sncode = to_number(regexp_replace(t1.otname, '^((\d+\.){3}(\d+))?.+$', '\3'))