Querying substrings against a list of values - sql

I'm reading from a dataset that I unfortunately don't have the access to modify. It has concatenated strings of values, and I want to select records for which any of those substrings (as split by a given character) matches any of the values in a specific list. I'll be passing the queries in via Python, so it won't be compared against a static list.
For example, the table looks like:
CrappyColumn
-----------
1;2
4
1
2;1
1;3
2
And I might want to return anything that has 2 or 4 in it. So, my result should be:
1;2
4
2
2;1
I have played with regexp_substr and gotten something that actually works; however, it just runs indefinitely (as much as 10 minutes before I give up) when I run it on the full dataset (which only includes about three thousand records with values that are often a couple hundred characters long). I need something that works in a reasonable amount of time for repeated execution.
I realize that--even with a variable comparison list--I could just write my Python code to parse the list and construct multiple LIKE statements, but that seems inefficient, and I assume that there is a better way.
And here's what I've done that takes too long:
SELECT DISTINCT CrappyColumn
FROM
(SELECT DISTINCT CrappyColumn, regexp_substr(CrappyColumn, '[^;]+', 1, LEVEL) as UGH
FROM CrappyTable
CONNECT BY regexp_substr(CrappyColumn, '[^;]+', 1, LEVEL) IS NOT NULL)
WHERE UGH IN ('2', '4')
Is there a better, faster, cleaner way to accomplish this?
EDIT - RESOLUTION:
Thanks to vkp's help, here is what I implemented:
regexp_like(SITE_ID, '^(2|4)(:)|(:)(2|4)(:)|(:)(2|4)$|^(2|4)$')
I modified it for my final product, so that it can handle strings of more than one character--by changing [2|4] to (2|4). This works in cases of searching for numbers that aren't single-digit.

You can use like:
select t.*
from crappytable t
where ';' || crappycolumn || ';' like '%;2;%' or
';' || crappycolumn || ';' like '%;4;%';
You seem to know that storing lists of values in a single column is a bad idea, so I'll spare the harangue ;)
EDIT:
If you don't like like, you can use regexp_like() like this:
where regexp_like(';' || crappycolumn || ';', ';2;|;4;')

A simpler method would be to use regexp_like to check if the list has 2 or 4 in it.
select *
from tablename
where regexp_like(crappycolumn,'^[2|4][^0-9]|[^0-9][2|4][^0-9]|[^0-9][2|4]$|^[2|4]$')
^[2|4][^0-9] - Starts with 2 or 4 not followed by a digit.
[^0-9][2|4][^0-9] - 2 or 4 not succeeded or preceded by a digit.
[^0-9][2|4]$ - Ends with 2 or 4 not preceded by a digit.
^[2|4]$ - 2 or 4 is the only character in the string.

Another form of regexp_like(). This regex looks for 2 or 4 only when proceeded by the beginning of the line or a semi-colon and when followed by a semi-colon or the end of the line:
SQL> with crappy_tbl(crappy_col) as (
select '1;2' from dual union
select '4' from dual union
select '1' from dual union
select '2;1' from dual union
select '1;3' from dual union
select '2' from dual union
select '22;;44;' from dual
)
select crappy_col
from crappy_tbl
where regexp_like(crappy_col, '(^|;)(2|4)(;|$)');
CRAPPY_
-------
1;2
2
2;1
4
SQL>

Related

Search a pattern from comma seperated parameters in plsql

My Parameter to a procedure lv_ip := 'MNS-GC%|CS,MIB-TE%|DC'
My cursor query should search for records that start with 'MNS-GC%' and 'MIB-TE%'.
Select id, date,program,program_start_date
from table_1
where program like 'MNS-GC%' or program LIKE 'MIB-TE%'
Please suggest ways to read it from the parameter and an alternative to LIKE.
Since you mention you want to preserve what's on the right side of the pipe, and want to be able to process parameters dynamically, here's a way to parse multi-delimited data that could give you some ideas using a CTE.
The table called 'tbl' just sets up your original data. tbl_comma contains that data split on the comma. The final query splits that data into name/value pairs.
Hopefully this will help give you some ideas even though it's not the exact answer you are looking for.
COLUMN ID FORMAT a3
COLUMN PROGRAM FORMAT a10
COLUMN part2 FORMAT a6
-- Original data
WITH tbl(ID, DATA) AS (
SELECT 1, 'MNS-GC%|CS,MIB-TE%|DC' FROM dual UNION ALL
SELECT 2, 'MNS-GC%|CS,MIB-TE%|DC,MIB-TA%|AB,MIB-TB%|BC' FROM dual
),
tbl_comma(ID, CASE) AS (
SELECT ID,
REGEXP_SUBSTR(DATA, '(.*?)(,|$)', 1, LEVEL, NULL, 1) CASE
FROM tbl
CONNECT BY REGEXP_SUBSTR(DATA, '(.*?)(,|$)', 1, LEVEL) IS NOT NULL
AND PRIOR ID = ID
AND PRIOR SYS_GUID() IS NOT NULL
)
--SELECT * FROM tbl_comma;
-- Parse into name/value pairs
SELECT ID,
REGEXP_REPLACE(CASE, '^(.*)\|.*', '\1') PROGRAM,
REGEXP_REPLACE(CASE, '.*\|(.*)$', '\1') PART2
FROM tbl_comma;
ID PROGRAM PART2
--- ---------- ------
1 MNS-GC% CS
1 MIB-TE% DC
2 MNS-GC% CS
2 MIB-TE% DC
2 MIB-TA% AB
2 MIB-TB% BC
6 rows selected.
If you're stuck with that input and the structure is fixed, with each comma-separated element having a pipe-delimited value, you could possibly convert that string to a regular expression pattern, and then use regexp_like to pattern-match:
select id, date, program, program_start_date
from table_1
where regexp_like(
program,
'^(' || rtrim(regexp_replace(lv_ip, '%\|.*?(,|$)', '|'), '|') || ')')
With your example parameter, the
'^(' || rtrim(regexp_replace(lv_ip, '%\|.*?(,|$)', '|'), '|') || ')'
would generate the pattern
^(MNS-GC|MIB-TE)
i.e. looking for either of those strings at the start of the program value.
db<>fiddle
Alternatively you could split the input up yourself, with instr and substr, and - since the number of elements may vary - create a dynamic query using them. That might be faster than using regular expression, but might be harder to maintain.
What would the regexp be to match CS|DC
It depends how you plan to use those values, but if you're looking for some column exactly matching one of them, then you could do something similar with:
'^(' || ltrim(regexp_replace(l_ip, '(^|,)[^|]*', null), '|') || ')$'
which with your input string would generate the pattern
^(CS|DC)$
But if you need to match the corresponding values as pairs - so the equivalent of something like:
where (program like 'MNS-GC%' and some_col = 'CS')
or (program like 'MIB-TE%' and some_col = 'DC')
... then you'd need to extract them as pairs, as #Gary_W has shown.

Remove 2 characters in oracle sql

I have a column that contains 12 digits but user wants only to generate a 10 digits.
I tried the trim, ltrim function but nothing work. Below are the queries I tried.
ltrim('10', 'column_name')
ltrim('10', column_name)
ltrim(10, column_name)
For example I have a column that contains a 12 digit number
100000000123
100000000456
100000000789
and the expected result I want is
0000000123
0000000456
0000000789
To extract the last 10 characters of an input string, regardless of how long the string is (so this will work if some inputs have 10 characters, some 12, and some 15 characters), you could use negative starting position in substr:
substr(column_name, -10)
For example:
with
my_table(column_name) as (
select '0123401234' from dual union all
select '0001112223334' from dual union all
select '12345' from dual union all
select '012345012345' from dual
)
select column_name, substr(column_name, -10) as substr
from my_table;
COLUMN_NAME SUBSTR
------------- ----------
0123401234 0123401234
0001112223334 1112223334
12345
012345012345 2345012345
Note in particular the third example. The input has only 5 digits, so obviously you can't get a 10 digit number from it. The result is NULL (undefined).
Note also that if you use something like substr(column_name, 3) you will get just '345' in that case; most likely not the desired result.
try to use SUBSTR(column_name, 2)

PLSQL - order by string with REGEX

I'm trying to sort the result set of a query where the row is VARCHAR2.
I've tried using just:
ORDER BY
UPPER(SERVER_NAME) ASC
But I get inconstant results, for example:
120157
777555
AKO
a20064
Elilikes
kagan
1200165_DAVID
As you can see, 1200165_DAVID appears last, in addition, I tried using a regular expression like so:
ORDER BY
(CASE WHEN REGEXP_LIKE(UPPER(SERVER_NAME), '^[0-9]+$') THEN 1 ELSE 2 END) ASC,
UPPER(SERVER_NAME) ASC
But I get the same results, I would like to get the following ordring is possible:
120157
1200165_DAVID
777555
a20064
AKO
Elilikes
kagan
Please advise.
Three things.
First: Why do you want 1200165_DAVID to appear AFTER 120157? It should appear before it, if you order alphabetically.
Second: Running your query on your test data, I get the correct result. So I am inclined to believe either your query is different from what you reported, or there is some other error somewhere.
Third: You may have who-knows-what characters in your data. Selecting str and dump(str) side by side (or whatever the name of your expression; I like to use str in my test data) to see what characters are in each string. Look especially at those that seem to be sorted "out of order".
with
inputs ( str ) as (
select '120157' from dual union all
select '777555' from dual union all
select 'AKO' from dual union all
select 'a20064' from dual union all
select 'Elilikes' from dual union all
select 'kagan' from dual union all
select '1200165_DAVID' from dual
)
select str from inputs
order by upper(str);
STR
-------------
1200165_DAVID
120157
777555
a20064
AKO
Elilikes
kagan
7 rows selected.
This is too long for a comment.
Your data would appear to not be all characters that you recognize. In particular, the first character is suspicious.
I would suggest that you run a query like this:
select ASCII(SUBSTR(server_name, 1, 1)) as first_char-ascii,
'|' || SUBSTR(server_name, 1, 1) || '|' as first_char,
COUNT(*), min(server_name), max(server_name)
from t
group by SUBSTR(server_name, 1, 1)
order by count(*) asc;
Then you will see what characters are actually at the beginning of the string. My guess is you will find at least one interesting character. You will then need to modify the data (or the query) to handle that.

to_number from char sql

I have to select only the IDs which have only even digits (an ID looks like: p19 ,p20 etc). That is, p20 is good (both 2 and 0 are even digits); p18 is not.
I thought to use substr to get each number from the IDs and then see if it's even .
select from profs
where to_number(substr(id_prof,2,2))%2=0 and to_number(substr(id_prof,3,2))%2=0;
IF you need all rows consist of 'p' in beginning and even digits on tail It should look like:
select *
from profs
where regexp_like (id_prof, '^p[24680]+$');
with
profs ( prof_id ) as (
select 'p18' from dual union all
select 'p24' from dual union all
select 'p53' from dual
)
-- End of test data; what is above this line is NOT part of the solution.
-- The solution (SQL query) begins here.
select *
from profs
where length(prof_id) = length(translate(prof_id, '013579', '0'));
PROF_ID
-------
p24
This solution should work faster than anything using regular expressions. All it does is to replace 0 with itself and DELETE all odd digits from the input string. (The '0' is included due to a strange but documented behavior of translate() - the third argument can't be empty). If the length of the input string doesn't change after the translation, that means the input string didn't have any odd digits.
where mod(to_number(regexp_replace(id_prof, '[^[:digit:]]', '')),2) = 0

Is it possible to query a comma separated column for a specific value?

I have (and don't own, so I can't change) a table with a layout similar to this.
ID | CATEGORIES
---------------
1 | c1
2 | c2,c3
3 | c3,c2
4 | c3
5 | c4,c8,c5,c100
I need to return the rows that contain a specific category id. I starting by writing the queries with LIKE statements, because the values can be anywhere in the string
SELECT id FROM table WHERE categories LIKE '%c2%';
Would return rows 2 and 3
SELECT id FROM table WHERE categories LIKE '%c3%' and categories LIKE '%c2%'; Would again get me rows 2 and 3, but not row 4
SELECT id FROM table WHERE categories LIKE '%c3%' or categories LIKE '%c2%'; Would again get me rows 2, 3, and 4
I don't like all the LIKE statements. I've found FIND_IN_SET() in the Oracle documentation but it doesn't seem to work in 10g. I get the following error:
ORA-00904: "FIND_IN_SET": invalid identifier
00904. 00000 - "%s: invalid identifier"
when running this query: SELECT id FROM table WHERE FIND_IN_SET('c2', categories); (example from the docs) or this query: SELECT id FROM table WHERE FIND_IN_SET('c2', categories) <> 0; (example from Google)
I would expect it to return rows 2 and 3.
Is there a better way to write these queries instead of using a ton of LIKE statements?
You can, using LIKE. You don't want to match for partial values, so you'll have to include the commas in your search. That also means that you'll have to provide an extra comma to search for values at the beginning or end of your text:
select
*
from
YourTable
where
',' || CommaSeparatedValueColumn || ',' LIKE '%,SearchValue,%'
But this query will be slow, as will all queries using LIKE, especially with a leading wildcard.
And there's always a risk. If there are spaces around the values, or values can contain commas themselves in which case they are surrounded by quotes (like in csv files), this query won't work and you'll have to add even more logic, slowing down your query even more.
A better solution would be to add a child table for these categories. Or rather even a separate table for the catagories, and a table that cross links them to YourTable.
You can write a PIPELINED table function which return a 1 column table. Each row is a value from the comma separated string. Use something like this to pop a string from the list and put it as a row into the table:
PIPE ROW(ltrim(rtrim(substr(l_list, 1, l_idx - 1),' '),' '));
Usage:
SELECT * FROM MyTable
WHERE 'c2' IN TABLE(Util_Pkg.split_string(categories));
See more here: Oracle docs
Yes and No...
"Yes":
Normalize the data (strongly recommended) - i.e. split the categorie column so that you have each categorie in a separate... then you can just query it in a normal faschion...
"No":
As long as you keep this "pseudo-structure" there will be several issues (performance and others) and you will have to do something similar to:
SELECT * FROM MyTable WHERE categories LIKE 'c2,%' OR categories = 'c2' OR categories LIKE '%,c2,%' OR categories LIKE '%,c2'
IF you absolutely must you could define a function which is named FIND_IN_SET like the following:
CREATE OR REPLACE Function FIND_IN_SET
( vSET IN varchar2, vToFind IN VARCHAR2 )
RETURN number
IS
rRESULT number;
BEGIN
rRESULT := -1;
SELECT COUNT(*) INTO rRESULT FROM DUAL WHERE vSET LIKE ( vToFine || ',%' ) OR vSET = vToFind OR vSET LIKE ('%,' || vToFind || ',%') OR vSET LIKE ('%,' || vToFind);
RETURN rRESULT;
END;
You can then use that function like:
SELECT * FROM MyTable WHERE FIND_IN_SET (categories, 'c2' ) > 0;
For the sake of future searchers, don't forget the regular expression way:
with tbl as (
select 1 ID, 'c1' CATEGORIES from dual
union
select 2 ID, 'c2,c3' CATEGORIES from dual
union
select 3 ID, 'c3,c2' CATEGORIES from dual
union
select 4 ID, 'c3' CATEGORIES from dual
union
select 5 ID, 'c4,c8,c5,c100' CATEGORIES from dual
)
select *
from tbl
where regexp_like(CATEGORIES, '(^|\W)c3(\W|$)');
ID CATEGORIES
---------- -------------
2 c2,c3
3 c3,c2
4 c3
This matches on a word boundary, so even if the comma was followed by a space it would still work. If you want to be more strict and match only where a comma separates values, replace the '\W' with a comma. At any rate, read the regular expression as:
match a group of either the beginning of the line or a word boundary, followed by the target search value, followed by a group of either a word boundary or the end of the line.
As long as the comma-delimited list is 512 characters or less, you can also use a regular expression in this instance (Oracle's regular expression functions, e.g., REGEXP_LIKE(), are limited to 512 characters):
SELECT id, categories
FROM mytable
WHERE REGEXP_LIKE('c2', '^(' || REPLACE(categories, ',', '|') || ')$', 'i');
In the above I'm replacing the commas with the regular expression alternation operator |. If your list of delimited values is already |-delimited, so much the better.