How to select the most matching subString to another String - sql

Lets say the full String is
The following example examines the string, looking for the first substring bounded by comas
and the subString is
substing bounded
is there any way that I could check the full string if contains a 90% matching subString using sql
like the word substing bounded and substring bounded in my example
the subString could be a compound of more words so I can't split the full string into words .

First transform your text in a table of words. You'll find a lot as posts to this topic on SO, e.g. here
You'll have to adjust the list of delimiter characters to extract the words only.
This is a sample query
with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual ),
t2 as (select rownum colnum from dual connect by level < 16 /* (max) number of words */),
t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum))) col from t1, t2
where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
select * from t3;
COL
----------
The
following
example
examines
...
In the next step your the Levenshtein Distance to get the closes word.
with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual ),
t2 as (select rownum colnum from dual connect by level < 16 /* (max) number of words */),
t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum))) col from t1, t2
where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
select col, str, UTL_MATCH.EDIT_DISTANCE(col, str) distance
from t3
cross join (select 'commas' str from dual)
order by 3;
COL STR DISTANCE
---------- ------ ----------
comas commas 1
for commas 5
examines commas 6
...
Check the definition of the Levenshtein Distance and define a threshold on the distance to get your candidate words.
To match independent of the word boundary simple scan through your input and get all substring in a lenth of your match string adjusted for the diferentce e.g. adding some 10%.
You may limit the candidates by filtering such substrings only that start on the word boundary. The rest ist the same distance calculation.
with txt as (select 'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
str as (select 'substing bounded' str from dual),
t1 as (select substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
(select str from str) str
from txt connect by level < (select length(txt) from txt) - (select length(str) from str))
select SUBSTR, STR,
UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR) distance
from t1
order by 3;
SUBSTR STR DISTANCE
-------------------- ---------------- ----------
substring bounded substing bounded 1
ubstring bounded substing bounded 3
substring bounde substing bounded 3
t substring bound substing bounded 5
...

Experiment with the SOUNDEX function.
I haven't tested this but this might help you on your way:
WITH strings AS (
select regexp_substr('The following example examines the string, looking for the first substring bounded by comas','[ ]+', 1, level) ss
from dual
connect by regexp_substr('The following example examines the string, looking for the first substring bounded by comas', '[ ]+', 1, level) is not null
)
SELECT ss
FROM strings
WHERE SOUNDEX(ss) = SOUNDEX( 'commas' ) ;
The REGEXP_SUBSTR with CONNECT BY splits the long string into words (by space) - amend the delimited as required to include punctuation marks etc.
Here we are relying on the built-in SOUNDEX matching our expectations.

Related

SQL: using regexp_substr ot regexp_extract, looking for the regex pattern that will only return the string between one character and a space

The row I am trying to parse from is a series of string values separated only by spaces. Sample below:
TX:123 SP:XapZNsyeS INST:456123
I need to use either regexp_substr or regexp_extract to return only values for the string that appears after "TX:" or "SP:", etc. So essentially an expression that only captures the string after a string (e.g. "TX:") and before a space (" ").
Here's one way to split on 2 delimiters. This works on Oracle 12c as you included the Oracle regexp-substr tag. Using a with statement, first set up the original data, then split on a space or the end of the line, then break into name-value pairs.
WITH tbl_original_data(ID, str) AS (
SELECT 1, 'TX:123 SP:XapZNsyeS INST:456123' FROM dual UNION ALL
SELECT 2, 'MI:321 SP:MfeKLgkrJ INST:654321' FROM dual
),
tbl_split_on_space(ID, ELEMENT) AS (
SELECT ID,
REGEXP_SUBSTR(str, '(.*?)( |$)', 1, LEVEL, NULL, 1)
FROM tbl_original_data
CONNECT BY REGEXP_SUBSTR(str, '(.*?)( |$)', 1, LEVEL) IS NOT NULL
AND PRIOR ID = ID
AND PRIOR SYS_GUID() IS NOT NULL
)
--SELECT * FROM tbl_split_on_space;
SELECT ID,
REGEXP_REPLACE(ELEMENT, '^(.*):.*', '\1') NAME,
REGEXP_REPLACE(ELEMENT, '.*:(.*)$', '\1') VALUE
FROM tbl_split_on_space;
ID NAME VALUE
---------- ---------- ----------
1 TX 123
1 SP XapZNsyeS
1 INST 456123
2 MI 321
2 SP MfeKLgkrJ
2 INST 654321
6 rows selected.
EDIT: Realizing this answer is a little more than was asked for, here's a simplified answer to return one element. Don't forget to allow for the ending of a space or the end of the line as well, in case you element is at the end of the line.
WITH tbl_original_data(ID, str) AS (
SELECT 1, 'TX:123 SP:XapZNsyeS INST:456123' FROM dual
)
SELECT REGEXP_SUBSTR(str, '.*?TX:(.*)( |$)', 1, 1, NULL, 1) TX_VALUE
FROM tbl_original_data;
TX_VALUE
--------
123
1 row selected.

How to select the list of words containing a particular substring as part of a SQL query (oracle)?

I'm trying to return the list of "words" (separated by spaces) containing a certain substring within a string as part of an Oracle Sql query. Would like to return the result as a comma separated list. Separate rows for each match would also work.
Example String in [text_col] field:
some words 123-asdf-789A and also this one 456-asdf-555A more words etc.
Desired result: 123-asdf-789A, 456-asdf-555A
This is what I have so far but it only returns the first result and the fact that it's two separate regular expressions makes it difficult to concatenate all matches as I would like to do.
CONCAT(REGEXP_SUBSTR(text_col, ''(([^[:space:]]+)\asdf)'', 1, 1, ''i'', 1),
REGEXP_SUBSTR(text_col, ''\asdf([^[:space:]]+)'', 1, 1, ''i'', 1))
You can use some regexp functions together as :
with tab(str) as
(
select 'some words 123-asdf-789A and also this one 456-asdf-555A more words etc' from dual
), t as
(
select regexp_substr(str,'[^[:space:]]+',1,level) as str, level as lvl
from tab
connect by level <= regexp_count(str,'[:space:]')
)
select listagg(str,',') within group (order by lvl) as "Result"
from t
where regexp_like(str,'-');
Result
---------------------------------
123-asdf-789A,456-asdf-555A
Demo
first split by spaces (through [:space:] posix) and take the ones containing dash characters, and finally concatenate by listagg() function
Use a recursive sub-query factoring clause and iterate through all the matches concatenating the string as you go:
Oracle Setup:
CREATE TABLE test_data ( value ) AS
SELECT 'some words 123-asdf-789A and also this one 456-asdf-555A more words etc.' FROM DUAL UNION ALL
SELECT 'some words without the expected sub-string' FROM DUAL UNION ALL
SELECT 'asdf asdf-123 456-asdf 78-asdf-90' FROM DUAL
Query:
WITH matches ( value, idx, cnt, match ) AS (
SELECT value,
0,
REGEXP_COUNT( value, '\S*asdf\S*' ),
CAST( NULL AS VARCHAR2(4000) )
FROM test_data
UNION ALL
SELECT value,
idx + 1,
cnt,
CASE idx WHEN 0 THEN '' ELSE match || ' ' END
|| REGEXP_SUBSTR( value, '\S*asdf\S*', 1, idx + 1 )
FROM matches
WHERE idx < cnt
)
SELECT value, match
FROM matches
WHERE idx = cnt;
Output:
VALUE | MATCH
:----------------------------------------------------------------------- | :--------------------------------
some words without the expected sub-string | null
some words 123-asdf-789A and also this one 456-asdf-555A more words etc. | 123-asdf-789A 456-asdf-555A
asdf asdf-123 456-asdf 78-asdf-90 | asdf asdf-123 456-asdf 78-asdf-90
db<>fiddle here

Oracle String Conversion

Need help in converting the following string into a required format. I will have several values as below. Is there a easy way to do this using REGEXP or something better?
Current format coming from column A
Region[Envionment Lead|||OTC|||06340|||List Program|||TX|||Z3452|||Souther Region 05|||M7894|||California Divison|||Beginning]
Region[Coding Analyst|||BA|||04561|||Water Bridge|||CA|||M8459|||West Region 09|||K04956|||East Division|||Supreme]
Required Format of column A
Region[actingname=Envionment Lead,commonid=OTC,insturmentid=06340,commonname=List Program]
Region[actingname=Coding Analyst,commonid=BA,insturmentid=04561,commonname=Water Bridge]
revised data
**Column data**
Region[Coding Analyst|||BA|||reg pro|||04561|||08/16/2011|||Board member|||AZ|||06340|||Whiter Bridge|||CA|||M0673|||West Region 09|||K04956|||East Division|||Supreme]
**required Data**
{actingname=06340, actingid=M0673, insturmentid=BA, insturmentname=Coding Analyst, commonname=West Region 09, stdate=08/16/2011, linnumber=04561, linstate=CA, linname=Supreme}
The issue is getting the 10,11,12 and 15 position of the string. I can get anything below 10th position, but not 10 or more string position. Can you please guide me what i'm i missing here
'{actingname=\8,actingid=\11,insturmentid=\2,insturmentname=\1,commonname=\12, stdate=\5,linnumber=4,linstate=10,linname=15}'--Here 10,11,12 and 15 posistion are not being fethched
I used REGEXP_REPLACE
SELECT REGEXP_REPLACE(
'Region[Envionment Lead|||OTC|||06340|||List Program|||TX|||Z3452|||Souther Region 05|||M7894|||California Divison|||Beginning]',
'^Region\[([[:alpha:][:space:][:digit:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:]]*)\|\|\|([[:alpha:][:space:][:digit:]]*).*',
'Region[actingname=\1,commonid=\2,instrumentid=\3,commonname=\4]') as replaced
FROM dual
or like an update it would be
UPDATE table1
SET col1 = REGEXP_REPLACE(
col1,
'^Region\[([[:alpha:][:space:][:digit:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:]]*)\|\|\|([[:alpha:][:space:][:digit:]]*).*',
'Region[actingname=\1,commonid=\2,instrumentid=\3,commonname=\4]')
You can use regexp_substr and listagg consecutively
with t1(str1) as
(
select 'Region[Coding Analyst|||BA|||04561|||Water Bridge]' from dual
), t2(str2) as
(
select 'actingname,commonid,insturmentid,commonname' from dual
), t3 as
(
select regexp_substr(str1, '[^|||]+', 1, level) str1,
regexp_substr(str2, '[^,]+', 1, level)||'=' str2,
level as lvl
from t1
cross join t2
connect by level <= regexp_count(str1, '[^|||]+')
), t4 as
(
select case when lvl = 1 then
replace(str1,'[','['||str2)
else
str2||str1
end as str, lvl
from t3
)
select listagg(str,',') within group (order by lvl) as "Result String" from t4;
Result String
----------------------------------------------------------------------------------------
Region[actingname=Coding Analyst,commonid=BA,insturmentid=04561,commonname=Water Bridge]
P.S. I considered the second one as a sample, and took the 4 first string due to number of substrings seperated by triple-pipes due to the number of tuple labels ending with equality sign is 4.
Demo
this will work:
select substr(regexp_replace(regexp_replace(regexp_replace
(regexp_replace(regexp_replace("col1",'\[','[actingname='),
'\|\|\|',',commonid=',1,1,'i'),
'\|\|\|',',insturmentid=',1,1,'i'),
'\|\|\|',',commonname=',1,1,'i'),
'\|',']',1,1,'i'),
1,regexp_instr(regexp_replace(regexp_replace(regexp_replace
(regexp_replace(regexp_replace("col1",'\[','[actingname='),
'\|\|\|',',commonid=',1,1,'i'),
'\|\|\|',',insturmentid=',1,1,'i'),
'\|\|\|',',commonname=',1,1,'i'),
'\|',']',1,1,'i'),'\]')-1 )||']'
from Table1;
check:
http://sqlfiddle.com/#!4/3ddfa0/11
thanks!!!!!!

Regexp_replace processing result

I have a string with groups of nubmers. And Id like to make constant length string. Now I use two regexp_replace. First to add 10 numbers to string and next to cut string and take last 10 values:
with s(txt) as ( select '1030123:12031:1341' from dual)
select regexp_replace(
regexp_replace(txt, '(\d+)','0000000000\1')
,'\d+(\d{10})','\1') from s ;
But Id like to use only one regex something like
regexp_replace(txt, '(\d+)',lpad('\1',10,'0'))
But it don't work. lpad executed before regexp. Could you have any ideas?
With a slightly different approach, you can try the following:
with s(id, txt) as
(
select rownum, txt
from (
select '1030123:12031:1341' as txt from dual union all
select '1234:0123456789:1341' from dual
)
)
SELECT listagg(lpad(regexp_substr(s.txt, '[^:]+', 1, lines.column_value), 10, '0'), ':') within group (order by column_value) txt
FROM s,
TABLE (CAST (MULTISET
(SELECT LEVEL FROM dual CONNECT BY instr(s.txt, ':', 1, LEVEL - 1) > 0
) AS sys.odciNumberList )) lines
group by id
TXT
-----------------------------------
0001030123:0000012031:0000001341
0000001234:0123456789:0000001341
This uses the CONNECT BY to split every string based on the separator ':', then uses LPAD to pad to 10 and then aggregates the strings to build rows containing the concatenation of padded values
This works for non-empty sequences (e.g. 123::456)
with s(txt) as ( select '1030123:12031:1341' from dual)
select regexp_replace (regexp_replace (txt,'(\d+)',lpad('0',10,'0') || '\1'),'0*(\d{10})','\1')
from s
;

Comma-delimited fields in a csv file in plsql

I have
WHILE INSTR (l_buffer, ',', 1, l_col_no) != 0
which checks whether the l_buffer is comma delimited and enters the loop.
Now I have a file with values
CandidateNumber,rnumber,title,OrganizationCode,OrganizationName,JobCode,JobName
10223,1600003B,Admin Officer,00000004,"Org Land, Inc.",ORGA03,ORGA03 HR & Admin
In this file it is considering "Org Land, Inc." as two words because of , in between. Is there a way to treat this as one by using Instr or anything?
Horrible idea. If you are forced to use character-delimited strings, the least you should be able to require is that the delimiter be a character that is all but guaranteed not to appear in regular field values.
The problem you raised can be solved. I show below a solution - probably not close to the most efficient, but at least it shouldn't be difficult to follow the logic. I intentionally chose an example (the fifth string) to demonstrate how it can fail. I assumed any commas between a pair of double-quotes (an opening one and a closing one) should become "invisible" - treated as if they were not delimiters, but part of the field value. That breaks if a double-quote is used in a way different from the "usual" - see my sample string #5. It will also break on any other "natural" uses of comma (where they are not meant as a delimiter) - for example, what if you have a field with a value of $1,000.00? Now you need to "escape" that comma too. One could probably come up with at least ten more similar situations - are you going to code around all of them?
Now, for my own learning and practice, I pretended the ONLY way a comma may need to be "escaped" (to become invisible to the tokenization process) is if it is enclosed between an opening and a closing double-quote (determined simply by ordering: a double-quote with an odd count from the beginning of the string is an opening one, and a double-quote with an even count is a closing one). Here is the solution; test strings at the top, including a few to test proper treatment of nulls, and the output following immediately after.
Good luck!
with test_strings (r, s) as (
select 1, 'abdc, ronfn 0003, "ABC, Inc.", 9939' from dual union all
select 2, 'New Delhi' from dual union all
select 3, null from dual union all
select 4, ',' from dual union all
select 5, 'If needed, use double quote("), OK?' from dual
),
t (r, s) as (
select r, ',' || s || ',' from test_strings
),
ct (r, nc, nq) as (
select r, regexp_count(s, ','), regexp_count(s, '"') from t
),
c (r, pos) as (
select t.r, instr(t.s, ',', 1, level) from t join ct on t.r = ct.r
connect by level <= ct.nc and t.r = prior t.r and prior sys_guid() is not null
),
q (r, pos) as (
select t.r, instr(t.s, '"', 1, level) from t join ct on t.r = ct.r
connect by level <= ct.nq and t.r = prior t.r and prior sys_guid() is not null
),
p (r, pos_from, pos_to, rn) as (
select r, pos, lead(pos) over (partition by r order by pos),
row_number() over (partition by r order by pos) from c
where mod((select count(1) from q where q.r = c.r and q.pos != 0
and q.pos < c.pos), 2) = 0
)
select p.r as string_number, p.rn as token_number,
substr(t.s, p.pos_from + 1, p.pos_to - p.pos_from - 1)
from t join p on t.r = p.r
where p.pos_to is not null
order by string_number, token_number
;
Results:
STRING_NUMBER TOKEN_NUMBER TOKEN
------------- ------------ --------------------
1 1 abdc
1 2 ronfn 0003
1 3 "ABC, Inc."
1 4 9939
2 1 New Delhi
3 1
4 1
4 2
5 1 If needed
9 rows selected.
Use notepad++, And change all commas to ';'. Before it, You should use REGEXP to change all commas between double quotes for let's say '#'. Then ctrl+h -> ',' to ';' and '#' to ','.