Find string match to Oracle table using regex - sql

I have an Oracle stored procedure on an Oracle 12c database that receives a company_name input. From that company_name, I need to find and flag Federal institutions. To accomplish that, I have a table (TBL_FED_KEY) with one column (KEY_1) of keywords. The table contains nearly 50 values like:
ARMY
FEDERAL
AIR FORCE
VETERANS
HOMELAND SECURITY
INDIAN HOSPITAL
WILL ROGERS
To give you an idea of the company_name string that could be passed through to the procedure, here are examples:
US Army - Munson Health Center
Federal Bureau of Prisons,BOP/DOJ-
Hickam Air Force Base Pharmacy
Minnesota Veterans Home Pharmacy
P.H.S. Indian Hospital
Will Rogers Health Center
What Oracle SQL can be used to match the incoming company_name against TBL_FED_KEY.KEY_1? I've tried multiple variations of REGEXP_INSTR but I can't seem to get anything to work 100%. Is REGEXP_INSTR even the best tool to accomplish this?
Thanks!

You could just use like:
select f.*
from TBL_FED_KEY f
where lower(i.name) like '%' || lower(KEY_1) || '%'

Seems you want case-insensitive match among those string. So, use REGEXP_LIKE() function with case-insensitive(i) option :
SELECT *
FROM TBL_FED_KEY
WHERE REGEXP_LIKE(company_name,key_1,'i')

I am not sure what the procedure is supposed to do after it "flags" the company as federal vs. not. I would instead write it as a function as shown below (but you can easily reuse most of the code in a procedure, if needed).
Then I illustrate how the function can be used directly in SQL. You can also use it in PL/SQL if needed, but in most cases you don't. Note - the same idea can be implemented exclusively in SQL, resulting in faster execution, since you don't need PL/SQL at all. Important - even in plain SQL, this should be implemented via a semi join, as I demonstrated, for faster execution.
Setup:
create table tbl_fed_key (key_1 varchar2(200));
insert into tbl_fed_key
select 'ARMY' from dual union all
select 'FEDERAL' from dual union all
select 'AIR FORCE' from dual union all
select 'VETERANS' from dual union all
select 'HOMELAND SECURITY' from dual union all
select 'INDIAN HOSPITAL' from dual union all
select 'WILL ROGERS' from dual
;
commit;
Function code:
create or replace function is_federal_institution(company_name varchar2)
return varchar
deterministic
as
is_fed varchar2(1);
begin
select case when exists ( select key_1
from tbl_fed_key
where instr(upper(company_name), upper(key_1)) > 0
)
then 'Y' else 'N' end
into is_fed
from dual;
return is_fed;
end;
/
SQL test:
with
inputs (str) as (
select 'Joe and Bob Army Supply Store' from dual union all
select 'Mary Poppins Indian Hospital' from dual union all
select 'Bridge Association of NYC' from dual union all
select 'Will Rogers Garden' from dual union all
select 'First Federal Bank NA' from dual
)
select str, is_federal_institution(str) as is_federal
from inputs
;
STR IS_FEDERAL
------------------------------ ----------
Joe and Bob Army Supply Store Y
Mary Poppins Indian Hospital Y
Bridge Association of NYC N
Will Rogers Garden Y
First Federal Bank NA Y
As you can see, I threw in a few false positives - to illustrate the important fact that this "technological" solution is only partial. A human will still need to review the individual hits, if accuracy is important.

Related

how to find char in column value

I have two tables
table with all country codes like KZ,US,RU
table tranzactions with terminal location like
(Starbucks 1500 Broadway *Near Times Square US)
(CoffeBoom KZ Mendekulova district *Near Dostyk plaza)
and I want select
country code number , code str , location terminal name
like
398 | KZ | CoffeBoom KZ Mendekulova district *Near Dostyk plaza
840 | US | tarbucks 1500 Broadway *Near Times Square US
and without case when in terminal location name has code char in string like 'Gucci Moscow Redkzsuzin district RU' where char 'KZ','UZ' country code I want to select only 'RU'.
You can try building a regular expression incorporating column code_str within itself. The following attempts such. It builds an expression looking for the beginning of the string or a space followed the country code followed by a space or end-of-string and extracts rows matching. However, both false positives and false negatives as your searching free form text. Any occurrence matching that pattern will be returned even if NOT actually the a valid code and can miss valid ones as well. For example it will not find the row:
982,'US', 'Starbucks 618 Miracle Mile, Chicago, IL, USA'
You may need to workout a better definition of what you are searching for.
with tranzactions (country_code_number , code_str , location_terminal_name) as
(select 398,'KZ', 'CoffeBoom KZ Mendekulova district *Near Dostyk plaza' from dual union all
select 840,'US', 'Starbucks 1500 Broadway *Near Times Square US' from dual union all
select 982,'US', 'Starbucks 618 Miracle Mile, Chicago, IL, USA' from dual
)
select * from tranzactions
where regexp_like(location_terminal_name, '(^| )' || code_str || '( |$)' );

SQL Masking A Mapping Field In The Query

I am creating a view to extract data from a table and load that data into a fixed file which will be loaded into a system. The view will map the table column to a particular format.
There is one column, Account_Number, which needs to be masked as the column has sensitive information.
My logic to mask the value is to shift the number to the next place in numberline.
so, if the number is 0 then 1, 4 then 5, etc. I am not able to come with the logic in the view itself.
Any help would be appreciated.
CREATE OR REPLACE FORCE EDITIONABLE VIEW "Schema1"."VW_ActiveTraders" ("FUND", "NAME", "CITY", "ACN") AS
Select
TD_Fund as FUND,
Name as NAME,
City as CITY,
Account_Number as ACN
FROM Trader1 -- Table Name
Account Number
023457456
123456789
012345678
Masked Account Number
134568567
012345678
123456789
Please note that Account Number column has more than 1000 entries.
You may use TRANSLATE to shift the numbers
with dt as (
select '023457456' ACN from dual union all
select '123456789' ACN from dual union all
select '012345678' ACN from dual)
select ACN,
TRANSLATE(ACN,'0123456789','1234567890') as ACN_WEAK_MASK
from dt;
ACN ACN_WEAK_
--------- ---------
023457456 134568567
123456789 234567890
012345678 123456789
But note, that this is not a real masking of sensitive information. It is very easy to unmask the information and get the original acount ID.
An often used masking is e.g. 012345678 gets ******678.
#MarmiteBomber #Stilgar - Thanks so much for clarification and help on the answer.
I just tweaked the query and it ran successfully.
Changed Query
------------------------------------------------------------------------------------------
CREATE OR REPLACE FORCE EDITIONABLE VIEW "Schema1"."VW_ActiveTraders" ("FUND", "NAME", "CITY", "ACN") AS
Select
TD_Fund as FUND,
Name as NAME,
City as CITY,
--Account_Number as ACN
TRANSLATE(Account_Number,'0123456789','1234567890') as ACN,
FROM Trader1 -- Table Name
------------------------------------------------------------------------------------------

Stored procedure

I have 4 tables to be used in the procedure
business(abnnumber,name)
business_industry(abnnumber,industryid)
industry(industryid,unionid)
trade_union(unionid)
I was assigned to get trade union title in one line and all the businesses ABNNUMBER and business name in different lines using stored procedure.
What I tried is:
CREATE [OR REPLACE] PROCEDURE INDUSTRY_INFORMATION
[enter image description here][1](P_INDUSTRYID in integer,
P_UNIONTITLE OUT VARCHAR2,
P_BUSINESSNAME OUT VARCHAR2) AS
BEGIN
SELECT TRADE_UNION.UNIONTITLE, BUSINESS.BUSINESSNAME INTO
P_UNIONTITLE,P_BUSINESSNAME
FROM BUSINESS inner join BUSINESS_INDUSTRY ON
BUSINESS.ABNNUMBER=BUSINESS_INDUSTRY.ABNNUMBER
INNER JOIN INDUSTRY ON BUSINESS_INDUSTRY.INDUSTRYID=INDUSTRY.INDUSTRYID
INNER JOIN TRADE_UNION ON INDUSTRY.UNIONID=TRADE_UNION.UNIONID;
END;
Sample data is in the link http://www.mediafire.com/file/8c4dwn4n88n8a42/strd_procedure.txt
Required output should be
UNIONTITLE (one line)
ABNNUMBER BUSINESS NAME (next line)
`` [1]: https://i.stack.imgur.com/sGuwe.jpg
I suspect that You need something like this:
create or replace procedure industry_info is
begin
for r in (
select tu.uniontitle ut,
listagg('['||b.abnnumber||'] '||b.businessname, ', ')
within group (order by b.businessname) blist
from business b
join business_industry bi on b.abnnumber = bi.abnnumber
join industry i on bi.industryid = i.industryid
join trade_union tu on i.unionid = tu.unionid
group by tu.uniontitle )
loop
dbms_output.put_line(r.ut);
dbms_output.put_line(r.blist);
dbms_output.put_line('-----');
end loop;
end;
Function listagg is available in Oracle 11g or later.
Output:
Cleaners' Union
[12345678912] Consolidated Proerty Services, [12345678929] Gold Cleaning Services, [12345678926] Home Cleaning Services, [12345678924] Shine Cleaning
-----
Construction Workers' Union
[12345678920] Build a House, [12345678919] Construction Solutions, [12345678922] Joe's Rubbish Removal, [12345678918] Leak and Roof Repair, [12345678928] Muscle Rubbish Removals
-----
Electricians' Union
[12345678916] Change the Fuse Electricals, [12345678921] Hire a Wire, [12345678917] Vicky Electricals
-----
Movers' Union
[12345678913] Kohlan Movers, [12345678925] Moveit
-----
Mowers' Union
[12345678923] Do it Right Mowers, [12345678911] James Mowers and Landscape
-----
Plumbers' Union
[12345678927] 24X7 Plumbing Service, [12345678915] Anytime Plumbers, [12345678914] Pumbers Delivered
-----

Fetch object of specified level-type from data hierarchy (Oracle 12 SQL)

My data-set inside table object_type_t looks something like the following:
OBJ_ID PARENT_OBJ OBJECT_TYPE OBJECT_DESC
--------- ------------ ------------- -----------------------
ES01 <null> ESTATE Bucks Estate
BUI01 ES01 BUILDING Leisure Centre
BUI02 ES01 BUILDING Fire Station
BUI03 <null> BUILDING Housing Block
SQ01 BUI01 ROOM Squash Court
BTR01 BUI02 ROOM Bathroom
AP01 BUI03 APARTMENT Flat No. 1
AP02 BUI03 APARTMENT Flat No. 2
BTR02 AP01 ROOM Bathroom
BDR01 AP01 ROOM Bedroom
BTR03 AP02 ROOM Bathroom
SHR01 BTR01 OBJECT Shower
SHR02 BTR02 OBJECT Shower
SHR03 BTR03 OBJECT Shower
Which in practical, hierarchical terms, looks something like this:
ES01
|--> BUI01
| |--> SQ01
|--> BUI02
| |--> BTR01
|--> SHR01
=======
BUI03
|--> AP01
| |--> BTR02
| | |--> SHR02
| |--> BDR01
|--> AP02
|--> BTR03
|--> SHR03
I know how to use hierarchical queries, such as CONNECT BY PRIOR. I'm also aware of how to find the root of the tree via connect_by_root. But what I am looking to do is find a given "level" of a tree (i.e. not the root level, but rather the "BUIDLING" level of a given object).
So for example, I would like to be able to query out every object in the hierarchy, which belongs to BUI01.
And then in reverse, given an object ID, I would like to be able to query out the associated (say) ROOM object_id for that object.
Things would be much easier if I could associate each OBJECT_TYPE with a given level. But as you see from the above example, BUILDING does not always appear at level 1 in the hierarchy.
My initial idea is to fetch the data into an intermediate tabular format (perhaps a materialized view) which would look like the following. This would allow me to find the data I want by simple SQL queries on the materialized view:
OBJ_ID OBJECT_DESC ESTATE_OBJ BUILDING_OBJ ROOM_OBJ
--------- ---------------- ---------- ------------ ----------
ES01 Bucks Estate ES01
BUI01 Leisure Centre ES01 BUI01
BUI02 Fire Station ES01 BUI02
BUI03 Housing Block BUI03
SQ01 Squash Court ES01 BUI01 SQ01
BTR01 Bathroom ES01 BUI02 BTR01
AP01 Flat No. 1 BUI03
AP02 Flat No. 2 BUI03
BTR02 Bathroom BUI03 BTR02
BDR01 Bedroom BUI03 BDR01
BTR03 Bathroom BUI03 BTR03
SHR01 Shower ES01 BUI02 BTR01
SHR02 Shower BUI03 BTR02
SHR03 Shower BUI03 BTR03
But (short of writing PL/SLQ, which I would like to avoid), I haven't been able to concisely structure a query which would achieve this tabular format.
Does anyone know how I can do this? Can it be done?
Solutions must be executable in Oracle 12c.
Additionally: Performance is important, since my underlying data structure contains several hundred-thousand lines, and structures can be quite deep. So faster solutions will be preferred over slower ones :-)
Thanks for your help, in advance.
If I correctly understand your need, maybe you can avoid the tabular view, directly querying your table;
Say you want to find all the objects belonging to BUI01, you can try:
with test(OBJ_ID, PARENT_OBJ, OBJECT_TYPE, OBJECT_DESC) as
(
select 'ES01','','ESTATE','Bucks Estate' from dual union all
select 'BUI01','ES01','BUILDING','Leisure Centre' from dual union all
select 'BUI02','ES01','BUILDING','Fire Station' from dual union all
select 'BUI03','','BUILDING','Housing Block' from dual union all
select 'SQ01','BUI01','ROOM','Squash Court' from dual union all
select 'BTR01','BUI02','ROOM','Bathroom' from dual union all
select 'AP01','BUI03','APARTMENT','Flat No. 1' from dual union all
select 'AP02','BUI03','APARTMENT','Flat No. 2' from dual union all
select 'BTR02','AP01','ROOM','Bathroom' from dual union all
select 'BDR01','AP01','ROOM','Bedroom' from dual union all
select 'BTR03','AP02','ROOM','Bathroom' from dual union all
select 'SHR01','BTR01','OBJECT','Shower' from dual union all
select 'SHR02','BTR02','OBJECT','Shower' from dual union all
select 'SHR03','BTR03','OBJECT','Shower' from dual
)
select OBJECT_TYPE, OBJ_ID, OBJECT_DESC
from test
connect by prior obj_id = parent_obj
start with obj_ID = 'BUI01'
This considers BUI01 belonging to itself; if you don't want this, you can modify the query in quite simple way to cut off your starting value.
On the opposite way, say you're looking for the room in which SHR01 is, you can try with the following; it's basically the same recursive idea, but in ascending order, instead of descending the tree:
with test(OBJ_ID, PARENT_OBJ, OBJECT_TYPE, OBJECT_DESC) as
(...
)
SELECT *
FROM (
select OBJECT_TYPE, OBJ_ID, OBJECT_DESC
from test
connect by obj_id = PRIOR parent_obj
start with obj_ID = 'SHR01'
)
WHERE object_type = 'ROOM'
In both cases, you only scan your table once, without any other structure; this way, this has a chance to be fast enough.
The desired output has 3 columns which are determined by object types. In general this could be extended with more columns, one for each possible value for the field object_type. Even with the given example data, one could imagine an additional column apartment_obj.
To make this generic without the need to self-join the table as many times as there are object type values, one could use a combination of CONNECT BY and PIVOT clauses:
SELECT *
FROM (
SELECT obj_id,
object_desc,
CONNECT_BY_ROOT obj_id AS pivot_col_value,
CONNECT_BY_ROOT object_type AS pivot_col_name
FROM object_type_t
-- skip the STARTS WITH clause to get all connected pairs
CONNECT BY parent_obj = PRIOR obj_id
)
PIVOT (
MAX(pivot_col_value) AS obj
FOR (pivot_col_name) IN (
'ESTATE' AS estate,
'BUILDING' AS building,
'ROOM' AS room
)
);
The FOR ... IN clause has a hard-coded list of names of the desired columns -- without the _obj suffix, as that gets added during the pivot transformation.
Oracle does not allow this list to be dynamically retrieved. NB: there is an exception to this rule when using the PIVOT XML syntax, but there you get the XML output in one column, which you would then need to parse. That would be rather inefficient.
The sub-query with the CONNECT BY clause does not have a STARTS WITH clause, which makes that query take any record as starting point and produce the descendants from there. Together with the CONNECT_BY_ROOT selection, this allows to produce a full list of all connected pairs, where the distance between the two in the hierarchy can be anything. The JOIN then matches the deeper of the two, so you get all ancestors of that node (including the node itself). And those ancestors are then pivoted into columns.
The CONNECT BY sub-query could also be written in way that the hierarchy is traversed backwards. The output is the same, but maybe there is a performance difference. If so, I think that variation could have better performance, but I did not test this on large datasets:
SELECT *
FROM (
SELECT CONNECT_BY_ROOT obj_id AS obj_id,
CONNECT_BY_ROOT object_desc AS object_desc,
obj_id AS pivot_col_value,
object_type AS pivot_col_name
FROM object_type_t
-- Connect in backward direction:
CONNECT BY obj_id = PRIOR parent_obj
)
PIVOT (
MAX(pivot_col_value) AS obj
FOR (pivot_col_name) IN (
'ESTATE' AS estate,
'BUILDING' AS building,
'ROOM' AS room
)
);
Note that in this variant the CONNECT_BY_ROOT returns the deeper node of the pair, because of the opposite traversal.
Alternative based on self-joins (previous answer)
You could use this query:
SELECT t1.obj_id,
t1.object_desc,
CASE 'ESTATE'
WHEN t1.object_type THEN t1.obj_id
WHEN t2.object_type THEN t2.obj_id
WHEN t3.object_type THEN t3.obj_id
END estate_obj,
CASE 'BUILDING'
WHEN t1.object_type THEN t1.obj_id
WHEN t2.object_type THEN t2.obj_id
WHEN t3.object_type THEN t3.obj_id
END building_obj,
CASE 'ROOM'
WHEN t1.object_type THEN t1.obj_id
WHEN t2.object_type THEN t2.obj_id
WHEN t3.object_type THEN t3.obj_id
END room_obj
FROM object_type_t t1
LEFT JOIN object_type_t t2 ON t2.obj_id = t1.parent_obj
LEFT JOIN object_type_t t3 ON t3.obj_id = t2.parent_obj
With many thanks to #trincot for the inspiration, I've carved out the following solution. It's not extremely quick on production data, but it does work on an arbitrarily deep tree. The only way in which this is not dynamic, is that one has to choose in advance which levels of the tree are to be extracted, and an additional column must be added to capture that data.
The principle is that one can build a sys_connect_by_path column, and use regular expressions to extract the required level-data from there.
WITH base_data (obj_id, parent_obj, object_type, object_desc) AS (
SELECT 'ES01','','ESTATE','Bucks Estate' FROM dual union all
SELECT 'BUI01','ES01','BUILDING','Leisure Centre' FROM dual union all
SELECT 'BUI02','ES01','BUILDING','Fire Station' FROM dual union all
SELECT 'BUI03','','BUILDING','Housing Block' FROM dual union all
SELECT 'SQ01','BUI01','ROOM','Squash Court' FROM dual union all
SELECT 'BTR01','BUI02','ROOM','Bathroom' FROM dual union all
SELECT 'AP01','BUI03','APARTMENT','Flat No. 1' FROM dual union all
SELECT 'AP02','BUI03','APARTMENT','Flat No. 2' FROM dual union all
SELECT 'BTR02','AP01','ROOM','Bathroom' FROM dual union all
SELECT 'BDR01','AP01','ROOM','Bedroom' FROM dual union all
SELECT 'BTR03','AP02','ROOM','Bathroom' FROM dual union all
SELECT 'SHR01','BTR01','OBJECT','Shower' FROM dual union all
SELECT 'SHR02','BTR02','OBJECT','Shower' FROM dual union all
SELECT 'SHR03','BTR03','OBJECT','Shower' FROM dual ),
obj_hierarchy AS (
SELECT object_type, obj_id, object_desc, parent_obj, sys_connect_by_path(object_type||':'||obj_id,'/')||'/' r_path
FROM base_data
START WITH parent_obj IS null
CONNECT BY PRIOR obj_id = parent_obj
)
SELECT obj_id, object_desc,
CASE
WHEN instr(h.r_path, 'ESTATE:') > 1
THEN regexp_replace (h.r_path,'.*/ESTATE:([^/]+).*$', '\1')
ELSE ''
END obj_estate,
CASE
WHEN instr(h.r_path, 'BUILDING:') > 1
THEN regexp_replace (h.r_path,'.*/BUILDING:([^/]+).*$', '\1')
ELSE ''
END obj_building,
CASE
WHEN instr(h.r_path, 'ROOM:') > 1
THEN regexp_replace (h.r_path,'.*/ROOM:([^/]+).*$', '\1')
ELSE ''
END obj_room
FROM obj_hierarchy h

Fuzzy text searching in Oracle

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?
UTL_MATCH contains methods for matching strings and comparing their similarity. The
edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance
relative to the size of the strings.
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5
You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
Edited:
If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
Example: A Turkish SOUNDEX is promoted here.
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
Simplified example rules:
Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...