Suggestions for Querying Database for Names - sql

I have an Oracle database that, like many, has a table containing biographical information. On which, I would like to search by name in a "natural" way.
The table has forename and surname fields and, currently, I am using something like this:
select id, forename, surname
from mytable
where upper(forename) like '%JOHN%'
and upper(surname) like '%SMITH%';
This works, but it can be very slow because the indices on this table obviously can't account for the preceding wildcard. Also, users will usually be searching for people based on what they tell them over the phone -- including a huge number of non-English names -- so it would be nice to also do some phonetic analysis.
As such, I have been experimenting with Oracle Text:
create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;
select score(1)+score(2) relevance,
id,
forename,
surname
from mytable
where contains(forename,'!%john%',1) > 0
and contains(surname,'!%smith%',2) > 0
order by relevance desc;
This has the advantage of using the Soundex algorithm as well as full text indices, so it should be a little more efficient. (Although, my anecdotal results show it to be pretty slow!) The only apprehensions I have about this are:
Firstly, the text indices need to be refreshed in some meaningful way. Using on commit would be too slow and might interfere with how the frontend software -- which is out of my control -- interacts with the database; so requires some thinking about...
The results that are returned by Oracle aren't exactly very naturally sorted; I'm not really sure about this score function. For example, my development data is showing "Jonathan Peter Jason Smith" at the top -- fine -- but also "Jane Margaret Simpson" at the same level as "John Terrance Smith"
I'm thinking that removing the preceding wildcard might improve performance without degrading the results as, in real life, you would never search for a chunk in the middle of a name. However, otherwise, I'm open to ideas... This scenario must have been implemented ad nauseam! Can anyone suggest a better approach to what I'm doing/considering now?
Thanks :)

I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, #X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!
So, my process is as follows:
Create a new table with columns: ID, token, sound and position; with the primary key over (ID, sound,position) and an additional index over (ID,sound).
Go through each person in the biographical table:
Concatenate their forename and surname.
Change the codepage to us7ascii, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.
Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with ID.
Like so:
declare
nameString varchar2(82);
token varchar2(40);
posn integer;
cursor myNames is
select id,
forename||' '||surname person_name
from mypeople;
begin
for person in myNames
loop
nameString := trim(
utl_i18n.escape_reference(
regexp_replace(
regexp_replace(person.person_name,'[^[:alpha:]]',' '),
'\s+',' '),
'us7ascii')
)||' ';
posn := 1;
while nameString is not null
loop
token := substr(nameString,1,instr(nameString,' ') - 1);
insert into personsearch values (person.id,lower(token),soundex(token),posn);
nameString := substr(nameString,instr(nameString,' ') + 1);
posn := posn + 1;
end loop;
end loop;
end;
/
So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into personsearch along with their ID.
To search, we do the same process: tokenise the search criteria and then return results where the Soundexes and relative positions match. We order by the position and then the Levenshtein distance (ld) from the original search for each token, in turn.
This query, for example, will search against two tokens (i.e., pre-tokenised search string):
with searchcriteria as (
select 'john' token1,
'smith' token2
from dual)
select alpha.id,
mypeople.forename||' '||mypeople.surname
from peoplesearch alpha
join mypeople
on mypeople.student_id = alpha.student_id
join peoplesearch beta
on beta.student_id = alpha.student_id
and beta.position > alpha.position
join searchcriteria
on 1 = 1
where alpha.sound = soundex(searchcriteria.token1)
and beta.sound = soundex(searchcriteria.token2)
order by alpha.position,
ld(alpha.token,searchcriteria.token1),
beta.position,
ld(beta.token,searchcriteria.token2),
alpha.student_id;
To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the position field in the joined table must be greater than the position of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)
As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:
The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the mypeople table to insert/update/delete tokens into the search table whenever the corresponding operation is performed on mypeople. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.
No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...
Anyway, hope that helps :) Comments welcome!
EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.

Related

Conditionally delete item inside an Array Field PostgreSQL

I'm building a kind of dictionary app and I have a table for storing words like below:
id | surface_form | examples
-----------------------------------------------------------------------
1 | sounds | {"It sounds as though you really do believe that",
| | "A different bell begins to sound midnight"}
Where surface_form is of type CHARACTER VARYING and examples is an array field of CHARACTER VARYING
Since the examples are generated automatically from another API, it might not contain the exact "surface_form". Now I want to keep in examples only sentences that contain the exact surface_form. For instance, in the given example, only the first sentence is kept as it contain sounds, the second should be omitted as it only contain sound.
The problem is I got stuck in how to write a query and/or plSQL stored procedure to update the examples column so that it only has the desired sentences.
This query skips unwanted array elements:
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id;
id | new_examples
----+----------------------------------------------------
1 | {"It sounds as though you really do believe that"}
(1 row)
Use it in update:
with corrected as (
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id
)
update a_table
set examples = new_examples
from corrected
where examples <> new_examples
and a_table.id = corrected.id;
Test it in rextester.
Maybe you have to change the table design. This is what PostgreSQL's documentation says about the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
Documentation:
https://www.postgresql.org/docs/current/static/arrays.html
The most compact solution (but not necessarily the fastest) is to write a function that you pass a regular expression and an array and which then returns a new array that only contains the items matching the regex.
create function get_matching(p_values text[], p_pattern text)
returns text[]
as
$$
declare
l_result text[] := '{}'; -- make sure it's not null
l_element text;
begin
foreach l_element in array p_values loop
-- adjust this condition to whatever you want
if l_element ~ p_pattern then
l_result := l_result || l_element;
end if;
end loop;
return l_result;
end;
$$
language plpgsql;
The if condition is only an example. You need to adjust that to whatever you exactly store in the surface_form column. Maybe you need to test on word boundaries for the regex or a simple instr() would do - your question is unclear about that.
Cleaning up the table then becomes as simple as:
update the_table
set examples = get_matching(examples, surface_form);
But the whole approach seems flawed to me. It would be a lot more efficient if you stored the examples in a properly normalized data model.
In SQL, you have to remember two things.
Tuple elements are immutable but rows are mutable via updates.
SQL is declarative, not procedural
So you cannot "conditionally" "delete" a value from an array. You have to think about the question differently. You have to create a new array following a specification. That specification can conditionally include values (using case statements). Then you can overwrite the tuple with the new array.
Looks like one way could to update the array with array elements that are valid by doing a select using like or some regular expression.
https://www.postgresql.org/docs/current/static/arrays.html
If you want to hold elements from array that have "surface_form" in it you have to use that entries with substring(....,...) is not null
First you unnest the array, hold only items that match, and then array_agg the stored items
Here is a little query you can run to test without any table.
SELECT
id,
surface_form,
(SELECT array_agg(examples_matching)
FROM unnest(surfaces.examples) AS examples_matching
WHERE substring(examples_matching, surfaces.surface_form) IS NOT NULL)
FROM
(SELECT
1 AS id,
'example' :: TEXT AS surface_form,
ARRAY ['example form', 'test test','second example form'] :: TEXT [] AS examples
) surfaces;
You can select data in temp table using
Then update temp table using update query on row number
Merge value using
This merge value you can update in original table
For Example
Suppose you create temp table
Temp (id int, element character varying)
Then update Temp table and nest it.
Finally update original table
Here is the query you can directly try to execute in editor
CREATE TEMP TABLE IF NOT EXISTS temp_element (
id bigint,
element character varying)WITH (OIDS);
TRUNCATE TABLE temp_element;
insert into temp_element select row_number() over (order by p),p from (
select unnest(ARRAY['It sounds as though you really do believe that',
'A different bell begins to sound midnight']) as P)t;
update temp_element set element = 'It sounds as though you really'
where element = 'It sounds as though you really do believe that';
--update table
select array_agg(r) from ( select element from temp_element)r

Improving performance on an alphanumeric text search query

I have table where millions of records are there I'm just posting sample data. Actually I'm looking to get only Endorsement data by using LIKE or LEFT but there is no difference between them in Execution time. IS there any fine way to get data in less time while dealing with Alphanumeric Data. I have 4.4M records in table. Suggest me
declare #t table (val varchar(50))
insert into #t(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal')
SELECT * FROM #t where RIGHT(val,11) = 'Endorsement'
SELECT * FROM #t where val like '%Endorsement%'
Imagine you'd have to find names in a telephone book that end with a certain string. All you could do is read every single name and compare. It doesn't help you at all to see where the names with A, B, C, etc. start, because you are not interested in the initial characters of the names but only in the last characters instead. Well, the only thing you could do to speed this up is ask some friends to help you and each person scans a range of pages only. In a DBMS it is the same. The DBMS performs a full table scan and does this parallelized if possible.
If however you had a telephone book listing the words backwards, so you'd see which words end with A, B, C, etc., that sure would help. In SQL Server: Create a computed column on the reverse string:
alter table t add reverse_val as reverse(val);
And add an index:
create index idx_reverse_val on t(reverse_val);
Then query the string with LIKE. The DBMS should notice that it can use the index for speeding up the search process.
select * from t where reverse_val like reverse('Endorsement') + '%';
Having said this, it seems strange that you are interested in the end of your strings at all. In a good database you store atomic information, e.g. you would not store a person's name and birthdate in the same column ('John Miller 12.12.2000'), but in separate columns instead. Sure, it does happen that you store names and want to look for names starting with, ending with, containing substrings, but this is a rare thing after all. Check your column and think about whether its content should be separate columns instead. If you had the string ('Endorsement', 'Renewal', etc.) in a separate column, this would really speed up the lookup, because all you'd have to do is ask where val = 'Endorsement' and with an index on that column this is a super-simple task for the DBMS.
try charindex or patindex:
SELECT *
FROM #t t
WHERE CHARINDEX('endorsement', t.val) > 0
SELECT *
FROM #t t
WHERE PATINDEX('%endorsement%', t.val) > 0
CREATE TABLE tbl
(val varchar(50));
insert into tbl(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal');
CREATE CLUSTERED INDEX inx
ON dbo.tbl(val)
SELECT * FROM tbl where val like '%Endorsement';
--LIKE '%Endorsement' will give better performance it will utilize the index well efficiently than RIGHT(val,ll)

Oracle Text Context-Index returns all rows for contains() query with wildcard

We use Oracle Text in a Oracle Database (11.2.0.4.0) to perform full-text search over stored documents, as well as over multiple columns in our database.
For these multi-column indexes we noticed that some double-sided wildcard queries return the wrong number of results: The whole table!
Our application translates the query of a user into a double-sided wildcard query (e.g. "york" -> "%york%") and passes them to the contains operator.
We re-ran this on the database and could reproduce it.
Consider, for example, a table containing cities where the full-text index spans all columns: Zip-Code, Cityname, State and Country:
select * from city where contains(cityname, '%york%')>0
The following query arguments seem to return a wrong number of results (all rows):
%s%
%i%
%d%
%c%
What I checked already:
Interestingly, the non-working queries are all format-arguments in C. But I have not been able to find these as keywords or special operators in the Oracle Text documentation.
I checked that the stop word list does not contain these queries.
I set a custom lexer and turned on the "mixed case" option for it, which seems to fix the issue for lowercase queries, but the issue persists for upper case queries (%S%).
The score operator returns a value of 6 for the rows that should not match:
select cityname, state, zip, score(1) from city where contains(cityname, '%s%', 1)>0
---------------------------------
|Cityname |State|Zip | Score(1)|
|-------------------------------|
|La Cibourg|NE |2332| 6 | - WRONG
|Morlon |FR |1638| 6 | - WRONG
|Leuk Stadt|VS |3953| 12 | - Correct row
---------------------------------
Do you know any (mis-)configuration that can cause this?
Update
The exact version is 11.2.0.4.0, with Patch 18842982 applied.
The script to create the table and index is below:
drop table city_copy;
create table city_copy (
city_nr number not null,
zip_code varchar2(60),
city_name varchar2(60),
state varchar2(60)
);
insert into city_copy
select 1, 2332, 'La Ciboug', 'NE' from dual
union all
select 2, 1638, 'Morlon', 'FR' from dual
union all
select 3, 3953, 'Leuk Stadt', 'VS' from dual;
commit;
exec ctxsys.ctx_ddl.drop_preference('CITY_MULTI');
exec ctxsys.ctx_ddl.create_preference('CITY_MULTI', 'MULTI_COLUMN_DATASTORE');
exec ctxsys.ctx_ddl.set_attribute('CITY_MULTI', 'COLUMNS', 'ZIP_CODE, CITY_NAME, STATE');
create index city_idx_ft on city_copy(zip_code)
indextype is ctxsys.context parameters ('datastore CITY_MULTI sync (on commit)');
The current settings for the default lexer are:
DEFAULT_LEXER COMPOSITE GERMAN
DEFAULT_LEXER MIXED_CASE YES
DEFAULT_LEXER ALTERNATE_SPELLING GERMAN
Our stoplist is unchanged from the default stoplist for German
SO, after quite a bit research...
I am still not sure if it is a bug, but although my intuition said it is the lexer that causes this behavior - it is not.
Please add to the preference an attribute named DELIMITER with a value of NEWLINE
exec ctx_ddl.set_attribute('CITY_MULTI', 'DELIMITER', 'NEWLINE');
That would solve your issue.
The default delimiter is COLUMN_NAME_TAG which probably conflicts with too short parameters (it is supposed to treat your data as if it is an XML, and probably somewhere in how Oracle concatenate the text there are the single characters you were looking for).
It looks to me like for the multi column data store Oracle Text constructs for each row you have an XML that contains the name of the column in it, something like:
<XML>
<zip_code>2332</zip_code>
<city_name>La Ciboug</city_name>
<state>NE</state>
</XML>
and that XML is being indexed (or a structure that is similar to it).
and when looking for just S the s from the word "state" returns in every row.
The new line changes the way the text is being built to
2332
La Ciboug
NE
which is better in your case and the way you search.
more info about it you can find here:
http://docs.oracle.com/cd/B19306_01/text.102/b14218/cdatadic.htm#i1006391
Good luck!

Matching sub string in a column

First I apologize for the poor formatting here.
Second I should say up front that changing the table schema is not an option.
So I have a table defined as follows:
Pin varchar
OfferCode varchar
Pin will contain data such as:
abc,
abc123
OfferCode will contain data such as:
123
123~124~125
I need a query to check for a count of a Pin/OfferCode combination and when I say OfferCode, I mean an individual item delimited by the tilde.
For example if there is one row that looks like abc, 123 and another that looks like abc,123~124, and I search for a count of Pin=abc,OfferCode=123 I wand to get a count = 2.
Obviously I can do a similar query to this:
SELECT count(1) from MyTable (nolock) where OfferCode like '%' + #OfferCode + '%' and Pin = #Pin
using like here is very expensive and I'm hoping there may be a more efficient way.
I'm also looking into using a split string solution. I have a Table-valued function SplitString(string,delim) that will return table OutParam, but I'm not quite sure how to apply this to a table column vs a string. Would this even be worth wile pursuing? It seems like it would be much more expensive, but I'm unable to get a working solution to compare to the like solution.
Your like/% solution is open to a bug if you had offer codes other than 3 digits (if there was offer code 123 and 1234, searching for like '%123%' would return both, which is wrong). You can use your string function this way:
SELECT Pin, count(1)
FROM MyTable (nolock)
CROSS APPLY SplitString(OfferCode,'~') OutParam
WHERE OutParam.Value = #OfferCode and Pin = #Pin
GROUP BY Pin
If you have a relatively small table you can probably get away with this. If you are working with a large number of rows or encountering performance problems, it would be more effective to normalize it as RedFilter suggested.
using like here is very expensive and I'm hoping there may be a more efficient way
The efficient way is to normalize the schema and put each OfferCode in its own row.
Then your query is more like (although you may need to use an intersection table depending on your schema):
select count(*)
from MyTable
where OfferCode = #OfferCode
and Pin = #Pin
Here is one way to use like for this problem, which is standard for getting exact matches when searching delimited strings while avoiding the '%123%' matches '123' and '1234' problem:
-- Create some test data
declare #table table (
Pin varchar(10) not null
, OfferCode varchar(100) not null
)
insert into #table select 'abc', '123'
insert into #table select 'abc', '123~124'
-- Mock some proc params
declare #Pin varchar(10) = 'abc'
declare #OfferCode varchar(10) = '123'
-- Run the actual query
select count(*) as Matches
from #table
where Pin = #Pin
-- Append delimiters to find exact matches
and '~' + OfferCode + '~' like '%~' + #OfferCode + '~%'
As you can see, we're adding the delimiters to the searched string, and also the search string in order to find matches, thus avoiding the bugs mentioned by other answers.
I highly doubt that a string splitting function will yield better performance over like, but it may be worth a test or two using some of the more recently suggested methods. If you still have unacceptable performance, you have a few options:
Updated:
Try an index on OfferCode (or on a computed persisted column of '~' + OfferCode + '~'). Contrary to the myth that SQL Server won't use an index with like and wildcards, this might actually help.
Check out full text search.
Create a normalized version of this table using a string splitter. Use this table to run your counts. Update this table according to some schedule or event (trigger, etc.).
If you have some standard search terms, pre-calculate the counts for these and store them on some regular basis.
Actually, the LIKE condition is going to have much less cost than doing any sort of string manipulation and comparison.
http://www.simple-talk.com/sql/performance/the-seven-sins-against-tsql-performance/

Get all records that contain a number

It is possible to write a query get all that record from a table where a certain field contains a numeric value?
something like "select street from tbladdress where street like '%0%' or street like '%1%' ect ect"
only then with one function?
Try this
declare #t table(street varchar(50))
insert into #t
select 'this address is 45/5, Some Road' union all
select 'this address is only text'
select street from #t
where street like '%[0-9]%'
street
this address is 45/5, Some Road
Yes, but it will be inefficient, and probably slow, with a wildcard on the leading edge of the pattern
LIKE '%[0-9]%'
Searching for text within a column is horrendously inefficient and does not scale well (per-row functions, as a rule, all have this problem).
What you should be doing is trading disk space (which is cheap) for performance (which is never cheap) by creating a new column, hasNumerics for example, adding an index to it, then using an insert/update trigger to set it based on the data going into the real column.
This means the calculation is done only when the row is created or modified, not every single time you extract the data. Databases are almost always read far more often than they're written and using this solution allows you to amortize the cost of the calculation over many select statement executions.
Then, when you want your data, just use:
select * from mytable where hasNumerics = 1; -- or true or ...
and watch it leave a regular expression query or like '%...%' monstrosity in its dust.
To fetch rows that contain only numbers,use this query
select street
from tbladdress
where upper(street) = lower(street)
Works in oracle .
I found this solution " select street from tbladresse with(nolock) where patindex('%[0-9]%',street) = 1"
it took me 2 mins to search 3 million on an unindexed field