Improving performance on an alphanumeric text search query - sql

I have table where millions of records are there I'm just posting sample data. Actually I'm looking to get only Endorsement data by using LIKE or LEFT but there is no difference between them in Execution time. IS there any fine way to get data in less time while dealing with Alphanumeric Data. I have 4.4M records in table. Suggest me
declare #t table (val varchar(50))
insert into #t(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal')
SELECT * FROM #t where RIGHT(val,11) = 'Endorsement'
SELECT * FROM #t where val like '%Endorsement%'

Imagine you'd have to find names in a telephone book that end with a certain string. All you could do is read every single name and compare. It doesn't help you at all to see where the names with A, B, C, etc. start, because you are not interested in the initial characters of the names but only in the last characters instead. Well, the only thing you could do to speed this up is ask some friends to help you and each person scans a range of pages only. In a DBMS it is the same. The DBMS performs a full table scan and does this parallelized if possible.
If however you had a telephone book listing the words backwards, so you'd see which words end with A, B, C, etc., that sure would help. In SQL Server: Create a computed column on the reverse string:
alter table t add reverse_val as reverse(val);
And add an index:
create index idx_reverse_val on t(reverse_val);
Then query the string with LIKE. The DBMS should notice that it can use the index for speeding up the search process.
select * from t where reverse_val like reverse('Endorsement') + '%';
Having said this, it seems strange that you are interested in the end of your strings at all. In a good database you store atomic information, e.g. you would not store a person's name and birthdate in the same column ('John Miller 12.12.2000'), but in separate columns instead. Sure, it does happen that you store names and want to look for names starting with, ending with, containing substrings, but this is a rare thing after all. Check your column and think about whether its content should be separate columns instead. If you had the string ('Endorsement', 'Renewal', etc.) in a separate column, this would really speed up the lookup, because all you'd have to do is ask where val = 'Endorsement' and with an index on that column this is a super-simple task for the DBMS.

try charindex or patindex:
SELECT *
FROM #t t
WHERE CHARINDEX('endorsement', t.val) > 0
SELECT *
FROM #t t
WHERE PATINDEX('%endorsement%', t.val) > 0

CREATE TABLE tbl
(val varchar(50));
insert into tbl(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal');
CREATE CLUSTERED INDEX inx
ON dbo.tbl(val)
SELECT * FROM tbl where val like '%Endorsement';
--LIKE '%Endorsement' will give better performance it will utilize the index well efficiently than RIGHT(val,ll)

Related

Best way to index a SQL table to find best matching string

Let's say I have a SQL table with an int PK column and an nvarchar(max). In the the nvarchar(max) column, I have a bunch of table entries that are all like this:
SOME_PEOPLE_LIKE_APPLES
SOME_PEOPLE_LIKE_APPLES_ON_TUESDAY
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_CAFE
SOME_PEOPLE_LIKE_APPLES_ON_THE_RIVER
.
.
.
SOME_ANTS_HATE_SYRUP
SOME_ANTS_HATE_SYRUP_WITH_STRAWBERRIES
There's millions of these rows - Then let's say my goal is to find the row with the most overlap for an input searchTerm - So in this case, if I input SOME PEOPLE_LIKE_APPLES_ON_THE_MOON_MOUNTAIN, the returned entry would be the third entry from the table above, SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON
I have a SPROC that does this very naively, it goes through the entire table as follows:
SELECT DISTINCT phrase, len(phrase) l, [id] FROM X WHERE searchTerm LIKE phrase + '%'
-- phrase is the row entry being searched against
-- searchTerm is the phrase we're searching for
I then ORDER BY length and pick the TOP only
Would there be a way to speed this up, perhaps by doing some indexing?
If this is confusing, think of it as tableRowEntry + wildcard = searchTerm
I'm on MSSQL 2008 if that makes any difference
If there is an index on your NVARCHAR-column a LIKE 'Something%' -search will be able to use it and should be pretty fast.
If there is a wildcard in the beginning you are out of luck. But - in your case - this should work.
You might use an indexed persistant computed column storing the length of the string. In this case you might reduce the workload enormously by filtering out all string which are to short or to long.
If there are certain words in your search terms which appear often but not everywhere, you might use side columns again and filter like AND InlcudePEOPLE=1 AND IncludeMOON=1
UPDATE
Here is an example
CREATE TABLE Phrase(ID INT IDENTITY
,Phrase NVARCHAR(100)
,PhraseLength AS LEN(Phrase) PERSISTED);
CREATE INDEX IX_Phrase_Phrase ON Phrase(Phrase);
CREATE INDEX IX_Phrase_PhraseLength ON Phrase(PhraseLength);
INSERT INTO Phrase
VALUES
('SOME_PEOPLE_LIKE_APPLES')
,('SOME_PEOPLE_LIKE_APPLES_ON_TUESDAY')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_CAFE')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_RIVER')
,('SOME_ANTS_HATE_SYRUP')
,('SOME_ANTS_HATE_SYRUP_WITH_STRAWBERRIES');
DECLARE #SearchTerm NVARCHAR(100)=N'SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_MOUNTAIN';
--This uses the index (checked against execution plan)
SELECT TOP 1 *
FROM Phrase
WHERE #SearchTerm LIKE Phrase + '%'
ORDER BY PhraseLength DESC;
--This might be even better, check with your high row count.
SELECT TOP 1 *
FROM Phrase
WHERE Phrase=LEFT(#SearchTerm,PhraseLength)
ORDER BY PhraseLength DESC;
GO
--Clean-Up
DROP TABLE Phrase;
The best solution here is to create a full-text search index:
https://msdn.microsoft.com/en-us/library/ms142571.aspx
Full text search is optimized for this task, once the index is created you can use full-text queries with the CONTAINS full-text function to find the matches efficiently:
SELECT DISTINCT phrase, len(phrase) l, [id] FROM X WHERE CONTAINS(phrase, searchPhrase)
Full text search not only allows custom optimization through query hints like OPTIMIZE FOR, it also allows for stopwords like AND and OR within the search terms, and a variety of other text-searching goodies, like being able to find spelling variations of the same word automatically and filter by relevance, etc..

Matching sub string in a column

First I apologize for the poor formatting here.
Second I should say up front that changing the table schema is not an option.
So I have a table defined as follows:
Pin varchar
OfferCode varchar
Pin will contain data such as:
abc,
abc123
OfferCode will contain data such as:
123
123~124~125
I need a query to check for a count of a Pin/OfferCode combination and when I say OfferCode, I mean an individual item delimited by the tilde.
For example if there is one row that looks like abc, 123 and another that looks like abc,123~124, and I search for a count of Pin=abc,OfferCode=123 I wand to get a count = 2.
Obviously I can do a similar query to this:
SELECT count(1) from MyTable (nolock) where OfferCode like '%' + #OfferCode + '%' and Pin = #Pin
using like here is very expensive and I'm hoping there may be a more efficient way.
I'm also looking into using a split string solution. I have a Table-valued function SplitString(string,delim) that will return table OutParam, but I'm not quite sure how to apply this to a table column vs a string. Would this even be worth wile pursuing? It seems like it would be much more expensive, but I'm unable to get a working solution to compare to the like solution.
Your like/% solution is open to a bug if you had offer codes other than 3 digits (if there was offer code 123 and 1234, searching for like '%123%' would return both, which is wrong). You can use your string function this way:
SELECT Pin, count(1)
FROM MyTable (nolock)
CROSS APPLY SplitString(OfferCode,'~') OutParam
WHERE OutParam.Value = #OfferCode and Pin = #Pin
GROUP BY Pin
If you have a relatively small table you can probably get away with this. If you are working with a large number of rows or encountering performance problems, it would be more effective to normalize it as RedFilter suggested.
using like here is very expensive and I'm hoping there may be a more efficient way
The efficient way is to normalize the schema and put each OfferCode in its own row.
Then your query is more like (although you may need to use an intersection table depending on your schema):
select count(*)
from MyTable
where OfferCode = #OfferCode
and Pin = #Pin
Here is one way to use like for this problem, which is standard for getting exact matches when searching delimited strings while avoiding the '%123%' matches '123' and '1234' problem:
-- Create some test data
declare #table table (
Pin varchar(10) not null
, OfferCode varchar(100) not null
)
insert into #table select 'abc', '123'
insert into #table select 'abc', '123~124'
-- Mock some proc params
declare #Pin varchar(10) = 'abc'
declare #OfferCode varchar(10) = '123'
-- Run the actual query
select count(*) as Matches
from #table
where Pin = #Pin
-- Append delimiters to find exact matches
and '~' + OfferCode + '~' like '%~' + #OfferCode + '~%'
As you can see, we're adding the delimiters to the searched string, and also the search string in order to find matches, thus avoiding the bugs mentioned by other answers.
I highly doubt that a string splitting function will yield better performance over like, but it may be worth a test or two using some of the more recently suggested methods. If you still have unacceptable performance, you have a few options:
Updated:
Try an index on OfferCode (or on a computed persisted column of '~' + OfferCode + '~'). Contrary to the myth that SQL Server won't use an index with like and wildcards, this might actually help.
Check out full text search.
Create a normalized version of this table using a string splitter. Use this table to run your counts. Update this table according to some schedule or event (trigger, etc.).
If you have some standard search terms, pre-calculate the counts for these and store them on some regular basis.
Actually, the LIKE condition is going to have much less cost than doing any sort of string manipulation and comparison.
http://www.simple-talk.com/sql/performance/the-seven-sins-against-tsql-performance/

Get first or second values from a comma separated value in SQL

I have a column that stores data like (42,12). Now I want to fetch 42 or 12 (two different select queries). I have searched and found some similar but much more complex scenarios. Is there any easy way of doing it? I am using MSSQL Server 2005.
Given there will always be only two values and they will be integer
The reason you have this problem is because the database (which you may not have any control over), violates first normal form. Among other things, first normal form says that each column should hold a single value, not multiple values. This is bad design.
Now, having said this, the first solution that pops into my head is to write a UDF that parses the value in this column, based on the delimiter and returns either the first or second value.
You can try something like this
DECLARE #Table TABLE(
Val VARCHAR(50)
)
INSERT INTO #Table (Val) SELECT '42,12'
SELECT *,
CAST(LEFT(Val,CHARINDEX(',',Val)-1) AS INT) FirstValue,
CAST(RIGHT(Val,LEN(Val) - CHARINDEX(',',Val)) AS INT) SecondValue
FROM #Table
You can use something like this:
SELECT SUBSTRING_INDEX(field, ',', 1)
Note: Its not the efficient way of doing things in rdbms. Consider normalizing your Database.

Sqlite : Sql to finding the most complete prefix

I have a sqlite table containing records of variable length number prefixes. I want to be able to find the most complete prefix against another variable length number in the most efficient way:
eg. The table contains a column called prefix with the following numbers:
1. 1234
2. 12345
3. 123456
What would be an efficient sqlite query to find the second record as being the most complete match against 12345999.
Thanks.
A neat trick here is to reverse a LIKE clause -- rather than saying
WHERE prefix LIKE '...something...'
as you would often do, turn the prefix into the pattern by appending a % to the end and comparing it to your input as the fixed string. Order by length of prefix descending, and pick the top 1 result.
I've never used Sqlite before, but just downloaded it and this works fine:
sqlite> CREATE TABLE whatever(prefix VARCHAR(100));
sqlite> INSERT INTO WHATEVER(prefix) VALUES ('1234');
sqlite> INSERT INTO WHATEVER(prefix) VALUES ('12345');
sqlite> INSERT INTO WHATEVER(prefix) VALUES ('123456');
sqlite> SELECT * FROM whatever WHERE '12345999' LIKE (prefix || '%')
ORDER BY length(prefix) DESC LIMIT 1;
output:
12345
Personally I use next method, it will use indexes:
statement '('1','12','123','1234','12345','123459','1234599','12345999','123459999')'
should be generated by client
SELECT * FROM whatever WHERE prefix in
('1','12','123','1234','12345','123459','1234599','12345999','123459999')
ORDER BY length(prefix) DESC LIMIT 1;
select foo, 1 quality from bar where foo like "123*"
union
select foo, 2 quality from bar where foo like "1234*"
order by quality desc limit 1
I haven't tested it, but the idea would work in other dialects of SQL
a couple of assumptions.
you are joining with some other table so you want to know the largest variable length prefix for each record in the table you are joining with.
your table of prefixes is actually more than just the three you provide in your example...otherwise you can hardcode the logic and move on.
prefix_table.prefix
1234
12345
123456
etc.
foo.field
12345999
123999
select
a.field,
b.prefix,
max(len(b.prefix)) as length
from
foo a inner join prefix_table b on b.prefix = left(a.field, len(b.prefix))
group by
a.field,
b.prefix
note that this is untested but logically should make sense.
Without resorting to a specialized index, the best performing strategy may be to hunt for the answer.
Issue a LIKE query for each possible prefix, starting with the longest. Stop once you get rows returned.
It's certainly not the prettiest way to achieve what you wan't but as opposed to the other suggestions, indexes will be considered by the query planner. As always, it depends on your actual data. In particular, on how many rows in your table, and how long the average hunt will be.

Get all records that contain a number

It is possible to write a query get all that record from a table where a certain field contains a numeric value?
something like "select street from tbladdress where street like '%0%' or street like '%1%' ect ect"
only then with one function?
Try this
declare #t table(street varchar(50))
insert into #t
select 'this address is 45/5, Some Road' union all
select 'this address is only text'
select street from #t
where street like '%[0-9]%'
street
this address is 45/5, Some Road
Yes, but it will be inefficient, and probably slow, with a wildcard on the leading edge of the pattern
LIKE '%[0-9]%'
Searching for text within a column is horrendously inefficient and does not scale well (per-row functions, as a rule, all have this problem).
What you should be doing is trading disk space (which is cheap) for performance (which is never cheap) by creating a new column, hasNumerics for example, adding an index to it, then using an insert/update trigger to set it based on the data going into the real column.
This means the calculation is done only when the row is created or modified, not every single time you extract the data. Databases are almost always read far more often than they're written and using this solution allows you to amortize the cost of the calculation over many select statement executions.
Then, when you want your data, just use:
select * from mytable where hasNumerics = 1; -- or true or ...
and watch it leave a regular expression query or like '%...%' monstrosity in its dust.
To fetch rows that contain only numbers,use this query
select street
from tbladdress
where upper(street) = lower(street)
Works in oracle .
I found this solution " select street from tbladresse with(nolock) where patindex('%[0-9]%',street) = 1"
it took me 2 mins to search 3 million on an unindexed field