Substring match in an arbitrary set of columns

Substring match in an arbitrary set of columns - sql

I have a SQL Server table with around 10 columns containing various identifiers, both alphanumeric and numeric. I am writing a procedure which will allow a substring match to be performed across an arbitrary subset of those columns. For example, "the value in column B contains substring bSub AND the value in column D contains substring dSub AND the value in column G contains substring gSub".
The following works, but is blisteringly slow:
SELECT * FROM Table T
WHERE
(#aSub IS NULL OR T.A LIKE CONCAT('%', #aSub, '%')) AND
(#bSub IS NULL OR T.B LIKE CONCAT('%', #bSub, '%')) AND
...
(#jSub IS NULL OR T.J LIKE CONCAT('%', #jSub, '%'))
Is there another way to structure this query which would be more performant? Or any techniques to speed things up? I believe that indexes won't help due to the substring match LIKE('%...).

In general, i'd say that string matching like this is always relatively slow.
With that being said, there are couple of things you could try:
Change LIKE to CHARINDEX. Since you don't actually match patterns, i suspect charindex is a bit more performant
Instead of checking OR #a IS NULL etc, build query dynamically from parameters that are actually not NULL. Something like:
declare #sql = 'select .... WHERE 1 = 1'
if #a is not null
set #sql = #sql + ' and CHARINDEX(#a, a) > 0'
if #b is not null
set #sql = #sql + ' and CHARINDEX(#b, b) > 0'
...
exec SP_EXECUTESQL #sql N#a nvarchar(100)', #a = #a, #b = #b...
This would only check relevant columns.
Create specific indexes for all or "most" searched for columns with the other needed columns as include. This might or might not help, but if you're lucky you can avoid full clustered index scan which is a lot more data to travel through than specific columns. This step is a bit tricky and might not help if SQL Server decides that it rather scans the clustered index anyway.

I believe that indexes won't help due to the substring match
LIKE('%...).
Yep, sadly this is true and there's no easy way around this. You could do something advanced like Trigram indexing but that's not a trivial task.
Without some sort of filter you need perform string evaluation and filtering on every relevant column for every row. This is no way to treat your optimizer.
So, if any of your columns are blank or null you can skip that row. With this in mind you can pre-filter these rows out. Here's some sample data to illustrate my point.
--==== Sample data
DECLARE #t TABLE
(
ID INT IDENTITY,
A VARCHAR(10),
B VARCHAR(10),
C VARCHAR(10),
BlankOrNull AS IIF(NULLIF(A,'') IS NULL OR
NULLIF(B,'') IS NULL OR
NULLIF(C,'') IS NULL,1,0) PERSISTED,
INDEX ix_blanknulls CLUSTERED(BlankOrNull,ID)
);
INSERT #t
VALUES('00ABC','00ABCD00','777'),('ABCD','XXXXX','XZZZ'),('8','9',''),(NULL,NULL,NULL),
('ABC','XXXX',NULL),('ABCD-000','XXXXX00','XZZZ00');
Note the indexed computed column that I added name BlankOrNull. If any row contains a blank or NULL column you are done with that row, no further evaluation needed.
Without the indexed/computed column you would do this:
--==== Sample Variables
DECLARE
#aSub VARCHAR(10) = 'ABC',
#bSub VARCHAR(10) = 'XXX',
#cSub VARCHAR(10) = 'ZZZ';
SELECT t.*
FROM (VALUES(#aSub,#bSub,#cSub)) AS s(A,B,C)
JOIN #t AS t
ON CHARINDEX(ISNULL(#aSub,T.A),T.A) > 0
AND CHARINDEX(ISNULL(#bSub,T.B),T.B) > 0
AND CHARINDEX(ISNULL(#cSub,T.C),T.C) > 0;
With the aforementioned index in place you can add a filter like so:
SELECT t.*
FROM (VALUES(#aSub,#bSub,#cSub)) AS s(A,B,C)
JOIN #t AS t
ON CHARINDEX(ISNULL(#aSub,T.A),T.A) > 0
AND CHARINDEX(ISNULL(#bSub,T.B),T.B) > 0
AND CHARINDEX(ISNULL(#cSub,T.C),T.C) > 0
WHERE t.BlankOrNull = 0
Now let's run both and compare execution plans:
Next compare the actual number of rows each query processes:
Leveraging the index on the computed column we are able to process 1/2 the rows (based on my sample data, I don't know what the real stuff looks like.)

Related

Optimizing stored procedure with multiple "LIKE"s

I am passing in a comma-delimited list of values that I need to compare to the database
Here is an example of the values I'm passing in:
#orgList = "1123, 223%, 54%"
To use the wildcard I think I have to do LIKE but the query runs a long time and only returns 14 rows (the results are correct, but it's just taking forever, probably because I'm using the join incorrectly)
Can I make it better?
This is what I do now:
declare #tempTable Table (SearchOrg nvarchar(max) )
insert into #tempTable
select * from dbo.udf_split(#orgList) as split
-- this splits the values at the comma and puts them in a temp table
-- then I do a join on the main table and the temp table to do a like on it....
-- but I think it's not right because it's too long.
select something
from maintable gt
join #tempTable tt on gt.org like tt.SearchOrg
where
AYEAR= ISNULL(#year, ayear)
and (AYEAR >= ISNULL(#yearR1, ayear) and ayear <= ISNULL(#yearr2, ayear))
and adate = ISNULL(#Date, adate)
and (adate >= ISNULL(#dateR1, adate) and adate <= ISNULL(#DateR2 , adate))
The final result would be all rows where the maintable.org is 1123, or starts with 223 or starts with 554
The reason for my date craziness is because sometimes the stored procedure only checks for a year, sometimes for a year range, sometimes for a specific date and sometimes for a date range... everything that's not used in passed in as null.
Maybe the problem is there?

Try something like this:
Declare #tempTable Table
(
-- Since the column is a varchar(10), you don't want to use nvarchar here.
SearchOrg varchar(20)
);
INSERT INTO #tempTable
SELECT * FROM dbo.udf_split(#orgList);
SELECT
something
FROM
maintable gt
WHERE
some where statements go here
And
Exists
(
SELECT 1
FROM #tempTable tt
WHERE gt.org Like tt.SearchOrg
)

Such a dynamic query with optional filters and LIKE driven by a table (!) are very hard to optimize because almost nothing is statically known. The optimizer has to create a very general plan.
You can do two things to speed this up by orders of magnitute:
Play with OPTION (RECOMPILE). If the compile times are acceptable this will at least deal with all the optional filters (but not with the LIKE table).
Do code generation and EXEC sp_executesql the code. Build a query with all LIKE clauses inlined into the SQL so that it looks like this: WHERE a LIKE #like0 OR a LIKE #like1 ... (not sure if you need OR or AND). This allows the optimizer to get rid of the join and just execute a normal predicate).

Your query may be difficult to optimize. Part of the question is what is in the where clause. You probably want to filter these first, and then do the join using like. Or, you can try to make the join faster, and then do a full table scan on the results.
SQL Server should optimize a like statement of the form 'abc%' -- that is, where the wildcard is at the end. (See here, for example.) So, you can start with an index on maintable.org. Fortunately, your examples meet this criteria. However, if you have '%abc' -- the wildcard comes first -- then the optimization won't work.
For the index to work best, it might also need to take into account the conditions in the where clause. In other words, adding the index is suggestive, but the rest of the query may preclude the use of the index.
And, let me add, the best solution for these types of searches is to use the full text search capability in SQL Server (see here).

Matching sub string in a column

First I apologize for the poor formatting here.
Second I should say up front that changing the table schema is not an option.
So I have a table defined as follows:
Pin varchar
OfferCode varchar
Pin will contain data such as:
abc,
abc123
OfferCode will contain data such as:
123
123~124~125
I need a query to check for a count of a Pin/OfferCode combination and when I say OfferCode, I mean an individual item delimited by the tilde.
For example if there is one row that looks like abc, 123 and another that looks like abc,123~124, and I search for a count of Pin=abc,OfferCode=123 I wand to get a count = 2.
Obviously I can do a similar query to this:
SELECT count(1) from MyTable (nolock) where OfferCode like '%' + #OfferCode + '%' and Pin = #Pin
using like here is very expensive and I'm hoping there may be a more efficient way.
I'm also looking into using a split string solution. I have a Table-valued function SplitString(string,delim) that will return table OutParam, but I'm not quite sure how to apply this to a table column vs a string. Would this even be worth wile pursuing? It seems like it would be much more expensive, but I'm unable to get a working solution to compare to the like solution.

Your like/% solution is open to a bug if you had offer codes other than 3 digits (if there was offer code 123 and 1234, searching for like '%123%' would return both, which is wrong). You can use your string function this way:
SELECT Pin, count(1)
FROM MyTable (nolock)
CROSS APPLY SplitString(OfferCode,'~') OutParam
WHERE OutParam.Value = #OfferCode and Pin = #Pin
GROUP BY Pin
If you have a relatively small table you can probably get away with this. If you are working with a large number of rows or encountering performance problems, it would be more effective to normalize it as RedFilter suggested.

using like here is very expensive and I'm hoping there may be a more efficient way
The efficient way is to normalize the schema and put each OfferCode in its own row.
Then your query is more like (although you may need to use an intersection table depending on your schema):
select count(*)
from MyTable
where OfferCode = #OfferCode
and Pin = #Pin

Here is one way to use like for this problem, which is standard for getting exact matches when searching delimited strings while avoiding the '%123%' matches '123' and '1234' problem:
-- Create some test data
declare #table table (
Pin varchar(10) not null
, OfferCode varchar(100) not null
)
insert into #table select 'abc', '123'
insert into #table select 'abc', '123~124'
-- Mock some proc params
declare #Pin varchar(10) = 'abc'
declare #OfferCode varchar(10) = '123'
-- Run the actual query
select count(*) as Matches
from #table
where Pin = #Pin
-- Append delimiters to find exact matches
and '~' + OfferCode + '~' like '%~' + #OfferCode + '~%'
As you can see, we're adding the delimiters to the searched string, and also the search string in order to find matches, thus avoiding the bugs mentioned by other answers.
I highly doubt that a string splitting function will yield better performance over like, but it may be worth a test or two using some of the more recently suggested methods. If you still have unacceptable performance, you have a few options:
Updated:
Try an index on OfferCode (or on a computed persisted column of '~' + OfferCode + '~'). Contrary to the myth that SQL Server won't use an index with like and wildcards, this might actually help.
Check out full text search.
Create a normalized version of this table using a string splitter. Use this table to run your counts. Update this table according to some schedule or event (trigger, etc.).
If you have some standard search terms, pre-calculate the counts for these and store them on some regular basis.

Actually, the LIKE condition is going to have much less cost than doing any sort of string manipulation and comparison.
http://www.simple-talk.com/sql/performance/the-seven-sins-against-tsql-performance/

On SQL Server (2008), if I want to filter a string field that starts with something, what is the best way?

On several SQL queries I need to check if a field starts with a character.
There are several ways to do it, which one is better in performance/standard?
I usually use
tb.field LIKE 'C%'
but I can also use
LEFT(LTRIM(tb.Field),1) = 'C'
I know well the uses of each case, but not in terms of performance.

I'd go with the first one LIKE C%, it'll use an index on the field if there is one rather than having to do a full table scan.
If you really need to include the whitespace LTRIM trimming in the query, you could create a persisted computed column with the value LEFT(LTRIM(tb.Field), 1) and put an index on it.

LIKE 'C%' is going to perform better than a LEFT(LTRIM()).
The LIKE predicate can still use a supporting index to get at the data you're looking for. I
However, when SQL Server encounters LEFT(LTRIM(tb.Field), 1) = 'C', the database can't determine what you mean. In order to perform a match, SQL Server must scan every row, LTRIM the data and then examine the first character. The end result is, most likely, a full table scan.

The first query is just a bit than the other. I've measured it with my query speed measurement script.
Try it yourself:
DECLARE #Measurements TABLE(
MeasuredTime INT NOT NULL
)
DECLARE #ExecutionTime INT
DECLARE #TimesMeasured INT
SET #TimesMeasured = 0
WHILE #TimesMeasured < 1000
BEGIN
DECLARE #StartTime DATETIME
SET #StartTime = GETDATE()
-- your select .. or what every query
INSERT INTO #Measurements
SELECT DATEDIFF(millisecond, #StartTime, getdate())
SET #TimesMeasured = #TimesMeasured + 1
END
SELECT #AvgTime = AVG(MeasuredTime) FROM #Measurements

SQL NOT LIKE - Not working

UPDATE 2:
In one of the rows, in the column closed_by, it contains a null. If I replace the null with text, the query starts working, but it will not work with null. So it seems null is the problem, but the query should return rows which have null too, as pqr does not equal null.
UPDATE 1:
I have also tried set #user = 'pqr', but it makes no difference. It still returns 0 rows, when it should be returning 1 row, as 1 of the rows does not contain pqr.
ORIGINAL QUESTION:
I am trying to return rows which do not contain the id provided:
declare #user varchar(3)
set #user = 'pqr'
select * from table1 where not( closed_by like #user )
closed_by contains data like
abc,def,ghi,jkl,mno
But this gives me no errors and returns no data, it should be returning a single row as pqr is not in 1 row.
Not sure what I am doing wrong.

You may want to check the syntax of the LIKE operator - it accetps a pattern and so you would need to use something like this instead:
declare #user varchar(5)
set #user = '%pqr%'
The '%' is a wildcard and matches any string of zeor or more chracters.
FYI - SQL Server won't be able to use indexes with a a LIKE pattern that starts with a wildcard and so you may find that your query performs badly with large data sets.

NULL is unknown and therefore it is unknown whether it is like your pattern or not. You can solve this easily by using:
DECLARE #user VARCHAR(5) = '%pqr%';
SELECT ... WHERE COALESCE(closed_by, '') NOT LIKE #user;
And in fact to be more accurate you probably want:
DECLARE #user VARCHAR(7) = '%,pqr,%';
SELECT ... WHERE COALESCE(',' + closed_by + ',', '') NOT LIKE #user;
This way, if you later have data like 'trv,pqrs,ltt' it won't return a false positive.
However, having said all of that, having you considered storing this in a more normalized fashion? Storing comma-separated lists can be very problematic and I can assue you this won't be the last challenge you face with dealing with data structured this way.

You need to include a wildcard character:
declare
#user varchar(5)
set #user = '%pqr%'
select *
from table1
where isnull(closed_by, '') not like #user

You need to use the % and _ wildcard characters when using LIKE. Without them you actually just have WHERE NOT (closed_by = #user).
Also, be careful of accidental matches. For example LIKE '%a%' would match your example record. For such cases, I tend to ensure that the comma delimitered lists also have commas at the start and end. Such as; ',abc,def,ghi,jkl,mno,' LIKE '%,ghi,%'
But, more over, you're using a relational database. You would be better off with each entry as it's own record in a normalised structure. Although this give 1:many relationships, rather than 1:1 relationships, you get the benefits of INDEXes and much more flexibility in your queries. (Your LIKE example can't use an index.)
REPLY TO UPDATE 2:
Be careful of how you assume NULL logic works.
The result of NULL LIKE '%pqr%' is NULL
The result of NOT (NULL) is NULL
You need to change your code to use WHERE NOT (ISNULL(closed_by, '') LIKE '%pqr%')

Try this:
select * from table1 where closed_by not like #user
And you may need to add the appropriate '%' characters to tell SQL Server which portion of the value to search. For example 'pqr%'

It sounds to me that you really are looking for equivalence, and not wildcard matches. Try this:
select * from table1 where closed_by <> #user

Update all SQL NULL values in multiple columns using Column level WHERE clause?

We have a database with a bunch of wide tables (40-80 columns each) and just found a bug that introduced NULL values into about 500 of the records. The NULL values can appear in any of the columns (all are integer columns, see image below) but these NULL values are causing issues with one of our reporting systems that cannot be changed easily. We need to replace the NULL values with a specific static value (in this case 99), but since this change has to be made on a per-column basis for over 250 different columns I would rather not write individual TSQL scripts updating each column one by one.
My brain is too fried right now to think up a clever solution, so my question is how can I perform this task on all columns on a table (or better yet multiple tables) using a simple and readable SQL query. I can isolate the records easy enough using a chain of WHERE (Answer_1 IS NULL) OR (Answer_2 IS NULL) OR ... or even by AdministrationID numbers for each table, but this trick won't work when updating as where clause is per row not per column. Any advice?
Here is a sample query showing a few of the records from 4 different tables:

There isn't any convention to this -- if you want to only process records where respective columns are NULL, you need to use:
WHERE Answer_1 IS NULL
OR Answer_2 IS NULL
OR ...
But you could use this in the UPDATE statement:
UPDATE YOUR_TABLE
SET col1 = COALESCE(col1, 99),
col2 = COALESCE(col2, 99),
col3 = ...
The logic is that the value will be updated to 99 only if the column value is NULL, because of how COALESCE works--returning the first non-NULL value (processing the list provided from left to right).

Just poll the sys.columns table for each table and create some dynamic sql... It's brute force but it saves you from having to write all the t-sql out.
For example:
DECLARE #TABLENAME AS VARCHAR(255)
SET #TABLENAME = 'ReplaceWithYourTableName'
SELECT 'UPDATE ' + #TableName + ' SET ' + CAST(Name AS VARCHAR(255)) + ' = 99
WHERE ' + CAST(Name AS VARCHAR(255)) + ' IS NULL'
FROM sys.columns
WHERE object_id = OBJECT_ID(#TABLENAME)
AND system_type_id = 56 -- int's only

Since you have to do this all over the place i wrote some javascript to help you build the sql. cut and paste this into your browsers address bar to get your sql.
javascript:sql='update your table set ';x=0;while(x <= 40){sql += 'answer_'+x+ ' = coalesce(answer_'+x+',99),\n';x++;};alert(sql);

I don't like the idea to manipulate the data itself for the purpose of reporting. If you change the NULL values to 99 to just to make your reporting easier then the I consider that data as corrupted. What if there are other consumer apart from reporting which need genuine data?
I would rather write an intelligent query for the report. For example, if you use ISNULL(columnname, 99), it would return 99 whenever the column value is NULL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Substring match in an arbitrary set of columns - sql

Related

Optimizing stored procedure with multiple "LIKE"s

Matching sub string in a column

On SQL Server (2008), if I want to filter a string field that starts with something, what is the best way?

SQL NOT LIKE - Not working

Update all SQL NULL values in multiple columns using Column level WHERE clause?

Categories

Resources