Does Adding Indexes speed up String Wildcard % searches? - sql

We are conducting a wildcard search on a database table with column string. Does creating a non-clustered index on columns help with wildcard searches? Will this improve performance?
CREATE TABLE [dbo].[Product](
[ProductId] [int] NOT NULL,
[ProductName] [varchar](250) NOT NULL,
[ModifiedDate] [datetime] NOT NULL,
...
CONSTRAINT [PK_ProductId] PRIMARY KEY CLUSTERED
(
[ProductId] ASC
)
)
Proposed Index:
CREATE NONCLUSTERED INDEX [IX_Product_ProductName] ON [dbo].[Product] [ProductName])
for this query
select * from dbo.Product where ProductName like '%furniture%'
Currently using Microsoft SQL Server 2019.

Creating a normal index will not help(*), but a full-text index will, though you would have to change your query to something like this:
select * from dbo.Product where ProductName CONTAINS 'furniture'
(* -- well, it can be slightly helpful, in that it can reduce a scan over every row and column in your table into a scan over merely every row and only the relevant columns. However, it will not achieve the orders of magnitude performance boost that we normally expect from indexes that turn scans into single seeks.)

For a double ended wildcard search as shown, an index cannot help you by restricting the rows SQL Server has to look at - a full table scan will be carried out. But it can help with the amount of data that has to be retrieved from disk.
Because in ProductName like '%furniture%', ProductName could start or end with any string, so no index can reduce the rows that have to be inspected.
However if a row in your Product table is 1,000 characters and you have 10,000 rows, you have to load that much data. But if you have an index on ProductName, and ProductName is only 50 characters, then you only have to load 10,000 * 50 rather than 10,000 * 1000.
Note: If the query was a single ended wildcard search with % at end of 'furniture%', then the proposed index would certainly help.

First you can use FTS to search words into sentences even partially (beginning by).
For those ending by or for those containing you can use a rotative indexing technic:
CREATE TABLE T_WRD
(WRD_ID BIGINT IDENTITY PRIMARY KEY,
WRD_WORD VARCHAR(64) COLLATE Latin1_General_100_BIN NOT NULL UNIQUE,
WRD_DROW AS REVERSE(WRD_WORD) PERSISTED NOT NULL UNIQUE,
WRD_WORD2 VARCHAR(64) COLLATE Latin1_General_100_CI_AI NOT NULL) ;
GO
CREATE TABLE T_WORD_ROTATE_STRING_WRS
(WRD_ID BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
WRS_ROTATE SMALLINT NOT NULL,
WRD_ID_PART BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
PRIMARY KEY (WRD_ID, WRS_ROTATE));
GO
CREATE OR ALTER TRIGGER E_I_WRD
ON T_WRD
FOR INSERT
AS
SET NOCOUNT ON;
-- splitting words
WITH R AS
(
SELECT WRD_ID, TRIM(WRD_WORD) AS WRD_WORD, 0 AS ROTATE
FROM INSERTED
UNION ALL
SELECT WRD_ID, RIGHT(WRD_WORD, LEN(WRD_WORD) -1), ROTATE + 1
FROM R
WHERE LEN(WRD_WORD) > 1
)
SELECT *
INTO #WRD
FROM R;
-- inserting missing words
INSERT INTO T_WRD (WRD_WORD, WRD_WORD2)
SELECT WRD_WORD, LOWER(WRD_WORD) COLLATE SQL_Latin1_General_CP1251_CI_AS
FROM #WRD
WHERE WRD_WORD NOT IN (SELECT WRD_WORD
FROM T_WRD);
-- inserting cross reference words
INSERT INTO T_WORD_ROTATE_STRING_WRS
SELECT M.WRD_ID, ROTATE, D.WRD_ID
FROM #WRD AS M
JOIN T_WRD AS D
ON M.WRD_WORD = D.WRD_WORD
WHERE NOT EXISTS(SELECT 1/0
FROM T_WORD_ROTATE_STRING_WRS AS S
WHERE S.WRD_ID = M.WRD_ID
AND S.WRS_ROTATE = ROTATE);
GO
Then now you can insert into the first table all the words you want from your sentences and finding it by ending of partially in querying those two tables...
As an example, word:
WITH
T AS (SELECT 'électricité' AS W)
INSERT INTO T_WRD
SELECT W, LOWER(CAST(W AS VARCHAR(64)) COLLATE SQL_Latin1_General_CP1251_CI_AS) AS W2
FROM T;
You can now use :
SELECT * FROM T_WRD;
SELECT * FROM T_WORD_ROTATE_STRING_WRS;
To find those partial words

It depends on the optimizer. Like usually requires a full table scan. if the optimizer can scan an index for matches than it will do an index scan which is faster than a full table scan.
if the optimizer does not select an index scan you can force it to use an index. You must measure performance times to determine if using an index scan decreases search time
Use with (index(index_name)) to force an index scan e.g.
select * from t1 with (index(t1i1)) where v1 like '456%'
SQL Server Index - Any improvement for LIKE queries?
If you use %search% pattern, the optimizer will always perform a full table scan.
Another technique for speeding up searches is to use substrings and exact match searches.

Yes, the part before the first % is matched against the index. Of course however, if your pattern starts with %, then a full scan will be performed instead.

Related

Optimize SQL Query (If possible) using CONVERT(INT, SUBSTRING( and LEN FUNCTION

My situation is like that :
I have these tables:
CREATE TABLE [dbo].[HeaderResultPulser]
(
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[ReportNumber] CHAR(255) NOT NULL,
[ReportDescription] CHAR(255) NOT NULL,
[CatalogNumber] NCHAR(255) NOT NULL,
[WorkerName] NCHAR(255) DEFAULT ('') NOT NULL,
[LastCalibrationDate] DATETIME NOT NULL,
[NextCalibrationDate] DATETIME NOT NULL,
[MachineNumber] INT NOT NULL,
[EditTime] DATETIME NOT NULL,
[Age] NCHAR(255) DEFAULT ((1)) NOT NULL,
[Current] INT DEFAULT ((-1)) NOT NULL,
[Time] BIGINT DEFAULT ((-1)) NOT NULL,
[MachineName] NVARCHAR(MAX) DEFAULT ('') NOT NULL,
[BatchNumber] NVARCHAR(MAX) DEFAULT ('') NOT NULL,
CONSTRAINT [PK_HeaderResultPulser]
PRIMARY KEY CLUSTERED ([Id] ASC)
);
CREATE TABLE [dbo].[ResultPulser]
(
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[ReportNumber] CHAR(255) NOT NULL,
[BatchNumber] CHAR(255) NOT NULL,
[DateTime] DATETIME NOT NULL,
[Ocv] FLOAT(53) NOT NULL,
[OcvMin] FLOAT(53) NOT NULL,
[OcvMax] FLOAT(53) NOT NULL,
[Ccv] FLOAT(53) NOT NULL,
[CcvMin] FLOAT(53) NOT NULL,
[CcvMax] FLOAT(53) NOT NULL,
[Delta] BIGINT NOT NULL,
[DeltaMin] BIGINT NOT NULL,
[DeltaMax] BIGINT NOT NULL,
[CurrentFail] BIT DEFAULT ((0)) NOT NULL,
[NumberInTest] INT NOT NULL
);
For every row in HeaderResultPulser I have multiple rows in ResultPulser
my key is the [HeaderResultPulser].[ReportNumber] to get a list of data in ResultPulser, and for every a lot of row with the same [ResultPulser].[ReportNumber]
It has multiple [ResultPulser].[NumberInTest] values
For example: in the ResultPulser table the data can look like this:
ReportNumber | NumberInTest
-------------+-------------
0000006211 | 1
0000006211 | 2
0000006211 | 3
0000006211 | 4
0000006211 | 5
0000006211 | 6
0000006212 | 1
0000006212 | 2
0000006212 | 3
0000006212 | 4
0000006212 | 5
NumberInTest can be 200, 500, 10000 and sometime even more..
The report number column contains two the first 7 chars are a number of machine and the rest is an incrementing number.
For example, 0000006212 is [0000006][212] == [the machine number][the incrementing number]
My query for example :
select
[HeaderResultPulser].[ReportNumber],
max(NumberInTest) as TotalCells
from
ResultPulser, HeaderResultPulser
where
((([ResultPulser].[ReportNumber] like '0000006%' and
CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8, LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815')
and ([HeaderResultPulser].[ReportNumber] = [ResultPulser].[ReportNumber])))
group by
[HeaderResultPulser].[ReportNumber]
Actually I want to get all the rows on the machine number 0000006 that number was 211 to 815 (include both)
This query takes about 6-7 seconds
There is a lot of data (in the hundreds of millions and billions and in the future can be more and can be much more in table ResultPulser), and it can get Tens of thousands of rows in HeaderResultPulser table
And In getting receive I only receive on select a few hundred in the worst case a thousand or about two thousand if I want to go far... but (in numbers) to get the max(NumberInTest) from ResultPulser I take about (It can get to a few millions of rows)
There is any way to optimize my query? Or when It's so much data it's just must this time? (That just the way it is)
The way you are doing joins is no longer standard. It's also hard to read, and dangerous if you ever need to use left joins. Instead of joining this way:
select *
from T1, T2
where T1.column = T2.column
Use ANSI-92 join syntax instead:
select *
from T1
join T2 on T1.column = T2.column
You said that your "key" was ReportNumber. Why isn't that declared in your schema? It sounds like you want a unique constraint on HeaderResultPulser.ReportNumber, and a foreign key on the the ReportPulser table, such that ReportNumber references HeaderResultPulser (ReportNumber)
Since your report number column seems to contain two different values, your table is not in First Normal Form. This is making things difficult for you. Why not split the two parts of the "report number" into two different columns when the data is entered? This will significantly improve your query performance, because you no longer need to perform an expression against the data in the table at query time to separate the ReportNumber into atomic values.
Your comment says that the first 7 characters of the ReportNumber are the MachineNumber. But you already have MachineNumber in the HeaderReportPulser table. So why not just add a separate column for Increment? If you still need ReportNumber to exist as a column, you can make it a calculated column, as the concatenation of MachineNumber and Increment.
If you don't want to touch the "existing" schema, we can do a similar thing in reverse. Your query will not be completely sargable unless you can do something to the schema, because you have to perform some kind of expression on the data in the ReportNumber column. But maybe you have the option to use a calculated column to do this up front:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
Now we have the increment as a column in its own right. But it's still being calculated at query time, because it's not persisted. We can make it persisted:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7) persisted;
We can also index a computed column. Since your required expression is deterministic and precise (see Indexes on Computed Columns), we don't actually have to mark it as persisted:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
create index ix_headerreportpulser_increment on HeaderReportPulser(Increment);
You could do a similar set of operations to create the Increment and MachineNumber on the ReportPulser table. If you always want to use both values, create an index on the combination of (MachineNumber, Increment)
The biggest performance gain might be eliminating the outer group by by using a correlated subquery or lateral join:
select hrp.[ReportNumber],
(select max(rp.NumberInTest)
from ResultPulser rp
where rp.ReportNumber = hrp.ReportNumber and
right(rp.ReportNumber, 3) between '211' and '815'
) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber like '0000006%';
Your logic looks like it only wants the last three characters of the ReportNumber, so I simplified the logic. I'm not 100% that is the case -- it just seems reasonable. Regardless, there is no need to convert the values to integers and then compare as strings. And similar logic can be used even for longer report numbers.
You also want an index on ResultPulser(ReportNumber, NumberInTest) :
create index idx_resultpulser_reportnumber_numberintest on ResultPulser(ReportNumber, NumberInTest)
EDIT:
Actually, I notice that the report number matches between the two tables. So this seems simplest:
select hrp.[ReportNumber],
(select max(rp.NumberInTest)
from ResultPulser rp
where rp.ReportNumber = hrp.ReportNumber
) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber >= '0000006211' and
hrp.ReportNumber <= '0000006815';
You still want to be sure you have the above index on ResultPulser.
If the ReportNumber is not a fixed 10 digits, then you can use:
where hrp.ReportNumber >= '0000006211' and
hrp.ReportNumber <= '0000006815' and
len(hrp.ReportNumber) = 10
This should also use the index and return exactly what you want.
Performance Optimization of any query depends on many factors including environment you are hosting and running your query. Hardware and Software play important part in optimization of heavy running database queries. In your case you can look into following things:
USE ANSI 92 JOIN syntax instead of default cross join
e.g
select *
from T1
join T2 on T1.column = T2.column
Put indexes on columns like
[ReportNumber]
[NumberInTest]
Note: You may need index for each column in the join area which is not primary key.
Remember use of MAX is always heavy and that could be the main problem in your query.
Finally you can further look into optimizing your query syntax using following online tool where you can specify your actual query and environment you are using:
https://www.eversql.com/
Hope it help you.
If you really want to optimize performance, I propose to add a bit of logic beyond SQL structures.
Is it possible that particular value of ReportNumber is present in table ResultPulser, but not in table HeaderResultPulser? If not, and I ssupose so, there is no reason to join table HeaderResultPulser.
Then, I propose to take advantage from fact, that the condition on ReportNumber can be expressed equivalently without dividing in substrings. For your example, the condition
([ResultPulser].[ReportNumber] like '0000006%' and
CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8,
LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815')
is equivalent to:
([ResultPulser].[ReportNumber] BETWEEN '0000006211' and '0000006815')
So the proposal is:
Create index on table ResultPulser(ReportNumber, NumberInTest)
Use selections similar to this:
select ReportNumber, max(NumberInTest) as TotalCells
from ResultPulser
where
ReportNumber BETWEEN '0000006211' and '0000006815'
group by
ReportNumber
(Please, add brackets or double quotes and capitalizations as necessary for MS SQL Server and your taste)
I would expect that good database will execute this query by index-only access, and it will be optimal from execution point of view.
Performance depends on not only on execution path, but also on setup and hardware. Please, make sure that your database has enough cache and fast disk accesses. Also concurrent load is very important.
Simple splitting the field ReportNumber into [the machine number] and [the incrementing number] will probably not improve performance of the query in form proposed by me. But it may be very convenient for other forms of access (other WHERE classes). And it will reflect the structure of the case. Even more important: It will release you from imposed limits. Currently, you have 3 digits for the [the incrementing number]. Are you sure, it will never be necessary to have more than 999 of them for single [the machine number]?
Why the field ReportNumber has type char(255), when only 10 characters are used? char(255) has fixed length, so it will be terrible wasting of space. Only database compression can help. Used space has strong influence on performance – Please, consider the above remark about the database cache.
If both these fields, [the machine number], [the incrementing number], are intergers, why not split ReportNumber and use integer type for them?
Side remark: Field names suggest that you search the total number of rows in table ResultPulser, which belong to single entry in table HeaderResultPulser. The proposed query will deliver this, only if numbers in NumberInTest are consecutive, without gaps. If this is not supplied, you have to count them rather than seek the maximum.

Count(*) on VARCHAR Index with blank NVARCHAR or NULL check results in double the rows returned

I have a table with a VARCHAR column and an index on it. Whenever a SELECT COUNT(*) is done on this table that has a check for COLUMN = N'' OR COLUMN IS NULL it returns double the number of rows. SELECT * with the same where clause will return the correct number of records.
After reading this article: https://sqlquantumleap.com/2017/07/10/impact-on-indexes-when-mixing-varchar-and-nvarchar-types/ and doing some testing I believe the collation of the column and the implicit conversion isn't the fault (at least not directly). The collation of the column is Latin1_General_CI_AS.
The database is on SQL Server 2012, and I've tested on 2016 as well.
I've created a test script (below) that will demonstrate this problem. In doing so, I believe that it may be related to data paging, as it needed a bit of data in the table for it to occur.
CREATE TABLE [dbo].TEMP
(
ID [varchar](50) COLLATE Latin1_General_CI_AS NOT NULL,
[DATA] [varchar](200) COLLATE Latin1_General_CI_AS NULL,
[TESTCOLUMN] [varchar](50) COLLATE Latin1_General_CI_AS NULL,
CONSTRAINT [PK_TEMP] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO
CREATE NONCLUSTERED INDEX [I_TEMP_TESTCOLUMN] ON dbo.TEMP (TESTCOLUMN ASC)
GO
DECLARE #ROWS AS INT = 40;
WITH NUMBERS (NUM) AS
(
SELECT 1 AS NUM
UNION ALL
SELECT NUM + 1 FROM NUMBERS WHERE NUM < #ROWS
)
INSERT INTO TEMP (ID, DATA)
SELECT NUM, '1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901324561234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'
FROM NUMBERS
SELECT #ROWS AS EXPECTED, COUNT(*) AS ACTUALROWS
FROM TEMP
GO
SELECT COUNT(*) AS INVALIDINDEXSEARCHCOUNT
FROM TEMP
WHERE (TESTCOLUMN = N'' OR TESTCOLUMN IS NULL)
GO
DROP TABLE TEMP
I'm able to modify the database to some extent (I won't be able to change data, or change the column from allowing NULL), unfortunately I am not able to modify the code doing the search, can anyone identify a way to get the correct COUNT(*) results returned?
TLDR: This is a bug in the product (reported here).
The poor practice that exposes this bug is mismatched datatypes (varchar column being compared to nvarchar) - on SQL collations this would just cause an implicit cast of the column to nvarchar and a full scan.
On Windows collations this can still result in a seek. This is generally a useful performance optimisation but here you have hit an edge case...
More Detail: use the below setup...
CREATE TABLE dbo.TEMP
(
ID INT IDENTITY PRIMARY KEY,
[TESTCOLUMN] [varchar](50) COLLATE Latin1_General_CI_AS NULL INDEX [I_TEMP_TESTCOLUMN],
Filler AS CAST('X' AS CHAR(8000)) PERSISTED
)
--Add 7 rows where TESTCOLUMN is NOT NULL
INSERT dbo.TEMP([TESTCOLUMN]) VALUES ('aardvark'), ('badger'),
('badges'), ('cat'),
('dog'), ('elephant'),
('zebra');
--Add 49 rows where TESTCOLUMN is NULL
INSERT dbo.TEMP([TESTCOLUMN])
SELECT NULL
FROM dbo.TEMP T1 CROSS JOIN dbo.TEMP T2
Then first look at the actual execution plan for
SELECT COUNT(*)
FROM dbo.TEMP
WHERE TESTCOLUMN = N'badger'
OPTION (RECOMPILE)
In SQL Collations the implicit cast to nvarchar would make the predicate entirely unsargable. With windows collations SQL Server is able to add the apparatus to the plan where the compute scalar calls an internal function GetRangeThroughConvert(N'badger',N'badger',(62)) and the resultant values end up being fed into a nested loops join to give start and end points for an index seek. (the article "Dynamic Seeks and Hidden Implicit Conversions" has some more details about this plan shape)
It is not exposed in the execution plan what the range start and end values are that this internal function returns but it is possible to see them if you happen to have a SQL Server build available where the short lived query_trace_column_values extended event has not been disabled. In the case above the function returns (badger, badgeS, 62) and these values are used in the index seek. As I added a row with the value "badges" in this case the seek ends up reading one more row than strictly necessary and the residual predicate retains only the one for "badger".
Now try
SELECT COUNT(*)
FROM dbo.TEMP
WHERE TESTCOLUMN = N''
OPTION (RECOMPILE)
The GetRangeThroughConvert function appears to give up when asked to provide a range for an empty string and output (null, null, 0).
The null here indicate that that end of the range is unbounded so effectively the index seek just ends up reading the whole index from first row to last.
the above shows the index seek read all 56 rows but the residual predicate did the job of removing all those not matching TESTCOLUMN = N'' (so the operator returns zero rows).
In general the seek predicate used here seems to act like a prefix search (e.g. the seek [TESTCOLUMN] = N'A' will read at least all rows starting with A with the residual predicate doing the equality check) so my expectations for empty string here would not be high in the first place but Paul White indicates that the range being seeked here is likely a bug anyway.
When you add the OR predicate to the query the execution plan changes.
It now ends up getting two outer rows to the nested loops join and so ends up doing two seeks (two executions of the seek operator on the inside of the nested loops).
One for the TESTCOLUMN = N'' case and one for the TESTCOLUMN IS NULL case. The values used for the TESTCOLUMN = N'' branch are still calculated through the GetRangeThroughConvert call (as this is the only way SQL Server can do a seek for this mismatched datatype case) so still have the expanded range including NULL.
The problem is that the residual predicate on the index seek now also changes.
It is now
CONVERT_IMPLICIT(nvarchar(50),[tempdb].[dbo].[TEMP].[TESTCOLUMN],0)=N''
OR [tempdb].[dbo].[TEMP].[TESTCOLUMN] IS NULL
The previous residual predicate of
CONVERT_IMPLICIT(nvarchar(50),[tempdb].[dbo].[TEMP].[TESTCOLUMN],0)=N''
would not be suitable as this would incorrectly remove the rows with NULL that need to be retained for the OR TESTCOLUMN IS NULL branch.
This means that when the seek for the N'' branch is done it still ends up reading all the rows with NULL as before but the residual predicate no longer is fit for purpose at removing these.
It might also seem a bit of a miss that the merge interval in the problem plan does not merge the overlapping ranges for the index seeks.
I assume this does not happen due to the different flags values from the two branches. Expr1014 has a value of 60 for the IS NULL branch and 0 for the = N'' branch.
In my test, which was on SQL 2019, when one removes the N and just compares against '' or null, the double counting goes away.
SELECT COUNT(*) AS ACTUALROWS
FROM TEMP
WHERE (TESTCOLUMN = '' OR TESTCOLUMN IS NULL)
The N identifier indicating Unicode is inappropriate anyway as the search column is not of type NVARCHAR. If test column were of type NVARCHAR, the count would be correct.
Eric Kassan's answer is correct:
The column in the table is VARCHAR, but you are searching as if the column is NVARCHAR.
These are two different datatypes, so the column should be changed to NVARCHAR, or the query should be changed by removing the N.
Why the result is doubled up when joining different datatypes is interesting, but that was not the question. :)

Where Clause Index Scan - Index Seek

I have below table:
CREATE TABLE Test
(
Id int IDENTITY(1,1) NOT NULL,
col1 varchar(37) NULL,
testDate datetime NULL
)
insert Test
select null
go 700000
select cast(NEWID() as varchar(37))
go 300000
And below indexes:
create clustered index CIX on Test(ID)
create nonclustered index IX_RegularIndex on Test(col1)
create nonclustered index IX_RegularDateIndex on Test(testDate)
When I query on my table:
SET STATISTICS IO ON
select * from Test where col1=NEWID()
select * from Test where TestDate=GETDATE()
First is making index scan whereas the second index seek. I expect that both of them must make index seek. Why does the first make index scan?
There is an implicit convert generated becuase the NEWID() function returns a value which is of uniqueidentifier datatype, and that is different than your VARCHAR datatype declared for the column.
Just try hovering your mouse over the SELECT part of the plan, where there is a "warning" sign.
Due to the fact that there is a mismatch between compared datatypes, the optimizer can't look at statistics and estimate how many rows with that NEWID() value there are in the table.
And because of the implicit convert, the optimizer thus decides that it is better to go and get all the rows (thus the SCAN), then pass them through the FILTER operation, where it does a conversion of the value of Col1 to a uniqueidentifier datatype and then removing the additional rows that do not match the filter condition.
As opposed to GETDATE() which returns a datetime value, which is of the same datatype as your testDate column, so no datatype conversion is needed and values can be compared as they are.

Execution Plan shows a sort - but can't figure a way around because of query

Here is my query -- the last query is what is causing me pain:
The address.postcode field is a varchar(14) and you can see the input format the user sends in.
DECLARE #ZipCode NVARCHAR(MAX) = ('06409;06471;11763;06443;06371;11949;11946;11742')
IF OBJECT_ID('tempdb..#ZipCodes') IS NOT NULL DROP TABLE #ZipCodes;
CREATE TABLE #ZipCodes (
Zipcode NVARCHAR(6)
)
INSERT INTO #ZipCodes ( Zipcode )
SELECT zip.Token + '%'
FROM DMS.fn_SplitList(#ZipCode, ';') zip
CREATE NONCLUSTERED INDEX [idx_Zip] ON #ZipCodes (Zipcode)
IF OBJECT_ID('tempdb..#ZipCodesConstituents') IS NOT NULL DROP TABLE #ZipCodesConstituents;
CREATE TABLE #ZipCodesConstituents (
ConstituentID UNIQUEIDENTIFIER
, PostCode NVARCHAR(12)
)
CREATE NONCLUSTERED INDEX [idx_ZipCodesConstituents] ON #ZipCodesConstituents (ConstituentID, PostCode)
INSERT INTO #ZipCodesConstituents ( ConstituentID, PostCode )
SELECT a.CONSTITUENTID
, a.POSTCODE
FROM #ZipCodes zip
JOIN DMS.address a
ON a.POSTCODE LIKE zip.Zipcode
where a.ISPRIMARY = 1
I am trying to attach the execution plan -- but not having any luck...
Basically the section of code has an Est Cost of 61.9%
and the Sort is 61.5%
I tried to evaluate the behavior, but on my tests, I can't force a sort operator in the last INSERT. But what I see are the following two issues.
You create the index, and afterwards insert into the table. This may be harmful. Not in all cases, but it may force the sort in your query, as SQL Server tries to ensure the right order during the insert to help the index at its reorganize.
Your using UNIQUEIDENTIFIER as your primary keys. This may be useful in a way, but I think in your case a simple IDENTITY(1,1) column would be enough, doesn't it? The UNIQUEIDENTIFIER in your index, will heavily force a fragmentation. It might not be the best idea to solve it this way.
I tested both variants with a test-set of 100.000 rows.
These are my results in measurement of the execution costs:
In all cases, the costs are significantly lower, if the Index is created after the INSERT of #ZipCodesConstituents.
The usage of an INDENTITY instead of an UNIQUEIDENTIFIER will additionally boost the performance.
It would be wise to add a Index on your Address table, if your running this sort of query more often.
Here are the measurements in cost-points (cp) - the lower the better:
UniqueIdentifier + Index before: 20 cp for the Insert.
UniqueIdentifier + Index after Insert: 8 cp for the Insert + 7 for the Index (where the SORT occurs).
Identity + Index before: 18 cp for the Insert
Identity + Index after: 7 cp for the Insert + 6 for the Index
Identity + Index after + Index on Address: 3 for the Insert + 6 for the Index
And the winner is: Identity Column + Index and maybe an Index on your Address too.
The index I used to boost both of the statements (UniqueIdentifier and Identity) is this one:
CREATE NONCLUSTERED INDEX [NCI_Adress_IsPrimary_Postcode]
ON Address ([IsPrimary],[PostCode])
INCLUDE ([Constituentid])
In my testcase it took 13 cp to build it. If you just use this once, it won't be helpful! If you use this statement quite often or even sometimes a day/week it may be useful for you.
Hopefully this will solve your problems.

Why Postgresql searches Text index faster than Int index?

CREATE TABLE index_test
(
id int PRIMARY KEY NOT NULL,
text varchar(2048) NOT NULL,
value int NOT NULL
);
CREATE INDEX idx_index_value ON index_test ( value );
CREATE INDEX idx_index_value_and_text ON index_test ( value, text );
CREATE INDEX idx_index_text_and_value ON index_test ( text, value );
CREATE INDEX idx_index_text ON index_test ( text );
The table is populated with 10000 random rows, 'value' column has integers from 0 to 100, 'text' column has random 128 bit md5 hash. Sorry for using bad column names.
My searches are:
select * from index_test r where r.value=56;
select * from index_test r where r.value=56 and r.text='dfs';
select * from index_test r where r.text='sdf';
Anytime I make some search...
if only indexes on 'text' and/or 'value' columns are presented
if combined ('text' and 'value' together) indexes are presented
... so, anytime I see the following picture:
The search for integer column 'value' is
slower
is combined from 2 searches: *Bitmap Heap Scan on index_test* and *Bitmap Index Scan on idx_index_value*
The search for varchar column 'text' is
faster
always using an index scan
Why searching for String is easier than searching for Integer?
Why the the search plans differ in that way?
Is there any similar situations when this effect can be reproduced and can be helpful for developers?
As the text is a hash, unique by definition, there will be one only row in the 10k rows of the table matching that text.
The 56 value will exist about 100 times inside the 10k rows and it will be scattered all over the table. So the planner goes first to the index and find the pages where those rows are. Then it visits each of those scattered pages to retrieve the rows.