How to combine potential indexes - sql

Some "missing index" code (see below) I got from internet searches is listing a lot of potential missing indexes for a particular table. Literally it's saying that I need 30 indexes. I already had 8 before running the code. Most experts state that a table should average 5. Can I combine a majority of these missing indexes so that it covers most of the tables indexing needs?
For example:
These two indexes are similar enough that it seems like they could be combined. But can they?
CREATE INDEX [NCI_12345] ON [DB].[dbo].[someTable]
([PatSample], [StatusID], [Sub1Sample])
INCLUDE ([PatID], [ProgID], [CQINumber])
CREATE INDEX [NCI_2535_2534] ON [DB].[dbo].[someTable]
([PatSample], [SecRestOnly])
INCLUDE ([CQINumber])
If I combine them it'd look like this:
CREATE INDEX [NCI_12345] ON [DB].[dbo].[someTable]
([PatSample], [StatusID], [Sub1Sample], [SecRestOnly])
INCLUDE ([PatID], [ProgID], [CQINumber])
NOTE: I just took the first statement and added [SecRestOnly] to it.
QUESTION: Would combining these satisfy both index needs? And if not, how would a highly used table with lots of fields ever just have 5 indexes?
Here's the code used to get "missing indexes":
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) *
(migs.user_seeks + migs.user_scans) AS improvement_measure,
LEFT (PARSENAME(mid.STATEMENT, 1), 32) as TableName,
'CREATE INDEX [NCI_' + CONVERT (VARCHAR, mig.index_group_handle) + '_'
+ CONVERT (VARCHAR, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.STATEMENT, 1), 32) + ']'
+ ' ON ' + mid.STATEMENT
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*, mid.database_id, mid.[object_id]
FROM [sys].dm_db_missing_index_groups mig
INNER JOIN [sys].dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle
INNER JOIN [sys].dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) > 10
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC;```

The sample you gave will not give you the desired result. The index on ([PatSample], [SecRestOnly]) will optimize search condition such as "PatSample = val1 and SecRestOnly = val2". The combined index will not because there are other segments between the two columns in the search condition. The key to remembers is that multi-segmented index can only be used to optimize multiple "equality" search when the columns in the search are the initial consecutive segments of the index.
Given that, it can be reasoned that suppose you have one index on (col1, col2) and another on (col1, col2, col3), then the former is not needed.
How many index to have is a trade off between update performance and search performance. More index will slow down insert/update/delete but will give query optimizer more options to optimize searches. Given your example, does your application require searching on the "SecRestOnly" by itself frequently, if that is the case, it would be better to have an index with "secRestOnly" by itself or as the first segment of a multi-segment index. If the search is rarely done on the column, then it may be reasonable to not have such index.

Related

How to optimize Impala query to combine LIKE with IN (literally or effectively)?

I need to try and optimize a query in Impala SQL that does partial string matches on about 60 different strings, against two columns in a database of 50+ billion rows. The values in these two columns are encrypted and have to be decrypted with a user defined function (in Java) to do the partial string match. So query would look something like:
SELECT decrypt_function(column_A), decrypt_function(column_B) FROM myTable WHERE ((decrypt_function(column_A) LIKE '%' + partial_string_1 + '%') OR (decrypt_function(column_B) LIKE '%' + partial_string_1 + '%')) OR ((decrypt_function(column_A) LIKE '%' + partial_string_2 + '%') OR (decrypt_function(column_B) LIKE '%' + partial_string_2 + '%')) OR ... [up to partial_string_60]
What I really want to do is decrypt the two column values I'm comparing with, once for each row and then compare that value with all the partial strings, then go onto the next row etc (for 55 billion rows). Is that possible somehow? Can there be a subquery that assigns the decrypted column value to a variable before using that to do the string comparison to each of the 60 strings? Then go onto the next row...
Or is some other optimization possible? E.g. using 'IN', so ... WHERE (decrypt_function(column_A) IN ('%' + partial_string_1 + '%', '%' + partial_string_2 + '%', ... , '%' + partial_string_60 + '%')) OR (decrypt_function(column_B) IN ('%' + partial_string_1 + '%', '%' + partial_string_2 + '%', ... , '%' + partial_string_60 + '%'))
Thanks
Use subquery and also regexp_like can have many patterns concatenated with OR (|), so you can check all alternatives in single regexp, though you may need to split into several function calls if the pattern string is too long:
select colA, ColB
from
(--decrypt in the subquery
SELECT decrypt_function(column_A) as colA, decrypt_function(column_B) as ColB
FROM myTable
) as s
where
--put most frequent substrings first in the regexp
regexp_like(ColA,'partial_string_1|partial_string_2|partial_string_3') --add more
OR
regexp_like(ColB,'partial_string_1|partial_string_2|partial_string_3')
In Hive use this syntax:
where ColA rlike 'partial_string_1|partial_string_2|partial_string_3'
OR ColB rlike 'partial_string_1|partial_string_2|partial_string_3'

Typically long running query?

I inherited a poorly designed SQL Server implementation, with horrible database schemas, and several pitifully slow queries/views that take hours, or even days, to execute.
I'm curious: What might an experienced DBA/SQL programmer consider to be an unusually long time for a query to take? In my professional experience I'm not used to seeing queries (for views or reports, etc) that take more than maybe an hour or two to run. We have several here that take 1-2 days or more!
(This database has no relationships, no Primary Keys, no Foreign Keys, hardly any indexes, duplicate data all over the place, old data that shouldn't even be in the tables, temp tables everywhere, and so on...ugh!)
What do you consider within the realm of acceptable or normal for a lengthy query process?
I'm trying to do a sanity check to determine how awful this database really is...
This isn't an answer to your question, it's just easier to offer this script to you here than in a comment.
You might want to run this query and see what SQL Server thinks are the missing indexes it needs to perform better (a short-term solution while you migrate to your new schema and database). DO NOT BLINDLY APPLY THESE INDEXES. These are merely suggestions for you to consider, that SQL Server itself has identified as being useful when it runs. You might select one or two, perhaps tweaking include columns, and AFTER TESTING, apply some of them to help you speed up your existing system (I forget where this query came from, so I'm sadly unable to attribute the original author):
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) AS improvement_measure,
(migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) AS [cumulative_impact],
OBJECT_NAME(OBJECT_ID) as TableName,
'CREATE INDEX [missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*, mid.database_id, mid.[object_id]
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) > 10
AND database_id = DB_ID()
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC
Further, you can use the following script to find the "longest running queries" (along with their SQL Plans). There's a TON of information here, so play with the ORDER BY to bring different kinds of issues to your attention at the top (for instance, the longest running queries might run only once or twice, while one that runs thousand of times might not take SO long, but might consume far more resources all told):
SELECT [St].Text,
[Qp].[Query_Plan],
[Qs].*
FROM (SELECT TOP 50 *
FROM [Sys].[Dm_Exec_Query_Stats]
ORDER BY [Total_Worker_Time] DESC
) AS [Qs]
CROSS APPLY [Sys].[Dm_Exec_Sql_Text] ([Qs].[Sql_Handle]) AS [St]
CROSS APPLY [Sys].[Dm_Exec_Query_Plan] ([Qs].[Plan_Handle]) AS [Qp]
WHERE ([Qs].[Max_Worker_Time] > 300 OR [Qs].[Max_Elapsed_Time] > 300)
AND [Qs].execution_count > 1
ORDER BY min_elapsed_time DESC, max_elapsed_time DESC
These queries only return data from running instances. Stopping and restarting SQL Server erases all the collected data, so these queries are only really valuable for systems that have been up and running in "real world" situations for a while.
I hope you find these useful.

Like Query in SQL taking time

So I've looked around to try to find some posts on this and there are many
Like Query 1 and Like Query 2 but none that address my specific question (that I could find).
I have two tables in which I have around 5000000+ records and I am returning Search result from these tables as :
SELECT A.ContactFirstName, A.ContactLastName
FROM Customer.CustomerDetails AS A WITH (nolock)
WHERE (A.ContactFirstName + ' ' + A.ContactLastName LIKE '%' + 'a' + '%')
UNION
SELECT C.ContactFirstName, C.ContactLastName
FROM Customer.Contacts AS C WITH (nolock)
WHERE (C.ContactFirstName + ' ' + C.ContactLastName LIKE '%' + 'a' + '%')
My problem is it is taking around 1 minute to execute.
For above query I am expecting result like :
Please suggest me the best practice to improve performance. Thanks in advance.
NOTE : No missing Indexes.
when you use "LIKE '%xxx%'" index are not used that why your query is slow i think. When you use "LIKE 'xxx%')" index is used (if an index exist on column of course. >Other proble you do a like on concatenante column, i dont knwo if index is used in this case. And why do a 'xxx' + ' ' + 'yyy' like 'z%', just do 'xxx' like 'z%' its the same. You can try to modify your query like this
SELECT A.ContactFirstName, A.ContactLastName
FROM Customer.CustomerDetails AS A WITH (nolock)
WHERE A.ContactFirstName LIKE '%a%' or A.ContactLastName LIKE '%a%'
UNION
SELECT C.ContactFirstName, C.ContactLastName
FROM Customer.Contacts AS C WITH (nolock)
WHERE C.ContactFirstName LIKE 'a%'
Use Charindex which improves performance of the search ,Here it checks the string to match with first charcter of given search charecter and doesn't search for any more matches.
DECLARE #Search VARCHAR(10)='a'
SELECT A.ContactFirstName, A.ContactLastName
FROM Customer.CustomerDetails AS A WITH (NOLOCK)
WHERE CHARINDEX(#Search,(A.ContactFirstName + ' ' + A.ContactLastName),0)>1

Advanced Sql query to identify missing data

I am currently dealing with a sql server table 'suburb' which has a suburb_id column and an adjacent_suburb_ids column. The adjacent_suburb_ids column is a comma separated string of other suburb_ids.
I have found that some of the records are not reciprocating -
e.g "SuburbA" has "SuburbB" id in adjacent_suburb_ids but "SuburbB" does not have "SuburbA" id in adjacent_suburb_ids
I need to identify all the suburbs which are not reciprocating the adjacent_suburbs, can I do this with a SQL query?
Please do not comment on the data/table structure as it is not in my control and I can't change it.
Assuming I'm understanding your question correctly, you can join the table to itself using the like and not like operators:
select s.suburb_id, s2.suburb_id as s2id
from suburb s
join suburb s2 on
s.suburb_id <> s2.suburb_id
and ',' + s2.adjacent_suburb_ids + ',' like
'%,' + cast(s.suburb_id as varchar(10)) + ',%'
and ',' + s.adjacent_suburb_ids + ',' not like
'%,' + cast(s2.suburb_id as varchar(10)) + ',%'
SQL Fiddle Demo
You need to concatenate the comma before and after to do a search within the set. And yes, if you had the chance, you should consider normalizing the data.

How to speed up my spatial search in SQL Server?

I have a database with about 1 million places (coordinates) placed out on the Earth. My web site has a map (Google Maps) that lets users find those places by zooming in on the map.
The database is a SQL Server 2008 R2 and I have created a spatial column for the location of each marker.
Problem is I need to cut down query time drastically. An example is a map area covering a few square kilometers which returns maybe 20000 points - that query takes about 6 seconds of CPU time on a very fast quad core processor.
I contruct a shape out of the visible area of the map, like this:
DECLARE #shape GEOGRAPHY = geography::STGeomFromText('POLYGON((' +
CONVERT(varchar, #ne_lng) + ' ' + CONVERT(varchar, #sw_lat) + ', ' +
CONVERT(varchar, #ne_lng) + ' ' + CONVERT(varchar, #ne_lat) + ', ' +
CONVERT(varchar, #sw_lng) + ' ' + CONVERT(varchar, #ne_lat) + ', ' +
CONVERT(varchar, #sw_lng) + ' ' + CONVERT(varchar, #sw_lat) + ', ' +
CONVERT(varchar, #ne_lng) + ' ' + CONVERT(varchar, #sw_lat) + '))', 4326)
And the query then makes the selection based on this:
#shape.STIntersects(MyTable.StartPoint) = 1
a) I have made sure the index is really used (checked the actual execution plan). Also tried with index hints.
b) I have also tried querying by picking everything in a specific distance from the center of the map. It's a little bit better, but it still takes many seconds.
The spatial index looks like this:
CREATE SPATIAL INDEX [IX_MyTable_Spatial] ON [dbo].[MyTable]
(
[MyPoint]
)USING GEOGRAPHY_GRID
WITH (
GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM),
CELLS_PER_OBJECT = 16, PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
What can be done to dramatically improve this search? Should I have a geometry-based index instead? Or are there other settings for the index that are badly chosen (they are the default ones)?
EDIT------------------
I ended up not using SQL Server Spatial indexes at all. Since I only need to do simple searches within a square of a map, using decimal data type and normal <= and >= search is so much faster, and totally enough for the purpose. Thanks everyone for helping me!
SQL server 2008 (and later) supports SPATIAL indexes.
See: http://technet.microsoft.com/en-us/library/bb895373.aspx
for a list of functions that can be used whilst still being able to use an index.
If you use any other function TSQL will not be able to use an index, killing performance.
See: http://technet.microsoft.com/en-us/library/bb964712.aspx
For general info on spatial indexes.
Have you tried using an "Index hint"? For Example:
SELECT * FROM [dbo].[TABLENAME] WITH(INDEX( [INDEX_NAME] ))
WHERE
[TABLENAME].StartPoint.STIntersects(#shape) = 1