How to speed up this simple query - sql

With SourceTable having > 15MM records and Bad_Phrase having > 3K records, the following query takes almost 10 hours to run on SQL Server 2005 SP4.
Update [SourceTable]
Set Bad_Count = (Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%')
In English, this query is counting the number of times that any phrases listed in Bad_Phrase are a substring of the column [Name] in the SourceTable and then placing that result in the column Bad_Count.
I would like some suggestions on how to have this query run considerably faster.

For a lack of a better idea, here is one:
I don't know if SQL Server natively supports parallelizing an UPDATE statement, but you can try to do it yourself manually by partitioning the work that needs to be done.
For instance, just as an example, if you can run the following 2 update statements in parallel manually or by writing a small app, I'd be curious to see if you can bring down your total processing time.
Update [SourceTable]
Set Bad_Count=(
Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)
where Name < 'm'
Update [SourceTable]
Set Bad_Count=(
Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)
where Name >= 'm'
So the 1st update statement takes care of updating all the rows whose names start with the letters a-l, and the 2nd query takes care of o-z.
It's just an idea, and you can try splitting this into smaller chunks and more parallel update statements, depending on the capacity of your SQL Server machine.

Sounds like your query is scanning the whole table. Does your tables have proper indexes on them. Putting an index on columns that appear in a where clause is a good place to start. You can also try and get the cost of the query in the Sql server management studio (display estimated execution cost) or if your willing to wait (display actual execution cost) are both buttons in the query window. The cost will provide insights as to what is taking forever and possibly steer you to wright faster queries.

You are updating the table using sub query with the same table, every row update will scan the whole table and that may cause too much execution time. I think is better if you will insert first all data in the #temp table and then use the #temp table in your update statement. Or you can JOIN the Source table and Temp table as well.

Related

SQL Server Select Query Is Slow

I have approx 820,000 records in my SQL Server table and it is taking 5 seconds to select the data from the table. The table has one clustered index on a time column that could be NULL (as of now it does not contain any NULL value). Why is it taking 5 to 6 seconds to fetch only this much records?
What did you mean by 'select the data'? If you are fetching so many records in management studio (displaying all the records) most of this 6 seconds is consumed by showing you all the rows. If it is the case just insert the records to a temp table. It will be much faster.
I recomend you this:
1.Check if you are using Clustered and Non-Clustered in you columns (best way I think with a sp_help NameTable).
2.When you using comand "select" specific always all name columns (never use Select * From ..... ).
3.If you are using SSMS check in tools SQL Execution Plan , with this tool you can make simple review your TSQL (you can see cost
each query you make.
4.Dont use "convert(...." in clause WHERE , for example .. Where Convert(int,NameColum)=100.

Query is very slow when we put a where clause on the total selected data by query

I am running a query which is selecting data on the basis of joins between 6-7 tables. When I execute the query it is taking 3-4 seconds to complete. But when I put a where clause on the fetched data it's taking more than one minute to execute. My query is fetching large amounts of data so I can't write it here but the situation I faced is explained below:
Select Category,x,y,z
from
(
---Sample Query
) as a
it's only taking 3-4 seconds to execute. But
Select Category,x,y,z
from
(
---Sample Query
) as a
where category Like 'Spart%'
is taking more than 2-3 minutes to execute.
Why is it taking more time to execute when I use the where clause?
It's impossible to say exactly what the issue is without seeing the full query. It is likely that the optimiser is pushing the WHERE into the "Sample query" in a way that is not performant. Possibly could be resolved by updating statistics on the table, but an easier option would be to insert the whole query into a temporary table, and filter from there.
Select Category,x,y,z
INTO #temp
from
(
---Sample Query
) as a
SELECT * FROM #temp WHERE category Like 'Spart%'
This will force the optimiser to tackle it in the logical order of pulling your data together before applying the WHERE to the end result. You might like to consider indexing the temp table's category field also.
If you're using MS SQL by checking the management studio actual execution plan it may already suggest an index creation
In any case, you should add to the index used by the query the column "Category"
If you don't have an index on that table create it composed by column "Category" and all the other columns used in join or where
bear in mind by using like 'text%' clause you could end in index scan and not index seek

Exclude Certain Records with Certain Values SQL Select

I've used the solution (by AdaTheDev) from the thread below as it relates to the same question:
How to exclude records with certain values in sql select
But when applying the same query to over 40,000 records, the query takes too long to process (>30mins). Is there another way that's efficient in querying for certain records with certain values (the same question as in the above stackoverflow thread). I've attempted to use this below and still no luck:
SELECT StoreId
FROM sc
WHERE StoreId NOT IN (
SELECT StoreId
FROM StoreClients
Where ClientId=5
);
Thanks --
You could use EXISTS:
SELECT StoreId
FROM sc
WHERE NOT EXISTS (
SELECT 1
FROM StoreClients cl
WHERE sc.StoreId = cl.StoreId
AND cl.ClientId = 5
)
Make sure to create an index on StoreId column. Your query could also benefit from an index on ClientId.
40,000 rows isn't too big for SQL server. The other thread is suggesting some good queries too. If I understand correctly, your problem right now is the performance.
To improve the performance, you may need to add a new index to your table, or just update the statistics on your table.
If you are running this query in SQL server management studio for testing and performance tuning, make sure that you are selecting "Include Actual Execution Plan". This will make the query to take longer to execute, but the execution plan result can help you on where the time is spent.
It usually suggests adding some indexes to improve the performance, which is easy to follow.
If adding new indexes are not possible, then you have to spend some time reading on how to understand the execution plan's diagram and finding the slow areas.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

SQL massive performance difference using SELECT TOP x even when x is much higher than selected rows

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
However
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)?
Has anyone else witnessed something like this?
Thanks
Table valued functions can have a non-linear execution time.
Let's consider function equivalent for this query:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.
Time taken to calculate SUM for each row increases as the row numbers increase.
If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.
However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.
Most probably, something similar happens in your case too.
To say something more definitely, I need to see the function definition.
Update:
SQL Server can push predicates into the function.
For instance, I just created this TVF:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
SELECT *
FROM fn_test()
WHERE name = #name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = #name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)
I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
After rebuilding the index the query now completes in less than a second.
Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
Not directly related, but another SO question on TVFs
You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)
It's not necessarily true that the whole table is processed if col1 has an index.
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
I bet one is using an index seek, while the other isn't.
If you have a generic client:
SET SHOWPLAN_ALL ON;
GO
select ...;
go
see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.
I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.