Identify query run time - sql

I have a query that returns results very fast, seconds. But when I want to fetch all the rows it takes several hours.
If my definition of how long a query takes to run is to fetch all rows, how can one measure this besides actually fetching all the rows?
Would select count (*) on all rows be a good indicator on how long it would take to fetch all rows?

select count(*) is going to likely do a table scan to return the total number of records.
Depending what is in the table and how it is indexed the count(*) would most likely return faster then running a select *
You could run some baselines on your table by using set statistics time on and set statistics io on.
I would also suggest running with client statistics.
Also, try running a top 100, 1000, 10000 with the above turned on.
When I performance tune I like to look at actual execution plan and estimated execution plan

Related

Simple select from table takes 24 seconds in SQL Server 2014

I have a table named [cwbOrder] that currently has 1.277.469 rows. I am using SQL Server 2014 and I am doing these tests on a UAT environment, on production this query takes a little bit longer.
If I try selecting all of the rows like using:
SELECT * FROM cwbOrder
It takes 24 seconds to retrieve all of the data from the table. I have read about how it is important to index columns used in the predicates (WHERE), but I still cannot understand how does a simple select take 24 seconds.
Using this table in other more complex queries generates a lot of extra workload for the query, although I have created the JOINs on indexed columns. Additionally I have selected only 2 columns from this table then JOINED it to another table and this operation still takes a significantly long amount of time. As an example please consider the below query:
Below I have attached the index structure of both tables, to illustrate the matter:
PK_cwbOrder is the index on the id_cwbOrder column in the cwbOrder table.
Edit 1: I have added the execution plan for the query in which I join the cwbOrder table with the cwbAction table.
Is there any way, considering the information above, that I can make this query faster?
There are many reasons why such a select could be slow:
The row size or number of rows could be very large, requiring a lot of time to transport or delay.
Other operations on the table could have locks on the table.
The database server or network could be very busy.
The "table" could really be a view that is running a complicated query.
You can test different aspects. For instance:
SELECT TOP 10 <one column here>
FROM cwbOrder o
This returns a very small result set and reads just a small part of the table. This reads the entire table but returns a small result set:
SELECT COUNT(*)
FROM cwbOrder o

Fastest way to do SELECT * WHERE not null

I'm wondering what is the fastest way to get all non null rows. I've thought of these :
SELECT * FROM table WHERE column IS NOT NULL
SELECT * FROM table WHERE column = column
SELECT * FROM table WHERE column LIKE '%'
(I don't know how to measure execution time in SQL and/or Hive, and from repeatedly trying on a 4M lines table in pgAdmin, I get no noticeable difference.)
You will never notice any difference in performance when running those queries on Hive because these operations are quite simple and run on mappers which are running in parallel.
Initializing/starting mappers takes a lot more time than the possible difference in execution time of these queries and adds a lot of heuristics to the total execution time because mappers may be waiting resources and not running at all.
But you can try to measure time, see this answer about how to measure execution time: https://stackoverflow.com/a/44872319/2700344
SELECT * FROM table WHERE column IS NOT NULL is more straightforward (understandable/readable) though all of queries are correct.

Reduce execution time on simple select query

SELECT sub_id, quo_id
FROM cos_emails WITH (nolock)
WHERE quo_id = 999624 AND sub_id = 771336
This query executes in 50 seconds and returns only one record. There are 16747425 records in the table.
How to reduce execution time?
First of all
Show us the execution plan
Pretty sure, that there are missing indexes
If a table scan or a clustered index scan is occurring, it means it will consume a lot of resources and hence take time to return result.

Why select Top clause could lead to long time cost

The following query takes forever to finish. But if I remove the top 10 clause, it finishs rather quickly. big_table_1 and big_table_2 are 2 tables with 10^5 records.
I used to believe that top clause will reduce the time cost, but it's apparently not here. Why???
select top 10 ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
There are other stackoverflow discussions on this same topic (links at bottom). As noted in the comments above it might have something to do with indexes and the optimizer getting confused and using the wrong one.
My first thought is that you are doing a select top serviceid from (select *....) and the optimizer may have difficulty pushing the query down to the inner queries and making using of the index.
Consider rewriting it as
select top 10 ServiceRequestID
from big_table_1
inner join big_table_2 cap2
on cap1.servicerequestid = cap2.customerreferencenumber
and big_table_1.statusid = 2
In your query, the database is probably trying to merge the results and return them and THEN limit it to the top 10 in the outer query. In the above query the database will only have to gather the first 10 results as results are being merged, saving loads of time. And if servicerequestID is indexed, it will be sure to use it. In your example, the query is looking for the servicerequestid column in a result set that has already been returned in a virtual, unindexed format.
Hope that makes sense. While hypothetically the optimizer is supposed to take whatever format we put SQL in and figure out the best way to return values every time, the truth is that the way we put our SQL together can really impact the order in which certain steps are done on the DB.
SELECT TOP is slow, regardless of ORDER BY
Why is doing a top(1) on an indexed column in SQL Server slow?
I had a similar problem with a query like yours. The query ordered but without the top clause took 1 sec, same query with top 3 took 1 minute.
I saw that using a variable for the top it worked as expected.
The code for your case:
declare #top int = 10;
select top (#top) ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
I cant explain why but I can give an idea:
try adding SET ROWCOUNT 10 before your query. It helped me in some cases. Bear in mind that this is a scope setting so you have to set it back to its original value after running your query.
Explanation:
SET ROWCOUNT: Causes SQL Server to stop processing the query after the specified number of rows are returned.
This can also depend on what you mean by "finished". If "finished" means you start seeing some display on a gui, that does not necessarily mean the query has completed executing. It can mean that the results are beginning to stream in, not that the streaming is complete. When you wrap this into a subquery, the outer query can't really do it's processing until all the results of the inner query are available:
the outer query is dependent on the length of time it takes to return the last row of the inner query before it can "finish"
running the inner query independently may only requires waiting until the first row is returned before seeing any results
In Oracle, there were "first_rows" and "all_rows" hints that were somewhat related to manipulating this kind of behaviour. AskTom discussion.
If the inner query takes a long time between generating the first row and generating the last row, then this could be an indicator of what is going on. As part of the investigation, I would take the inner query and modify it to have a grouping function (or an ordering) to force processing all rows before a result can be returned. I would use this as a measure of how long the inner query really takes for comparison to the time in the outer query takes.
Drifting off topic a bit, it might be interesting to try simulating something like this in Oracle: create a Pipelined function to stream back numbers; stream back a few (say 15), then spin for a while before streaming back more.
Used a jdbc client to executeQuery against the pipelined function. The Oracle Statement fetchSize is 10 by default. Loop and print the results with a timestamp. See if the results stagger. I could not test this with Postgresql (RETURN NEXT), since Postgres does not stream the results from the function.
Oracle Pipelined Function
A pipelined table function returns a row to its invoker immediately
after processing that row and continues to process rows. Response time
improves because the entire collection need not be constructed and
returned to the server before the query can return a single result
row. (Also, the function needs less memory, because the object cache
need not materialize the entire collection.)
Postgresql RETURN NEXT
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation.
JDBC Default Fetch Sizes
statement.setFetchSize(100);
When debugging things like this I find that the quickest way to figure out how SQL Server "sees" the two queries is to look at their query plans. Hit CTRL-L in SSMS in the query view and the results will show what logic it will use to build your results when the query is actually executed.
SQL Server maintains statistics about the data your tables, e.g. histograms of the number of rows with data in certain ranges. It gathers and uses these statistics to try to predict the "best" way to run queries against those tables. For example, it might have data that suggests for some inputs a particular subquery might be expected to return 1M rows, while for other inputs the same subquery might return 1000 rows. This can lead it to choose different strategies for building the results, say using a table scan (exhaustively search the table) instead of an index seek (jump right to the desired data). If the statistics don't adequately represent the data, the "wrong" strategy can be chosen, with results similar to what you're experiencing. I don't know if that's the problem here, but that's the kind of thing I would look for.
If you want to compare performances of your two queries, you have to run these two queries in the same situation ( with clean memory buffers ) and have mumeric statistics
Run this batch for each query to compare execution time and statistics results
(Do not run it on a production environment) :
DBCC FREEPROCCACHE
GO
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO
-- your query here
GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO
I've just had to investigate a very similar issue.
SELECT TOP 5 *
FROM t1 JOIN t2 ON t2.t1id = t1.id
WHERE t1.Code = 'MyCode'
ORDER BY t2.id DESC
t1 has 100K rows, t2 20M rows, The average number of rows from the joined tables for a t1.Code is about 35K. The actual resultset is only 3 rows because t1.Code = 'MyCode' only matches 2 rows which only have 3 corresponding rows in t2. Stats are up-to-date.
With the TOP 5 as above the query takes minutes, with the TOP 5 removed the query returns immediately.
The plans with and without the TOP are completely different.
The plan without the TOP uses an index seek on t1.Code, finds 2 rows, then nested loop joins 3 rows via an index seek on t2. Very quick.
The plan with the TOP uses an index scan on t2 giving 20M rows, then nested loop joins 2 rows via an index seek on t1.Code, then applies the top operator.
What I think makes my TOP plan so bad is that the rows being picked from t1 and t2 are some of the newest rows (largest values for t1.id and t2.id). The query optimiser has assumed that picking the first 5 rows from an evenly distributed average resultset will be quicker than the non-TOP approach. I tested this theory by using a t1.code from the very earliest rows and the response is sub-second using the same plan.
So the conclusion, in my case at least, is that the problem is a result of uneven data distribution that is not reflected in the stats.
TOP does not sort the results to my knowledge unless you use order by.
So my guess would be, as someone had already suggested, that the query isn't taking longer to execute. You simply start seeing the results faster when you don't have TOP in the query.
Try using #sql_mommy query, but make sure you have the following:
To get your query to run faster, you could create an index on servicerequestid and statusid in big_table_1 and an index on customerreferencenumber in big_table_2. If you create unclustered indexes, you should get an index only plan with very fast results.
If I remember correctly, the TOP results will be in the same order as the index you us on big_table_1, but I'm not sure.
GĂ­sli
It might be a good idea to compare the execution plans between the two. Your statistics might be out of date. If you see a difference between the actual execution plans, there is your difference in performance.
In most cases, you would expect better performance in the top 10. In your case, performance is worse. If this is the case you will not only see a difference between the execution plans, but you will also see a difference in the number of returned rows in the estimated execution plan and the actual execution plan, leading to the poor decission by the SQL engine.
Try again after recomputing your statistics (and while you're at it, rebuilding indices)
Also check if it helps to take out the where big_table_1.StatusId=2 and instead go for
select top 10 ServiceRequestID
from big_table_1 as cap1 INNER JOIN
big_table_2 as cap2
ON cap1.ServiceRequestID = cap2.CustomerReferenceNumber
WHERE cap1.StatusId=2
I find this format much more readable, though it should (though remotely possibly it doesn't) optimise to the same execution plan. The returned endresult will be identical regardless

SQL massive performance difference using SELECT TOP x even when x is much higher than selected rows

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
However
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)?
Has anyone else witnessed something like this?
Thanks
Table valued functions can have a non-linear execution time.
Let's consider function equivalent for this query:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.
Time taken to calculate SUM for each row increases as the row numbers increase.
If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.
However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.
Most probably, something similar happens in your case too.
To say something more definitely, I need to see the function definition.
Update:
SQL Server can push predicates into the function.
For instance, I just created this TVF:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
SELECT *
FROM fn_test()
WHERE name = #name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = #name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)
I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
After rebuilding the index the query now completes in less than a second.
Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
Not directly related, but another SO question on TVFs
You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)
It's not necessarily true that the whole table is processed if col1 has an index.
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
I bet one is using an index seek, while the other isn't.
If you have a generic client:
SET SHOWPLAN_ALL ON;
GO
select ...;
go
see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.
I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.