I have a question about LIMIT/TOP. As I understand before we get only rows from the limit, the whole table is processed.
so if I write Select * from TABLE limit 2, first the whole table is processed and then it is cut.
Is there a way to cut it before it gets processed? So for example "take 2 random rows". So then I don't query the whole table, but only a part of it.
I hope this question makes sense to you. I will appreciate your help!
In the execution plan tree a LIMIT node will stop processing the child nodes as soon as it's complete; i.e., when it receives the maximum number of rows from the child nodes (in your case 2 rows).
This will be very effective in terms of performance and response time if the child nodes are pipelined, reducing the cost drastically. For example:
select * from t limit 2;
If the child nodes are materialized then the subbranch will be entirely processed before limiting, and the cost and response time won't be significantly affected. For example:
select * from t order by rand() limit 2;
MySQL Limit clause used select statement is used to restrict the number of rows returns from the result set, rather than fetching the whole set from table.
If you use Select * from TABLE limit 2 it will give result set in random order. It better to used Limit clause with criteria so you can increase the performance on table.
For example:
SELECT * FROM TABLE
WHERE column_name >30
ORDER BY column_name DESC
LIMIT 5;
I have a query that returns results very fast, seconds. But when I want to fetch all the rows it takes several hours.
If my definition of how long a query takes to run is to fetch all rows, how can one measure this besides actually fetching all the rows?
Would select count (*) on all rows be a good indicator on how long it would take to fetch all rows?
select count(*) is going to likely do a table scan to return the total number of records.
Depending what is in the table and how it is indexed the count(*) would most likely return faster then running a select *
You could run some baselines on your table by using set statistics time on and set statistics io on.
I would also suggest running with client statistics.
Also, try running a top 100, 1000, 10000 with the above turned on.
When I performance tune I like to look at actual execution plan and estimated execution plan
I have a pipelined function . I have two sql statements as below.
The first one is as select * from table
and the second one is a select count(*) from table.
SELECT *
FROM table (es_feed_api_da.invoice_daily ('10-sep-2014'));
SELECT count(*)
FROM table (es_feed_api_da.invoice_daily ('10-sep-2014'));
i am running the two queries in toad.
I find that the second one (select count(*)) takes relatively more time than first one(select *)
Can someone please explain the reason to me..
Thanks
i am running the two queries in toad
I find that the second one (select count(*)) takes relatively more time than first one(select *) Can someone please explain the reason to me.
It is quite obvious that SELECT * would be faster than SELECT COUNT(*) because you are executing it on TOAD which is a GUI based client tool and gives you only first few rows(like only 50 rows in SQL Developer) when you simply project/select the rows. The time elapsed would keep increasing as and when you fetch more rows by scrolling down the query result.
On the other hand, when you do a SELECT COUNT(*), it must count all the rows in the table as opposed to SELECT * which only returns the first few rows in TOAD.
I don't have TOAD, but I can demonstrate the behaviour in SQL Developer.
Output of SELECT * ONLY first 50 rows:
After scrolling down to 500 rows:
The time taken to fetch further rows will increase as and when you scroll down further.
I have a table with 226 million rows that has a varchar2(2000) column. The first 10 characters are indexed using a functional index SUBSTR("txtField",1,10).
I am running a query such as this:
select count(1)
from myTable
where SUBSTR("txtField",1,10) = 'ABCDEFGHIJ';
The value does not exist in the database so the return in "0".
The explain plan shows that the operation performed is "INDEX (RANGE SCAN)" which I would assume and the cost is 4. When I run this query it takes on average 114 seconds.
If I change the query and force it to not use the index:
select count(1)
from myTable
where SUBSTR("txtField",1,9) = 'ABCDEFGHI';
The explain plan shows the operation will be a "TABLE ACCESS (FULL)" which makes sense. The cost is 629,000. When I run this query it takes on average 103 seconds.
I am trying to understand how scanning an index can take longer than reading every record in the table and performing the substr function on a field.
Followup:
There are 230M+ rows in the table and the query returns 17 rows; I selected a new value that is in the database. Initially I was executing with a value that was not in the database and returned zero rows. It seems to make no difference.
Querying for information on the index yields:
CLUSTERING_FACTOR=201808147
LEAF_BLOCKS=1131660
I am running the query with AUTOTRACE ON and the gather_plan_statistics and will add those results when they are available.
Thanks for all the suggestions.
There's a lot of possibilities.
You need to look at the actual execution plan, though.
You can run the query with the /*+ gather_plan_statistics */ hint, and then execute:
select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
You should also look into running a trace/tkprof to see what is actually happening - your DBA should be able to assist you with this.
The following query takes forever to finish. But if I remove the top 10 clause, it finishs rather quickly. big_table_1 and big_table_2 are 2 tables with 10^5 records.
I used to believe that top clause will reduce the time cost, but it's apparently not here. Why???
select top 10 ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
There are other stackoverflow discussions on this same topic (links at bottom). As noted in the comments above it might have something to do with indexes and the optimizer getting confused and using the wrong one.
My first thought is that you are doing a select top serviceid from (select *....) and the optimizer may have difficulty pushing the query down to the inner queries and making using of the index.
Consider rewriting it as
select top 10 ServiceRequestID
from big_table_1
inner join big_table_2 cap2
on cap1.servicerequestid = cap2.customerreferencenumber
and big_table_1.statusid = 2
In your query, the database is probably trying to merge the results and return them and THEN limit it to the top 10 in the outer query. In the above query the database will only have to gather the first 10 results as results are being merged, saving loads of time. And if servicerequestID is indexed, it will be sure to use it. In your example, the query is looking for the servicerequestid column in a result set that has already been returned in a virtual, unindexed format.
Hope that makes sense. While hypothetically the optimizer is supposed to take whatever format we put SQL in and figure out the best way to return values every time, the truth is that the way we put our SQL together can really impact the order in which certain steps are done on the DB.
SELECT TOP is slow, regardless of ORDER BY
Why is doing a top(1) on an indexed column in SQL Server slow?
I had a similar problem with a query like yours. The query ordered but without the top clause took 1 sec, same query with top 3 took 1 minute.
I saw that using a variable for the top it worked as expected.
The code for your case:
declare #top int = 10;
select top (#top) ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
I cant explain why but I can give an idea:
try adding SET ROWCOUNT 10 before your query. It helped me in some cases. Bear in mind that this is a scope setting so you have to set it back to its original value after running your query.
Explanation:
SET ROWCOUNT: Causes SQL Server to stop processing the query after the specified number of rows are returned.
This can also depend on what you mean by "finished". If "finished" means you start seeing some display on a gui, that does not necessarily mean the query has completed executing. It can mean that the results are beginning to stream in, not that the streaming is complete. When you wrap this into a subquery, the outer query can't really do it's processing until all the results of the inner query are available:
the outer query is dependent on the length of time it takes to return the last row of the inner query before it can "finish"
running the inner query independently may only requires waiting until the first row is returned before seeing any results
In Oracle, there were "first_rows" and "all_rows" hints that were somewhat related to manipulating this kind of behaviour. AskTom discussion.
If the inner query takes a long time between generating the first row and generating the last row, then this could be an indicator of what is going on. As part of the investigation, I would take the inner query and modify it to have a grouping function (or an ordering) to force processing all rows before a result can be returned. I would use this as a measure of how long the inner query really takes for comparison to the time in the outer query takes.
Drifting off topic a bit, it might be interesting to try simulating something like this in Oracle: create a Pipelined function to stream back numbers; stream back a few (say 15), then spin for a while before streaming back more.
Used a jdbc client to executeQuery against the pipelined function. The Oracle Statement fetchSize is 10 by default. Loop and print the results with a timestamp. See if the results stagger. I could not test this with Postgresql (RETURN NEXT), since Postgres does not stream the results from the function.
Oracle Pipelined Function
A pipelined table function returns a row to its invoker immediately
after processing that row and continues to process rows. Response time
improves because the entire collection need not be constructed and
returned to the server before the query can return a single result
row. (Also, the function needs less memory, because the object cache
need not materialize the entire collection.)
Postgresql RETURN NEXT
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation.
JDBC Default Fetch Sizes
statement.setFetchSize(100);
When debugging things like this I find that the quickest way to figure out how SQL Server "sees" the two queries is to look at their query plans. Hit CTRL-L in SSMS in the query view and the results will show what logic it will use to build your results when the query is actually executed.
SQL Server maintains statistics about the data your tables, e.g. histograms of the number of rows with data in certain ranges. It gathers and uses these statistics to try to predict the "best" way to run queries against those tables. For example, it might have data that suggests for some inputs a particular subquery might be expected to return 1M rows, while for other inputs the same subquery might return 1000 rows. This can lead it to choose different strategies for building the results, say using a table scan (exhaustively search the table) instead of an index seek (jump right to the desired data). If the statistics don't adequately represent the data, the "wrong" strategy can be chosen, with results similar to what you're experiencing. I don't know if that's the problem here, but that's the kind of thing I would look for.
If you want to compare performances of your two queries, you have to run these two queries in the same situation ( with clean memory buffers ) and have mumeric statistics
Run this batch for each query to compare execution time and statistics results
(Do not run it on a production environment) :
DBCC FREEPROCCACHE
GO
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO
-- your query here
GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO
I've just had to investigate a very similar issue.
SELECT TOP 5 *
FROM t1 JOIN t2 ON t2.t1id = t1.id
WHERE t1.Code = 'MyCode'
ORDER BY t2.id DESC
t1 has 100K rows, t2 20M rows, The average number of rows from the joined tables for a t1.Code is about 35K. The actual resultset is only 3 rows because t1.Code = 'MyCode' only matches 2 rows which only have 3 corresponding rows in t2. Stats are up-to-date.
With the TOP 5 as above the query takes minutes, with the TOP 5 removed the query returns immediately.
The plans with and without the TOP are completely different.
The plan without the TOP uses an index seek on t1.Code, finds 2 rows, then nested loop joins 3 rows via an index seek on t2. Very quick.
The plan with the TOP uses an index scan on t2 giving 20M rows, then nested loop joins 2 rows via an index seek on t1.Code, then applies the top operator.
What I think makes my TOP plan so bad is that the rows being picked from t1 and t2 are some of the newest rows (largest values for t1.id and t2.id). The query optimiser has assumed that picking the first 5 rows from an evenly distributed average resultset will be quicker than the non-TOP approach. I tested this theory by using a t1.code from the very earliest rows and the response is sub-second using the same plan.
So the conclusion, in my case at least, is that the problem is a result of uneven data distribution that is not reflected in the stats.
TOP does not sort the results to my knowledge unless you use order by.
So my guess would be, as someone had already suggested, that the query isn't taking longer to execute. You simply start seeing the results faster when you don't have TOP in the query.
Try using #sql_mommy query, but make sure you have the following:
To get your query to run faster, you could create an index on servicerequestid and statusid in big_table_1 and an index on customerreferencenumber in big_table_2. If you create unclustered indexes, you should get an index only plan with very fast results.
If I remember correctly, the TOP results will be in the same order as the index you us on big_table_1, but I'm not sure.
GĂsli
It might be a good idea to compare the execution plans between the two. Your statistics might be out of date. If you see a difference between the actual execution plans, there is your difference in performance.
In most cases, you would expect better performance in the top 10. In your case, performance is worse. If this is the case you will not only see a difference between the execution plans, but you will also see a difference in the number of returned rows in the estimated execution plan and the actual execution plan, leading to the poor decission by the SQL engine.
Try again after recomputing your statistics (and while you're at it, rebuilding indices)
Also check if it helps to take out the where big_table_1.StatusId=2 and instead go for
select top 10 ServiceRequestID
from big_table_1 as cap1 INNER JOIN
big_table_2 as cap2
ON cap1.ServiceRequestID = cap2.CustomerReferenceNumber
WHERE cap1.StatusId=2
I find this format much more readable, though it should (though remotely possibly it doesn't) optimise to the same execution plan. The returned endresult will be identical regardless