Two queries with equal results takes vastly different time to execute? - sql

I'm currently using redshift. I was trying to execute a query to calculate a column called id_number (data type INTEGER) from a VARCHAR column called id to speed up further queries using id_number instead of id.
Here is the first query I tried :
select rank() over (order by id) id_number, id, sid1 ,sid2
from table
limit 10000
However, noticing that this query was taking quite some time, I tried the next query:
with A as(
select id, sid1, sid2
from table
limit 10000
)
select rank() over (order by id) id_number, id, sid1 ,sid2
from A
which was over in a flash.
How was it that the second query took such less time to execute, while the two queries seem to do the exact same thing?
If it is because of the positions of limit 10000, how did the position of limit contribute to the difference in execution time?

Your two queries are quite different.
The first one has to sort the complete table to get the rank() and then emits the first 10000 rows of the result (with no particular ordering enforced).
The second one selects 10000 rows (without a particular ordering enforced) and then sorts those to calculate rank() on them.
If the table is significantly larger than 10000 rows, it is unsurprising that the first query, which has to sort it all, is much slower.
Look at the EXPLAIN (ANALYZE, BUFFERS) output to understand this better.

Related

Slow performance with NOT ANY and ORDER BY in same query

It seems like there is a strange performance hit when running a query that includes both NOT 'some string' = ANY(array_column) as well as an ORDER BY statement in the same query.
The following is a simplified table structure illustrating the behavior where tagger is an array of UUIDs (v4):
CREATE TABLE IF NOT EXISTS "doc"."test" (
"id" STRING,
"last_active" TIMESTAMP,
"taggers" ARRAY(STRING)
)
The taggers array can grow somewhat large with maybe hundreds and in some case thousands of individual strings.
The following queries are all very performant and resolve within .03 seconds:
SELECT id FROM test ORDER BY last_active DESC LIMIT 10;
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) LIMIT 10;
SELECT id FROM test WHERE ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active DESC LIMIT 10;
However including both parts in the query jumps to around 2 - 3 seconds:
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active LIMIT 10;
What's very strange is that of the previous list of queries that run fast the last one is almost the exact same as the slow one, just without the negation. Negation of the ANY is also very fast. It's only when negation of ANY in a combination of a limit is added that things slow down. Any help would be greatly appreciated.
The query with only ORDER BY doesn't apply any filtering and it's much faster of course.
The query that only has the filtering NOT ...ANY() without ORDER BY applies the filter only to a short number of records until the LIMIT number (10 in this case) is reached.
The last query (filtering with NOT & ANY and ORDER BY) is significantly slower because it has to do much more work: It has to apply the filter on all records of the table, then sort them and finally return the first 10 (LIMIT).

Is there any way to calculate total number of rows that return from dynamic query in Common Table Expression(CTE) or Subquery

We are in the process of optimizing our database.We have most of store procedure that uses CTE because it gives us high performance according to our table strucure.We have almost dynamic query that have different result according to different condition.We hold all data in CTE, and check condition, that was the not problem but we need total number of rows that return by each query ,in calculating this it takes lots of time.Temporary table or table variable not suitable in our case as it takes lots of time to insert data in it.We have structure as following
With t(fields) as
(select field1,field2.......
ROW_NUMBER() OVER (order by some column) as row...
from some table and lots of
inner n left joins
where some condition ),
rowTotal(RowTotal) as
(select max(row) from t)
select * from t,RowTotal
where condition for paging
But max(row) took lots of times if i remove this it return data within 100ms. I tried Coun(*),Count(SomeField) and many other it works but took lots of time.How can i achieve total number of rows from cte within some ms any aggregate function will not work for me.Is there any other way to calculate rowtotal like ##rowcount.Thanks in advance for any help.
If you are after the total number of rows from the inner query you can add this as a column to your select using COUNT() and PARTITION BY().
With t(fields) as
(select COUNT(*) OVER (PARTITION BY 1) AS TotalRows,
field1,field2.......
ROW_NUMBER() OVER (order by some column) as row...
from some table ...
This should give you a count of the total rows in 't' as the first column of t
I don't know that this is the fastest way to get the result you want but it works for me on 000's of returned records and prevents extra select queries to find the count separately.

MAX vs Top 1 - which is better?

I had to review some code, and came across something that someone did, and can't think of a reason why my way is better and it probably isn't, so, which is better/safer/more efficient?
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5 GROUP BY event_id
OR
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date
I would have gone with the 2nd option, but I'm not sure why, and if that's right.
1) When there is a clustered index on the table and the column to be queried, both the MAX() operator and the query SELECT TOP 1 will have almost identical performance.
2) When there is no clustered index on the table and the column to be queried, the MAX() operator offers the better performance.
Reference: http://www.johnsansom.com/performance-comparison-of-select-top-1-verses-max/
Performance is generally similar, if your table is indexed.
Worth considering though: Top usually only makes sense if you're ordering your results (otherwise, top of what?)
Ordering a result requires more processing.
Min doesn't always require ordering. (Just depends, but often you don't need order by or group by, etc.)
In your two examples, I'd expect speed / x-plan to be very similar. You can always turn to your stats to make sure, but I doubt the difference would be significant.
They are different queries.
The first one returns many records (the biggest a_date for each event_id found within a_primary_key = 5)
The second one returns one record (the smallest a_date found within a_primary_key = 5).
For the queries to have the same result you would need:
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date DESC
The best way to know which is faster is to check the query plan and do your benchmarks. There are many factors that would affect the speed, such as table/heap size, etc. And even different versions of the same database may be optimized to favor one query over the other.
I perform max and top on one table with 20,00,000+ records ,
and found that Top give faster result with order by than max or min function.
So , best way is to execute both your query one by one for some time and check connection elapsed time for than.
MAX and TOP function differently. Your first query will return the maximum value found for a_date that has a a_primary_key = 5 for each different event_id found. The second query will simply grab the first a_date with a a_primary_key = 5 found in the result set.
To add the otherwise brilliant responses noting that the queries do very different things indeed, I'd like to point out that the results will be very different if there are no rows matching the criteria in the select.
SELECT MAX() will return one result with a NULL value
SELECT TOP 1 will result zero results
These are very different things.
I ran an experiment and I got the Clustered Index Cost 98% when I used an aggregate like Min/ Max, but when I used TOP and Order By, Clustered Index Scan cost was reduced to 45%. When it comes to querying large datasets, TOP and Order By combination will be less expensive and will give faster results.

total number of rows of a query

I have a very large query that is supposed to return only the top 10 results:
select top 10 ProductId from .....
The problem is that I also want the total number of results that match the criteria without that 'top 10', but in the same time it's considered unaceptable to return all rows (we are talking of roughly 100 thousand results.
Is there a way to get the total number of rows affected by the previous query, either in it or afterwords without running it again?
PS: please no temp tables of 100 000 rows :))
dump the count in a variable and return that
declare #count int
select #count = count(*) from ..... --same where clause as your query
--now you add that to your query..of course it will be the same for every row..
select top 10 ProductId, #count as TotalCount from .....
Assuming that you're using an ORDER BY clause already (to properly define which the "TOP 10" results are), then you could add a call of ROW_NUMBER also, with the opposite sort order, and pick the highest value returned.
E.g., the following:
select top 10 *,ROW_NUMBER() OVER (order by id desc) from sysobjects order by ID
Has a final column with values 2001, 2000, 1999, etc, descending. And the following:
select COUNT(*) from sysobjects
Confirms that there are 2001 rows in sysobjects.
I suppose you could hack it with a union select
select top 10 ... from ... where ...
union
select count(*) from ... where ...
For you to get away with this type of hack you will need to add fake columns to the count query so it returns the same amount of columns as the main query. For example:
select top 10 id, first_name from people
union
select count(*), '' as first_name from people
I don't recommend using this solution. Using two separate queries is how it should be done
Generally speaking no - reasoning is as follows:
If(!) the query planner can make use of TOP 10 to return only 10 rows then RDBMS will not even know the exact number of rows that satisfy the full criteria, it just gets the TOP 10.
Therefore, when you want to find out count of all rows satisfying the criteria you are not running it the second time, but the first time.
Having said that proper indexes might make both queries execute pretty fast.
Edit
MySQL has SQL_CALC_FOUND_ROWS which returns the number of rows that query would return if there was no LIMIT applied - googling for an equivalent in MS SQL points to analytical SQL and CTE variant, see this forum (even though not sure that either would qualify as running it only once, but feel free to check - and let us know).

Max and Min Time query

how to show max time in first row and min time in second row for access using vb6
What about:
SELECT time_value
FROM (SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
)
ORDER BY time_value DESC;
That should do the job unless there are no rows in SomeTable (or your DBMS does not support the notation).
Simplifying per suggestion in comments - thanks!
SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
ORDER BY time_value DESC;
If you can get two values from one query, you may improve the performance of the query using:
SELECT MIN(time_column) AS min_time,
MAX(time_column) AS max_time
FROM SomeTable;
A really good optimizer might be able to deal with both halves of the UNION version in one pass over the data (or index), but it is quite easy to imagine an optimizer tackling each half of the UNION separately and processing the data twice. If there is no index on the time column to speed things up, that could involve two table scans, which would be much slower than a single table scan for the two-value, one-row query (if the table is big enough for such things to matter).