Is there any difference in execution time between RANK() vs. DENSE_RANK() vs. ROW_NUMBER()? I understand that the use case for all the three are different. But I wanted to know if it is already established that one would take more time than the other?
Personally, on few trials, I observed that DENSE_RANK() took more time than RANK(). Is there a reason for this or any documentation that could help me understand more?
Related
I've seen answers to this question for other databases (MySQL, SQL Server, etc.) but not for PostgreSQL. So, is COUNT(1) or COUNT(*) faster/better for selecting the row count of a table?
Benchmarking the difference
The last time I've benchmarked the difference between COUNT(*) and COUNT(1) for PostgreSQL 11.3, I've found that COUNT(*) was about 10% faster. The explanation by Vik Fearing at the time has been that the constant expression 1 (or at least its nullability) is being evaluated for the entire count loop. I haven't checked whether this has been fixed in PostgreSQL 14.
Don't worry about this in real world queries
However, you shouldn't worry about such a performance difference. The difference of 10% was measurable in a benchmark, but I doubt you can consistently measure such a difference in an ordinary query. Also, ideally, all SQL vendors optimise the two things in the same way, given that 1 is a constant expression, and thus can be eliminated. As mentioned in the above article, I couldn't find any difference in any other RDBMS that I've tested (MySQL, Oracle, SQL Server), and I wouldn't expect there to be any difference.
Will the function row_number() always sort the same data in the same way?
No. Ordering in SQL is unstable, meaning that the original sort order is not preserved. There is no guarantee that an analytic function or order by will return the results in the same order for the same key values.
You can always add a unique id as the last key in the sort to make it reproducible.
EDIT:
Note: the non-reproduciblity of order by is part of the SQL standard. Oracle documentation does not specify otherwise. And, in general, I ordering is usually not stable in databases (for equivalent key values). I would expect row_number() to behave the same way.
If you need things in a particular order, you can add rowid to the order by clause (see here). In fact, rowid may solve your problem without row_number().
I am trying to eliminate duplicate rows from my select by person ID see here
I got a solution using Analytic function:
SELECT PersonID, LastName, FirstName, RecordId, RecordType
FROM (SELECT PersonID, LastName, FirstName, RecordId, RecordType,
ROW_NUMBER() OVER (PARTITION BY PersonID ORDER BY RecordType ASC) AS rn
FROM test_records) t
WHERE rn = 1
I would like to understand if it will be more expensive using this Analytic function then just running two consecutive query's :
SELECT distinct PersonID from test_records;
Then for each PersonID (java code or plsql):
SELECT * from test_records where PersonID =X and rownum = 1;
Will it be correct comparing the explain plan and the cost?
Will it be correct to add the costs of the two querys and compare to the Analytic function cost?
Thank you!
Couple of general rules to be mindful of:
Prefer using the builtin functions for analytics. Since they're native, the CBO can do a lot of behind-the-scenes magic to speed things up.
Avoid making multiple queries if you can. The overhead of sending the query from your application will really start to add up and cause a lot of performance issues. If you're doing it in PL/SQL, the penalty is reduced, but it's still less efficient than a single query.
Based on what you've posted, I would recommend you use the analytic function. However, I'm not sure what you're trying to accomplish in this query, but it doesn't look like either approach is going to be very good. I don't know if this is possible, but you might want to change up your schema if you can.
It seems like you're storing the data in a really nasty way. From your other question, it looks like you don't have a way to put a good index on the table. No indexing combined with these analytic functions will dramatically reduce the scalability of that table. If you put a few thousand rows in there, you're going to be seeing some awfully long-running queries.
The right answer is to try both methods and compare them in your environment. I do note that the two methods do not produce the same results. The first query gives the "first" RecordType. The second gives an arbitrary row (and I'm assuming row_num should really be rownum.
Each has benefits. From the perspective of SQL only, the second method will use fewer Oracle resources. Alas, this will (I almost 100% sure) be overcome by the expense of running lots and lots of queries. Don't forget the looping logic and all the rest as well.
Why is the first method better? First, it is only one query so it incurs overhead of running a query only once. Second, it doesn't require a lot of extra non-SQL code for looping and so on. Third, the query can run in parallel. Fourth, Oracle analytic functions are usually pretty fast.
There might be some cases where the second method is better. For instance, if you have 1,000,000 records and only one person, then the second will definitely be faster. So, it is not a slam-dunk as to which is better. But for most distributions of data, I'd go with the first method.
With Denali's CTP 3 release , we have more analytical functions out of which I am interested in two:
a) First_Value
b) Last_Value
I understood that FIRST_VALUE returns first value based on partition and order by clause while Last_Value returns last value based on partition and order by clause.
But in what practical situation they will be useful? A sample real time situation will help me understand this.
These functions can help you get other information from the resultset without using complicated self-joins, derived tables, etc. For example, let's say you have twenty forum messages in a table, and you want to know who started the thread, and who posted the last response. They are ordered by date/time, so while MIN() & MAX() can help you identify when the first & last posts occurred, they can't tell you who those authors were unless you went out and got that additional information somehow. Even that can be complicated - if you don't have a natural or artificial key column for example (you could join on an identity column), you might be tempted to join on the date/time values, which are not guaranteed to be unique...
I have these Queries:
With CTE(comno) as
(select distinct comno=ErpEnterpriseId from company)
select id=Row_number() over(order by comno),comno from cte
select comno=ErpEnterpriseId,RowNo=Row_number() over (order by erpEnterpriseId) from company group by ErpEnterpriseId
SELECT erpEnterpriseId, ROW_NUMBER() OVER(ORDER BY erpEnterpriseId) AS RowNo
FROM
(
SELECT DISTINCT erpEnterpriseId
FROM Company
) x
All three of them returns identical cost and actual execution plans..why and how so ?
It's all down to the query optimizer - that will by trying to optimize the query you enter into the most efficient execution plan (i.e several different queries could be optimized down to the SAME statement that is estimated to be most efficient).
The main thing you should do when trying to optimise a query and find which one performs the best, is to just try them and compare performance. Run an SQL profiler trace to see what the duration/reads is for each version. I usually run each version of a query 3 times to get an average to compare. Each time, clearing the execution plan and data cache down to prevent skewed results.
It's worth having a read of this MSDN article on the optimizer.
Simple, the optimizer is probably turning all your statements into the same statement.
Just like in English, in which there are many ways to say the same thing, all three of those queries are asking for the same data. The SQL Engine (the query optimizer) knows that and is smart enough to know what you are asking.
Even more appropriately, the engine has information that you don't (or likely don't know) - how the data is organized and indexed. It uses this information to make it's own decision about what the BEST way to get the data is, and that's what it is doing.
Although there are ways to override the optimizer, unless you really know what you are doing, you will probably only hurt performance. So your best option is to write the queries in whatever way make most sense to you (or other humans) for readability and maintainability.