Oracle Cost of analytic function vs double fetch - sql

I am trying to eliminate duplicate rows from my select by person ID see here
I got a solution using Analytic function:
SELECT PersonID, LastName, FirstName, RecordId, RecordType
FROM (SELECT PersonID, LastName, FirstName, RecordId, RecordType,
ROW_NUMBER() OVER (PARTITION BY PersonID ORDER BY RecordType ASC) AS rn
FROM test_records) t
WHERE rn = 1
I would like to understand if it will be more expensive using this Analytic function then just running two consecutive query's :
SELECT distinct PersonID from test_records;
Then for each PersonID (java code or plsql):
SELECT * from test_records where PersonID =X and rownum = 1;
Will it be correct comparing the explain plan and the cost?
Will it be correct to add the costs of the two querys and compare to the Analytic function cost?
Thank you!

Couple of general rules to be mindful of:
Prefer using the builtin functions for analytics. Since they're native, the CBO can do a lot of behind-the-scenes magic to speed things up.
Avoid making multiple queries if you can. The overhead of sending the query from your application will really start to add up and cause a lot of performance issues. If you're doing it in PL/SQL, the penalty is reduced, but it's still less efficient than a single query.
Based on what you've posted, I would recommend you use the analytic function. However, I'm not sure what you're trying to accomplish in this query, but it doesn't look like either approach is going to be very good. I don't know if this is possible, but you might want to change up your schema if you can.
It seems like you're storing the data in a really nasty way. From your other question, it looks like you don't have a way to put a good index on the table. No indexing combined with these analytic functions will dramatically reduce the scalability of that table. If you put a few thousand rows in there, you're going to be seeing some awfully long-running queries.

The right answer is to try both methods and compare them in your environment. I do note that the two methods do not produce the same results. The first query gives the "first" RecordType. The second gives an arbitrary row (and I'm assuming row_num should really be rownum.
Each has benefits. From the perspective of SQL only, the second method will use fewer Oracle resources. Alas, this will (I almost 100% sure) be overcome by the expense of running lots and lots of queries. Don't forget the looping logic and all the rest as well.
Why is the first method better? First, it is only one query so it incurs overhead of running a query only once. Second, it doesn't require a lot of extra non-SQL code for looping and so on. Third, the query can run in parallel. Fourth, Oracle analytic functions are usually pretty fast.
There might be some cases where the second method is better. For instance, if you have 1,000,000 records and only one person, then the second will definitely be faster. So, it is not a slam-dunk as to which is better. But for most distributions of data, I'd go with the first method.

Related

How can I solve a performance issue in sql query?

All developers know that "IN" and DISTINCT create issue for all sql query. My colleague created below query but now He is not working at my employed company.PLease take a look below code. How can I tune up my query for high performance?
SELECT xxx
, COUNT(DISTINCT Id) Count
FROM Test (NOLOCK)
WHERE IsDeleted = 0
AND xxx IN
(
SELECT CAST(value AS INT)
FROM STRING_SPLIT(#ProductIds, ',')
)
GROUP BY xxx
All developers know that "IN" and DISTINCT create issue for all sql query.
This is not necessarily true. They do hurt performance, but sometimes they are necessary.
The IN is probably not a big deal. It gets evaluated once. If you have another way to pass in a list -- say using a temporary table -- that is better.
The COUNT(DISTINCT id) is suspicious. I would expect id to already be unique. If so, then just use COUNT(*).
The WITH (NOLOCK) is not recommended unless you really know what you are doing. Working with data that might be inconsistent is dangerous.
I have used used Sentry One Plan Explorer to help find the tuning points of queries I am having performance issues with:
https://www.sentryone.com/plan-explorer
First you need to decide what good performance is in your environment, then find the worst parts of the query and optimize those first.
Last, consider how you are storing your data, look for places it makes sense to add an index if needed.
better you have to create an index for XXX column

How to avoid duplicated SELECT phrases in SQL (MariaDB)

I am working with a small MariaDB database. To extract time intervals per user, I use the following query:
SELECT
SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime)) AS seconds,
TIME_FORMAT(SEC_TO_TIME(SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime))),'%Hh %im %ss') AS formattedTime,
TSUser.name
FROM Activity
INNER JOIN User ON User.id = Activity.userID
GROUP BY User.id
ORDER BY seconds DESC;
I have to select the time as plain seconds (... AS seconds) to be able to order the results by it, as can be seen in my query.
However, I also want MariaDB to format the time interval, for that I use the TIME_FORMAT function. The problem is, I have to duplicate the whole SUM(...) phrase inside the TIME_FORMAT call again. This doesn't seem very elegant. Will MariaDB recognize the duplication and calculate the SUM only once? Also, is there a way to get the same result without duplicating the SUM?
I figured this should be possible with a nested query construct like so:
SELECT
innerQuery.name,
innerQuery.seconds,
TIME_FORMAT(SEC_TO_TIME(innerQuery.seconds), '%Hh %im')
FROM (
//Do the sum here, once.
) AS innerQuery
ORDER BY innerQuery.seconds DESC;
Is this the best way to do it / "ok" to do?
Note: I don't need the raw seconds in the result, only the formatted time is needed.
I'd appreciate help, thanks.
Alas. There isn't a really good solution. When you use a subquery, then MariaDb materializes the subquery (as does MySQL). Your query is rather complex, so there is a lot of I/O happening anyway, so the additional materialization may not be important.
Repeating the expression is really more an issue of aesthetics than performance. The expression will be re-executed multiple times. But, the real expense of doing aggregations is the file sort for the group by (or whatever method is used). Doing the sum() twice is not a big deal (unless you are calling a really expensive function as well as the aggregation function).
Other database engines do not automatically materialize subqueries, so using a subquery in other databases is usually the recommended approach. In MariaDB/MySQL, I would guess that repeating the expression is more efficient, although you can try both on your data and report back.
In this case, you don't need the raw values. The formatted value will work correctly in the ORDER BY.
Your subquery idea is likely to be slower because of all the overhead in having two queries.
This is a Rule of Thumb: It takes far more effort for MySQL to fetch a row than to evaluate expressions in the row. With that rule, duplicate expressions are not a burden.

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

SQL: using count() vs. keeping a separate field

I have a question regarding SQL performance and was hoping someone would have the answer.
I have the database table tbl_users and I want to get the total number of users I have. I could write it as SELECT COUNT(*) FROM tbl_users. I presume such query would have performance implications were I to have a handful of users vs. several millions of them. (So, assumption #1 is that the more rows I have, the more resources this query will consume).
In this particular case I need to run this query at a relatively high frequency and each time I need to get up-to-date data (so, caching is not an option).
Assuming my assmption #1 is correct, I then thought of structuring it the following way:
create tbl_stats with a field userCounter
each time there is an insert in tbl_users, userCounter is updated +1
each time I need to get my user count, I can pull that one field from tbl_stats
Now, I realize that by doing it this way, the data in userCounter is technically a duplicate, which is bad form.
So, will my first query (assuming millions of rows of data) consume that many resources to warrant me to implement my alternative design? If so (or if possibly yes), then is my alternative design consistent with best practices?
If your table is indexed, which it almost certainly will be, then the performance of select count(*) probably will not be as bad as you might anticipate - even if you have millions of rows.
But, if it does become a concern, then rather than roll your own solution, look into using an indexed view.
I have a database table with almost 5 million records, the following query returns in less than a second
select count(userID) from tblUsers
This query returns in 2 seconds
select count(*) from tblUsers
I'd personally just go with select count() rather than creating a duplicate field
I think this is one of those scenarios where you really need to measure the performance to make a good decision. I would wager that a simple COUNT() isn't going to create enough latency that you would need to implement your proposed work-around.
If you are worried I would encapsulate your COUNT() in a function or stored procedure so you can quickly swap it out later if performance does become a problem.
On some systems you can ask the system to maintain the counts for you. For example, in SQL Server you can have an indexed view on the count:
create view vwCountUsers
with schema binding
as
select count_big(*) as count
from dbo.tbl_users;
create clustered index cdxCountUsers on vwCountUsers (count);
The system will maintain the count for you and will always be available at nearly no cost.
If you have a desperate need and a real business case for up to the minute accurate counts, then the trigger would be the way to go. Just make sure it caters for all multi-user issues such as concurrency and transactions.
It could become a bottleneck because instead of 5 transactions being able to insert into a new table, they will queue up waiting to update the userCounter table, and you may even get deadlocks.
There are other options for less accurate counts, but if you want accurate then there are very few other choices, but I'll try to think of some:
You could partition the data and in userCounter store a count by day. If the data only gets added for the current day, do a select sum(dailycount) from counter + select count(*) from table where {date=today}
You could at least use the nolock or readpast options to lessen resource usage:
select * from tbl with (readpast)
select * from tbl with (nolock)
There are somethings it makes sense to precalculate for performace reasons (comlex calculations over years of data). That's why data warehouses exists much of the time to speed reporting. Select count(*) is generally not one of them if you have any indexing on the table at all. There are far worse performance problems to solve than that. I get 1 second to return the count on a table with 13 million rows.
I'm all about writing code that will is more likely to perform well than the alternative (avoiding correlated subqueries, using set-based operatiosn instead of cursors, having sargable where clauses), but this is a mirco optimization that should not be addressed until there is a real performance problem.

Can I trust Execution plans?

I have these Queries:
With CTE(comno) as
(select distinct comno=ErpEnterpriseId from company)
select id=Row_number() over(order by comno),comno from cte
select comno=ErpEnterpriseId,RowNo=Row_number() over (order by erpEnterpriseId) from company group by ErpEnterpriseId
SELECT erpEnterpriseId, ROW_NUMBER() OVER(ORDER BY erpEnterpriseId) AS RowNo
FROM
(
SELECT DISTINCT erpEnterpriseId
FROM Company
) x
All three of them returns identical cost and actual execution plans..why and how so ?
It's all down to the query optimizer - that will by trying to optimize the query you enter into the most efficient execution plan (i.e several different queries could be optimized down to the SAME statement that is estimated to be most efficient).
The main thing you should do when trying to optimise a query and find which one performs the best, is to just try them and compare performance. Run an SQL profiler trace to see what the duration/reads is for each version. I usually run each version of a query 3 times to get an average to compare. Each time, clearing the execution plan and data cache down to prevent skewed results.
It's worth having a read of this MSDN article on the optimizer.
Simple, the optimizer is probably turning all your statements into the same statement.
Just like in English, in which there are many ways to say the same thing, all three of those queries are asking for the same data. The SQL Engine (the query optimizer) knows that and is smart enough to know what you are asking.
Even more appropriately, the engine has information that you don't (or likely don't know) - how the data is organized and indexed. It uses this information to make it's own decision about what the BEST way to get the data is, and that's what it is doing.
Although there are ways to override the optimizer, unless you really know what you are doing, you will probably only hurt performance. So your best option is to write the queries in whatever way make most sense to you (or other humans) for readability and maintainability.