Why might a join on a subquery be slow? What could be done to make it faster? (SQL) [closed]

Why might a join on a subquery be slow? What could be done to make it faster? (SQL) [closed] - sql

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
This is a data science interview question.
My understanding of subqueries is that especially with correlated subqueries where the is dependent on the outer query, the correlated subquery requires more or one values to be passed to it by the outer query before the subquery can be resolved. This means that you need to process the subquery more times, one for each row in the outer query.
In particular, in this case, if the inner and outer query returns M and N rows, respectively, the total run time could be O(M*N)
So in general, that would be my answer for why running a subquery could be slow, but am I missing anything else that pertains to joining on a subquery? Also I'm not really sure what could be done to make it faster.
I would of course appreciate any tips or help.
Thanks!

I think that your answer should be correct: subqueries are slow, if they are correlated. Uncorrelated subquery are only evaluated a single time.
What can be done to speedup: correlated subqueries can be rewritten as joins! And join queries can be executed must faster!
If you use a good RDBMS, the optimizer is often able to rewrite a correlated subquery into a join query (however, not for all cases). However, if you use simple RDBMS, there is either no optimizer at all or the optimizer is not very advance (ie, cannot unnest subqueries into join queries). For those cases, you need to rewrite you query by hand.

Wow - what an open ended question! I'm not sure how far outside the box they want you to think, but some possible reasons:
Criteria too broad
The criteria for your query may be too broad, there may be extra clauses you could add that would reduce the sheer amount of data the RDBMS has to process.
Lack of indexes
If there aren't any indexes on the pertinent columns, the RDBMS may have to resort to full table scans which could be slow.
Stale stats
If statistics haven't been updated for a while, the RDBMS may not have the full picture of the skew of the data which can affect the execution time massively.
Physical arrangement of database
If the indexes and tables are on the same physical drive(s), this can create IO contention.
Parallelism
The RDBMS may not be set up correctly for parallelism meaning that the RDBMS may not be making the best use of the available hardware.
Scheduling
The time when the query is run can affect the execution time. Would the query be better run out of hours?
Data changes
Data changes can affect the skew of the data and in rare cases create cartesians. On large databases there should be full traceability of data at row level at least to track down data issues.
Locking
Related to high levels of use is the issue of locking. If you require clean reads, there may be contention on the required data which could slow down the query.
Misleading execution plans
You may have pulled the execution plans but these don't always tell the full story. Cost is a function of CPU and IO but your system may be more bound on one than the other. Some RDBMSs have setting that can force the optimiser to skew the cost towards one side or the other to produce better plans.
Static data not being cached
If you have some static data you're recalculating each time, this will affect the cost. Such data should be stored in an indexed or temporary table to reduce the amount of processing that the RDBMs needs to process.
Query simply too complex
Whilst the query may scan perfectly well to you, if you can break it up into chunks using temporary tables or the like, this could perform significantly better.
I'm going to stop there as I could easily spend the rest of the day adding to this, but hopefully this gives you a flavour.

Related

Does a second join ever improve query performance?

I found a legacy sproc in our system that includes a redundant second join for "performance reasons."
But if I look at the query execution plan it appears the second join actually increases the query cost by ~30%.
Does a second join ever improve performance? What might the original author been thinking of when he or she added the second join?

A second join can improve performance. Join's can be used for filtering, which reduces the number of rows.
This would be particularly true if the join keys are indexed, so the filtering can take place very quickly.
The reduced result set can also speed aggregations and sorting.
If the result sets are exactly the same, the second join might improve performance by enabling a better execution plan. However, there may be other ways to achieve the same goal.

If the additional table being joined to contained an index that could be used in the query, while the other tables did not... then you might get a performance boost by virtue of being able to use the index instead of a full scan.
My guess is that you'd be better off just adding the index to one of the existing tables and avoiding the extra join, but I guess it would depend on the specific situation?
That's the only benefit I can think of (assuming you don't actually need to select any of the fields)!

Is there ever a scenario in SQL where an index would be detrimental? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Can an index ever hurt? That's my whole question. I'm curious.

Query optimizers will simply ignore indexes that are irrelevant to a query. But they still have to spend some microseconds during query optimization, considering whether each index should be used.
The more indexes you have on a table, the more complex the optimizer's job to analyze which is the best one to use. In some rare cases, the optimization phase could actually be more costly than the query execution.
I worked on a case recently helping a client using MySQL 5.6, in which some new sophisticated query optimization features caused the query to use 100% CPU during optimization. Basically, it caused the optimizer to estimate the benefit of thousands of permutations of index choices, like a chess-playing program looking ahead several moves.
To solve this problem, we changes some configuration variables effectively to make MySQL 5.6's optimizer skip its new features, and be dumber about optimal index choice, like it was in MySQL 5.5. This solved the CPU spike issue in that case.
That case was exceptional because the query was very complex, and they had many indexes.
This case was also very specific to one version of one brand of RDBMS. But other brands of database may have similar edge cases.

Yes, indexes can hurt. First, there is overhead in maintaining the index during inserts, updates, and deletes. This overhead can be detrimental, particularly in high-volume transactional environments.
Indexes may also be used incorrectly. For instance, the following query can be quite hard to optimize:
select t.*
from table t
where col1 > 'x'
order by col2
when there are two indexes, one on col1 and the other on col2.
One approach is to use the col1 index to fetch all the appropriate rows. Then use a sort for the order by. Another approach is to use the col2 index for the ordering and then filter one row at a time.
Which approach is better depends on the data, and it can be hard for an optimizer to always make the right decision. This is a case where having a second index can mean that the wrong index is used for optimization.
In general, indexes do help with query optimization and for many systems, the additional overhead of maintaining them is negligible. But, this doesn't mean that they are always helpful.

SQL - Join Aggregated query or Aggregate/Sum after join?

I have a hard time figuring out what is best, or if there is difference at all,
however i have not found any material to help my understanding of this,
so i will ask this question, if not for me, then for others who might end up in the same situation.
Aggregating a sub-query before or after a join, in my specific situation the sub-query is rather slow due to fragmented data and bad normalization procedure,
I got a main query that is highly complex and a sub-query that is built from 3 small queries that is combined using union (will remove duplicate records)
i only need a single value from this sub-query (for each line), so at some point i will end up summing this value, (together with grouping the necessary control data with it so i can join)
what will have the greatest impact?
To sum sub-query before the join and then join with the aggregated version
To leave the data raw, and then sum the value together with the rest of the main query
remember there are thousands of records that will be summed for each line,
and the data is not native but built, and therefore may reside in memory,
(that is just a guess from the query optimizers perspective)

Usually I keep the group-by inside the subquery (referred as "inline view" in Oracle lingo).
This way the query is much more simple and clear.
Also I believe the execution plan is more efficient, because the data set to be aggregated is smaller and the resulting set of join keys is also smaller.
This is not a definitive answer though. If the row source that you are joining to the inline view has few matching rows, you may find that a early join reduces the aggregation effort.
The right anwer is: benchmark the queries for your particular data set.

I think in such a general way there is no right or wrong way to do it. The performance from a query like the one that you describe depends on many different factors:
what kind of join are you actually doing (and what algorithm is used in the background)
is the data to be joined small enough to fit into the memory of the machine joining it?
what query optimizations are you using, i.e. what DBMS (Oracle, MsSQL, MySQL, ...)
...
For your case I simply suggest benchmarking. I'm sorry if that does not seem like a satisfactory answer, but it is the way to go in many performance questions...
So set up a simple test using both your approaches and some test data, then pick whatever is faster.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?

First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

performance - single join select vs. multiple simple selects

What is better as far as performance goes?

There is only one way to know: Time it.
In general, I think a single join enables the database to do a lot of optimizations, as it can see all the tables it needs to scan, overhead is reduced, and it can build up the result set locally.
Recently, I had about 100 select-statements which I changed into a JOIN in my code. With a few indexes, I was able to go from 1 minute running time to about 0.6 seconds.

Do not try to write your own join loop as a bunch of selects. Your database server has many clever algorithms for doing joins. Further, your database server can use statistics and estimated cost of access to dynamically pick a join algorithm.
The database server's join algorithm is -- usually -- better than anything you might concoct. They know more about physical I/O, caching and what-not.
This allows you to focus on your problem domain.

A single join will usually outperform multiple single selects. However, there are too many different cases that fit your question. It isn't wise to lump them together under a single simple rule.
More important, a single join will usually be easier for the next programmer to understand and to revise, provided that you and the next programmer "speak the same language" when you use SQL. I'm talking about the language of sets of tuples.
And equally important is that database physical design and query design need to focus first on the questions that will result in a ten for one speed improvement, not on a 10% speed imporvement. If you were doing thousands of simple selects versus a single join, you might get a ten for one advantage. If you are doing three or four simple selects, you won't see a big improvement one way or the other.

One thing to consider besides what has been said, is that the selects will return more data through the network than the joins probably will. If the network connection is already a bottleneck, this could make it much worse, especially if this is done frequently. That said, your best bet in any performacne situation is to test, test, test.

It all depends on how the database will optimize the joins, and the use of indexes.
I had a slow and complex query with lots of joins. Then i subdivided it into 2 or 3 less complex querys. The performance gain was astonishing.
But in the end, "it depends", you have to know where´s the bottleneck.

As has been said before, there is no right answer without context.
The answer to this is dependent on (from the top of my head):
the amount of joining
the type of joining
indexing
the amount of re-use you could have for any of the separate pieces to be joined
the amount of data to be processed
the server setup
etc.

If you are using SQL Server (I am not sure if this is available with other RDBMSs) I would suggest that you bundle an execution plan with you query results. This will give you the ability to see exactly how your query(s) are being executed and what is causing any bottlenecks.
Until you know what SQL Server is actually doing I wouldn't hazard a guess about which query is better.

If your database has lots of data .... and there are multiple joins then please use indexing for better performance.
If there are left/right outer joins in this case , then use multiple selects.
It all depends on your db size, your query, the indexes (which include primary and foreign keys also) ... One cannot reach on conclusion with yes/no on your question.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas