Performance Comparison of Multiple Join and Multiple Subqueries - sql

I know that this type of questions are asked before, but I couldn't find one with my exact problem.
I'll try to give an exaggerated example.
Let's say that we want to find companies with at least one employee older than 40 and at least one customer younger than 20.
Query my colleague wrote for this problem is like this :
SELECT DISTINCT(c.NAME) FROM COMPANY c
LEFT JOIN EMPLOYEE e ON c.COMPANY_ID = e.COMPANY_ID
LEFT JOIN CUSTOMER u ON c.COMPANY_ID = u.COMPANY_ID
WHERE e.AGE > 40 and u.AGE < 20
I'm new to databases. But looking at this query (like a time complexity problem) it will create an unnecessarily huge temporary table. It will have employeeAmount x customerAmount rows for each companies.
So, I re-wrote the query:
SELECT c.NAME FROM COMPANY c
WHERE EXISTS (SELECT * FROM EMPLOYEE e WHERE e.AGE > 40 AND c.COMPANY_ID = e.COMPANY_ID )
OR EXISTS (SELECT * FROM CUSTOMER u WHERE u.AGE < 20 AND c.COMPANY_ID = u.COMPANY_ID )
I do not know if this query will be worse since it will run 2 subqueries for each company.
I know that there can be better ways to write this. For example writing 2 different subqueries for 2 age conditions and then UNION'ing them may be better. But I really want to learn if there is something wrong with one of / both of two queries.
Note: You can increase the join/subquery amount. For example, "we want to find companies with at least one employee older than 40 and at least one customer younger than 20 and at least one order bigger than 1000$"
Thanks.

The exists version should have much better performance in general, especially if you have indexes on company_id in each of the subtables.
Why? The JOIN version creates an intermediate result with all customers over 40 and all employees under 20. That could be quite large if these groups are large for a particular company. Then, the query does additional work to remove duplicates.
There might be some edge cases where the first version has fine performance. I would expect this, for instance, if either of the groups were empty -- no employees ever under 20 or no customers ever over 40. Then the intermediate result set is empty and removing duplicates is not necessary. For the general case, though, I recommend exists.

To know what really happens in your current environment, with your database settings and with your data you need to compare real execution plans (not just EXPLAIN PLAN which gives only the estimated plan). Only real execution plan can give detailed resources used by the query like CPU and IO in addition to detailed steps used by Oracle (full table scan, joins, etc.).
Try:
ALTER SESSION STATISTICS_LEVEL=ALL;
<your query>
SELECT * FROM TABLE(dbms_xplan.display(NULL, NULL, format=>'allstats last'));
Do not assume, just test.

Related

Rewrite the following SQL query to turn it more efficient/improve its execution and the reasons for that

Relational Schema:
City(cityID, nameCity, nbInhabitants)
Company(companyID, companyName, nbEmployees, cityID) cityID: FK(City)
Given the following statistics:
• City contains 4 000 tuples with 20 tuples per page
• Company contains 200 000 tuples with 15 tuples per page
Now rewrite the following query to improve its execution and the reasons for those benefits:
SELECT DISTINCT companyID
FROM City NATURAL JOIN Company
WHERE nbEmployees >= 5000
AND nameCity = 'Lisboa'
Thank you so much
First, I don't think the select distinct is needed. Why would the companyId be duplicated in the company table? Also, I would expect the join to bring in only one city.
Avoid natural join. It is just bugs waiting to happen. You don't know what join keys are being used and it doesn't even use properly declared foreign key relationships.
Let me assume this is your query:
select c.CompanyId
from Company c join
City ci
on c.cityId = ci.cityId
where c.nbEmployees >= 5000 and ci.nameCity = 'Lisboa';
You have two approaches to optimization. I would first suggest indexes on Company(nbEmployees, cityId) and City(cityId, 'Lisboa').
If you have lots of companies with more than 5,000 employees and very few in Lisbon, then the alternative indexing strategy is city(nameCity, cityId), company(cityId, nbEmployees, companyId).

Put many columns in group by clause in Oracle SQL

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows
Customer
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001 John 30 1 Jan 2017
002 Jack 10 2 Jan 2017
003 Jim 50 3 Jan 2017
Payment
CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT |
-------------------------------------------
001 900 100.00
001 901 200.00
001 902 300.00
003 903 999.00
We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. There are many possible ways to do this but I would like to ask which one of the following is better.
Solution 1
SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;
Solution 2
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE
Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID
While in solution 2, I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead.
With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. But I don't know about the performance.
I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test.
Update
I have just done the test on my big query in both versions as everyone suggested. The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it.
My production data is quite small now, it has about 4000 customers and the query was run against the whole database. Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. The execution plans for both versions have a little bit difference in join algorithm (HASH GROUP BY vs SORT AGGREGATE) on some part.
Both versions use about 13 minutes, no significant difference.
I also have done the test on the simplified versions similar to the SQL in the question. Both version has exactly the same execution plan and elapse time.
With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. I will very appreciate if anyone could give any information to convince or reject this idea.
Another option is
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM PAYMENT
GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
To decide which one of three is better just test them and see the execution plans.
Neither. Do the sum on payment, then join the results.
select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id
Solution 1 is costly.
Even though optimizer could avoid the unecessary sorting,
at some point you will be forced to add indexes/constraints
over irrelevant columns to improve performance.
Not a good practice in the long term.
Solution 2 is the Oracle way.
Oracle documentation states that:
GROUP BY clause must contain only aggregates or grouping columns
Oracle engineers had valid reasons to do that,
however this does not apply to other RDBMS where you
can simply put GROUP BY c.customerID and all will be fine.
For the sake of code readability a --comment would be cheaper.
In general, not embracing any platform principles would have a cost:
more code, weird code, memory, disk space, performance, etc.
In Solution 1 the query will repeat the MAX function for each column. I don't know exactly how the MAX function works but I assume that it sorts all elements on the column than pick the first (best case scenario). It is kind of a time bomb, when your table gets bigger this query will get worst very fast. So if you consern about performance you should pick the solution 2. It looks messier but will be better for the application.

How can I improve a mostly "degenerate" inner join?

This is Oracle 11g.
I have two tables whose relevant columns are shown below (I have to take the tables as given -- I cannot change the column datatypes):
CREATE TABLE USERS
(
UUID VARCHAR2(36),
DATA VARCHAR2(128),
ENABLED NUMBER(1)
);
CREATE TABLE FEATURES
(
USER_UUID VARCHAR2(36),
FEATURE_TYPE NUMBER(4)
);
The tables express the concept that a user can be assigned a number of features. The (USER_UUID, FEATURE_TYPE) combination is unique.
I have two very similar queries I am interested in. The first one, expressed in English, is "return the UUIDs of enabled users who are assigned feature X". The second one is "return the UUIDs and DATA of enabled users who are assigned feature X". The USERS table has about 5,000 records and the FEATURES table has about 40,000 records.
I originally wrote the first query naively as:
SELECT u.UUID FROM USERS u
JOIN FEATURES f ON f.USER_UUID=u.UUID
WHERE f.FEATURE_TYPE=X and u.ENABLED=1
and that had lousy performance. As an experiment I tried to see what would happen if I didn't care about whether or not a user was enabled and that inspired me to try:
SELECT USER_UUID FROM FEATURES WHERE TYPE=X
and that ran very quickly. That in turn inspired me to try
(SELECT USER_UUID FROM FEATURES WHERE TYPE=X)
INTERSECT
(SELECT UUID FROM USERS WHERE ENABLED=1)
That didn't run as quickly as the second query, but ran much more quickly than the first.
After more thinking I realized that in the case at hand every user or almost every user was assigned at least one feature, which meant that the join condition was always or almost always true, which meant that the inner join completely or mostly degenerated into a cross join. And since 5,000 x 40,000 = 200,000,000 that is not a good thing. Obviously the INTERSECT version would be dealing with many fewer rows which presumably is why it is significantly faster.
Question: Is INTERSECT really the way go to in this case or should I be looking at some other type of join?
I wrote the query for the one that also needs to return DATA similarly to the very first one:
SELECT u.UUID, u.DATA FROM USERS u
JOIN FEATURES f ON f.USER_UUID=u.UUID
WHERE f.FEATURE_TYPE=X and u.ENABLED=1
But it would seem I can't do the INTERSECT trick here because there's no column in FEATURES that matches the DATA column.
Question: How can I rewrite this to avoid the degenerate join problem and perform like the query that doesn't return DATA?
I would intuitively use the EXISTS clause:
SELECT u.UUID
FROM USERS u
WHERE u.ENABLED=1
AND EXISTS (SELECT 1 FROM FEATURES f where f.FEATURE_TYPE=X and f.USER_UUID=u.UUID)
or similarly:
SELECT u.UUID, u.DATA
FROM USERS u
WHERE u.ENABLED=1
AND EXISTS (SELECT 1 FROM FEATURES f where f.FEATURE_TYPE=X and f.USER_UUID=u.UUID)
This way you can select every field from USERS since there is no need for INTERSECT anymore (which was a rather good choice for the 1st case, IMHO).

SQL Server cross join performance

I have a table that has 14,091 rows (2 columns, let's say first name, last name). I then have a calendar table that has 553 rows of just dates (first of each month). I do a cross join in order to get every combination of first name, last name, & first of month because this is my requirement. This takes just over a minute.
Is there anything I can do about this to make it faster or can a cross join never get any faster like I suspect?
People Table
first_name varchar2(100)
last_name varchar2(1000)
Dates Table
dt DateTime
select a.first_name, a.last_name, b.dt
from people a, dates b
It will be slow as it making all possible combinations. 14091 * 553. It will not going to be fast until you have either index or inner join.
Yeah. Takes over a minute. Let's get this clear. You talk of 14091 * 553 rows - that is 7792323. Rounded that is 7.8 million rows. And loading them into a data table (which is not known for performance).
Want to see slow? Put them into a grid. THEN you see slow.
The requirements make no sense in a table. None. Absolutely none.
And no, there is no way to speed up the loading of 7.8 million rows into a data structure that is not meant to hold these amounts of data.

SQL Maximum number of doctors in a department

my problem is this:
I have a table named
Doctor(id, name, department)
and another table named
department(id, name).
a Doctor is associated with a department (only one department, not more)
I have to do a query returning the department with the maximum number of doctors associated with it.
I am not sure how to proceed, I feel like I need to use a nested query but I just started and I'm really bad at this.
I think it should be something like this, but again I'm not really sure and I can't figure out what to put in the nested query:
SELECT department.id
FROM (SELECT FROM WHERE) , department d, doctor doc
WHERE doc.id = d.id
A common approach to the "Find ABC with the maximum number of XYZs" problem in SQL is as follows:
Query for a list of ABCs that includes each ABC's count of XYZs
Order the list in descending order according to the count of XYZs
Limit the result to a single item; that would be the top item that you need.
In your case, you can do it like this (I am assuming MySQL syntax for taking the top row):
SELECT *
FROM department dp
ORDER BY (SELECT COUNT(*) FROM doctor d WHERE d.department_id=dp.id) DESC
LIMIT 1
You can use Group BY
Select top (1) department.id,count(Doctor.*) as numberofDocs
from department inner join Doctor on Doctor.id = department.id
Group by department.id
Order by count(Doctor.*) desc
I generally avoid using sub queries in MySQL due to a well known bug in MySQL. Due to the bug, MySQL executes the inner query for every single outer query result. Therefore, if you have 10 departments, then doctor query would be executed 10 times. The bug may have been fixed in MySQL 5.6. In this particular case the number of departments may not be large, therefore performance may not be your main concern. However, the following solution should work for MySQL and much more optimized. The answer by dasblinkenlight is almost the same, just got ahead of me :). But MySQL does not support the command top.
select dep.id, dep.name, count(doc.id) as dep_group_count from Doctor doc join department dep on doc.department = dep.id group by doc.department order by dep_group_count DESC LIMIT 1