Is there any way to simplify these two subqueries into one? - sql

I have the call
SELECT *,
(SELECT first_name||' '||last_name FROM users WHERE user_id=U.invited_by) AS inviter,
(SELECT first_name FROM users WHERE user_id=U.invited_by) AS inviter_first
FROM users AS U
and that works. But as you can see, the two subqueries are both retrieving pretty data from the same row. Is there any way to simplify the two SELECT calls as one and still get the same results?

You have to do a join. Since you are joining back to the same table use an alias.
SELECT U.*,
i_table.first_name||' '||i_table.last_name AS inviter,
i_table.first_name as inviter_first
FROM users as U
LEFT JOIN users as i_table on i_table.user_id=U.invited_by
Note this query changes your query from performing 2 queries per row (so 2n * n or O(n^2)) to performing 1 joined query.
If you have an index on user_id you should see amazing increases in performance.
If you don't it should still be a lot faster at O(2n)

Related

Performance of Join vs Subquery in Where Clause (HIVE)

Can someone please help me understand which approach would be the most efficient.
The first table users_of_interest_table has one column users that has ~1,000 unique user ID's.
The second table app_logs_table has a users column as well as an app_log column. The table has more than 1 billion rows and over 10 million unique users.
What is the most efficient way to get all the app log data for the users in users_of_interest. Here is what I have come up with so far.
Option 1: Use Inner Join
SELECT
u.users, a.app_logs
FROM
users_of_interest_table u
INNER JOIN
app_logs_table a
ON
u.users = a.users
Option 2: Subquery in Where Clause
SELECT
a.users, a.app_logs
FROM
app_logs_table a
WHERE
a.users IN (SELECT u.users FROM users_of_interest_table u)
the community advises the use of the Join clause, but, in some tests that I have done, the In clause has been more efficient
You must do the test yourself, use the SQL Server Profile tool for this

Understanding when to use a subquery over a join

I seem to be missing something. I keep reading that you should use a join instead of a sub-select in most articles I read. However running a quick experiment myself shows a big win for the sub-query when it comes down to execution time.
Trying to get all first names of people that have made a bid (I presume the tables speak for themselves) results in the follwing.
This join takes 10 seconds
select U.firstname
from Bid B
inner join [User] U on U.userName = B.[user]
This query with sub-query takes 3 seconds
select firstname
from [User]
where userName in (select [user] from bid)
Why is my experiment not in line with what I keep reading everywhere or am I missing something?
Experimenting on I found that execution times are the same after adding distinct to both.
They're not the same thing. In the query with joins you can potentially multiply rows or have rows entirely removed from the results.
Inner Join removes rows on non-matched keys. It also multiplies rows on any matched keys that repeat in either one or both tables being joined. Inner Join therefor goes through the additional step of multiplying and removing rows.
The subquery you used is a SELECT. Since there are no filters using a WHERE it is as fast as a simple SELECT and since there are no joins you get results as fast as the results can be selected.
Some may argue that Outer joins return NULLs similar to sub-queries- but they can still multiply rows. Hence, sub-queries and joins are not the same thing.
In the queries you provided, you want to use the 2nd query (the one with the subquery) since it doesn't multiply or remove rows.
Good Read for Subquery vs Inner Join
https://www.essentialsql.com/subquery-versus-inner-join/

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;

Queries within queries: Is there a better way?

As I build bigger, more advanced web applications, I'm finding myself writing extremely long and complex queries. I tend to write queries within queries a lot because I feel making one call to the database from PHP is better than making several and correlating the data.
However, anyone who knows anything about SQL knows about JOINs. Personally, I've used a JOIN or two before, but quickly stopped when I discovered using subqueries because it felt easier and quicker for me to write and maintain.
Commonly, I'll do subqueries that may contain one or more subqueries from relative tables.
Consider this example:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Rarely, I'll do subqueries after the WHERE clause.
Consider this example:
SELECT
user_id,
(SELECT name FROM organizations WHERE (SELECT organization FROM locations WHERE records.location = location_id) = organization_id) AS organization_name
FROM records
ORDER BY in_timestamp
In these two cases, would I see any sort of improvement if I decided to rewrite the queries using a JOIN?
As more of a blanket question, what are the advantages/disadvantages of using subqueries or a JOIN? Is one way more correct or accepted than the other?
In simple cases, the query optimiser should be able to produce identical plans for a simple join versus a simple sub-select.
But in general (and where appropriate), you should favour joins over sub-selects.
Plus, you should avoid correlated subqueries (a query in which the inner expression refer to the outer), as they are effectively a for loop within a for loop). In most cases a correlated subquery can be written as a join.
JOINs are preferable to separate [sub]queries.
If the subselect (AKA subquery) is not correlated to the outer query, it's very likely the optimizer will scan the table(s) in the subselect once because the value isn't likely to change. When you have correlation, like in the example provided, the likelihood of single pass optimization becomes very unlikely. In the past, it's been believed that correlated subqueries execute, RBAR -- Row By Agonizing Row. With a JOIN, the same result can be achieved while ensuring a single pass over the table.
This is a proper re-write of the query provided:
SELECT u.username,
u.last_name||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
LEFT JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
...because the subselect can return NULL if the user_id doesn't exist in the USERS table. Otherwise, you could use an INNER JOIN:
SELECT u.username,
u.last_name ||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
Derived tables/inline views are also possible using JOIN syntax.
a) I'd start by pointing out that the two are not necessarily interchangable. Nesting as you have requires there to be 0 or 1 matching value otherwise you will get an error. A join puts no such requirement and may exclude the record or introduce more depending on your data and type of join.
b) In terms of performance, you will need to check the query plans but your nested examples are unlikely to be more efficient than a table join. Typically sub-queries are executed once per row but that very much depends on your database, unique constraints, foriegn keys, not null etc. Maybe the DB can rewrite more efficiently but joins can use a wider variety of techniques, drive the data from different tables etc because they do different things (though you may not observe any difference in your output depending on your data).
c) Most DB aware programmers I know would look at your nested queries and rewrite using joins, subject to the data being suitably 'clean'.
d) Regarding "correctness" - I would favour joins backed up with proper constraints on your data where necessary (e.g. a unique user ID). You as a human may make certain assumptions but the DB engine cannot unless you tell it. The more it knows, the better job it (and you) can do.
Joins in most cases will be much more faster.
Lets take this with an example.
Lets use your first query:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Now consider we have 100 records in records and 100 records in user.(Assuming we dont have index on user_id)
So if we understand your algorithm it says:
For each record
Scan all 100 records in users to find out username
Scan all 100 records in users to find out last name and first name
So its like we scanned users table 100*100*2 time. Is it really worth. If we consider index on user_id it will make thing better, but is it still worth.
Now consider a join (nested loop will almost produce same result as above, but consider a hash join):
Its like.
Make a hash map of user.
For each record
Find a mapping record in Hashmap. Which will be certainly much more faster then looping and finding a record.
So clearly, joins should be favorable.
NOTE: Example used of 100 record may produce identical plan, but the idea is to analyze how it can effect the performance.

Select Users with more Items

Each user HAS MANY photos and HAS MANY comments. I would like to order users by SUM(number_of_photos, number_of_comments)
Can you suggest me the SQL query?
GROUP BY with JOINs works more efficiently than dependent subqueries (in all relational DBs I know):
Select * From Users
Left Join Photos On (Photos.user_id = Users.id)
Left Join Comments On (Comments.user_id = Users.id)
Group By UserId
Order By (Count(Photos.id) + Count(Comments.id))
with some assumptions on the tables (e.g. an id primary key in each of them).
Select * From Users U
Order By (Select Count(*) From Photos
Where userId = U.UserId) +
(Select Count(*) From Comments
Where userId = U.UserId)
EDIT: although every query using subqueries can also be done using Joins, which will be faster ,
is not a simple question,
and is irrelevant unless the system is
experiencing performance problems.
1) Both constructions must be translated by the query optimizer into a query plan which includes some type of correlated join, be it a nested loop join, hash-join, merge join, or whatever. And it's entirely possible, (even likely), that they will both result in the same query plan.
NOTE: This is because the entire SQL Statement is translated into a single query plan. The subqueries do NOT get their own, individual query plans as though they were being executed in isolation.
What query plan and what type of joins are used will depend on the data structure and the data in each specific situation. The only way to tell which is faster is to try both, in controlled environments, and measure the performance... but,
2) Unless the system is experiencing an issue with performance, (unacceptable poor performance). clarity is more important. And for problems like the one described above, (where none of the data attributes in the "other" tables are required in the output of the SQL Statement, a Subquery is much clearer in describing the function and purpose of the SQL that a join with Group Bys would be.
I think that the accepted solutions would be problematic from a performance standpoint, assuming you have many users, photos, and comments. Your query runs two separate select statements for every row in the user table.
What you want to do is synthesize a query using ActiveRecord that looks like this:
SELECT user.*, COUNT(c.id) + COUNT(p.id) AS total_count
FROM users u LEFT JOIN photos p ON u.id = p.user_id
LEFT JOIN comments c ON u.id = c.user_id
GROUP BY user.id
ORDER BY total_count DESC
The join will be much, much more efficient. Using left joins insures that even if a user has no comments or photos they will still be included in the results.
If I were to assume that you had a count of comments and a count of photos (user.number_of_photos, user.number_of_comments; as seen above), it would be simple (not stupid):
Select user_id from user order by number_of_photos DESC, number_of_comments DESC
In Ruby On Rails:
User.find(:all, :order => '((SELECT COUNT(*) FROM photos WHERE user_id=users.id) + (SELECT COUNT(*) FROM classifications WHERE user_id=users.id)) DESC')