SELECT DISTINCT query taking too long SQL

SELECT DISTINCT query taking too long SQL - sql

This is my code below, it's taking a very a long time to execute. When I add the SELECT DISTINCT it makes it very long.
What I'm trying to do is get unique companies that satisfy these conditions and also calculate how many teams each company has (this is given by team_id which is given to each user in auth_user u table).
Any help would be amazing, I want to learn how to make better SQL queries. I know that GROUP BY is the better way to do this, but I can't seem to get it.
SELECT DISTINCT u.company_id, c.name, c.company_type, c.office_location, (SELECT (COUNT(DISTINCT u.team_id)) FROM auth_user u WHERE u.company_id = c.id GROUP BY u.company_id) as number_of_teams, s.status, h.auto_renewal
FROM auth_user u, companies_company c, subscriptions_subscription s, hubspot_company h
WHERE u.company_id = c.id
AND s.company_id = c.id
AND h.myagi_id = c.id
ORDER BY u.company_id ASC

First of all refactor your query to use the 1992 JOIN syntax instead of your grandpa's comma-join syntax. (I'm a grandpa and I jumped at using JOIN as soon as it became available.)
SELECT DISTINCT u.company_id, c.name, c.company_type, c.office_location,
count_of_teams_TODO,
s.status, h.auto_renewal
FROM auth_user u
JOIN companies_company c ON u.company_id = c.id
JOIN subscriptions_subscription s ON s.company_id = c.id
JOIN hubspot_company h ON h.myagi_id = c.id
ORDER BY u.company_id ASC;
Then, I believe each user belongs to one team; that is has one value of auth_user.team_id. And you want your result set to show how many teams the company has.
So substitute COUNT(DISTINCT u.team_id) teams for my count_of_teams_TODO placeholder, getting this. There's no need for a subquery. But for the aggregate function COUNT() we need GROUP BY. And we want to group by company, status, and autorenewal.
SELECT c.id, company_id, c.name, c.company_type, c.office_location,
COUNT(DISTINCT u.team_id) teams,
s.status, h.auto_renewal
FROM auth_user u
JOIN companies_company c ON u.company_id = c.id
JOIN subscriptions_subscription s ON s.company_id = c.id
JOIN hubspot_company h ON h.myagi_id = c.id
GROUP BY c.id, s.status, h.auto_renewal
ORDER BY u.company_id ASC;
And that should do it. Study up on GROUP BY and aggregate functions. Every second you spend learning those concepts better will help you.
As far as performance goes, get this working and then ask another question. Tag it with query-optimization and read this before you ask it.

Related

Why doesn't this work with an ON clause, but does with a WHERE clause?

Julia just finished conducting a coding contest, and she needs your help assembling the leaderboard! Write a query to print the respective hacker_id and name of hackers who achieved full scores for more than one challenge. Order your output in descending order by the total number of challenges in which the hacker earned a full score. If more than one hacker received full scores in same number of challenges, then sort them by ascending hacker_id.
My solution that doesn't work:
SELECT h.hacker_id, h.name
FROM Hackers h
JOIN Challenges c
ON h.hacker_id = c.hacker_id
JOIN Difficulty d
ON c.difficulty_level = d.difficulty_level
JOIN Submissions s
ON s.score = d.score
-- no where clause
GROUP BY h.hacker_id, h.name
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC, h.hacker_id;
Solution that works:
select h.hacker_id, h.name
from Submissions as s
join Hackers as h
on s.hacker_id = h.hacker_id
join Challenges as c
on s.challenge_id = c.challenge_id
join Difficulty as d
on c.Difficulty_level = d.Difficulty_level
-- the only real difference:
where s.score = d.score
group by h.hacker_id, h.name
having count(*) > 1
order by count(*) desc, h.hacker_id;
Why does having s.score = d.score in the WHERE clause make the query work, but having it in an ON clause as part of an INNER JOIN make it not work (on HackerRank.com where the query comes from)? I thought for INNER JOINs it didn't matter, because the optimizer rearranges them at will?
How do I know when to use something like s.score = d.score (or whatever the columns are) in a WHERE clause and not in an ON clause as part of an INNER JOIN?

Why would your code be correct? The problem is not ON vs WHERE.
You:
ON h.hacker_id = c.hacker_id
ON c.difficulty_level = d.difficulty_level
ON s.score = d.score
Them:
on s.hacker_id = h.hacker_id
on s.challenge_id = c.challenge_id
on c.Difficulty_level = d.Difficulty_level
where s.score = d.score
You have h & c for hacker_id but they have h & s.
You are missing challenge_id.
You are correct that in a sequence of inner/cross joins conjuncts can appear anywhere among the ON & WHERE as long as aliases are in scope.
CROSS JOIN vs INNER JOIN in SQL
Is there any rule of thumb to construct SQL query from a human-readable description?

SQL query to find the top 3 in a category

Calling all sql enthusiasts!
Quick info: using PostgreSQL.
I have a query that return the maximum number of likes for a user per category. What I want now, is to show the top 3 users with the most likes per category.
A helpful resource was using this example to solve the problem:
select type, variety, price
from fruits
where (
select count(*) from fruits as f
where f.type = fruits.type and f.price <= fruits.price
) <= 2;
I understand this, but my query is using joins and I am also a beginner, so I was not able to use this information effectively.
Down to business, this is my query for returning the MAX likes for a user per category.
SELECT category, username, MAX(post_likes) FROM (
SELECT c.name category, u.username username, SUM(p.like_count) post_likes, COUNT(*) post_num
FROM categories c
JOIN topics t ON c.id = t.category_id
JOIN posts p ON t.id = p.topic_id
JOIN users u ON u.id = p.user_id
GROUP BY c.name, u.username) AS leaders
WHERE post_likes > 0
GROUP BY category, username
HAVING MAX(post_likes) >= (SELECT SUM(p.like_count)
FROM categories c
JOIN topics t ON c.id = t.category_id
JOIN posts p ON t.id = p.topic_id
JOIN users u ON u.id = p.user_id WHERE c.name = leaders.category
GROUP BY u.username order by sum desc limit 1)
ORDER BY MAX(post_likes) DESC;
Any and all help would be greatly appreciated. I am having a difficult time wrapping my head around this problem. Thank!

If you want the most likes per category, use window functions:
SELECT cu.*
FROM (SELECT c.name as category, u.username as username,
SUM(p.like_count) as post_likes, COUNT(*) as post_num,
ROW_NUMBER() OVER (PARTITION BY c.name ORDER BY COUNT(*) DESC) as seqnum
FROM categories c JOIN
topics t
ON c.id = t.category_id JOIN
posts p
ON t.id = p.topic_id JOIN
users u
ON u.id = p.user_id
GROUP BY c.name, u.username
) cu
WHERE seqnum <= 3;
This always returns three rows per category, even if there are ties. If you want to do something else, then consider DENSE_RANK() or RANK() instead of ROW_NUMBER().
Also, use as for column aliases in the FROM clause. Although optional, one day you will leave out a comma and be grateful that you are in the habit of using as.

SQL Query NOT IN, EXIST

Schemas
Movie(title, year, director, budget, earnings)
Actor(stagename, realname, birthyear)
ActedIn(stagename, title, year, pay)
CanWorkWith(stagename, director)
I need to find all the actors (stagename and realname) that have never worked in a movie that has made a profit (Earnings > budget). SO finding all the bad actors :P
SELECT A.stagename, A.realname
FROM Actor A
WHERE A.stagename NOT IN
(SELECT B.stagename
FROM ActedIN B
WHERE EXIST
(SELECT *
FROM Movie M
WHERE M.earnings > M.budget AND M.title = B.title AND M.year))
Would this find all the actors whose stagename does not appear in the second query? Second query will find all stagenames that acted in movies that made a profit.
Is this correct?

I think you could simplify it a bit, see below:
SELECT DISTINCT A.stagename, A.realname
FROM Actor A
WHERE NOT EXISTS
(SELECT *
FROM Actor B
, Movie M
, ActedIn X
WHERE M.Title = X.Title
AND X.StageName = B.StageName
AND M.earnings > M.budget
AND M.year = X.Year
AND A.StageName = B.StageName)

SELECT
a.stagename,
a.realname
FROM
Actor a
LEFT JOIN
ActedIn b ON a.stagename = b.stagename
LEFT JOIN
Movie c ON b.title = c.title
AND a.year = b.year
AND c.earnings >= c.budget
WHERE
c.title IS NULL
GROUP BY
a.stagename,
a.realname
-No subqueries
-Accounts for actors who never acted in a movie yet
-Access to aggregate functions if needed.

That will work, but just do a join between ActedIn and Movie rather than exist.
Possibly also an outer join may be faster rather than the NOT IN clause, but you would need to run explain plans to be sure.

That would do it. You could also write it like:
SELECT A.stagename, A.realname, SUM(B.pay) AS totalpay
FROM Actor A
INNER JOIN ActedIn B ON B.stagename = A.stagename
LEFT JOIN Movie M ON M.title = B.title AND M.year = B.year AND M.earnings > M.budget
WHERE M.title IS NULL
GROUP BY A.stagename, A.realname
ORDER BY totalpay DESC
It basically takes the movies that made a profit and uses that as a left join condition; when the left join is null it gets counted.
I've also added the total pay of said bad actors and ranked them from best to worst paid ;-)

Yes, you have the right idea for using NOT IN, but you're missing half a boolean condition in the second subquery's WHERE clause. I think you intend to use AND M.year = B.year
WHERE M.earnings > M.budget AND M.title = B.title AND M.year = B.year
You can also do this with a few LEFT JOINs, looking for NULL in the right side of the join. This may be faster than the subquery.
SELECT
A.stagename,
A.realname
FROM Actor A
LEFT OUTER JOIN ActedIN B ON A.stagename = B.stagename
LEFT OUTER JOIN Movie M ON B.title = M.title AND B.year = M.year AND M.earnings > M.budget
WHERE
/* NULL ActedIN.stagename indicates the actor wasn't in this movie */
B.stagename IS NULL

Need hints on seemingly simple SQL query

I'm trying to do something like:
SELECT c.id, c.name, COUNT(orders.id)
FROM customers c
JOIN orders o ON o.customerId = c.id
However, SQL will not allow the COUNT function. The error given at execution is that c.Id is not valid in the select list because it isn't in the group by clause or isn't aggregated.
I think I know the problem, COUNT just counts all the rows in the orders table. How can I make a count for each customer?
EDIT
Full query, but it's in dutch... This is what I tried:
select k.ID,
Naam,
Voornaam,
Adres,
Postcode,
Gemeente,
Land,
Emailadres,
Telefoonnummer,
count(*) over (partition by k.id) as 'Aantal bestellingen',
Kredietbedrag,
Gebruikersnaam,
k.LeverAdres,
k.LeverPostnummer,
k.LeverGemeente,
k.LeverLand
from klanten k
join bestellingen on bestellingen.klantId = k.id
No errors but no results either..

When using an aggregate function like that, you need to group by any columns that aren't aggregates:
SELECT c.id, c.name, COUNT(orders.id)
FROM customers c
JOIN orders o ON o.customerId = c.id
GROUP BY c.id, c.name

If you really want to be able to select all of the columns in Customers without specifying the names (please read this blog post in full for reasons to avoid this, and easy workarounds), then you can do this lazy shorthand instead:
;WITH o AS
(
SELECT CustomerID, CustomerCount = COUNT(*)
FROM dbo.Orders GROUP BY CustomerID
)
SELECT c.*, o.OrderCount
FROM dbo.Customers AS c
INNER JOIN dbo.Orders AS o
ON c.id = o.CustomerID;
EDIT for your real query
SELECT
k.ID,
k.Naam,
k.Voornaam,
k.Adres,
k.Postcode,
k.Gemeente,
k.Land,
k.Emailadres,
k.Telefoonnummer,
[Aantal bestellingen] = o.klantCount,
k.Kredietbedrag,
k.Gebruikersnaam,
k.LeverAdres,
k.LeverPostnummer,
k.LeverGemeente,
k.LeverLand
FROM klanten AS k
INNER JOIN
(
SELECT klantId, klanCount = COUNT(*)
FROM dbo.bestellingen
GROUP BY klantId
) AS o
ON k.id = o.klantId;
I think this solution is much cleaner than grouping by all of the columns. Grouping on the orders table first and then joining once to each customer row is likely to be much more efficient than joining first and then grouping.

The following will count the orders per customer without the need to group the overall query by customer.id. But this also means that for customers with more than one order, that count will repeated for each order.
SELECT c.id, c.name, COUNT(orders.id) over (partition by c.id)
FROM customers c
JOIN orders ON o.customerId = c.id

SQL Join and Count can't GROUP BY correctly?

So let's say I want to select the ID of all my blog posts and then a count of the comments associated with that blog post, how do I use GROUP BY or ORDER BY so that the returned list is in order of number of comments per post?
I have this query which returns the data but not in the order I want? Changing the group by makes no difference:
SELECT p.ID, count(c.comment_ID)
FROM wp_posts p, wp_comments c
WHERE p.ID = c.comment_post_ID
GROUP BY c.comment_post_ID;

I'm not familiar with pre-SQL92 syntax, so I'll express it in a way that I'm familiar with:
SELECT c.comment_post_ID, COUNT(c.comment_ID)
FROM wp_comments c
GROUP BY c.comment_post_ID
ORDER BY COUNT(c.comment_ID) -- ASC or DESC
What database engine are you using? In SQL Server, at least, there's no need for a join unless you're pulling more data from the posts table. With a join:
SELECT p.ID, COUNT(c.comment_ID)
FROM wp_posts p
JOIN wp_comments c ON c.comment_post_ID = p.ID
GROUP BY p.ID
ORDER BY COUNT(c.comment_ID)

SELECT p.ID, count(c.comment_ID) AS [count]
FROM wp_posts p, wp_comments c
WHERE p.ID = c.comment_post_ID
GROUP BY c.comment_post_ID;
ORDER BY [count] DESC

probably there are no related data on the comments table, so please try grouping it by the post ID, and please learn JOIN statements, it is very helpful and produces better results
SELECT p.ID, count(c.comment_ID)
FROM wp_posts p
LEFT JOIN wp_comments c ON (p.ID = c.comment_post_ID)
GROUP BY p.ID
I also encountered that kind of situation in my SQL query journeys :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SELECT DISTINCT query taking too long SQL - sql

Related

Why doesn't this work with an ON clause, but does with a WHERE clause?

SQL query to find the top 3 in a category

SQL Query NOT IN, EXIST

Need hints on seemingly simple SQL query

SQL Join and Count can't GROUP BY correctly?

Categories

Resources