Combine a CROSS JOIN and a LEFT JOIN - sql

I have two tables named author and commit_metrics. Both of them have an id field. Author has author_name and author_email. Commit_metrics has author_id and author_date.
I am trying to write a query that will get the number of commits that each author had in a given week, even if that number is 0. Here's what I have so far:
SELECT a.id, a.author_name, a.author_email, c.week_num, COUNT(c.id)
FROM author AS a
CROSS JOIN generate_series(1, 610) AS s(n)
LEFT JOIN (SELECT c.id,
c.author_id,
c.author_date,
WEEK_NUMBER(c.author_date) AS week_num
FROM commit_metrics c) AS c ON s.n = c.week_num AND a.id = c.author_id
WHERE c.week_num IS NOT NULL
GROUP BY a.id, a.author_name, a.author_email, c.week_num
ORDER BY c.week_num DESC, a.author_name;
WEEK_NUMBER is a function I wrote for this query:
CREATE OR REPLACE FUNCTION WEEK_NUMBER(date TIMESTAMP) RETURNS INTEGER AS
$$
SELECT TRUNC(DATE_PART('day', date - '2008-01-01') / 7)::INTEGER;
$$ LANGUAGE SQL;
Currently, the query works like a charm with one major caveat. It doesn't properly calculate 0 when the author made no commits in a given week. I'm not sure why it doesn't. When I do the query with just the FROM and CROSS JOIN, it properly prints the many thousand combined authors/weeks. However, when I add the LEFT JOIN, it loses any week where the author did not make a commit.
Any help would be greatly appreciated. I'm open to doing away with the generate_series call if it's unnecessary.
Also, I found this post, but I don't think it's helpful for my case.

Although you are using a left join, "WHERE c.week_num IS NOT NULL" filters out all of the cases where there is no post. Try this:
SELECT a.id, a.author_name, a.author_email, s.n as week_num, COUNT(c.id) as post_count
FROM author AS a
CROSS JOIN generate_series(1, 610) AS s(n)
LEFT JOIN (SELECT c.id,
c.author_id,
c.author_date,
WEEK_NUMBER(c.author_date) AS week_num
FROM commit_metrics c) AS c ON s.n = c.week_num AND a.id = c.author_id
GROUP BY a.id, a.author_name, a.author_email, s.n
ORDER BY s.n DESC, a.author_name;

Your WHERE clause is excluding the records on commit_metrics that are null, which is the case when the author has no commits during the week selected. You should just remove this from the WHERE clause to get your desired output.
If you need the WHERE clause to eliminate some of the CROSS JOIN records based on your data, you will need that CROSS JOIN and WHERE to be in a sub-select that you LEFT JOIN to, or create some more complicated logic in the current WHERE clause.

Remove the filtering condition. Also a subquery is not needed and you want to select s.n instead of c.week_num:
SELECT a.id, a.author_name, a.author_email, s.n as week_num, COUNT(c.id)
FROM author a CROSS JOIN
generate_series(1, 610) AS s(n) LEFT JOIN
commit_metrics c
ON s.n = WEEK_NUMBER(c.author_date) AND a.id = c.author_id
GROUP BY a.id, a.author_name, a.author_email, c.week_num
ORDER BY c.week_num DESC, a.author_name;

Related

SQL alternative to Cross apply

I have a requirement where to bring in all the records from left table for every match in right table.
Sample query below. In the temp table #Dates_Test in below query i am bringing past 1 weeks dates.
For each record in employee if the Date in temp table(#Dates_Test) is between MvIn_DT and MvOut_Dt , i have to return 7 rows. I can achieve expected output using CROSS APPLY , I am looking for alternatives other than CROSS APPLY. Thanks in advance.
Dates_test Result set:
Expected output:
Query:
SELECT c.Name
,c.ID
,COUNT(DISTINCT c.ID) AS [ID_Count]
,t2.DATE AS SvcDate
INTO #Test
FROM Employee c
CROSS APPLY (
SELECT [Date]
FROM #Dates_Test t
WHERE t.DATE BETWEEN c.MvIn_DT
AND c.MvOut_DT
) t2
WHERE c.[State] = 'NY'
GROUP BY t2.DATE
,c.ID
This would more normally be written using JOIN:
SELECT c.Name, c.ID,
COUNT(DISTINCT c.ID) AS [ID_Count],
t2.DATE AS SvcDate
INTO #Test
FROM Employee c JOIN
#Dates_Test t
ON t.DATE BETWEEN c.MvIn_DT AND c.MvOut_DT
WHERE c.[State] = 'NY'
GROUP BY t2.DATE, c.Name, c.ID ;
But if you want an improvement in performance, these will probably be pretty similar.

JOIN only one row from second table and if no rows exist return null

In this query I need to show all records from the left table and only the records from the right table where the result is the highest date.
Current query:
SELECT a.*, c.*
FROM users a
INNER JOIN payments c
ON a.id = c.user_ID
INNER JOIN
(
SELECT user_ID, MAX(date) maxDate
FROM payments
GROUP BY user_ID
) b ON c.user_ID = b.user_ID AND
c.date = b.maxDate
WHERE a.package = 1
This returns all records where the join is valid, but I need to show all users and if they didn't make a payment yet the fields from the payments table should be null.
I could use a union to show the other rows:
SELECT a.*, c.*
FROM users a
INNER JOIN payments c
ON a.id = c.user_ID
INNER JOIN
(
SELECT user_ID, MAX(date) maxDate
FROM payments
GROUP BY user_ID
) b ON c.user_ID = b.user_ID AND
c.date = b.maxDate
WHERE a.package = 1
union
SELECT a.*, c.*
FROM users a
--here I would need to join with payments table to get the columns from the payments table,
but where the user doesn't have a payment yet
WHERE a.package = 1
The option to use the union doesn't seem like a good solution, but that's what I tried.
So, in other words, you want a list of users and the last payment for each.
You can use OUTER APPLY instead of INNER JOIN to get the last payment for each user. The performance might be better and it will work the way you want regarding users with no payments.
SELECT a.*, b.*
FROM users a
OUTER APPLY ( SELECT * FROM payments c
WHERE c.user_id = a.user_id
ORDER BY c.date DESC
FETCH FIRST ROW ONLY ) b
WHERE a.package = 1;
Here is a generic version of the same concept that does not require your tables (for other readers). It gives a list of database users and the most recently modified object for each user. You can see it properly includes users that have no objects.
SELECT a.*, b.*
FROM all_users a
OUTER APPLY ( SELECT * FROM all_objects b
WHERE b.owner = a.username
ORDER BY b.last_ddl_time desc
FETCH FIRST ROW ONLY ) b
I like the answer from #Matthew McPeak but OUTER APPLY is 12c or higher and isn't very idiomatic Oracle, historically anyway. Here's a straight LEFT OUTER JOIN version:
SELECT *
FROM users a
LEFT OUTER JOIN
(
-- retrieve the list of payments for just those payments that are the maxdate per user
SELECT payments.*
FROM payments
JOIN (SELECT user_id, MAX(date) maxdate
FROM payments
GROUP BY user_id
) maxpayment_byuser
ON maxpayment_byuser.maxdate = payments.date
AND maxpayment_byuser.user_id = payments.user_id
) b ON a.ID = b.user_ID
If performance is an issue, you may find the following more performant but for simplicity you'll end up with an extra "maxdate" column.
SELECT *
FROM users a
LEFT OUTER JOIN
(
-- retrieve the list of payments for just those payments that are the maxdate per user
SELECT *
FROM (
SELECT payments.*,
MAX(date) OVER (PARTITION BY user_id) maxdate
FROM payments
) max_payments
WHERE date = maxdate
) b ON a.ID = b.user_ID
A generic approach using row_number() is very useful for "highest date" or "most recent" or similar conditions:
SELECT
*
FROM users a
LEFT OUTER JOIN (
-- determine the row corresponding to "most recent"
SELECT
payments.*
, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date DESC) is_recent
FROM payments
) b ON a.ID = b.user_ID
AND b.is_recent = 1
(reversing the ORDER BY within the over clause also enables "oldest")

Get the row with max(timestamp)

I need to select most recently commented articles, with the last comment for each article, i.e. other columns of the row which contains max(c.created):
SELECT a.id, a.title, a.text, max(c.created) AS cid, c.text?
FROM subscriptions s
JOIN articles a ON a.id=s.article_id
JOIN comments c ON c.article_id=a.id
WHERE s.user_id=%d
GROUP BY a.id, a.title, a.text
ORDER BY max(c.created) DESC LIMIT 10;
Postgres tells me that I have to put c.text into GROUP BY. Obviously, I don't want to do this. min/max doesn't fit too. I don't have idea, how to select this.
Please advice.
In PostgreSQL, DISTINCT ON is probably the optimal solution for this kind of query:
SELECT DISTINCT ON (a.id)
a.id, a.title, a.text, c.created, c.text
FROM subscriptions s
JOIN articles a ON a.id = s.article_id
JOIN comments c ON c.article_id = a.id
WHERE s.user_id = %d
ORDER BY a.id, c.created DESC
This retrieve articles with the latest comment and associated additional columns.
Explanation, links and a benchmark in this closely related answer.
To get the latest 10, wrap this in a subquery:
SELECT *
FROM (
SELECT DISTINCT ON (a.id)
a.id, a.title, a.text, c.created, c.text
FROM subscriptions s
JOIN articles a ON a.id = s.article_id
JOIN comments c ON c.article_id = a.id
WHERE s.user_id = 12
ORDER BY a.id, c.created DESC
) x
ORDER BY created DESC
LIMIT 10;
Alternatively, you could use window functions in combination with standard DISTINCT:
SELECT DISTINCT
a.id, a.title, a.text, c.created, c.text
,first_value(c.created) OVER w AS c_created
,first_value(c.text) OVER w AS c_text
FROM subscriptions s
JOIN articles a ON a.id = s.article_id
JOIN comments c ON c.article_id = a.id
WHERE s.user_id = 12
WINDOW w AS (PARTITION BY c.article_id ORDER BY c.created DESC)
ORDER BY c_created DESC
LIMIT 10;
This works, because DISTINCT (unlike aggregate functions) is applied after window functions.
You'd have to test which is faster. I'd guess the last one is slower.

SQL Return only where more than one join

Not sure how to ask this as I'm a bit of a database noob,
What I want to do is the following.
table tb_Company
table tb_Division
I want to return companies that have more than one division and I don't know how to do the where clause.
SELECT dbo.tb_Company.CompanyID, dbo.tb_Company.CompanyName,
dbo.tb_Division.DivisionName FROM dbo.tb_Company INNER JOIN dbo.tb_Division ON
dbo.tb_Company.CompanyID = dbo.tb_Division.DivisionCompanyID
Any help or links much appreciated.
You'll need another JOIN where you only return companies having more than one division by using a GROUP BYand a HAVINGclause.
You can read up on grouping here
Groups a selected set of rows into a
set of summary rows by the values of
one or morecolumns or expressions. One
row is returned for each group.
Aggregate functions in the SELECT
clause list provide
information about each group instead
of individual rows.
SELECT dbo.tb_Company.CompanyID
, dbo.tb_Company.CompanyName
, dbo.tb_Division.DivisionName
FROM dbo.tb_Company
INNER JOIN dbo.tb_Division ON dbo.tb_Company.CompanyID = dbo.tb_Division.DivisionCompanyID
INNER JOIN (
SELECT DivisionCompanyID
FROM dbo.tb_Division
GROUP BY
DivisionCompanyID
HAVING COUNT(*) > 1
) d ON d.DivisionCompanyID = dbo.tb_Company.CompanyID
another alternative...
SELECT c.CompanyId, c.CompanyName, d.DivisionName
FROM tbl_Company c
INNER JOIN tbl_Division d ON c.CompanyId=d.DivisionCompanyId
GROUP BY c.CompanyId, c.CompanyName, d.DivisionName
HAVING COUNT(*) > 1
How about?
WITH COUNTED AS
(
SELECT C.CompanyID, C.CompanyName, D.DivisionName,
COUNT() OVER(PARTITION BY C.CompanyID) AS Cnt
FROM dbo.tb_Company C
INNER JOIN dbo.tb_Division D ON C.CompanyID = D.DivisionCompanyID
)
SELECT *
FROM COUNTED
WHERE Cnt > 1
With the other solutions (that join onto Division table twice), a single company/division can be returned under a heavy insert load.
If a row is inserted into the Division table between the time the first join occurs and the time the second join (with the group by/having) is evaluated, the first Division join will return a single row. However, the second one will return a count of 2.
How about...
SELECT dbo.tb_Company.CompanyID,
dbo.tb_Company.CompanyName,
FROM dbo.tb_Company
WHERE (SELECT COUNT(*)
FROM dbo.tb_Division
WHERE dbo.tb_Company.CompanyID =
dbo.tb_Division.DivisionCompanyID) > 1;

Simple SQL question about getting rows and associated counts

this oughta be an easy one.
My question is very similar to this one; basically, I've got a table of posts, a table of comments with a foreign key for the post_id, and a table of votes with a foreign key for the post id. I'd like to do a single query and get back a result set containing one row per post, along with the count of associated comments and votes.
From the question I've linked to above, it seems that for getting a table back containing just a row for each post and a comment count, this is the right approach:
SELECT a.ID, a.Title, COUNT(c.ID) AS NumComments
FROM Articles a
LEFT JOIN Comments c ON c.ParentID = a.ID
GROUP BY a.ID, a.Title
I thought adding vote count would be as easy as adding another left join, as in
SELECT a.ID, a.Title, COUNT(c.ID) AS NumComments, COUNT(v.id AS NumVotes)
FROM Articles a
LEFT JOIN Comments c ON c.ParentID = a.ID
LEFT JOIN Votes v ON v.ParentID = a.ID
GROUP BY a.ID, a.Title
but I'm getting bad numbers back. What am I missing?
SELECT
a.ID,
a.Title,
COUNT(DISTINCT c.ID) AS NumComments,
COUNT(DISTINCT v.id) AS NumVotes
FROM
Articles a
LEFT JOIN Comments c ON c.ParentID = a.ID
LEFT JOIN Votes v ON v.ParentID = a.ID
GROUP BY
a.ID,
a.Title
SELECT id, title,
(
SELECT COUNT(*)
FROM comments c
WHERE c.ParentID = a.ID
) AS NumComments,
(
SELECT COUNT(*)
FROM votes v
WHERE v.ParentID = a.ID
) AS NumVotes
FROM articles a
try:
COUNT(DISTINCT c.ID) AS NumComments
You are thinking in trees, not recordsets.
In the recordset the you get each Comment and each Vote returned multiple times combined with each other. Run the query without the group by and the count to see what I mean.
The solution is simple: use COUNT(DISCTINCT c.ID) and COUNT(DISTINCT v.ID)