Count rows after joining three tables in PostgreSQL - sql

Suppose I have three tables in PostgreSQL:
table1 - id1, a_id, updated_by_id
table2 - id2, a_id, updated_by_id
Users - id, display_name
Suppose I am using the using the following query:
select count(t1.id1) from table1 t1
left join table2 t2 on (t1.a_id=t2.a_id)
full outer join users u1 t1.updated_by_id=u1.id)
full outer join users u2 t2.updated_by_id=u2.id)
where u1.id=100;
I get 50 as count.
Whereas with:
select count(t1.id1) from table1 t1
left join table2 t2 on (t1.a_id=t2.a_id)
full outer join users u1 t1.updated_by_id=u1.id)
full outer join users u2 t2.updated_by_id=u2.id)
where u2.id=100;
I get only 25 as count.
What is my mistake in the second query? What can I do to get the same count?
My requirement is that there is a single user table, referenced by multiple tables. I want to take the complete list of users and get the count of ids from different tables.
But the table on which I have joined alone returns the proper count but rest of them don't return the proper count. Can anybody suggest a way to modify my second query to get the proper count?

To simplify your logic, aggregate first, join later.
Guessing missing details, this query would give you the exact count, how many times each user was referenced in table1 and table2 respectively for all users:
SELECT *
FROM users u
LEFT JOIN (
SELECT updated_by_id AS id, count(*) AS t1_ct
FROM table1
GROUP BY 1
) t1 USING (id)
LEFT JOIN (
SELECT updated_by_id AS id, count(*) AS t2_ct
FROM table2
GROUP BY 1
) t2 USING (id);
In particular, avoid multiple 1-n relationships multiplying each other when joined together:
Two SQL LEFT JOINS produce incorrect result
To retrieve a single or few users only, LATERAL joins will be faster (Postgres 9.3+):
SELECT *
FROM users u
LEFT JOIN LATERAL (
SELECT count(*) AS t1_ct
FROM table1
WHERE updated_by_id = u.id
) ON true
LEFT JOIN LATERAL (
SELECT count(*) AS t2_ct
FROM table2
WHERE updated_by_id = u.id
) ON true
WHERE u.id = 100;
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Explain perceived difference
The particular mismatch you report is due to the specifics of a FULL OUTER JOIN:
First, an inner join is performed. Then, for each row in T1 that does
not satisfy the join condition with any row in T2, a joined row is
added with null values in columns of T2. Also, for each row of T2 that
does not satisfy the join condition with any row in T1, a joined row
with null values in the columns of T1 is added.
So you get NULL values appended on the respective other side for missing matches. count() does not count NULL values. So you can get a different result depending on whether you filter on u1.id=100 or u2.id=100.
This is just to explain, you don't need a FULL JOIN here. Use the presented alternatives instead.

Related

Inner join returns more records than original tables

I'm trying to count the data records from Hive table t1 that have profile_emails that appear in Hive table t2. Multiple records can have the same profile_email in t1, but t2.profile_email is unique. I would expect a result count of < 11,681,830 (since some t1.profile_emails are not in t2). Instead it massively blows up. How is this possible with an inner join?? (and how do I fix it?)
select count(*) from t1;
#11,681,830
select count(*) from t2;
#1,661,773
SELECT count (*) FROM t1
inner JOIN t2 ON t1.profile_email = t2.profile_email
#1,519,465,221

Left Join with Distinct Clause

Below is my insert query.
INSERT INTO /*+ APPEND*/ TEMP_CUSTPARAM(CUSTNO, RATING)
SELECT DISTINCT Q.CUSTNO, NVL(((NVL(P.RATING,0) * '10.0')/100),0) AS RATING
FROM TB_ACCOUNTS Q LEFT JOIN TB_CUSTPARAM P
ON P.TEXT_PARAM IN (SELECT DISTINCT PRDCD FROM TB_ACCOUNTS)
AND P.TABLENAME='TB_ACCOUNTS' AND P.COLUMNNAME='PRDCD';
In the previous version of the query, P.TEXT_PARAM=Q.PRDCD but during insert to TEMP_CUSTPARAM due to violation of unique constraint on CUSTNO.
The insert query is taking ages to complete. Would like to know how to use distinct with LEFT JOIN statement.
Thanks.
SELECT T1.Col1, T2.Col2 FROM Table1 T1
Left JOIN
(SELECT Distinct Col1, Col2 FROM Table2
) T2 ON T2.Id = T1.Id
You are missing criteria to join TB_ACCOUNTS records with their related TB_ACCOUNTS/PRDCD TB_CUSTPARAM records and thus cross join them instead. I guess you want:
INSERT INTO /*+ APPEND*/ TEMP_CUSTPARAM(CUSTNO, RATING)
SELECT DISTINCT
Q.CUSTNO,
NVL(P.RATING, 0) * 0.1 AS RATING
FROM TB_ACCOUNTS Q
LEFT JOIN TB_CUSTPARAM P ON P.TEXT_PARAM = Q.PRDCD
AND P.TABLENAME = 'TB_ACCOUNTS'
AND P.COLUMNNAME = 'PRDCD';
If the query is taking ages to complete, check first the execution plan. You may find some hints here - If you see a cartesian join on two non-trivial tables, probably the query should be revisited.
Than ask yourself what is the expectation of the query.
Do you expect one record per CUSTNO? Or can a customer have more than one rating?
One reting per customer could have sense from the point of business. To get unique customer list with rating
1) first get a UNIQUE CUSTNO - note that this is in generel not done with a DISTINCT clause, but if tehre are more rows per customer with a filter predicate, e.g. selecting the most recent row.
2) than join to the rating table

Join SQL Tables with Unique Data (Not same number of columns!)

How can I join three or four SQL tables that DO NOT have an equal amount of rows while ensuring that there are no duplicates of a primary/foreign key?
Structure:
Table1: id, first_name, last_name, email
Table2: id (independent of id in table 1), name, location, table1_id, table2_id
Table3: id, name, location
I want all of the data from table 1, then all of the data from table 2 corresponding with the table1_id without duplicates.
Kind of tricky for a new guy...
Not sure what do you want to do with Table3.
A LEFT JOIN returns all records from the LEFT table, and the matched records from the right table. If there is no match (from the right side), then the result is NULL.
So per example:
SELECT * FROM Table1 AS t
LEFT JOIN Table2 AS tt
ON t.id = tt.id
The LEFT table refers to the table statement before the LEFT JOIN, and the RIGHT table refers to the table statement after the LEFT JOIN. If you want to add in Table3 as well, use the same logic:
SELECT * FROM Table1 AS t
LEFT JOIN Table2 AS tt
ON t.id = tt.id
LEFT JOIN Table3 AS ttt
ON t.id = ttt.id
Note, that I use alias names for the tables (by using AS), so that I can more easily refer to a specific table. For example, t refers to Table1, tt refers to Table2, and ttt refers to Table3.
Joins are often used in SQL, therefore it is useful to look into: INNER JOIN, RIGHT JOIN, FULL JOIN, and SELF JOIN, as well.
Hope this helps.
Good luck with learning!
You will want to use an LEFT JOIN
SELECT * FROM table1 LEFT JOIN table2 ON Table1.ID = Table2.table1_id

Is it possible to restrict the results of an outer join?

I've got a scenario where I need to do a join across three tables.
table #1 is a list of users
table #2 contains users who have trait A
table #3 contains users who have trait B
If I want to find all the users who have trait A or trait B (in one simple sql) I think I'm stuck.
If I do a regular join, the people who don't have trait A won't show up in the result set to see if they have trait B (and vice versa).
But if I do an outer join from table 1 to tables 2 and 3, I get all the rows in table 1 regardless of the rest of my where clause specifying a requirement against tables 2 or 3.
Before you come up with multiple sqls and temp tables and whatnot, this program is far more complex, this is just the simple case. It dynamically creates the sql based on lots of external factors, so I'm trying to make it work in one sql.
I expect there are combinations of in or exists that will work, but I was hoping for some thing simple.
But basically the outer join will always yield all results from table 1, yes?
SELECT *
FROM table1
LEFT OUTER
JOIN table2
ON ...
LEFT OUTER
JOIN table3
ON ...
WHERE NOT (table2.pk IS NULL AND table3.pk IS NULL)
or if you want to be sneaky:
WHERE COALESCE(table2.pk, table3.pk) IS NOT NULL
but for you case, i simply suggest:
SELECT *
FROM table1
WHERE table1.pk IN (SELECT fk FROM table2)
OR table1.pk IN (SELECT fk FROM table3)
or the possibly more efficient:
SELECT *
FROM table1
WHERE table1.pk IN (SELECT fk FROM table2 UNION (SELECT fk FROM table3)
If you really just want the list of users that have one trait or the other, then:
SELECT userid FROM users
WHERE userid IN (SELECT userid FROM trait_a UNION SELECT userid FROM trait_b)
Regarding outerjoin specifically, longneck's answer looks like what I was in the midst of writing.
I think you could do a UNION here.
May I suggest:
SELECT columnList FROM Table1 WHERE UserID IN (SELECT UserID FROM Table2)
UNION
SELECT columnList FROM Table1 WHERE UserID IN (SELECT UserID FROM Table3)
Would something like this work? Keep in mind depending on the size of the tables left outer joins can be very expensive with regards to performance.
Select *
from table1
where userid in (Select t.userid
From table1 t
left outer join table2 t2 on t1.userid=t2.userid and t2.AttributeA is not null
left outer join table3 t3 on t1.userid=t3.userid and t3.AttributeB is not null
group by t.userid)
If all you want is the ids of the users then
SELECT UserId From Table2
UNION
SELECT UserId From Table3
is totally sufficient.
If you want some more infos from Table1 on these users, you can join the upper SQL to Table 1:
SELECT <list of columns from Table1>
FROM Table1 Join (
SELECT UserId From Table2
UNION
SELECT UserId From Table3) User on Table1.UserID = Users.UserID

How can a LEFT OUTER JOIN return more records than exist in the left table?

I have a very basic LEFT OUTER JOIN to return all results from the left table and some additional information from a much bigger table. The left table contains 4935 records yet when I LEFT OUTER JOIN it to an additional table the record count is significantly larger.
As far as I'm aware it is absolute gospel that a LEFT OUTER JOIN will return all records from the left table with matched records from the right table and null values for any rows which cannot be matched, as such it's my understanding that it should be impossible to return more rows than exist in the left table, but it's happening all the same!
SQL Query follows:
SELECT SUSP.Susp_Visits.SuspReason, SUSP.Susp_Visits.SiteID
FROM SUSP.Susp_Visits LEFT OUTER JOIN
DATA.Dim_Member ON SUSP.Susp_Visits.MemID = DATA.Dim_Member.MembershipNum
Perhaps I have made a mistake in the syntax or my understanding of LEFT OUTER JOIN is incomplete, hopefully someone can explain how this could be occurring?
The LEFT OUTER JOIN will return all records from the LEFT table joined with the RIGHT table where possible.
If there are matches though, it will still return all rows that match, therefore, one row in LEFT that matches two rows in RIGHT will return as two ROWS, just like an INNER JOIN.
EDIT:
In response to your edit, I've just had a further look at your query and it looks like you are only returning data from the LEFT table. Therefore, if you only want data from the LEFT table, and you only want one row returned for each row in the LEFT table, then you have no need to perform a JOIN at all and can just do a SELECT directly from the LEFT table.
Table1 Table2
_______ _________
1 2
2 2
3 5
4 6
SELECT Table1.Id,
Table2.Id
FROM Table1
LEFT OUTER JOIN Table2 ON Table1.Id=Table2.Id
Results:
1,null
2,2
2,2
3,null
4,null
It isn't impossible. The number of records in the left table is the minimum number of records it will return. If the right table has two records that match to one record in the left table, it will return two records.
In response to your postscript, that depends on what you would like.
You are getting (possible) multiple rows for each row in your left table because there are multiple matches for the join condition. If you want your total results to have the same number of rows as there is in the left part of the query you need to make sure your join conditions cause a 1-to-1 match.
Alternatively, depending on what you actually want you can use aggregate functions (if for example you just want a string from the right part you could generate a column that is a comma delimited string of the right side results for that left row.
If you are only looking at 1 or 2 columns from the outer join you might consider using a scalar subquery since you will be guaranteed 1 result.
Each record from the left table will be returned as many times as there are matching records on the right table -- at least 1, but could easily be more than 1.
Could it be a one to many relationship between the left and right tables?
LEFT OUTER JOIN just like INNER JOIN (normal join) will return as many results for each row in left table as many matches it finds in the right table. Hence you can have a lot of results - up to N x M, where N is number of rows in left table and M is number of rows in right table.
It's the minimum number of results is always guaranteed in LEFT OUTER JOIN to be at least N.
If you need just any one row from the right side
SELECT SuspReason, SiteID FROM(
SELECT SUSP.Susp_Visits.SuspReason, SUSP.Susp_Visits.SiteID, ROW_NUMBER()
OVER(PARTITION BY SUSP.Susp_Visits.SiteID) AS rn
FROM SUSP.Susp_Visits
LEFT OUTER JOIN DATA.Dim_Member ON SUSP.Susp_Visits.MemID = DATA.Dim_Member.MembershipNum
) AS t
WHERE rn=1
or just
SELECT SUSP.Susp_Visits.SuspReason, SUSP.Susp_Visits.SiteID
FROM SUSP.Susp_Visits WHERE EXISTS(
SELECT DATA.Dim_Member WHERE SUSP.Susp_Visits.MemID = DATA.Dim_Member.MembershipNum
)
Pay attention if you have a where clause on the "right side' table of a query containing a left outer join...
In case you have no record on the right side satisfying the where clause, then the corresponding record of the 'left side' table will not appear in the result of your query....
It seems as though there are multiple rows in the DATA.Dim_Member table per SUSP.Susp_Visits row.
if multiple (x) rows in Dim_Member are associated with a single row in Susp_Visits, there will be x rows in the resul set.
Since the left table contains 4935 records, I suspect you want your results to return 4935 records. Try this:
create table table1
(siteID int,
SuspReason int)
create table table2
(siteID int,
SuspReason int)
insert into table1(siteID, SuspReason) values
(1, 678),
(1, 186),
(1, 723)
insert into table2(siteID, SuspReason) values
(1, 678),
(1, 965)
select distinct t1.siteID, t1.SuspReason
from table1 t1 left join table2 t2 on t1.siteID = t2.siteID and t1.SuspReason = t2.SuspReason
union
select distinct t2.siteID, t2.SuspReason
from table1 t1 right join table2 t2 on t1.siteID = t2.siteID and t1.SuspReason = t2.SuspReason
The only way your query would return more number of rows than the left table ( which is SUSP.Susp_Visits in your case), is that the condition (SUSP.Susp_Visits.MemID = DATA.Dim_Member.MembershipNum) is matching multiple rows in the right table, which is DATA.Dim_Member. So, there are multiple rows in the DATA.Dim_Member where identical values are present for DATA.Dim_Member.MembershipNum. You can verify this by executing the below query:
select DATA.Dim_Member.MembershipNum, count(DATA.Dim_Member.MembershipNum) from DATA.Dim_Member group by DATA.Dim_Member.MembershipNum
Simply, LEFT OUTER JOIN is the Cartesian product within each join key, along with the unmatched rows of the left table
(i.e. for each key_x that has N records in table_L and M records in table_R the result will have N*M records if M>0, or N records if M=0)