Inner join returns more records than original tables - sql

I'm trying to count the data records from Hive table t1 that have profile_emails that appear in Hive table t2. Multiple records can have the same profile_email in t1, but t2.profile_email is unique. I would expect a result count of < 11,681,830 (since some t1.profile_emails are not in t2). Instead it massively blows up. How is this possible with an inner join?? (and how do I fix it?)
select count(*) from t1;
#11,681,830
select count(*) from t2;
#1,661,773
SELECT count (*) FROM t1
inner JOIN t2 ON t1.profile_email = t2.profile_email
#1,519,465,221

Related

Comparing base table value with second table's sum of value with group by

I have two tables:
One is base table and second is transaction table. I want to compare base table value with second table's sum of value with group by.
Table1(T1Id,Amount1,...)
Tabe2(T2Id,T1ID,Amount2)
I want those rows from table 1 WHere SUM of Table2's SUM( Amount2) is greater or equal table1's Amount1.
*T1ID is in relation with both tables
* The SQL query have many joins with other table for data retriving.
One approach uses a join:
SELECT t1.T1Id, t1.Amount1
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.T1Id = t2.T1ID
GROUP BY
t1.T1Id, t1.Amount1
HAVING
SUM(t2.Amount2) >= t1.Amount1;
We can also try doing this via a correlated subquery:
SELECT t1.T1Id, t1.Amount1
FROM Table1 t1
WHERE t1.Amount1 <= (SELECT SUM(t2.Amount2) FROM Table2 t2
WHERE t1.T1Id = t2.T1ID);
I would use something similar to the query below:
SELECT
a.T1Id, a.Amount1, SUM(b.Amount2)
FROM Table1 a
INNER JOIN Table2 b on b.T1Id = a.T1Id
GROUP BY a.T1Id, a.Amount1
HAVING SUM(b.Amount2) >= a.Amount1;
Basically what the query above does is give you the ID, Amount from table 1 and the summed amount from table 2. The HAVING clause at the end of query filters out those records where the summed amount from the second table is smaller than the amount from the first one.
If you want to add further table joins to the query, you can do so by adding as many joins as you wish. I would recommend having a referenced ID for each table you are joining in the Table1 table.

Count rows after joining three tables in PostgreSQL

Suppose I have three tables in PostgreSQL:
table1 - id1, a_id, updated_by_id
table2 - id2, a_id, updated_by_id
Users - id, display_name
Suppose I am using the using the following query:
select count(t1.id1) from table1 t1
left join table2 t2 on (t1.a_id=t2.a_id)
full outer join users u1 t1.updated_by_id=u1.id)
full outer join users u2 t2.updated_by_id=u2.id)
where u1.id=100;
I get 50 as count.
Whereas with:
select count(t1.id1) from table1 t1
left join table2 t2 on (t1.a_id=t2.a_id)
full outer join users u1 t1.updated_by_id=u1.id)
full outer join users u2 t2.updated_by_id=u2.id)
where u2.id=100;
I get only 25 as count.
What is my mistake in the second query? What can I do to get the same count?
My requirement is that there is a single user table, referenced by multiple tables. I want to take the complete list of users and get the count of ids from different tables.
But the table on which I have joined alone returns the proper count but rest of them don't return the proper count. Can anybody suggest a way to modify my second query to get the proper count?
To simplify your logic, aggregate first, join later.
Guessing missing details, this query would give you the exact count, how many times each user was referenced in table1 and table2 respectively for all users:
SELECT *
FROM users u
LEFT JOIN (
SELECT updated_by_id AS id, count(*) AS t1_ct
FROM table1
GROUP BY 1
) t1 USING (id)
LEFT JOIN (
SELECT updated_by_id AS id, count(*) AS t2_ct
FROM table2
GROUP BY 1
) t2 USING (id);
In particular, avoid multiple 1-n relationships multiplying each other when joined together:
Two SQL LEFT JOINS produce incorrect result
To retrieve a single or few users only, LATERAL joins will be faster (Postgres 9.3+):
SELECT *
FROM users u
LEFT JOIN LATERAL (
SELECT count(*) AS t1_ct
FROM table1
WHERE updated_by_id = u.id
) ON true
LEFT JOIN LATERAL (
SELECT count(*) AS t2_ct
FROM table2
WHERE updated_by_id = u.id
) ON true
WHERE u.id = 100;
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Explain perceived difference
The particular mismatch you report is due to the specifics of a FULL OUTER JOIN:
First, an inner join is performed. Then, for each row in T1 that does
not satisfy the join condition with any row in T2, a joined row is
added with null values in columns of T2. Also, for each row of T2 that
does not satisfy the join condition with any row in T1, a joined row
with null values in the columns of T1 is added.
So you get NULL values appended on the respective other side for missing matches. count() does not count NULL values. So you can get a different result depending on whether you filter on u1.id=100 or u2.id=100.
This is just to explain, you don't need a FULL JOIN here. Use the presented alternatives instead.

Do volatile tables truncate results by default?

I have two queries that supposed to bring equivalent results. However the second query gives only partial results (less than 10 % of the total).
First query gives more than 4 million rows
SELECT id, amount
FROM table1 t1 LEFT OUTER JOIN table2 t2 ON t1.id = t2.id;
Second give only 18 thousand records
CREATE VOLATILE TABLE vt AS
(
SELECT id, amount
FROM table1 t1 LEFT OUTER JOIN table2 t2 ON t1.id = t2.id;
)
WITH DATA
NO PRIMARY INDEX
ON COMMIT PRESERVE ROWS;
SELECT *
FROM vt ;
Why does the second query give less records ???
When you do a SHOW TABLE vt; you'll notice that it's created as a SET table, which doesn't store duplicate rows. There are only 18 thousand distinct (id,amount) combinations.
Either add DISTINCT to your first Select or use CREATE MULTISET VOLATILE TABLE.

In SQL Server, How to join two table having different column name and different data row?

Suppose I have two tables T1 & T2. I want resulting output table as O1 with help of a SQL query
T1 {SName}
T2 {VName}
O1 {SName, VName}
It seems like you want a cross join of the two tables to include all combinations of rows in T1 and T2. Try this:
Select SName, VName From T1 Inner Join T2 On 1=1
The number of rows you will get is the product of the number of rows in T1 and T2 each.
if you're not joining on anything and want a table of all possible combinations:
select SName, VName
into O1
from T1 cross join T2
if you have a column to join on:
select SName, VName
into O1
from T1 inner join T2 on T1.col = T2.col
Select record from T1 and T2 based on filtering criteria and then insert record in table O1 , use below query to create table O1 and inserting those records.
INSERT INTO O1(SNAME, VNAME)
SELECT(T1.SNAME, T2.VNAME)
From T1 Inner Join T2 On 1=1 //i.e WHERE T1.id=T2.T1_id
use WHERE to filter records
there are several ways of doing it.one easy way is:
select T1.SName,T2.VName from T1,T2;
i don't know if i am right, you can use cross join.
if you have two tables Players and GameScores then
SELECT* INTO O1 FROM GameScores CROSS JOIN Players will return all records where each row from the first table is combined with each row from the second table. Which also mean CROSS JOIN returns the Cartesian product of the sets of rows from the joined tables.
to get the above result as
T1 {SName}
T2 {VName}
O1 {SName, VName}
(SELECT * FROM T1,T2) as O1;
will definitely work if both the table have single value if not then you can select the
Particular column like T1.SName,T2.VName

SQL SELECT across two tables

I am a little confused as to how to approach this SQL query.
I have two tables (equal number of records), and I would like to return a column with which is the division between the two.
In other words, here is my not-working-correctly query:
SELECT( (SELECT v FROM Table1) / (SELECT DotProduct FROM Table2) );
How would I do this? All I want it a column where each row equals the same row in Table1 divided by the same row in Table2. The resulting table should have the same number of rows, but I am getting something with a lot more rows than the original two tables.
I am at a complete loss. Any advice?
It sounds like you have some kind of key between the two tables. You need an Inner Join:
select t1.v / t2.DotProduct
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
Should work. Just make sure you watch out for division by zero errors.
You didn't specify the full table structure so I will assume a common ID column to link rows in the tables.
SELECT table1.v/table2.DotProduct
FROM Table1 INNER JOIN Table2
ON (Table1.ID=Table2.ID)
You need to do a JOIN on the tables and divide the columns you want.
SELECT (Table1.v / Table2.DotProduct) FROM Table1 JOIN Table2 ON something
You need to substitue something to tell SQL how to match up the rows:
Something like: Table1.id = Table2.id
In case your fileds are both integers you need to do this to avoid integer math:
select t1.v / (t2.DotProduct*1.00)
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
If you have multiple values in table2 relating to values in table1 you need to specify which to use -here I chose the largest one.
select t1.v / (max(t2.DotProduct)*1.00)
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
Group By t1.v