SQL join without losing rows - sql

I have 2 tables with the same schema of userID, category, count. I need a query to sum the count of each userID/category pair. Sometimes a pair will exist in one table and not the other. I'm having trouble doing a join without losing the rows where a userID/category pair only exists in 1 table. This is what I'm trying (without success):
select a.user, a.category, count=a.count+b.count
from #temp1 a join #temp2 b
on a.user = b.user and a.category = b.category
Example:
Input:
user category count
id1 catB 3
id2 catG 9
id3 catW 17
user category count
id1 catB 1
id2 catM 5
id3 catW 13
Desired Output:
user category count
id1 catB 4
id2 catG 9
id2 catM 5
id3 catW 30
Update: "count" is not the actual column name. I just used it for the sake of this example, and I forgot it's a reserved word.

You need to:
Use a full outer join so you don't drop rows present in one table and not the other
Coalesce counts prior to addition, because 0 + NULL = NULL
Also, because COUNT is a reserved word, I would recommend escaping it.
So, using all of these guidelines, your query becomes:
SELECT COALESCE(a.user, b.user) AS user,
COALESCE(a.category, b.category) AS category,
COALESCE(a.[count],0) + COALESCE(b.[count],0) AS [count]
FROM #temp1 AS a
FULL OUTER JOIN #temp2 AS b
ON a.user = b.user AND
a.category = b.category

One way to approach this is with a full outer join:
select coalesce(a.user, b.user) as user,
coalesce(a.category, b.category) as category,
coalesce(a.count, 0) + coalesce(b.count, 0) as "count"
from #temp1 a full outer join
#temp2 b
on a.user = b.user and
a.category = b.category;
When using full outer join, you have to be careful because the key fields can be NULL when there is a match in only one table. As a result, the select tends to have a lot of coalesce()s (or similar constructs).
Another way is using a union all query with aggregation:
select "user", category, SUM(count) as "count"
from ((select "user", category, "count"
from #temp1
) union all
(select "user", category, "count"
from #temp2
)
) t
group by "user", category

Related

Count the Same Columns in Two Differnt Table

I am looking for a way to count for the same column in two different tables.
So I have two tables, table1 and table2. They both have the column "category". I want to find a way to count category for these two tables and show as the result below.
I know how to do this individually by
select category, count(category) as cnt from table1
group by category
order by cnt desc
select category, count(category) as cnt from table2
group by category
order by cnt desc
Not sure how to combine the two into one.
The expected result should be like below. Please note there are some "category" in table1 but not in table2 or vice versa, for example category c and d.
table1 table2
a 4 2
b 4 3
c 3
d 4
One method is full join:
select coalesce(t1c.category, t2c.category) as category,
t1c.t1_cnt, t2c.t2_cnt
from (select category, count(*) as t1_cnt
from table1
group by category
) t1c full join
(select category, count(*) as t2_cnt
from table2
group by category
) t2c
on t1c.category = t2c.category;
You need to be very careful that you aggregate before doing the join.

use SUM with left join get me wrong result

So I have :
CREATE TABLE A (id INT,type int,amount int);
INSERT INTO A (id,type,amount) VALUES (1,0,25);
INSERT INTO A (id,type,amount) VALUES (2,0,25);
INSERT INTO A (id,type,amount) VALUES (3,1,10);
CREATE TABLE B (id INT,A_ID int,txt text);
INSERT INTO B (id,A_id,txt) VALUES (1,1,'abc');
INSERT INTO B (id,A_id,txt) VALUES (2,1,'def');
INSERT INTO B (id,A_id,txt) VALUES (3,2,'xxx');
I run this query:
SELECT min(A.id), SUM(A.amount), COUNT(B.id) FROM A
LEFT JOIN B ON A.id = B.A_id
GROUP BY A.type
I get :
min(A.id) SUM(A.amount) COUNT(B.id)
1 75 3
3 10 0
But I'm instead expecting to get :
min(A.id) SUM(A.amount) COUNT(B.id)
1 50 3
3 10 0
Can someone help? What is the best way to achieve this exact result ?
I want group BY type and get SUM of grouped A.amount and get count() of all B corresponding to its foreign key.
here is the repro : https://www.db-fiddle.com/f/esu13uGLcgFDpX7aEQRMJR/0 please RUN sql code.
EDIT to add more detail : I know the result is correct if I remove group by we can see
1, 50, 2
2, 25, 1
But I expect the above result, what is the best way to achieve it ? I want make SUM of a TYPE then count all B related to this groupped A
Just a shorter version of the solution. It counts B_IDs first in the inner query, so I need to Sum the counts in the outer query.
SELECT min(A.id), SUM(A.amount), Sum(Bid) FROM A
LEFT JOIN (select count(id) as Bid, A_id from B group by A_id) as Bcount
ON A.id = Bcount.A_id
GROUP BY A.type
This can happen when you SUM from an 1-N relation.
The matching records can multiply the result.
For example, when 1 records in A are joined with 2 in B it returns 2 times the amount of A before the GROUP BY. So a SUM then doubles A.amount.
A way to get around that is using sub-queries that join one-on-one.
And a COUNT DISTINCT can be used to count unique id's.
So this just a way to get the SUM of A correct.
SELECT
q1.type,
q1.min_id,
q2.amount,
COALESCE(q1.totalB, 0) as totalB
FROM
(
SELECT
A.type,
MIN(A.id) AS min_id,
COUNT(DISTINCT B.id) AS totalB
FROM A
LEFT JOIN B ON B.A_id = A.id
GROUP BY A.type
) AS q1
JOIN
(
SELECT
type,
SUM(amount) AS amount
FROM A
GROUP BY type
) AS q2 ON q2.type = q1.type
View on DB Fiddle
The SQL is tested for MySql. But it's an ANSI standard SQL that would run on almost any RDBMS, including MS Sql Server.
one way of doing this would be to use ROW_NUMBER():
WITH CTE AS (SELECT A.id AS Aid,
A.[type],
A.amount,
B.id AS bid,
txt,
ROW_NUMBER() OVER (PARTITION BY A.id ORDER BY B.id) AS RN
FROM A
LEFT JOIN B ON A.id = B.A_ID)
SELECT MIN(Aid) AS Min_A_ID,
SUM(CASE RN WHEN 1 THEN amount END) AS Amount,
COUNT(bid) AS BCount
FROM CTE
GROUP BY [type];
I also recommend getting rid of that text datatype and using varchar(MAX).

How should I query these tables?

That's the database I have:
This is the (first) Offer-table with articles and the respective ID:
This is the (second) Bid-Table with the offered articles:
I have to query the numbers of the articles that have offered the same number of
So I want to spend this here:
ID1 ID2 Number_of_Orders
1 2 2
1 5 2
2 5 2
I tried to join it into inline views:
SELECT DISTINCT * FROM
(SELECT BID.ID as ID1 FROM OFFER
INNER JOIN BID ON OFFER.ID=BID.ID
GROUP BY GEBOT.ID) v1,
(SELECT BID.ID as ID2 FROM OFFER
INNER JOIN BID ON OFFER.ID=BID.ID
GROUP BY BID.ID) v2,
(SELECT COUNT(GID) as NUMBER_OF_ORDERS FROM BID
INNER JOIN OFFER ON OFFER.ID=BID.ID
GROUP BY BID.ID
) v3;
but I do not know how I should spend the two IDs under the condition that they have the same number of orders (bids)
You seem to want to count the bids for each ID, and then do a self-join on that result to find matches:
with cte (id, number_of_bids) as (
select id, count(*)
from bid
group by id
)
select c1.id as id1, c2.id as id2, c1.number_of_bids
from cte c1
join cte c2
on c2.number_of_bids = c1.number_of_bids
and c2.id > c1.id
order by id1, id2;
ID1 ID2 NUMBER_OF_BIDS
---------- ---------- --------------
1 2 2
1 5 2
2 5 2
The CTE just gets the number of offers for each ID with simple aggregation. (You could do it with inline views instead of a CTE, but you'd be counting them twice, once in each inline view).
Then the main query joins that CTE to itself on the aggregated number_of_bids being equal, and also one the second ID being higher than the first - which eliminates duplicates. Without doing that you'd see a row where ID1 was 5 and ID2 was 2, i.e. the reverse of the last of the three rows you want (and the same for the other two), plus each ID/count matched with itself.
You don't need to join to the offer table - you aren't using anything data from that.
you simply join(inner join) these two tables and put the condition such as table1.bidPrice = table2.bidPrice

How to compare two tables in Hive based on counts

I have below hive tables
Table_1
ID
1
1
2
Table_2
ID
1
2
2
I am comparing two tables based on count of ID in both tables, I need the output like below
ID
1 - 2records in table 1 and 1 record in Table 2
2 - one record in Table 1 and 2 records in table 2
Table_1 is parent table
i am using below query
select count(*),ID from Table_1 group by ID;
select count(*),ID from Table_2 group by ID;
Just do a full outer join on your queries with the on condition as X.id = Y.id, and then select * from the resultant table checking for nulls on either side.
Select id, concat(cnt1, " entries in table 1, ",cnt2, "entries in table 2") from (select * from (select count(*) as cnt1, id from table1 group by id) X full outer join (select count(*) as cnt2, id from table2 group by id)
on X.id=Y.id
)
Try This. You may use a case statement to check if it should be record / records etc.
SELECT m.id,
CONCAT (COALESCE(a.ct, 0), ' record in table 1, ', COALESCE(b.ct, 0),
' record in table 2')
FROM (SELECT id
FROM table_1
UNION
SELECT id
FROM table_2) m
LEFT JOIN (SELECT Count(*) AS ct,
id
FROM table_1
GROUP BY id) a
ON m.id = a.id
LEFT JOIN (SELECT Count(*) AS ct,
id
FROM table_2
GROUP BY id) b
ON m.id = b.id;
You could use this Python program to do a full comparison of 2 Hive tables:
https://github.com/bolcom/hive_compared_bq
If you want a quick comparison just based on counts, then pass the "--just-count" option (you can also specify the group by column with "--group-by-column").
The script also allows you to visually see all the differences on all rows and all columns if you want a complete validation.

Value present in more than one table

I have 3 tables. All of them have a column - id. I want to find if there is any value that is common across the tables. Assuming that the tables are named a.b and c, if id value 3 is present is a and b, there is a problem. The query can/should exit at the first such occurrence. There is no need to probe further. What I have now is something like
( select id from a intersect select id from b )
union
( select id from b intersect select id from c )
union
( select id from a intersect select id from c )
Obviously, this is not very efficient. Database is PostgreSQL, version 9.0
id is not unique in the individual tables. It is OK to have duplicates in the same table. But if a value is present in just 2 of the 3 tables, that also needs to be flagged and there is no need to check for existence in he third table, or check if there are more such values. One value, present in more than one table, and I can stop.
Although id is not unique within any given table, it should be unique across the tables; a union of distinct id should be unique, so:
select id from (
select distinct id from a
union all
select distinct id from b
union all
select distinct id from c) x
group by id
having count(*) > 1
Note the use of union all, which preserves duplicates (plain union removes duplicates).
I would suggest a simple join:
select a.id
from a join
b
on a.id = b.id join
c
on a.id = c.id
limit 1;
If you have a query that uses union or group by (or order by, but that is not relevant here), then you need to process all the data before returning a single row. A join can start returning rows as soon as the first values are found.
An alternative, but similar method is:
select a.id
from a
where exists (select 1 from b where a.id = b.id) and
exists (select 1 from c where a.id = c.id);
If a is the smallest table and id is indexes in b and c, then this could be quite fast.
Try this
select id from
(
select distinct id, 1 as t from a
union all
select distinct id, 2 as t from b
union all
select distinct id, 3 as t from c
) as t
group by id having count(t)=3
It is OK to have duplicates in the same table.
The query can/should exit at the first such occurrence.
SELECT 'OMG!' AS danger_bill_robinson
WHERE EXISTS (SELECT 1
FROM a,b,c -- maybe there is a place for old-style joins ...
WHERE a.id = b.id
OR a.id = c.id
OR c.id = b.id
);
Update: it appears the optimiser does not like carthesian joins with 3 OR conditions. The below query is a bit faster:
SELECT 'WTF!' AS danger_bill_robinson
WHERE exists (select 1 from a JOIN b USING (id))
OR exists (select 1 from a JOIN c USING (id))
OR exists (select 1 from c JOIN b USING (id))
;