Sql join does not return some rows with groupBy - sql

I am trying to learn sql.I do some practices.I created a table which called Student.
Id | Name | Amount
1 | Jone | 100
2 | Jack | 200
3 | Emily | 300
4 |Haaland | 500
7 |Ted | 700
I also created Orders table like that:
Id | Name | Amount | Dıscount
1 | Jone | 100 | 10
2 | Jack | 112 | 20
3 | Emily | 300 | 30
4 |Haaland | 500 | 50
5 |Jack | 88 | 12
7 |Ted | 150 | 235
My query is:
select a1.Id Id ,a1.Name Name, a1.Amount Amount , sum(a2.discount)
from student a1
left outer join orders a2
on a1.Id=a2.Id
and a1.Name=a2.Name
and a1.Amount = a2.Amount
group by a1.Id, a1.Name, a1.Amount
Result:
Id | Name | Amount | Dıscount
1 | Jone | 100 | 10
3 | Emily | 300 | 30
4 |Haaland | 500 | 50
2 | Jack | 200 | null
7 | Ted | 700 | null
I get null value for the jack row.I have to use a1.Amount=a2.Amount because I remove amount constraint Ted'discount also appears.
Expected Result :
Id | Name | Amount | Dıscount
1 | Jone | 100 | 10
3 | Emily | 300 | 30
4 |Haaland | 500 | 50
2 | Jack | 200 | 32
7 | Ted |700 | null

I think the logic you want is to pre-aggregate the orders of each name in a subquery, then join by name and amount:
select s.id , s.name, s.amount, o.discount
from student s
left join (
select name, sum(amount) amount, sum(discount) discount
from orders
group by name
) o on o.name = s.name and o.amount = s.amount

What is the confusion? In one row you have:
id name amount
2 Jack 200
And in the other:
id name amount
2 Jack 112
Your join requires equality on all three columns. The amounts don't match, so there is no match for Jack's row and the amount is null.
Your question is not clear on what you actually want to do, so I'll stop here.

The amount for Jack does not match (200 in Student, 88 and 112 in Orders), so nothing can be joined ON a1.Amount = a2.Amount for that record. However, Please be advised that even if one of the values in Amount does match, the GROUP BY function will still not know which Amount you want associated with 'Jack'.

Related

Trying to join a table of individuals to a table of couples, give a family ID and not time out the server

I have one table with fake individual tax records like so (one row per filer):
T1:
+-------+---------+---------+
| Person| Spouse | Income |
+-------+---------+---------+
| 1 | 2 | 34000 |
| 2 | 1 | 10000 |
| 3 | NULL | 97000 |
| 4 | 6 | 11000 |
| 5 | NULL | 25000 |
| 6 | 4 | 100000 |
+-------+---------+---------+
I have a second table which has tax 'families', a single individual or married couple (one line per tax 'family').
T1_Family:
+-------- -+-------+---------+
| Family_id| Person| Spouse |
+-------- -+-------+---------+
| 2 | 2 | 1 |
| 3 | 3 | NULL |
| 5 | 5 | NULL |
| 6 | 6 | 4 |
+------ ---+-------+---------+
Family = max(Person) within a couple
The idea of joining the two is for example, to sum the income of 2 people in one tax family (aggregate to the family level).
So, I've tried the following:
select *
into family_table
from
(
(select * from T1_family)a
join
(select * from T1)b
on a.family = b.person **or a.spouse = b.person**
)
where family_id is not null and person is not null
What I should get (and I do get when I select 1 random couple) is one line per individual where I can then group by family_id and sum income, pension contributions, etc. BUT SQL times out before the tables can be joined. The part in bold is what's slowing down the process but I'm not sure what else to do.
Is there an easier way to group by family?
It is simpler to put the data on one row:
select a.*, p.income as person_income, s.income as spouse_income
into family_table
from t1_family a left join
t1 p
on a.person = p.person lef tjoin
t1 s
on a.spouse = s.person;
Of course, you can add them together as well.

SQL: Cascading conditions on Join

I have found a few similar questions to this on SO but nothing which applies to my situation.
I have a large dataset with hundreds of millions of rows in Table 1 and am looking for the most efficient way to run the following query. I am using Google BigQuery but I think this is a general SQL question applicable to any DBMS?
I need to apply an owner to every row in Table 1. I want to join in the following priority:
1: if item_id matches an identifier in Table 2
2: if no item_id matches try match on item_name
3: if no item_id or item_name matches try match on item_division
4: if no item_division matches, return null
Table 1 - Datapoints:
| id | item_id | item_name | item_division | units | revenue
|----|---------|-----------|---------------|-------|---------
| 1 | xyz | pen | UK | 10 | 100
| 2 | pqr | cat | US | 15 | 120
| 3 | asd | dog | US | 12 | 105
| 4 | xcv | hat | UK | 11 | 140
| 5 | bnm | cow | UK | 14 | 150
Table 2 - Identifiers:
| id | type | code | owner |
|----|---------|-----------|-------|
| 1 | id | xyz | bob |
| 2 | name | cat | dave |
| 3 | division| UK | alice |
| 4 | name | pen | erica |
| 5 | id | xcv | fred |
Desired output:
| id | item_id | item_name | item_division | units | revenue | owner |
|----|---------|-----------|---------------|-------|---------|-------|
| 1 | xyz | pen | UK | 10 | 100 | bob | <- id
| 2 | pqr | cat | US | 15 | 120 | dave | <- code
| 3 | asd | dog | US | 12 | 105 | null | <- none
| 4 | xcv | hat | UK | 11 | 140 | fred | <- id
| 5 | bnm | cow | UK | 14 | 150 | alice | <- division
My attempts so far have involved multiple joining the table onto itself and I fear it is becoming hugely inefficient.
Any help much appreciated.
Another option for BigQuery Standard SQL
#standardSQL
SELECT ARRAY_AGG(a)[OFFSET(0)].*,
ARRAY_AGG(owner
ORDER BY CASE
WHEN type = 'id' THEN 1
WHEN type = 'name' THEN 2
WHEN type = 'division' THEN 3
END
LIMIT 1
)[OFFSET(0)] owner
FROM Datapoints a
JOIN Identifiers b
ON (a.item_id = b.code AND b.type = 'id')
OR (a.item_name = b.code AND b.type = 'name')
OR (a.item_division = b.code AND b.type = 'division')
GROUP BY a.id
ORDER BY a.id
It leaves out entries which k=have no owners - like in below result (id=3 is out as it has no owner)
Row id item_id item_name item_division units revenue owner
1 1 xyz pen UK 10 100 bob
2 2 pqr cat US 15 120 dave
3 4 xcv hat UK 11 140 fred
4 5 bnm cow UK 14 150 alice
I am using the following query (thanks #Barmar) but want to know if there is a more efficient way in Google BigQuery:
SELECT a.*, COALESCE(b.owner,c.owner,d.owner) owner FROM datapoints a
LEFT JOIN identifiers b on a.item_id = b.code and b.type = 'id'
LEFT JOIN identifiers c on a.item_name = c.code and c.type = 'name'
LEFT JOIN identifiers d on a.item_division = d.code and d.type = 'division'
I'm not sure if BigQuery optimizes today a query like this - but at least you would be writing a query that gives strong hints to not run the subqueries when not needed:
#standardSQL
SELECT COALESCE(
null
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT '15229281' user) a
4.2s elapsed, 683 GB processed
{"action":"started"}
For example, the following query took a long time to run, but BigQuery could optimize its execution massively in the future (depending on how frequently users needed an operation like this):
#standardSQL
SELECT COALESCE(
"hello"
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT actor.login user FROM `githubarchive.year.2016` LIMIT 10) a
114.7s elapsed, 683 GB processed
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello

SQL select values sum by same ID

here is my table called "Employee"
eID | name |
==============
1 | Mike |
2 | Josh |
3 | Mike |
And table called "Sells"
sID | eID | | price |
=========================
1 | 1 | | 8 |
2 | 3 | | 9 |
3 | 3 | | 5 |
4 | 1 | | 4 |
5 | 1 | | 3 |
This should be my expected result: returns the total income per employee
name | Income |
==================
Mike | 15 |
Josh | 0 |
Mike | 14 |
Actually, I know use the query "SUM...GROUP BY..." to get the incomes of 15 and 14, but I don't know how to get the income of 0 which is not shown on the "Sells" table.
Could someone give me some help? Thanks a lot.
You just need to use a left outer join, so you can get the sum for missing values too. You could use case expression to deal with null values
SELECT e.name,
COALESCE(SUM(price), 0) as Income
FROM employees e
LEFT OUTER JOIN sells s
ON e.eid = s.eid
GROUP BY e.eid, e.name
Edited: case expression is not needed. I put coalesce on the return of sum fuction, in order to deal with missing values (SUM over an empty set returns NULL)

Using COUNT in nested SELECTS

I'm trying to count the number of instances of a value in a column and return it in another column. I've successfully done the calculation with a hard coded value but can't figure out how to get a variable from the main SELECT into the COUNT.
In the sample data at the bottom, CommissionRate shows the number of times the DealInstanceOID value in that row shows up in the DealInstanceOID column. My hard coded solution works for all of these values, but getting this to happen dynamically is mystifying me. The DealInstanceOID variable is out of the scope of the nested SELECT and I'm unsure of how to work around that.
I posted this question earlier and had some complaints about how my tables are joined - wasn't able to get any more feedback from those posters and I am reposting as they suggested.
SELECT
D.DealOID
DD.DealInstanceOID
, CommissionRate = (SELECT (DealInstanceOID COUNT(*) FROM dbo.DealDetail WHERE DealInstanceOID = 4530))
, Commission = CONVERT(MONEY,C.Commission,1)
FROM dbo.Book AS B WITH(NOLOCK)
INNER JOIN Contract as C WITH(NOLOCK) ON B.BookOID = C.BookOID
INNER JOIN Deal as D WITH(NOLOCK)ON C.ContractOID = D.ContractOID
INNER JOIN DealInstance DI WITH(NOLOCK) ON DI.DealOID = D.DealOID
INNER JOIN DealDetail AS DD WITH(NOLOCK)ON DD.DealInstanceOID = DI.DealInstanceOID
GROUP BY
DD.DealInstanceOID
, D.DealOID
, C.Commission
, B.BookOID
ORDER BY DealOID ASC
DealOID |Commission |CommissionRate|Commission/Rate|DealInstanceOID
101 | $1000 | 5 | $200.00 | 4530
101 | $1000 | 5 | $200.00 | 4530
101 | $1000 | 5 | $200.00 | 4530
101 | $1000 | 5 | $200.00 | 4530
101 | $1000 | 5 | $200.00 | 4530
101 | $5000 | 6 | $833.33 | 4531
102 | $5000 | 6 | $833.33 | 4531
102 | $5000 | 6 | $833.33 | 4531
102 | $5000 | 6 | $833.33 | 4531
102 | $5000 | 6 | $833.33 | 4531
102 | $5000 | 6 | $833.33 | 4531
103 | $6000 | 3 | $2,000.00 | 4540
103 | $6000 | 3 | $2,000.00 | 4540
103 | $6000 | 3 | $2,000.00 | 4540
Two problems with your scalar sub-select statement. One is a syntax error and the other is referencing. Fix them as follows:
CommissionRate = (SELECT COUNT(*) FROM dbo.DealDetail as s WHERE s.DealInstanceOID = dd.DealInstanceOID)
You should be able to reference it by the table alias and column name:
...
WHERE dbo.DealDetail.DealInstanceOID = DD.DealInstanceOID))
...
Using a join instead of a select for every row.
The key is to use a CTE. As you can see below the result is two selects against the database as opposed to one for every row.
WITH counts AS
( -- This will give a virtual table (CTE) with ID and count
SELECT DealInstanceOID, COUNT(*) as C
FROM dbo.DealDetail
GROUP BY DealInstanceOID
)
SELECT
D.DealOID,
DI.DealInstanceOID,
counts.C AS CommissionRate,
CONVERT(MONEY,C.Commission,1) AS Commission
FROM dbo.Book AS B WITH(NOLOCK)
JOIN Contract as C WITH(NOLOCK) ON B.BookOID = C.BookOID
JOIN Deal as D WITH(NOLOCK)ON C.ContractOID = D.ContractOID
JOIN DealInstance DI WITH(NOLOCK) ON DI.DealOID = D.DealOID
JOIN counts ON DI.DealInstanceOID = counts.DealInstanceOID
GROUP BY DI.DealInstanceOID, D.DealOID, C.Commission, B.BookOID
ORDER BY DealOID ASC

Left Join COUNT on tables

I have 2 tables:
puid | personid | ptitle
----------------------------
1 | 200 | richard
2 | 201 | swiss
suid | personidref | stitle
----------------------------
1 | 200 | alf
2 | 201 | lando
3 | 200 | willis
4 | 201 | luke
5 | 201 | kojak
6 | 200 | r2-d2
7 | 201 | jabba
I am trying to left join with a count of table two. I have tried to figure out to use generate_series or sub selects but I cant noodle the syntax.
In english: show me each unique person in table one with a count of each entry in table two.
example output:
puid | personid | ptitle | count
---------------------------------
1 | 200 | richard | 3
2 | 201 | swiss | 4
Is this is simple subquery, is generate_series the right tool for the job?
select *
from
t1
left join
(
select personidref, count(*) total
from t2
group by personidref
) s using(personidref)
order by puid
Notice that doing the aggregation before joining probably has a performance gain over doing it after.