I have two huge tables avg_rent and avg_sale. They contain average prices for apartments broken down by location, apartment size and other factors. The data in those tables may be incomplete.
For example in table avg_sale I may have:
id | apartment_size_id | county | city | median_sale
100 | 1 | 1 | 4 | 800
101 | 4 | 1 | 4 | 600
102 | 6 | 1 | 4 | 650
And in table avg_rent I may have:
id | apartment_size_id | county | city | median_rent
300 | 1 | 1 | 4 | 300
301 | 2 | 1 | 4 | 250
302 | 3 | 1 | 4 | 200
303 | 4 | 1 | 4 | 250
305 | 6 | 1 | 4 | 200
I want to create a SQL query or plpqsql function that would aggregate median_sale, median_rent and apartment_size_id columns and fill in missing data with -1 or something. In case of the example would return this (there are total of 6 size categories):
apartment_size_id | median_rent | median_sale
1 | 300 | 800
2 | 250 | -1
3 | 200 | -1
4 | 250 | 600
5 | -1 | -1
6 | 200 | 650
How can I do this?
You would use left join for this, assuming you have a table of apartment sizes:
select a.apartment_size_id, coalesce(r.median_rent, -1) as median_rent,
coalesce(s.median_sales, -1) as median_sales
from apartment_sizes a left join
avg_rent r
on a.apartment_size_id = r.apartment_size_id and
r.county = 1 and r.city = 4 left join
avg_sale s
on a.apartment_size_id = s.apartment_size_id and
s.county = 1 and s.city = 4;
This also assumes that you want the information for a single county/city pair.
I would recommend that you represent the missing values using NULL rather than -1, unless you have a good reason for choosing -1.
You can do this with full outer join and COALESCE
select
r.apartment_size_id,
COALESCE(r.median_rent, -1) as median_rent,
COALESCE(s.median_sale, -1) as median_sale
from avg_rent r
FULL OUTER JOIN avg_sale s
on r.apartment_size_id = s.apartment_size_id
This query definitely gives only those apartment_size_id present in either avg_rent and avg_sale
If you have a apartment table which is having all apartment_size_id info then you can do the same with left join and COALESCE
select
a.apartment_size_id,
COALESCE(r.median_rent, -1) as median_rent,
COALESCE(s.median_sale, -1) as median_sale
from apartment a
LEFT JOIN avg_rent r on a.apartment_size_id = r.apartment_size_id
LEFT JOIN avg_sale s on a.apartment_size_id = s.apartment_size_id
sql fiddle demo
Related
I am trying to learn sql.I do some practices.I created a table which called Student.
Id | Name | Amount
1 | Jone | 100
2 | Jack | 200
3 | Emily | 300
4 |Haaland | 500
7 |Ted | 700
I also created Orders table like that:
Id | Name | Amount | Dıscount
1 | Jone | 100 | 10
2 | Jack | 112 | 20
3 | Emily | 300 | 30
4 |Haaland | 500 | 50
5 |Jack | 88 | 12
7 |Ted | 150 | 235
My query is:
select a1.Id Id ,a1.Name Name, a1.Amount Amount , sum(a2.discount)
from student a1
left outer join orders a2
on a1.Id=a2.Id
and a1.Name=a2.Name
and a1.Amount = a2.Amount
group by a1.Id, a1.Name, a1.Amount
Result:
Id | Name | Amount | Dıscount
1 | Jone | 100 | 10
3 | Emily | 300 | 30
4 |Haaland | 500 | 50
2 | Jack | 200 | null
7 | Ted | 700 | null
I get null value for the jack row.I have to use a1.Amount=a2.Amount because I remove amount constraint Ted'discount also appears.
Expected Result :
Id | Name | Amount | Dıscount
1 | Jone | 100 | 10
3 | Emily | 300 | 30
4 |Haaland | 500 | 50
2 | Jack | 200 | 32
7 | Ted |700 | null
I think the logic you want is to pre-aggregate the orders of each name in a subquery, then join by name and amount:
select s.id , s.name, s.amount, o.discount
from student s
left join (
select name, sum(amount) amount, sum(discount) discount
from orders
group by name
) o on o.name = s.name and o.amount = s.amount
What is the confusion? In one row you have:
id name amount
2 Jack 200
And in the other:
id name amount
2 Jack 112
Your join requires equality on all three columns. The amounts don't match, so there is no match for Jack's row and the amount is null.
Your question is not clear on what you actually want to do, so I'll stop here.
The amount for Jack does not match (200 in Student, 88 and 112 in Orders), so nothing can be joined ON a1.Amount = a2.Amount for that record. However, Please be advised that even if one of the values in Amount does match, the GROUP BY function will still not know which Amount you want associated with 'Jack'.
I have found a few similar questions to this on SO but nothing which applies to my situation.
I have a large dataset with hundreds of millions of rows in Table 1 and am looking for the most efficient way to run the following query. I am using Google BigQuery but I think this is a general SQL question applicable to any DBMS?
I need to apply an owner to every row in Table 1. I want to join in the following priority:
1: if item_id matches an identifier in Table 2
2: if no item_id matches try match on item_name
3: if no item_id or item_name matches try match on item_division
4: if no item_division matches, return null
Table 1 - Datapoints:
| id | item_id | item_name | item_division | units | revenue
|----|---------|-----------|---------------|-------|---------
| 1 | xyz | pen | UK | 10 | 100
| 2 | pqr | cat | US | 15 | 120
| 3 | asd | dog | US | 12 | 105
| 4 | xcv | hat | UK | 11 | 140
| 5 | bnm | cow | UK | 14 | 150
Table 2 - Identifiers:
| id | type | code | owner |
|----|---------|-----------|-------|
| 1 | id | xyz | bob |
| 2 | name | cat | dave |
| 3 | division| UK | alice |
| 4 | name | pen | erica |
| 5 | id | xcv | fred |
Desired output:
| id | item_id | item_name | item_division | units | revenue | owner |
|----|---------|-----------|---------------|-------|---------|-------|
| 1 | xyz | pen | UK | 10 | 100 | bob | <- id
| 2 | pqr | cat | US | 15 | 120 | dave | <- code
| 3 | asd | dog | US | 12 | 105 | null | <- none
| 4 | xcv | hat | UK | 11 | 140 | fred | <- id
| 5 | bnm | cow | UK | 14 | 150 | alice | <- division
My attempts so far have involved multiple joining the table onto itself and I fear it is becoming hugely inefficient.
Any help much appreciated.
Another option for BigQuery Standard SQL
#standardSQL
SELECT ARRAY_AGG(a)[OFFSET(0)].*,
ARRAY_AGG(owner
ORDER BY CASE
WHEN type = 'id' THEN 1
WHEN type = 'name' THEN 2
WHEN type = 'division' THEN 3
END
LIMIT 1
)[OFFSET(0)] owner
FROM Datapoints a
JOIN Identifiers b
ON (a.item_id = b.code AND b.type = 'id')
OR (a.item_name = b.code AND b.type = 'name')
OR (a.item_division = b.code AND b.type = 'division')
GROUP BY a.id
ORDER BY a.id
It leaves out entries which k=have no owners - like in below result (id=3 is out as it has no owner)
Row id item_id item_name item_division units revenue owner
1 1 xyz pen UK 10 100 bob
2 2 pqr cat US 15 120 dave
3 4 xcv hat UK 11 140 fred
4 5 bnm cow UK 14 150 alice
I am using the following query (thanks #Barmar) but want to know if there is a more efficient way in Google BigQuery:
SELECT a.*, COALESCE(b.owner,c.owner,d.owner) owner FROM datapoints a
LEFT JOIN identifiers b on a.item_id = b.code and b.type = 'id'
LEFT JOIN identifiers c on a.item_name = c.code and c.type = 'name'
LEFT JOIN identifiers d on a.item_division = d.code and d.type = 'division'
I'm not sure if BigQuery optimizes today a query like this - but at least you would be writing a query that gives strong hints to not run the subqueries when not needed:
#standardSQL
SELECT COALESCE(
null
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT '15229281' user) a
4.2s elapsed, 683 GB processed
{"action":"started"}
For example, the following query took a long time to run, but BigQuery could optimize its execution massively in the future (depending on how frequently users needed an operation like this):
#standardSQL
SELECT COALESCE(
"hello"
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT actor.login user FROM `githubarchive.year.2016` LIMIT 10) a
114.7s elapsed, 683 GB processed
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
I can't understand why there is the difference in results between two queries:
1) select distinct sum(lot.lot_size) over (partition by lot.detail_id) buying_size, lot.f_detail_id
from buying
join lot on lot.lot_id = buying.lot_id
where exists (select 1
from selling s
join buying b on b.buying_id = s.buying_buying_id
where s.deal_id = 123456
and buying.deal_id = b.deal_id
and s.selling_detailid = buying.buying_detailid)
and buying.buying_status <> 'Canceled'
result:
|buying_size|f_detail_id |
|-----------|------------|
| 105 | 1 |
| 200 | 2 |
| 75 | 3 |
| 225 | 4 |
| 300 | 5 |
2) select distinct *
from (select sum(lot.lot_size) over (partition by lot.detail_id) buying_size, lot.f_detail_id
from buying
join lot on lot.lot_id = buying.lot_id
where exists (select 1
from selling s
join buying b on b.buying_id = s.buying_buying_id
where s.deal_id = 123456
and buying.deal_id = b.deal_id
and s.selling_detailid = buying.buying_detailid)
and buying.buying_status <> 'Canceled')
result:
|buying_size|f_detail_id |
|-----------|------------|
| 105 | 1 |
| 200 | 2 |
| 75 | 3 |
| 225 | 4 |
| 150 | 5 |
I know that this query might be written by another way with using GROUP BY but the main thing I'm interested in is why the result of analytical function depends on DISTINCT.
I need help with this.
Tbl: WarehouseInventory
Date | DelRec | ProductId | Quantity
2015-09-10 | 110 | 1 | 100
2015-09-12 | 111 | 1 | 100
2015-09-12 | 111 | 2 | 200
2015-09-12 | 111 | 3 | 300
Tbl: Withdrawals
Date | ID | ProductId | Quantity | CustomerId
2015-09-11 | 1 | 1 | 400 | 2
2015-09-12 | 1 | 1 | 100 | 1
2015-09-12 | 2 | 2 | 200 | 1
2015-09-12 | 3 | 3 | 300 | 1
Tbl: Customers
Customer Id | Name
1 | Somebody
2 | Someone
The output should be like this
DelRec | Date Added | ProductId | Stocked | Withdrawn | Customer
110 | 2015-09-10 | 1 | 100 | 0 | NULL
0 | 2015-09-11 | 1 | 0 | 400 | Someone
111 | 2015-09-12 | 1 | 100 | 100 | Somebody
111 | 2015-09-12 | 2 | 200 | 200 | Somebody
111 | 2015-09-12 | 3 | 300 | 300 | Somebody
This is what I have come up so far and it's giving me a wrong output
select wi.DateAdded as 'Date Added', max(wi.DeliveryReceipt) as 'Delivery Receipt', wi.ProductId as 'Product',
max(isnull(wi.Quantity, 0)) as 'Stocked', max(isnull(w.Quantity, 0)) as 'Withdrawn', e.Customers as 'Customer'
from WarehouseInventory wi
cross join Withdrawals w
cross join Customer e
group by wi.DateAdded, wi.ProductId, e.Customers, wi.DeliveryReceipt, w.ProductId
Basically, I need to join the two tables on the date and product and if there is a null value in one of the tables, just make it 0. I appreciate your help.
You can use a FULL OUTER JOIN:
SELECT DelRec,
COALESCE(wi.[Date], wd.[Date]) AS Date_Added,
COALESCE(wi.ProductId, wd.ProductId) AS ProductId,
COALESCE(wi.Quantity, 0) AS Stocked,
COALESCE(wd.Quantity, 0) AS Withdrawn,
c.Name AS Customer
FROM WarehouseInventory AS wi
FULL OUTER JOIN Withdrawals AS wd
ON wi.[Date] = wd.[Date] AND wi.ProductId = wd.ProductId
LEFT JOIN Customers AS c ON c.[Customer Id] = wd.CustomerId
ORDER BY Date_Added
You have a few inconsistencies between your example table and your query, but here's the basic gist:
You want to FULL OUTER JOIN your Warehouse delivery (A) and Withdrawal (B) tables on both product and date
Make sure you coalesce(A, B) for both date and product
Sum the quantities from each table, then coalesce outside each aggregate to get zeros (since one column can be all nulls).
Here:
select
coalesce(wi.DateAdded, w.date) as 'Date Added',
max(wi.DeliveryReceipt) as 'Delivery Receipt',
coalesce(wi.ProductId, w.productId) as 'Product',
coalesce(sum(wi.Quantity), 0) as 'Stocked',
coalesce(sum(w.Quantity), 0) as 'Withdrawn',
e.name as 'Customer'
from WarehouseInventory wi
full outer join Withdrawals w on w.date = wi.dateadded and w.productId = wi.productId
left join Customer e on e.customerId = w.customerId
group by
coalesce(wi.DateAdded, w.date),
coalesce(wi.ProductId, w.productId),
e.name
I'm having some trouble joining the contents of two tables. Here's the current situation and the table should be shown as result
Table a
id | income
----------
1 | 100
2 | 200
3 | 300
4 | 400
5 | 500
Table b
id | outcome
----------
1 | 10
2 | 20
6 | 60
7 | 70
3 | 30
Result table
id | income | outcome | balance
--------------------------------
1 | 100 | 10 | 100-10=90
2 | 200 | 20 | 200-20=180
3 | 300 | 30 | 300-30=270
4 | 400 | 0 | 400-0=400
5 | 500 | 0 | 500-0=500
6 | 0 | 60 | 0-60=-60
7 | 0 | 70 | 0-70=-70
every id should be disctinct in the result table
income and outcome should be shown in the result table. if the id is not in one of the table income or outcome should be 0
calculation of balance column: income-outcome
Please provide code, without using the statement
where id not in (select id from ...)
because not in statement is not supported by the system I am using.
Thank you very much for your help
You can do what you want using a full outer join:
select coalesce(a.id, b.id) as id,
coalesce(a.income, 0) as income,
coalesce(b.outcome, 0) as outcome,
(coalesce(a.income, 0) - coalesce(b.outcome, 0)) as balance
from tablea a full outer join
tableb b
on a.id = b.id;
You don't specify the database. Full outer join is ANSI-standard and supported by most databases.