How to GROUP BY and aggregate fields after JOINS in Query - sql

I have the following data which I got from the following query:
date
quantity
name
season_id
contract_id
signing_date
1
2016-07-01 00:00:00
3
John Doe
4
3000
2016-10-20
2
2021-07-28 00:00:00
14
John Doe
5
3541
2021-01-28
3
2016-08-15 00:00:00
10
John Doe
5
3000
2016-10-20
4
2016-08-02 00:00:00
5
John Doe
5
1528
2016-03-02
WITH ws AS (select date, quantity,
name, season_id, contract_id, contract.signing_date
FROM warehouse_state
JOIN inventory ON inventory.id = warehouse_state.inventory_id
JOIN owner ON owner.inventory_id = warehouse_state.id
JOIN season ON season.id = owner.season_id
JOIN contract ON contract.id = warehouse_contract.contract_id
GROUP BY date, quantity, name, season.id, contract.id, signing_date)
Now, I am having trouble aggregating the ws records based on dates.
Let's say I want a SUM of quantity grouped by date where date is date before contract signing_date. Not sure how to proceed with this, and probably it can be done in a single query without having a WITH x AS query or something actually using it like:
SELECT * FROM ws
LEFT JOIN contract on contract.contract_id = ws.contract_id
-- Here set following condition: for any ws record that has `date` before `signing_date`, SUM quantity and return aggregate
Expected output:
contract_id
signing_date
quantity
name
3000
2016-10-20
18
John Doe
3541
2021-01-28
18
John Doe
1528
2021-01-28
0
John Doe
In the expect output quantity is a SUM, and the record is grouped by contract. In the first record, #1, #3, and #4 were aggregated because their date values are before the contract (3000) signing_date. Even though, the 4th record does not have the same contract_id, it's also aggregated because its date field is before the signing date in contract 3000. Similarly, when grouped by contract 3541, record #2 is excluded from the aggregation because its date value is not before the signing_date of contract 3541.
Any suggestions? Thanks

Does that SQL really compile? Reason is I see you referencing an inventory table that I don't see anywhere.
Also you are grouping on all columns -- essential a "select distinct." Is that what you meant to do?
That aside, assuming your joins are correct and a couple of other assumptions, I'm going to sub them all with "< your tables and joins >." I think all you want is a simple aggregate. No need for a CTE (with clause).
select
date, sum (quantity)
FROM
< your tables and joins >
where
date < signing_date
GROUP BY
date
Alternatively, you can see the total quantity for all dates AND the total quantity before the contract date using a filter:
select
date, sum (quantity) as total_quantity,
sum (quantity) filter (where date < signing_date) as qty_before_contract_sign
FROM
< your tables and joins >
GROUP BY
date
If you wanted to see the other columns as well, then you want a windowing function. Let me know if that's the case and I can demonstrate.
-- EDIT 9/7/22 --
Based on your update, I think this is what you want:
select
contract_id, contract.signing_date, sum (quantity) as quantity,
name
FROM warehouse_state
JOIN inventory ON inventory.id = warehouse_state.inventory_id
JOIN owner ON owner.inventory_id = warehouse_state.id
JOIN season ON season.id = owner.season_id
JOIN contract ON contract.id = warehouse_contract.contract_id
where
date < contact.signing_date
GROUP BY
contract_id, contract.signing_date, name
But the one gotcha is Contract 1528 will not show up in this output since it's filtered out by the where condition.
I'm not fond of this, but you could keep the filter to overcome this... maybe there's a better solution.
select
contract_id, contract.signing_date,
coalesce (sum (quantity) filter (where date < contact.signing_date), 0) as quantity,
name
FROM warehouse_state
JOIN inventory ON inventory.id = warehouse_state.inventory_id
JOIN owner ON owner.inventory_id = warehouse_state.id
JOIN season ON season.id = owner.season_id
JOIN contract ON contract.id = warehouse_contract.contract_id
GROUP BY
contract_id, contract.signing_date, name
Also, my output does not match yours, but I'm hoping that's because of sample data.

Related

Finding a min() date for one column and then using this to join with other tables that have a date LESS than this date

In short, I have two tables:
(1) pharmacy_claims (columns: user_id, date_service, claim_id, record_id, prescription)
(2) medical_claims (columns: user_id, date_service, provider, npi, cost)
I want to find user_id's in (1) that have a certain prescription value, find their earliest date_service (e.g. min(date_service)) and then use these user_id's with their earliest date of service as a cohort to pull all of their associated data from (2). Basically I want to find all of their medical_claims data PRIOR to the first time they were prescribed a given prescription in pharmacy_claims.
pharmacy_claims looks something like this:
user_id | prescription | date_service
1 a 2018-05-01
1 a 2018-02-11
1 a 2019-10-11
1 b 2018-07-12
2 a 2019-01-02
2 a 2019-03-10
2 c 2018-04-11
3 c 2019-05-26
So for instance, if I was interested in prescription = 'a', I would only want user_id 1 and 2 returned, with dates 2018-02-11 and 2019-01-02, respectively. Then I would want to pull user_id 1 and 2 from the medical_claims, and get all of their data PRIOR to these respective dates.
The way I tried to go about this was to build out a temp table in the pharmacy_claims table to query the user_id's that have a given medication, and then left join this back to the table to create a cohort of user_id's with a date_service
Here's what I did:
(1) Pulled all of the relevant data from the main pharmacy claims table:
CREATE TABLE user.temp_pharmacy_claims AS
SELECT user_id, claim_id, record_id, date_service
FROM dw.pharmacyclaims
WHERE date_service between '2018-01-01' and '2019-08-31'
This results in ~50,000 user_id's
(2) Created a table with just the user_id's a min(date_service):
CREATE TABLE user.temp_pharmacy_claims_index AS
SELECT distinct user_id, min(date_service) AS Min_Date
FROM user.temp_pharmacy_claims
GROUP BY 1
(3) Created a final table (to get the desired cohort):
CREATE TABLE user.temp_pharmacy_claims_final_index AS
SELECT a.userid
FROM user.temp_pharmacy_claims a
LEFT JOIN user.temp_pharmacy_claims_index b
ON a.user = b.user
WHERE a.date_service < Min_Date
However, this gets me 0 results when there should be a few thousand. Is this set up correctly? It's probably not the most efficient approach, but it looks sound to me, so not sure what's going on.
I think you just want a correlated subquery:
select mc.*
from medical_claims mc
where mc.date_service < (select min(pc.date)
from pharmacy_claims pc
where pc.user_id = mc.user_id and
pc.prescription = ?
);

An aggregation is affecting results in a major way

I seem to be getting duplicates as a result of this query. The only analysis I want to do is the sum of calls/the total orders, and to be able to see how many support_tickets were generated from orders within an order range, up to a call_date. Very simple, but surprisingly complex to code up. Here is my attempt. I have also tried to change the below into a union, but still get wrong aggregate results.
The query:
SELECT marketing_code,
count(order_code) order_code_count,
order_date,
sum(support_ticket_call) call_count,
call_date
FROM
(select distinct marketing_code, order_code, order_date from table1) a
left join
(select count(call_ids) as support_ticket_Call, call_date
FROM table2 group by call_date) b
on b.order_ID_code = a.order_id_code
group by marketing_code, order_date, call_date
Please note, the call can happen at a much later date than the order. The order date is in table 1, but not in table 2; the call_date is in table 2, but not in table 1. Also, in the data, the marketing code is either AB16 or AB17.
Sample data:
Marketing code order_code_count call_count call_date order_date
AB16 30 45 2016-01-01 2015-12-27
AB17 13 17 2016-01-02 2015-12-29
AB16 24 29 2016-01-02 2016-01-01
The sum of support ticket calls should be lower than the order count.
You join your tables by order_id_code, but in the right part of your join you count all calls from one day. This doesn't seem right. Try something like this:
select
marketing_code
count(order_code) order_code_count
order_date
count(call_ids) call_count
call_date
from
table1 a left join table2 b on b.order_ID_code = a.order_id_code
group by
marketing_code, order_date, call_date

Only joining rows where the date is less than the max date in another field

Let's say I have two tables. One table containing employee information and the days that employee was given a promotion:
Emp_ID Promo_Date
1 07/01/2012
1 07/01/2013
2 07/19/2012
2 07/19/2013
3 08/21/2012
3 08/21/2013
And another table with every day employees closed a sale:
Emp_ID Sale_Date
1 06/12/2013
1 06/30/2013
1 07/15/2013
2 06/15/2013
2 06/17/2013
2 08/01/2013
3 07/31/2013
3 09/01/2013
I want to join the two tables so that I only include sales dates that are less than the maximum promotion date. So the result would look something like this
Emp_ID Sale_Date Promo_Date
1 06/12/2013 07/01/2012
1 06/30/2013 07/01/2012
1 06/12/2013 07/01/2013
1 06/30/2013 07/01/2013
And so on for the rest of the Emp_IDs. I tried doing this using a left join, something to the effect of
left join SalesTable on PromoTable.EmpID = SalesTable.EmpID and Sale_Date
< max(Promo_Date) over (partition by Emp_ID)
But apparently I can't use aggregates in joins, and I already know that I can't use them in the where statement either. I don't know how else to proceed with this.
The maximum promotion date is:
select emp_id, max(promo_date)
from promotions
group by emp_id;
There are various ways to get the sales before that date, but here is one way:
select s.*
from sales s
where s.sales_date < (select max(promo_date)
from promotions p
where p.emp_id = s.emp_id
);
Gordon's answer is right on! Alternatively, you could also do a inner join to a subquery to achieve your desired output like this:
SELECT s.emp_id
,s.sales_date
,t.promo_date
FROM sales s
INNER JOIN (
SELECT emp_id
,max(promo_date) AS promo_date
FROM promotions
GROUP BY emp_id
) t ON s.emp_id = t.emp_id
AND s.sales_date < t.promo_date;
SQL Fiddle Demo

SQL Inner Join query

I have following table structures,
cust_info
cust_id
cust_name
bill_info
bill_id
cust_id
bill_amount
bill_date
paid_info
paid_id
bill_id
paid_amount
paid_date
Now my output should display records (1 jan 2013 to 1 feb 2013) between two bill_dates dates as single row as follows,
cust_name | bill_id | bill_amount | tpaid_amount | bill_date | balance
where tpaid_amount is total paid for particular bill_id
For example,
for bill id abcd, bill_amount is 10000 and user pays 2000 one time and 3000 second time
means, paid_info table contains two entries for same bill_id
bill_id | paid_amount
abcd 2000
abcd 3000
so, tpaid_amount = 2000 + 3000 = 5000 and balance = 10000 - tpaid_amount = 10000 - 5000 = 5000
Is there any way to do this with single query (inner joins)?
You'd want to join the 3 tables, then group them by bill ids and other relevant data, like so.
-- the select line, as well as getting your columns to display, is where you'll work
-- out your computed columns, or what are called aggregate functions, such as tpaid and balance
SELECT c.cust_name, p.bill_id, b.bill_amount, SUM(p.paid_amount) AS tpaid, b.bill_date, b.bill_amount - SUM(p.paid_amount) AS balance
-- joining up the 3 tables here on the id columns that point to the other tables
FROM cust_info c INNER JOIN bill_info b ON c.cust_id = b.cust_id
INNER JOIN paid_info p ON p.bill_id = b.bill_id
-- between pretty much does what it says
WHERE b.bill_date BETWEEN '2013-01-01' AND '2013-02-01'
-- in group by, we not only need to join rows together based on which bill they're for
-- (bill_id), but also any column we want to select in SELECT.
GROUP BY c.cust_name, p.bill_id, b.bill_amount, b.bill_date
A quick overview of group by: It will take your result set and smoosh rows together, based on where they have the same data in the columns you give it. Since each bill will have the same customer name, amount, date, etc, we are fine to group by those as well as the bill id, and we'll get a record for each bill. If we wanted to group it by p.paid_amount, though, since each payment would have a different one of those (possibly), you'd get a record for each payment as opposed to for each bill, which isn't what you'd want. Once group by has smooshed these rows together, you can run aggregate functions such as SUM(column). In this example, SUM(p.paid_amount) totals up all the payments that have that bill_id to work out how much has been paid. For more information, please look at W3Schools chapter on group by in their SQL tutorials.
Hope I've understood this correctly and that this helps you.
This will do the trick;
select
cust_name,
bill_id,
bill_amount,
sum(paid_amount),
bill_date,
bill_amount - sum(paid_amount)
from
cust_info
left outer join bill_info
left outer join paid_info
on bill_info.bill_id=paid_info.bill_id
on cust_info.cust_id=bill_info.cust_id
where
bill_info.bill_date between X and Y
group by
cust_name,
bill_id,
bill_amount,
bill_date

Optimizing Query With Subselect

I'm trying to generate a sales reports which lists each product + total sales in a given month. Its a little tricky because the prices of products can change throughout the month. For example:
Between Jan-01 and Jan-15, my company sells 50 Widgets at a cost of $10 each
Between Jan-15 and Jan-31, my company sells 50 more Widgets at a cost of $15 each
The total sales of Widgets for January = (50 * 10) + (50 * 15) = $1250
This setup is represented in the database as follows:
Sales table
Sale_ID ProductID Sale_Date
1 1 2009-01-01
2 1 2009-01-01
3 1 2009-01-02
...
50 1 2009-01-15
51 1 2009-01-16
52 1 2009-01-17
...
100 1 2009-01-31
Prices table
Product_ID Sale_Date Price
1 2009-01-01 10.00
1 2009-01-16 15.00
When a price is defined in the prices table, it is applied to all products sold with the given ProductID from the given SaleDate going forward.
Basically, I'm looking for a query which returns data as follows:
Desired output
Sale_ID ProductID Sale_Date Price
1 1 2009-01-01 10.00
2 1 2009-01-01 10.00
3 1 2009-01-02 10.00
...
50 1 2009-01-15 10.00
51 1 2009-01-16 15.00
52 1 2009-01-17 15.00
...
100 1 2009-01-31 15.00
I have the following query:
SELECT
Sale_ID,
Product_ID,
Sale_Date,
(
SELECT TOP 1 Price
FROM Prices
WHERE
Prices.Product_ID = Sales.Product_ID
AND Prices.Sale_Date < Sales.Sale_Date
ORDER BY Prices.Sale_Date DESC
) as Price
FROM Sales
This works, but is there a more efficient query than a nested sub-select?
And before you point out that it would just be easier to include "price" in the Sales table, I should mention that the schema is maintained by another vendor and I'm unable to change it. And in case it matters, I'm using SQL Server 2000.
If you start storing start and end dates, or create a view that includes the start and end dates (you can even create an indexed view) then you can heavily simplify your query. (provided you are certain there are no range overlaps)
SELECT
Sale_ID,
Product_ID,
Sale_Date,
Price
FROM Sales
JOIN Prices on Sale_date > StartDate and Sale_Date <= EndDate
-- careful not to use between it includes both ends
Note:
A technique along these lines will allow you to do this with a view. Note, if you need to index the view, it will have to be juggled around quite a bit ..
create table t (d datetime)
insert t values(getdate())
insert t values(getdate()+1)
insert t values(getdate()+2)
go
create view myview
as
select start = isnull(max(t2.d), '1975-1-1'), finish = t1.d from t t1
left join t t2 on t1.d > t2.d
group by t1.d
select * from myview
start finish
----------------------- -----------------------
1975-01-01 00:00:00.000 2009-01-27 11:12:57.383
2009-01-27 11:12:57.383 2009-01-28 11:12:57.383
2009-01-28 11:12:57.383 2009-01-29 11:12:57.383
It's well to avoid these types of correlated subqueries. Here's a classic technique for such cases.
SELECT
Sale_ID,
Product_ID,
Sale_Date,
p1.Price
FROM Sales AS s
LEFT JOIN Prices AS p1 ON s.ProductID = p1.ProductID
AND s.Sale_Date >= p1.Sale_Date
LEFT JOIN Prices AS p2 ON s.ProductID = p2.ProductID
AND s.Sale_Date >= p2.Sale_Date
AND p2.Sale_Date > p1.Sale_Date
WHERE p2.Price IS NULL -- want this one not to be found
Use a left outer join on the pricing table as p2, and look for a NULL record demonstrating that the matched product-price record found in p1 is the most recent on or before the sales date.
(I would have inner-joined the first price match, but if there is none, it's nice to have the product show up anyway so you know there's a problem.)
Are you actually running into performance problems or are you just anticipating them? I would implement this exactly as you have, were my hands tied from a schema-modification standpoint as yours are.
I agreee with Sean. The code you have written is very clean and understandable. If you are having performance issues, then take the extra effort to make the code faster. Otherwise, you are making the code more complex for no reason. Nested sub-selects are extremely useful when used judiciously.
The combination of Product_ID and Sale_Date is your foreign key. Try a select-join on Product_ID, Sale_Date.