SUM with left outer join gets inflated result - sql

The following query gives me the MRR (monthly recurring revenue) for my customer:
with dims as (
select distinct subscription_id, country_name, product_name from revenue
where site_id = '18XLsHIVSJg' and subscription_id is not null
)
select to_date('2022-07-01') as occurred_date,
count(distinct srm.subscription_id) as subscriptions,
count(distinct srm.receiver_contact) as subscribers,
sum(srm.baseline_mrr) as mrr_srm
from subscription_revenue_mart srm
join dims d on d.subscription_id = srm.subscription_id
where srm.site_id = '18XLsHIVSJg'
-- MRR as of the day before ie June 30th
and to_date(srm.creation_date) < '2022-07-01'
-- Counting the subscriptions active after July 1st
and ((srm.subscription_status = 'SUBL.A') or
-- Counting the subscriptions canceled/deactivated after July 1st
(srm.subscription_status = 'SUBL.C' and (srm.deactivation_date >= '2022-07-01') or (srm.canceled_date >= '2022-07-01')) ) group by 1;
I get a total of $5922.15 but I need to add data from another table to capture upgrades/downgrades a customer makes on a product subscription. Using the same approach as above, I can query my "change" table thusly:
select subscription_id, sum(mrr_change_amount) mrr_change_amount,max(subscription_event_date) subscription_event_date from subscription_revenue_mart_change srmc
where site_id = '18XLsHIVSJg'
and to_date(srmc.creation_date) < '2022-07-01'
and ((srmc.subscription_status = 'SUBL.A')
or (srmc.subscription_status = 'SUBL.C' and (srmc.deactivation_date >= '2022-07-01') or (srmc.canceled_date >= '2022-07-01')))
group by 1;
I get a total of $3635.47
When I combine both queries into one, I get an inflated result:
with dims as (
select distinct subscription_id, country_name, product_name from revenue
where site_id = '18XLsHIVSJg' and subscription_id is not null
),
change as (
select subscription_id, sum(mrr_change_amount) mrr_change_amount,
-- there can be multiple changes per subscription
max(subscription_event_date) subscription_event_date from subscription_revenue_mart_change srmc
where site_id = '18XLsHIVSJg'
and to_date(srmc.creation_date) < '2022-07-01'
and ((srmc.subscription_status = 'SUBL.A')
or (srmc.subscription_status = 'SUBL.C' and (srmc.deactivation_date >= '2022-07-01') or (srmc.canceled_date >= '2022-07-01')))
group by 1
)
select to_date('2022-07-01') as occurred_date,
count(distinct srm.subscription_id) as subscriptions,
count(distinct srm.receiver_contact) as subscribers,
-- See comment RE: LEFT OUTER join
sum(coalesce(c.mrr_change_amount,srm.baseline_mrr)) as mrr
from subscription_revenue_mart srm
join dims d
on d.subscription_id = srm.subscription_id
-- LEFT OUTER join required for customers that never made a change
left outer join change c
on srm.subscription_id = c.subscription_id
where srm.site_id = '18XLsHIVSJg'
and to_date(srm.creation_date) < '2022-07-01'
and ((srm.subscription_status = 'SUBL.A')
or (srm.subscription_status = 'SUBL.C' and (srm.deactivation_date >= '2022-07-01') or (srm.canceled_date >= '2022-07-01'))) group by 1;
It should be $9557.62 ie (5922.15 + $3635.47) but the query outputs $16116.91, which is wrong.
I think the explode-implode syndrome may cause this.
I had designed my "change" CTE to prevent this by aggregating all the relevant fields but it's not working.
Can someone provide pointers on the best way to work around this issue?

It would help if you gave us sample data too, but I see a problem here:
sum(coalesce(c.mrr_change_amount,srm.baseline_mrr)) as mrr
Why COALESCE? That will give you one of the 2 numbers, but I guess what you want is:
sum(ifnull(c.mrr_change_amount, 0) + srm.baseline_mrr) as mrr
That's the best I can offer with what you've given us.

Related

Calculate rolling year totals in sql

I am gathering something that is essentially am "enrollment date" for users. The "enrollment date" is not stored in the database (for a reason too long to explain here), so I have to deduce it from the data. I then want to reuse this CTE in numerous places throughout another query to gather values such as "total orders 1 year before enrollment" and "total orders 1 year after enrollment".
I haven't gotten this code to run, as it's much more complex in my actual data set (this code is paraphrased from the actual code) and I have a feeling it's not the best way to do this. As you can see, my date conditionals are mostly just placeholders, but I think it should be obvious what I am trying to do.
That said, I think this would mostly work. My question is, is there a better way to do this? Additionally, could I combine the rolling year before and rolling year after into one table somehow? (maybe window functions)? This is part of a much bigger query, so the more consolidation I could do, the better it would seem.
For what it's worth, the subquery to derive the "enrollment date" is also more complex than shown here.
With enroll as (Select
user_id,
MIN(date) as e_date
FROM `orders` o
WHERE (subscribed = True)
group by user_id
)
Select*
from users
left join (select
user_id,
SUM(total_paid)
from orders where date > (select enroll.e_date where user_id = user_id) AND date < (select enroll.e_date where user_id = user_id + 365 days)
and order_type = 'special'
group by user_id
) as rolling_year_after on rolling_year_after.user_id = users.user_id
left join (select
user_id,
SUM(total_paid)
from orders where date < (select enroll.e_date where user_id = user_id) and date > (select enroll.e_date where user_id = user_id - 365 days)
and order_type = 'special'
group by user_id
) as rolling_year_before on rolling_year_before.user_id = users.user_id
Maybe something like this, not sure if its more performant, but looks a bit cleaner:
With enroll as (Select
user_id,
MIN(date) as e_date
FROM `orders` o
WHERE (subscribed = True)
group by user_id
)
, rolling_year as (
select
user_id,
SUM(CASE WHEN date between enroll.edate and enroll.edate + 365 days then (total_paid) else 0 end) as rolling_year_after,
SUM(CASE WHEN date between enroll.edate - 365 days and enroll.edate then (total_paid) else 0 end) as rolling_year_before
from orders
left join enroll
on order.user_id = enroll.user_id
where order_type = 'special'
group by user_id
)
Select *
from users
left join rolling_year
on users.user_id = rolling_year.user_id

SELECT list expression references column integration_start_date which is neither grouped nor aggregated at

I'm facing an issue with the following query. It gave me this error [SELECT list expression references column integration_start_date which is neither grouped nor aggregated at [34:63]]. In particular, it points to the first 'when' in the result table, which I don't know how to fix. This is on BigQuery if that helps. I see everything is written correctly or I could be wrong. Seeking for help.
with plan_data as (
select format_date("%Y-%m-%d",last_day(date(a.basis_date))) as invoice_date,
a.sponsor_id as sponsor_id,
b.company_name as sponsor_name,
REPLACE(SUBSTR(d.meta,STRPOS(d.meta,'merchant_id')+12,13),'"','') as merchant_id,
a.state as plan_state,
date(c.start_date) as plan_start_date,
a.employee_id as square_employee_id,
date(
(select min(date)
from glproductionview.stats_sponsors
where sponsor_id = a.sponsor_id and sponsor_payroll_provider_identifier = 'square' and date >= c.start_date) )
as integration_start_date,
count(distinct a.employee_id) as eligible_pts_count, --pts that are in active plan and have payroll activities (payroll deductions) in the reporting month
from glproductionview.payroll_activities as a
left join glproductionview.sponsors as b
on a.sponsor_id = b.id
left join glproductionview.dc_plans as c
on a.plan_id = c.id
left join glproductionview.payroll_connections as d
on a.sponsor_id = d.sponsor_id and d.provider_identifier = 'rocket' and a.company_id = d.payroll_id
where a.payroll_provider_identifier = 'rocket'
and format_date("%Y-%m",date(a.basis_date)) = '2021-07'
and a.amount_cents > 0
group by 1,2,3,4,5,6,7,8
order by 2 asc
)
select invoice_date,
sponsor_id,
sponsor_name,
eligible_pts_count,
case
when eligible_pts_count <= 5 and date_diff(current_date(),integration_start_date, month) <= 12 then 20
when eligible_pts_count <= 5 and date_diff(current_date(),integration_start_date, month) > 12 then 15
when eligible_pts_count > 5 and date_diff(current_date(),integration_start_date, month) <= 12 then count(distinct square_employee_id)*4
when eligible_pts_count > 5 and date_diff(current_date(),integration_start_date, month) > 12 then count(distinct square_employee_id)*3
else 0
end as fees
from plan_data
group by 1,2,3,4;

Summing Two Columns from Two Tables with Two Dates

I work for a CPG company and need to create a report that compares the previous month's delivered units to the next month's forecast. (Simply, our forecasting tool screws up occasionally and this will help identify when the forecast is off.)
My issue is my SQL query is summing forecast sales correctly, but the sum of total delivered is not respecting the dates I have in my WHERE clause -- it's summing total delivered for as far back as the query can reach.
Here is my query:
SELECT
DelUnits.Customer, DelUnits.ObsText01,
FinalFcst.SKU, FinalFcst.Customer,
SUM(DelUnits.Value) AS TotalDelivered,
SUM(FinalFcst.FinalFcst) AS ForecastSales
FROM
DelUnits
LEFT JOIN
FinalFcst ON DelUnits.Customer = FinalFcst.Customer
WHERE
(FinalFcst.DT >= '2018-01-01' and FinalFcst.DT <= '2018-01-31')
AND (DelUnits.Date >= '2017-12-01' and DelUnits.Date <= '2017-12-31')
AND DelUnits.ObsText01 = '10_LB'
AND FinalFcst.SKU = '10_LB'
GROUP BY
DelUnits.Customer, DelUnits.ObsText01, FinalFcst.SKU, FinalFcst.Customer
Again, the query seems to work correctly for the final forecast (summing the forecast between 1/1/18 - 1/31/18) but sums the entire delivery history for a customer. I don't understand why it won't sum the delivery history for just 12/1/17 - 12/31/17.
Thank you for your help!
Presumably, there is only one row for FinalFcst. So, either include it in the GROUP BY clause or use MAX() instead of SUM():
max(FinalFcst.FinalFcst) as ForecastSales
One way to achieve this is to calculate TotalDelivered and ForecastSales in 2 different queries and then join them together.
Try this:
SELECT DelUnits.customer,
DelUnits.obstext01,
FinalFcst.sku,
FinalFcst.customer,
totaldelivered,
forecastsales
FROM (SELECT customer,
obstext01,
Sum(value) AS TotalDelivered
FROM delunits
WHERE date >= '2017-12-01'
AND date <= '2017-12-31'
AND obstext01 = '10_LB'
GROUP BY customer,
obstext01) DelUnits
LEFT JOIN (SELECT customer,
sku,
Sum(finalfcst) AS ForecastSales
FROM finalfcst
WHERE dt >= '2018-01-01'
AND dt <= '2018-01-31'
AND sku = '10_LB'
GROUP BY customer,
sku) FinalFcst ON DelUnits.customer = FinalFcst.customer
You have a many to many relationship between the tables. Ultimately you need to SUM() one table before joining to the other to create a one to many relationship, or you end up duplicating records.
My favorite approach is a derived table:
SELECT C.Customer,
C.ObsText01,
FC.SKU,
C.TotalDelivered,
SUM(FC.FinalFcst) ForecastSales
FROM (SELECT SUM(Value) TotalDelivered, Customer, ObsText01
FROM DelUnits
WHERE Date >= '2017-12-01' AND Date <= '2017-12-31'
AND ObsText01 = '10_LB'
GROUP BY Customer) C
LEFT JOIN FinalFcst FC ON C.Customer = FC.Customer
AND FC.DT >= '2018-01-01'
AND FC.DT <= '2018-01-31'
AND FC.SKU = '10_LB'
GROUP BY C.Customer, C.ObsText01, FC.SKU, C.TotalDelivered
A couple things: Added your forecast table filters to the join predicate, since having those in the WHERE will create an INNER JOIN out of your LEFT JOIN. Also removed FC.Customer from the select and the group since it is redundant with C.Customer.
Maybe you could try to create a temp table to calculate the delivery history. I am not sure of the SQL Server verbiage, but something like this:
WITH DEL_HIST AS
(SELECT DelUnits.Customer,
DelUnits.ObsText01,
sum(DelUnits.Value) as TotalDelivered,
FROM DelUnits
Where(DelUnits.Date >= '2017-12-01' and DelUnits.Date <= '2017-12-31')
and DelUnits.ObsText01 = '10_LB'
Group By DelUnits.Customer, DelUnits.ObsText01)
SELECT
DEL_HIST.Customer,
DEL_HIST.ObsText01,
FinalFcst.SKU,
FinalFcst.Customer,
DEL_HIST.TotalDelivered,
sum(FinalFcst.FinalFcst) as ForecastSales
FROM DEL_HIST
left join FinalFcst ON DelUnits.Customer = FinalFcst.Customer
Where (FinalFcst.DT >= '2018-01-01' and FinalFcst.DT <= '2018-01-31')
and FinalFcst.SKU = '10_LB'
Group By DelUnits.Customer, DelUnits.ObsText01, FinalFcst.SKU, FinalFcst.Customer

SQL Query to show order of work orders

First off sorry for the poor subject line.
EDIT: The Query here duplicates OrderNumbers I am needing the query to NOT duplicate OrderNumbers
EDIT: Shortened the question and provided a much cleaner question
I have a table that has a record of all of the work orders that have been performed. there are two types of orders. Installs and Trouble Calls. My query is to find all of the trouble calls that have taken place within 30 days of an install and match that trouble call (TC) to the proper Install (IN). So the Trouble Call date has to happen after the install but no more than 30 days after. Additionally if there are two installs and two trouble calls for the same account all within 30 days and they happen in order the results have to reflect that. The problem I am having is I am getting an Install order matching to two different Trouble Calls (TC) and a Trouble Call(TC) that is matching to two different Installs(IN)
In the example on SQL Fiddle pay close attention to the install order number 1234567810 and the Trouble Call order number 1234567890 and you will see the issue I am having.
http://sqlfiddle.com/#!3/811df/8
select b.accountnumber,
MAX(b.scheduleddate) as OriginalDate,
b.workordernumber as OriginalOrder,
b.jobtype as OriginalType,
MIN(a.scheduleddate) as NewDate,
a.workordernumber as NewOrder,
a.jobtype as NewType
from (
select workordernumber,accountnumber,jobtype,scheduleddate
from workorders
where jobtype = 'TC'
) a join
(
select workordernumber,accountnumber,jobtype,scheduleddate
from workorders
where jobtype = 'IN'
) b
on a.accountnumber = b.accountnumber
group by b.accountnumber,
b.scheduleddate,
b.workordernumber,
b.jobtype,
a.accountnumber,
a.scheduleddate,
a.workordernumber,
a.jobtype
having MIN(a.scheduleddate) > MAX(b.scheduleddate) and
DATEDIFF(day,MAX(b.scheduleddate),MIN(a.scheduleddate)) < 31
Example of what I am looking for the results to look like.
Thank you for any assistance you can provide in setting me on the correct path.
You were actually very close. I realized that what you really want is the MIN() TC date that is greater than each install date for that account number so long as they are 30 days or less apart.
So really you need to group by the install dates from your result set excluding WorkOrderNumbers still. Something like:
SELECT a.AccountNumber, MIN(a.scheduleddate) TCDate, b.scheduleddate INDate
FROM
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'TC'
) a
INNER JOIN
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'IN'
) b
ON a.AccountNumber = b.AccountNumber
WHERE b.ScheduledDate < a.ScheduledDate
AND DATEDIFF(DAY, b.ScheduledDate, a.ScheduledDate) <= 30
GROUP BY a.AccountNumber, b.AccountNumber, b.ScheduledDate
This takes care of the dates and AccountNumbers, but you still need the WorkOrderNumbers, so I joined the workorders table back twice, once for each type.
NOTE: I assume that each workorder has a unique date for each account number. So, if you have workorder 1 ('TC') for account 1 done on '1/1/2015' and you also have workorder 2 ('TC') for account 1 done on '1/1/2015' then I can't guarantee that you will have the correct WorkOrderNumber in your result set.
My final query looked like this:
SELECT
aggdata.AccountNumber, inst.workordernumber OriginalWorkOrderNumber, inst.JobType OriginalJobType, inst.ScheduledDate OriginalScheduledDate,
tc.WorkOrderNumber NewWorkOrderNumber, tc.JobType NewJobType, tc.ScheduledDate NewScheduledDate
FROM (
SELECT a.AccountNumber, MIN(a.scheduleddate) TCDate, b.scheduleddate INDate
FROM
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'TC'
) a
INNER JOIN
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'IN'
) b
ON a.AccountNumber = b.AccountNumber
WHERE b.ScheduledDate < a.ScheduledDate
AND DATEDIFF(DAY, b.ScheduledDate, a.ScheduledDate) <= 30
GROUP BY a.AccountNumber, b.AccountNumber, b.ScheduledDate
) aggdata
LEFT OUTER JOIN workorders tc
ON aggdata.TCDate = tc.ScheduledDate
AND aggdata.AccountNumber = tc.AccountNumber
AND tc.JobType = 'TC'
LEFT OUTER JOIN workorders inst
ON aggdata.INDate = inst.ScheduledDate
AND aggdata.AccountNumber = inst.AccountNumber
AND inst.JobType = 'IN'
select in1.accountnumber,
in1.scheduleddate as OriginalDate,
in1.workordernumber as OriginalOrder,
'IN' as OriginalType,
tc.scheduleddate as NewDate,
tc.workordernumber as NewOrder,
'TC' as NewType
from
workorders in1
out apply (Select min(in2.scheduleddate) as scheduleddate from workorders in2 Where in2.jobtype = 'IN' and in1.accountnumber=in2.accountnumber and in2.scheduleddate>in1.scheduleddate) ins
join workorders tc on tc.jobtype = 'TC' and tc.accountnumber=in1.accountnumber and tc.scheduleddate>in1.scheduleddate and (ins.scheduleddate is null or tc.scheduleddate<ins.scheduleddate) and DATEDIFF(day,in1.scheduleddate,tc.scheduleddate) < 31
Where in1.jobtype = 'IN'

How do I use calendar exceptions to generate accurate schedules using GTFS?

I'm having trouble figuring out the GTFS query to obtain the next 20 schedules for a given stop ID and a given direction.
I know the stop ID, the trip direction ID, the time (now) and the date (today)
I wrote
SELECT DISTINCT ST.departure_time FROM stop_times ST
JOIN trips T ON T._id = ST.trip_id
JOIN calendar C ON C._id = T.service_id
JOIN calendar_dates CD on CD.service_id = T.service_id
WHERE ST.stop_id = 3377699724118483
AND T.direction_id = 0
AND ST.departure_time >= "16:00:00"
AND
(
( C.start_date <= 20140607 AND C.end_date >= 20140607 AND C.saturday= 1 ) // regular service today
AND ( ( CD.date != 20140607 ) // no exception today
OR ( CD.date = 20140607 AND CD.exception_type = 1 ) // or ADDED exception today
)
)
ORDER BY stopTimes.departure_time LIMIT 20
This results in no record being found.
If a remove the last part, dealgin with the CD tables (i.e. the removed or added exceptions), it works perfectly fine.
So I think I'm miswriting the check on the exceptions.
As written above with // comments, I want to check that
today is in a regular service (from checking the calendar table)
there is no removal exception for today (or in this case the trips corresponding to this service id are not included in the computation)
if there is added exception for today, the corresponding trips shall be included in the computation
can you help me with that ?
I'm fairly certain it's not possible to do what you're trying to do with only a single SELECT statement, due to the design of the calendar and calendar_dates tables.
What I do is use a second, inner query to build the set of active service IDs on the requested date, then join the outer query against this set to include only results relevant for that date. Try this:
SELECT DISTINCT ST.departure_time FROM stop_times ST
JOIN trips T ON T._id = ST.trip_id
JOIN (SELECT _id FROM calendar
WHERE start_date <= 20140607
AND end_date >= 20140607
AND saturday = 1
UNION
SELECT service_id FROM calendar_dates
WHERE date = 20140607
AND exception_type = 1
EXCEPT
SELECT service_id FROM calendar_dates
WHERE date = 20140607
AND exception_type = 2
) ASI ON ASI._id = T.service_id
WHERE ST.stop_id = 3377699724118483
AND T.direction_id = 0
AND ST.departure_time >= "16:00:00"
ORDER BY ST.departure_time
LIMIT 20