Accumulating values until up to date - sql

I'm working on an order system where orders come in. For the analytics department I want to build a view that accumulates all sales for a given day.
That is not an issue, I got the working query for that. More complicated is a second number where I want to show the accumulated sales to that day.
Meaning if I have $100 of sales on Feb 1 the column should show $100. If I have $200 of sales on Feb 2 that column should show $300 and so on.
This is what I came up with so far:
select
date_trunc('day', o.created_at) :: date,
sum(o.value) sales_for_day,
count(o.accepted_at) as num_of_orders_for_day,
-- sales_for_month_to_date
-- num_of_orders_for_month_to_date
from
orders o
where
status = 'accepted'
group by
date_trunc('day', o.accepted_at);

Just use window functions:
select date_trunc('day', o.created_at) :: date,
sum(o.value) as sales_for_day,
count(o.accepted_at) as num_of_orders_for_day,
sum(sum(o.value)) over (partition by date_trunc('month', o.accepted_at order by min(o.created_at)) as sales_for_month_to_date
sum(count(*)) over (partition by date_trunc('month', o.accepted_at order by min(o.created_at)) as num_of_orders_for_month_to_date
from orders o
where status = 'accepted'
group by date_trunc('day', o.accepted_at);
Based on the comments in your code, I surmise that you want month-to-date numbers, so this also partitions by month.

Related

How to join partitioned table with another one

Sorry for the newbie question, but I'm really having trouble with the following issue:
Say, I have this code in place:
WITH active_pass AS (SELECT DATE_TRUNC(fr.day, MONTH) AS month, id,
CASE
WHEN SUM(fr.imps) > 100 THEN 1
WHEN SUM(fr.imps) < 100 THEN 0
END AS active_or_passive
FROM table1 AS fr
WHERE day between (CURRENT_DATE() - 730) AND (CURRENT_DATE() - EXTRACT(DAY FROM CURRENT_DATE()))
GROUP BY month, id
ORDER BY month desc),
# summing the score for each customer (sum for the whole year)
active_pass_assigned AS (SELECT id, month,
SUM(SUM(active_or_passive)) OVER (PARTITION BY id ORDER BY month rows BETWEEN 3 PRECEDING AND 1 PRECEDING) AS trailing_act
FROM active_pass AS a
GROUP BY month, id
ORDER BY MONTH desc)
What it does is it creates a trailing total over the last 3 months to see how many of those last 3 month the customer was active. However, I have no idea how to join with the next table to get a sum of revenue that said client generated. What I tried is this:
SELECT c.id, DATE_TRUNC(day, MONTH) AS month, SUM(revenue) AS Rev, name
FROM table2 AS c
JOIN active_pass_assigned AS a
ON c.id = a.id
WHERE day between (CURRENT_DATE() - 365) AND (CURRENT_DATE() - EXTRACT(DAY FROM CURRENT_DATE()))
GROUP BY month, id, name
ORDER BY month DESC
However, it returns waaay higher values for Revenue than the actual ones and I have no idea why. Furthermore, could you please tell me how to join those two tables together so that I only get the customer's revenue on the months his activity was equal to 3?

How to Calculate Full/Repeat Retention in BigQuery SQL

I am trying to calculate a "rolling retention" or "repeat retention" (Not sure what the appropriate name for this is), but a scenario where I only want to count the proportion of users who place an order every single month consecutively.
So if 10 users place an order in Jan 2020, and 5 of them come back in Feb, that would equal a 50% retention.
Now for March, I only want to consider the 5 users who ordered in February, still taking note of the total January cohort size.
So if 2 users from February come back in March, retention for March will be 2/10 = 20%. If a user from Jan who didn't return in Feb, places an order in March, they will not be included in the calculation for March, because they did not return in February.
Basically, this retention will progressively decrease to 0% and can never increase.
Here is what I have done so far:
WITH first_order AS (SELECT
customerEmail,
MIN(orderedat) as firstOrder,
FROM fact AS fact
GROUP BY 1 ),
cohort_data AS (SELECT
first_order.customerEmail,
orderedAt as order_month,
MIN(FORMAT_DATE("%y-%m (%b)", date(firstorder))) as cohort_month,
FROM first_order as first_order
LEFT JOIN fact as fact
ON first_order.customeremail = fact.customeremail
GROUP BY 1,2, FACT.orderedAt),
cohort_count AS (select cohort_month, count(distinct customeremail) AS total_cohort_count FROM cohort_data GROUP BY 1 )
SELECT
cd.cohort_month,
date_trunc(date(cd.order_month), month) as order_month,
total_cohort_count,
count(distinct cd.customeremail) as total_repeat
FROM cohort_data as cd
JOIN cohort_data as last_month
ON cd.customeremail= last_month.customeremail
and date(cd.order_month) = date_add(date(last_month.order_month), interval 1 month)
LEFT JOIN cohort_count AS cc
on cd.cohort_month = cc.cohort_month
GROUP BY 1,2,3
ORDER BY cohort_month, order_month ASC
Here is the result. I'm not sure where I got it wrong but the numbers are too small and the retention increases in some months which shouldn't be.
I did an INNER JOIN in the last query so I could compare the previous month to the current month, but it didn't work exactly how I wanted.
Sample Data:
I'd appreciate any help
I would start with one row per customer per month. Then, I would enumerate the customer/months and keep only those with no gaps . . . and aggregate:
with customer_months as (
select customer_email,
date_trunc(ordered_at, month) as yyyymm,
min(date_trunc(ordered_at, month)) over (partition by customer_email) as first_yyyymm
from cohort_data
group by 1, 2
)
select first_yyyymm, yyyymm, count(*)
from (select cm.*,
row_number() over (partition by custoemr_email order by yyyymm) as seqnum
from customer_months cm
) cm
where yyyymm = date_add(first_yyyymm, interval seqnum - 1 month)
group by 1, 2
order by 1, 2;

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

Is it possible to look at two consecutive rows and determine the difference in time between the two using SQL?

I am relatively new to SQL, so please bear with me! I am trying to see how many customers make a purchase after being dormant for two years. Relevant fields include cust_id and purchase_date (there can be several observations for the same cust_id but with different dates). I am using Redshift for my SQL scripts.
I realize I cannot put the same thing in for the DATEDIFF parameters (it just doesn't make any sense), but I am unsure what else to do.
SELECT *
FROM tickets t
LEFT JOIN d_customer c
ON c.cust_id = t.cust_id
WHERE DATEDIFF(year, t.purchase_date, t.purchase_date) between 0 and 2
ORDER BY t.cust_id, t.purchase_date
;
I think you want lag(). To get the relevant tickets:
SELECT t.*
FROM (SELECT t.*,
LAG(purchase_date) OVER (PARTITION BY cust_id ORDER BY purchase_date) as prev_pd
FROM tickets t
) t
WHERE prev_pd < purchase_date - interval '2 year';
If you want the number of customers, use count(distinct):
SELECT COUNT(DISTINCT cust_id)
FROM (SELECT t.*,
LAG(purchase_date) OVER (PARTITION BY cust_id ORDER BY purchase_date) as prev_pd
FROM tickets t
) t
WHERE prev_pd < purchase_date - interval '2 year';
Note that these do not use DATEDIFF(). This counts the number of boundaries between two date values. So, 2018-12-31 and 2019-01-01 have a difference of 1 year.

SQL: aggregation (group by like) in a column

I have a select that group by customers spending of the past two months by customer id and date. What I need to do is to associate for each row the total amount spent by that customer in the whole first week of the two month time period (of course it would be a repetition for each row of one customer, but for some reason that's ok ). do you know how to do that without using a sub query as a column?
I was thinking using some combination of OVER PARTITION, but could not figure out how...
Thanks a lot in advance.
Raffaele
Query:
select customer_id, date, sum(sales)
from transaction_table
group by customer_id, date
If it's a specific first week (e.g. you always want the first week of the year, and your data set normally includes January and February spending), you could use sum(case...):
select distinct customer_id, date, sum(sales) over (partition by customer_ID, date)
, sum(case when date between '1/1/15' and '1/7/15' then Sales end)
over (partition by customer_id) as FirstWeekSales
from transaction_table
In response to the comments below; I'm not sure if this is what you're looking for, since it involves a subquery, but here's my best shot:
select distinct a.customer_id, date
, sum(sales) over (partition by a.customer_ID, date)
, sum(case when date between mindate and dateadd(DD, 7, mindate)
then Sales end)
over (partition by a.customer_id) as FirstWeekSales
from transaction_table a
left join
(select customer_ID, min(date) as mindate
from transaction_table group by customer_ID) b
on a.customer_ID = b.customer_ID