SQL manipulation of table (aggregate and grouping) - sql

I would like to make a daily query (using bigquery) to compare the sums for different metrics between yesterday and today. sample dataset look like this:
assuming today is 23 Dec 2019, the query will aggregate different metrics (revenue, cost, profit) for different customer for 23 Dec (today) and 22 Dec (yesterday), if sum(yesterday)/sum(today) is not within the threshold of 0.5-1.5, then it will be labelled as anomalous
the query will be made daily and new result will simply be appended. ideally the final table would look like this:
My main concern is that I am able to do this for one metric only (i.e revenue), but not sure how to apply to all metrics (and also make the query more efficient). this is the code i have written
SELECT cust_id,
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
THEN revenue
END) AS sum(yesterday),
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY)
THEN revenue
END) AS sum(today),
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
THEN revenue
END) / SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY)
THEN revenue
END) as ratio,
FROM `dataset`
GROUP BY cust_id
and the code gives me:
Apologies in advance for the lack of clarity in the question, as I am new to this and not sure how to phrase this question more accurately

My suggestion would be to put the source data in an Excel pivot table. (move the Values group to the rows to get the desired view.).
if you want to stick to SQL however, you need to unpivot the rows first, by putting each measure in a separate row and then group the intermediate results, like this:
WITH unpivoted AS
(
SELECT
date
, 'revenue' AS metrics
, SUM( revenue ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
UNION ALL
SELECT
date
, 'cost' AS metrics
, SUM( cost ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
-- add more desired metrics
)
SELECT
date as date_generated
, cust_id
, metrics
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL 0 DAY ) THEN amount END ) AS today
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL -1 DAY ) THEN amount END ) AS yesterday
...
FROM
unpivoted
WHERE
date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY )
AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP
BY
date, cust_id, metrics

You can summarize the data and then use lag() or a join to bring in the previous days data:
with t as (
select cust_id, date,
sum(revenue) as revenue,
sum(cost) as cost,
sum(profit) as profit
from dataset
where date >= date_add(current_date, interval -1 day)
group by cust_id, date
)
select t.cust_id,
today, yesterday
from t today left join
t yesterday
on yesterday.cust_id = today.cust_id and
yesterday.date = date_add(current_date, interval -1 day)
where today.date = current_date;

You can unpivot the columns first and then group the results. After that, you might need to use LAG() to show data from one day and the previous one in the same row.
WITH unpivoted AS
(
SELECT
date,
'revenue' AS metrics,
SUM( revenue ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
UNION ALL
SELECT
date,
'cost' AS metrics,
SUM( cost ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
UNION ALL
SELECT
date,
'profit' AS metrics,
SUM( profit ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
)
SELECT
date as date_generated,
metrics,
cust_id,
LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) yesterday,
SUM( amount ) AS today,
LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount) as ratio,
CASE WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount)<0.5 then 'TRUE'
WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount)>1.5 then 'TRUE'
WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount) is NULL then 'TRUE'
ELSE 'FALSE'
END as anomalous
FROM
unpivoted
WHERE date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY ) AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP BY
date_generated, cust_id, metrics
ORDER BY
date_generated, metrics, cust_id
Note that my solution is only limited to current day and previous day (today and yesterday) when using the WHERE clause, so this could be used to aggregate metrics from more than two days.

Related

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

Finding DAU/MAU ratio in SQL

There's a table with customerID, timestamp, activity columns and I found DAU(DailyActiveUsers) and MAU(MonthlyActiveUsers) from this table. Now I need to find DAU/MAU. The problem is I got DAU and MAU as two separate queries as they both need to be grouped by day and month respectively.
Also, DAU would be a table since it's grouped by day and would have 30 rows in the table. MAU is just a single number. How can I find DAU/MAU which is apparently a ratio?
My query for DAU
select date, count(distinct customerID) as dau
from table
where extract(month from timestamp) = 1 and extract(year from timestamp) = 2020
and activity = 'opened_the_app'
group by date
This gives me dau for all the 31 days in month of january.
Similarly i found MAU by grouping month which gives me a single value for the month of january.
How can I find the DAU/MAU ratio for january?
You can join them together:
select d.*, d.dau * 1.0 / m.mau
from (select date, count(distinct customerID) as dau
from table
where timestamp >= '2020-01-01' and
timestamp < '2020-02-01' and
activity = 'opened_the_app'
group by date
) d cross join
(select count(distinct customerID) as mau
from table
where timestamp >= '2020-01-01' and
timestamp < '2020-02-01' and
activity = 'opened_the_app'
) m
You can find it from the DAU table itself since the MAU will be the sum of DAU
select dau/sum(dau) as result from (
select date, count(distinct customerID) as dau
from table
where extract(month from timestamp) = 1 and extract(year from timestamp) = 2020
and activity = 'opened_the_app'
) dau_table

How to calculate average and latest cost of a product over date range in same sql query

I have table where product is there and it's cost over a time range. I need to calculate the average cost over the period, with the latest cost till date to be considered in average also I need to fetch the current cost. How can I achieve it in same query.
Input Table
I am looking for output like
product | average_cost | current_cost
(average cost is (cost*days of that cost)/total days dill today's date.
You can use date arithmetic and conditional aggregation:
select product,
( sum( cost * datediff(day, beg_date, (case when end_date > getdate() then getdate() else end_date end) )) /
sum(datediff(day, beg_date, (case when end_date > getdate() then getdate() else end_date end))
) as avg_price,
max(case when end_date > getdate() then price end)
from t
group by product;

I need a query to compare one Saturday's total sales with the rest of the year's average Saturday's total sales

My data set's fields are ts, quantity, unit_price
I first need to run sum(quanitiy * unit_price) to get my sales number
ts(time stamp) is formatted like this - 2019-01-15 14:55:00 UTC
Is this what you want?
select avg(case when datecol = ? then total end) as sales_your_date,
avg(case when datecol <> ? then total end) as sales_other
from (select date(t.ts) as dte, sum(t.quantity * t.unit_price) as total
from t
where ts >= timestamp('2018-01-01') and
ts < timestamp('2019-01-01')
group by dte
) t
where extract(dayofweek from datecol) = 6 -- Saturday
This is not much different from your previous question. The same idea works, just with aggregating the data first.
? is for the date you care about.
Below is for BigQuery Standard SQL
#standardSQL
SELECT DATE(ts) AS sale_date, quanitiy * unit_price AS sale_total,
ROUND((SUM(quanitiy * unit_price) OVER() - quanitiy * unit_price) / (COUNT(1) OVER() - 1), 2) AS sale_rest_average
FROM `project.dataset.table`
WHERE EXTRACT(DAYOFWEEK FROM DATE(ts)) = 7
AND EXTRACT(YEAR FROM DATE(ts)) = 2018
In case if you timestamp field is of TIMESTAMP data type (vs STRING) you can use just
WHERE EXTRACT(DAYOFWEEK FROM ts) = 7
AND EXTRACT(YEAR FROM ts) = 2018

Day level Calculation in oracle sql

I have two year static data .I have column name as stats_date,P2P_volume .Initially i created following query for single day in oracle sql developer
select '1' as KPI_ID, 'P2P' as KPI_DESC,'22-MAR-17' as dates,
(sum(case when STATS_DATE between add_months('22-MAR-17',0)-13
and add_months('22-MAR-17',0)-7 then P2P_VOLUME else 0 end )) LAST_WEEK_Volume,
(sum(case when STATS_DATE between add_months('22-MAR-17',0)-6
and add_months('22-MAR-17',0) then P2P_VOLUME else 0 end )) THIS_WEEK_Volume from table
my problem is i want create dynamic query which will give me Last_week_volume,and this_week_volume Date wise for two years.rather than single day
In the absence of a complete set of sample data and requirements here are my assumptions:
There is one row per day per KPI
The definition of current week, previous week can be satisfied
by using the 'IW' date format mask
This solution uses a subquery to calculate the sum of the measures for each week. This feeds into the main query which uses an analytic lag() function to show the totals for the current week and the previous week.
with cte as (
select kpi
, to_char(static_date, 'YYYY') as yr
, to_char(static_date, 'IW') as wk
, sum(volume) as wk_volume
, sum(value) as wk_value
, sum(revenue) as wk_revenue
from t23
group by kpi,to_char(static_date, 'YYYY'), to_char(static_date, 'IW')
)
select kpi
, yr||'-'||wk as year_wk
, wk_volume as curr_wk_volume
, lag(wk_volume) over (order by yr, wk) as prev_wk_volume
, wk_value as curr_wk_value
, lag(wk_value) over (order by yr, wk) as prev_wk_value
, wk_revenue as curr_wk_revenue
, lag(wk_revenue) over (order by yr, wk) as prev_wk_revenue
from cte
order by 1, 2
/
There is a SQL Fiddle demo here.