How to find the number of purchases over time intervals SQL - sql

I'm using Redshift (Postgres), and Pandas to do my work. I'm trying to get the number of user actions, lets say purchases to make it easier to understand. I have a table, purchases that holds the following data:
user_id, timestamp , price
1, , 2015-02-01, 200
1, , 2015-02-02, 50
1, , 2015-02-10, 75
ultimately I would like the number of purchases over a certain timestamp. Such as
userid, 28-14_days, 14-7_days, 7
Here is what I have so far, I'm aware I don't have an upper limit on the dates:
SELECT DISTINCT x_days.user_id, SUM(x_days.purchases) AS x_num, SUM(y_days.purchases) AS y_num,
x_days.x_date, y_days.y_date
FROM
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as x_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(x_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id
) AS x_days
JOIN
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as y_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(y_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id) AS y_days
ON
x_days.user_id = y_days.user_id
GROUP BY
x_days.user_id, x_days.x_date, y_days.y_date
params={'x_days_ago':x_days_ago, 'y_days_ago':y_days_ago}
where these are set in python/pandas
x_days_ago = 14
y_days_ago = 7
But this didn't work out exactly as planned:
user_id x_num y_num x_date y_date
0 5451772 1 1 2015-02-10 2015-02-10
1 5026678 1 1 2015-02-09 2015-02-09
2 6337993 2 1 2015-02-14 2015-02-13
3 6204432 1 3 2015-02-10 2015-02-11
4 3417539 1 1 2015-02-11 2015-02-11
Even though I don't have an upper date to look between (so x is effectively searching from 14 days to now and y is 7 days to now, meaning overlap), in some cases y is higher.
Can anyone help me either fix this or give me a better way?
Thanks!

It might not be the most efficient answer, but you can generate each sum with a sub-select:
WITH
summed AS (
SELECT user_id, day, COUNT(1) AS purchases
FROM (SELECT user_id, DATE(timestamp) AS day FROM purchases) AS _
GROUP BY user_id, day
),
users AS (SELECT DISTINCT user_id FROM purchases)
SELECT user_id,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval ' 7 days')) AS days_7,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval '14 days')) AS days_14
FROM users;
(This was tested in Postgres, not in Redshift; but the Redshift documentation suggests that both WITH and DISTINCT are supported.) I would have liked to do this with a window, to obtain rolling sums; but it's a little onerous without generate_series().

Related

Retrieve Customers with a Monthly Order Frequency greater than 4

I am trying to optimize the below query to help fetch all customers in the last three months who have a monthly order frequency +4 for the past three months.
Customer ID
Feb
Mar
Apr
0001
4
5
6
0002
3
2
4
0003
4
2
3
In the above table, the customer with Customer ID 0001 should only be picked, as he consistently has 4 or more orders in a month.
Below is a query I have written, which pulls all customers with an average purchase frequency of 4 in the last 90 days, but not considering there is a consistent purchase of 4 or more last three months.
Query:
SELECT distinct lines.customer_id Customer_ID, (COUNT(lines.order_id)/90) PurchaseFrequency
from fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY Customer_ID
HAVING PurchaseFrequency >=4;
I tried to use window functions, however not sure if it needs to be used in this case.
I would sum the orders per month instead of computing the avg and then retrieve those who have that sum greater than 4 in the last three months.
Also I think you should select your interval using "month(CURRENT_DATE()) - 3" instead of using a window of 90 days. Of course if needed you should handle the case of when current_date is jan-feb-mar and in that case go back to oct-nov-dec of the previous year.
I'm not familiar with Google BigQuery so I can't write your query but I hope this helps.
So I've found the solution to this using WITH operator as below:
WITH filtered_orders AS (
select
distinct customer_id ID,
extract(MONTH from date) Order_Month,
count(order_id) CountofOrders
from customer_order_lines` lines
where EXTRACT(YEAR FROM date) = 2022 AND EXTRACT(MONTH FROM date) IN (2,3,4)
group by ID, Order_Month
having CountofOrders>=4)
select distinct ID
from filtered_orders
group by ID
having count(Order_Month) =3;
Hope this helps!
An option could be first count the orders by month and then filter users which have purchases on all months above your threshold:
WITH ORDERS_BY_MONTH AS (
SELECT
DATE_TRUNC(lines.date, MONTH) PurchaseMonth,
lines.customer_id Customer_ID,
COUNT(lines.order_id) PurchaseFrequency
FROM fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY PurchaseMonth, Customer_ID
)
SELECT
Customer_ID,
AVG(PurchaseFrequency) AvgPurchaseFrequency
FROM ORDERS_BY_MONTH
GROUP BY Customer_ID
HAVING COUNT(1) = COUNTIF(PurchaseFrequency >= 4)

Postgresql Sum statistics by month for a single year

I have the following table in the database:
item_id, integer
item_name, character varying
price, double precision
user_id, integer
category_id, integer
date, date
1
Pizza
2.99
1
2
'2020-01-01'
2
Cinema
5
1
3
'2020-01-01'
3
Cheeseburger
4.99
1
2
'2020-01-01'
4
Rental
100
1
1
'2020-01-01'
Now I want to get the statistics for the total price for each month in a year. It should include all items as well as a single category both for all the time and specified time period. For example, using this
SELECT EXTRACT(MONTH from date),COALESCE(SUM(price), 0)
FROM item_table
WHERE user_id = 1 AND category_id = 3 AND date BETWEEN '2020-01-01'AND '2021-01-01'
GROUP By date_part
ORDER BY date_part;
I expect to obtain this:
date_part
total
1
5
2
0
3
0
...
...
12
0
However, I get this:
date_part
total
1
5
1) How can I get zero value for a case when no items for a specified category are found? (now it just skips the month)
2) The above example gives the statistics for the selected category within some time period. For all my purposes I need to write 3 more queries (select for all time and all categories/ all the time single category/ single year all categories). Is there a unique query for all these cases? (when some parameters like category_id or date are null )
You can get the "empty" months by doing a right join against a table that contains the month numbers and moving the WHERE criteria into the JOIN criteria:
-- Create a temporary "table" for month numbers
WITH months AS (SELECT * FROM generate_series(1, 12) AS t(n))
SELECT months.n as date_part, COALESCE(SUM(price), 0) AS total
FROM item_table
RIGHT JOIN months ON EXTRACT(MONTH from date) = months.n
AND user_id = 1 AND category_id = 3 AND "date" BETWEEN '2020-01-01'AND '2021-01-01'
GROUP BY months.n
ORDER By months.n;
I'm not quite sure what you want from your second part, but you could take a look at Grouping Sets, Cube and Rollup.

Aggregating two values in same select statement. Second aggregation is decreasing in value for each row for some reason

I'm currently trying to aggregate two values simultaneously in one select statement; however, the second aggregated value is decreasing for some reason. I know what I'm doing is wrong, but I don't understand why it's wrong (assuming it's the very last code block). Mainly just trying to better understand what's going on, and why it's happening.
I already have a corrected query that works (at the bottom)
Note: Query and outputs are simplified, please ignore any syntax issues. Additionally, in real query, I need to keep subscription_start_date field in until the end.
Query with issue (very last block):
WITH max_product_user_count AS (
-- The total count is obtained when "days" = 0
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS total_user_count
FROM users
WHERE days = 0
),
daily_product_user_count AS (
-- As "days" go up, the number of subscribers for each start date/product type decreases
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS daily_user_count
FROM users
WHERE days IN (0,5,14,21,30,33,60)
)
-- Trying to aggregate by product and day, across all subscription start dates
SELECT
d.product,
d.days,
SUM(daily_user_count) AS daily_count,
SUM(total_user_count) AS total_count
FROM daily_product_user_count d
INNER JOIN max_product_user_count m ON d.subscription_start_date = m.subscription_start_date
AND d.product = m.product
GROUP BY 1,2
ORDER BY 1,2
Current Output:
PRODUCT DAYS DAILY_COUNT TOTAL_COUNT
product_1 0 10000 10000
product_1 5 99231 99781
product_1 14 96124 98123
product_1 21 85123 96441
product_1 30 23412 94142
product_1 33 12931 92111
product_1 60 10231 90123
Expected Output:
PRODUCT DAYS DAILY_COUNT TOTAL_COUNT
product_1 0 10000 10000
product_1 5 99231 10000
product_1 14 96124 10000
product_1 21 85123 10000
product_1 30 23412 10000
product_1 33 12931 10000
product_1 60 10231 10000
Updated correct query:
WITH max_product_user_count AS (
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS total_user_count
FROM users
WHERE days = 0
),
max_user_count_aggregation AS (
SELECT
product,
SUM(total_user_count) AS total_count
FROM max_product_user_count
GROUP BY 1
),
daily_product_user_count AS (
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS daily_user_count
FROM users
WHERE days IN (0,5,14,21,30,33,60)
)
daily_user_count_aggregation AS (
SELECT
product,
days,
SUM(daily_user_count) AS daily_count
FROM daily_product_user_count
GROUP BY 1
)
SELECT
d.product,
d.days,
daily_count,
total_count
FROM daily_user_count_aggregation d
INNER JOIN max_user_count_aggregation m ON d.product = m.product
ORDER BY 1,2
If I understand what you are trying to do, the query is way more complicated than necessary. I think this does what you want:
SELECT datediff('days', subscription_start_date, subscription_date) AS days,
product,
SUM(num_users) FILTER (WHERE days IN (0, 5, 14, 21, 30, 33, 60)) AS daily_user_count,
SUM(num_users) FILTER (WHERE days = 0) AS total_user_count
FROM users
GROUP BY days, product;
I would advise you to ask a new question, explaining the logic you want to implement and providing reasonable sample data and desired results.

7-day user count: Big-Query self-join to get date range and count?

My Google Firebase event data is integrated to BigQuery and I'm trying to fetch from here one of the info that Firebase gives me automatically: 1-day, 7-day, 28-day user count.
1-day count is quite straightforward
SELECT
"1-day" as period,
events.event_date,
count(distinct events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events
WHERE events.event_name = "session_start"
group by events.event_date
with a neat result like
period event_date uid
1-day 20190609 5
1-day 20190610 7
1-day 20190611 5
1-day 20190612 7
1-day 20190613 37
1-day 20190614 73
1-day 20190615 52
1-day 20190616 36
But to me it gets complicated when I try to count for each day how many unique users I had in the previous 7 days
From the above query, I know my target value for day 20190616 will be 142, by filtering 7 days and removing the group by condition.
The solution I tried is direct self join (and variations that didnt change the result)
SELECT
"7-day" as period,
events.event_date,
count(distinct user_events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events,
`your_path.events_*` as user_events
WHERE user_events.event_name = "session_start"
and PARSE_DATE("%Y%m%d", events.event_date) between DATE_SUB(PARSE_DATE("%Y%m%d", user_events.event_date), INTERVAL 7 DAY) and PARSE_DATE("%Y%m%d", user_events.event_date) #one day in the first table should correspond to 7 days worth of events in the second
and events.event_date = "20190616" #fixed date to check
group by events.event_date
Now, I know I'm barely setting any join conditions, but if any I expected to produce cross joins and huge results. Instead, the count this way is 70, which is a lot lower than expected. Furthermore, I can set INTERVAL 2 DAY and the result does not change.
I'm clearly doing something very wrong here, but I also thought that the way I'm doing it is very rudimental, and there must be a smarter way to accomplish this.
I have checked Calculating a current day 7 day active user with BigQuery? but the explicit cross join here is with event_dim which definition I'm unsure about
Cheched the solution provided at Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU) as suggested by comment.
The solution seemed sound at first but has some problems the more recent the day is. Here's the query using COUNT(DISTINCT) that I adapted to my case
SELECT DATE_SUB(event_date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT user_pseudo_id) unique_90_day_users
, COUNT(DISTINCT IF(i<29,user_pseudo_id,null)) unique_28_day_users
, COUNT(DISTINCT IF(i<8,user_pseudo_id,null)) unique_7_day_users
, COUNT(DISTINCT IF(i<2,user_pseudo_id,null)) unique_1_day_users
FROM (
SELECT PARSE_DATE("%Y%m%d",event_date) as event_date, user_pseudo_id
FROM `your_path_here.events_*`
WHERE EXTRACT(YEAR FROM PARSE_DATE("%Y%m%d",event_date))=2019
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
and here is the result for the latest days (consider data starts 23rd May) where you can appreciate that the result is wrong
row_num date_grp 90-day 28-day 7-day 1-day
114 2019-06-16 273 273 273 210
115 2019-06-17 78 78 78 78
so in the last day this count for 90-day,28-day,7day is only considering the very same day instead of all the days before.
It's not possible for 90-day count on the 17th June to be 78 if the 1-day on the 16th June was higher.
This is AN answer to my same question.
My means are rudimentary as I'm not extremely familiar with BQ shortcuts and some advanced functions, but the result is still correct.
I hope others will be able to integrate with better queries.
#standardSQL
WITH dates AS (
SELECT i as event_date
FROM UNNEST(GENERATE_DATE_ARRAY('2019-05-24', CURRENT_DATE(), INTERVAL 1 DAY)) i
)
, ptd_dates as (
SELECT DISTINCT "90-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 90)) i
UNION ALL
SELECT distinct "28-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 29)) i
UNION ALL
SELECT distinct "7-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 7)) i
UNION ALL
SELECT distinct "1-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",event_date) as ptd_date
FROM dates
)
SELECT event_date,
sum(IF(day_category="90-day",unique_ptd_users,null)) as count_90_day ,
sum(IF(day_category="28-day",unique_ptd_users,null)) as count_28_day,
sum(IF(day_category="7-day",unique_ptd_users,null)) as count_7_day,
sum(IF(day_category="1-day",unique_ptd_users,null)) as count_1_day
from (
SELECT ptd_dates.day_category
, ptd_dates.event_date
, COUNT(DISTINCT user_pseudo_id) unique_ptd_users
FROM ptd_dates,
`your_path_here.events_*` events,
unnest(events.event_params) e_params
WHERE ptd_dates.ptd_date = events.event_date
GROUP BY ptd_dates.day_category
, ptd_dates.event_date)
group by event_date
order by 1,2,3
As per suggestion from ECris, I first defined a calendar table to use: this contains 4 categories of PTDs (period to date). Each is generated from basic elements: this should scale linearly as it's not querying the event dataset and therefore does not have gaps.
Then the join is made with events, where the join condition shows how for each date I'm counting distinct users in all related days in the period.
The results are correct.

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary