I have created a basic database (picture attached) Database, I am trying to find the following:
"Median total amount spent per user in each calendar month"
I tried the following, but getting errors:
SELECT
user_id,
AVG(total_per_user)
FROM (SELECT user_id,
ROW_NUMBER() over (ORDER BY total_per_user DESC) AS desc_total,
ROW_NUMBER() over (ORDER BY total_per_user ASC) AS asc_total
FROM (SELECT EXTRACT(MONTH FROM created_at) AS calendar_month,
user_id,
SUM(amount) AS total_per_user
FROM transactions
GROUP BY calendar_month, user_id) AS total_amount
ORDER BY user_id) AS a
WHERE asc_total IN (desc_total, desc_total+1, desc_total-1)
GROUP BY user_id
;
In Postgres, you could just use aggregate function percentile_cont():
select
user_id,
percentile_cont(0.5) within group(order by total_per_user) median_total_per_user
from (
select user_id, sum(amount) total_per_user
from transactions
group by date_trunc('month', created_at), user_id
) t
group by user_id
Note that date_trunc() is probably closer to what you want than extract(month from ...) - unless you do want to sum amounts of the same month for different years together, which is not how I understood your requirement.
Just use percentile_cont(). I don't fully understand the question. If you want the median of the monthly spending, then:
SELECT user_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_per_user
ROW_NUMBER() over (ORDER BY total_per_user DESC) AS desc_total,
ROW_NUMBER() over (ORDER BY total_per_user ASC) AS asc_total
FROM (SELECT DATE_TRUNC('month', created_at) AS calendar_month,
user_id, SUM(amount) AS total_per_user
FROM transactions t
GROUP BY calendar_month, user_id
) um
GROUP BY user_id;
There is a built-in function for median. No need for fancier processing.
Related
How can I count the new and existing users by categories and years?
For instance, during 2015-2020 if someone bought a product in category_A in 2016 first, it will be counted as a new uesr in 2016 in category_A although this user bought a product in category_B in 2015.
Table_1 (Columns: product_name, date, category, sales, user_id)
Want to get the result as bleow
One approach uses two levels of aggregation:
select extract(year from mindate) yr, category, count(*) num_new
from (
select user_id, category, min(date) mindate
from table_1
group by user_id, category
) t
group by extract(year from mindate)
The subquery retrieves the first purchase date of each user by category. Then, the outer query aggregates by the year of that date.
If you want the count of current users as well, then it is a bit different. You can use a window function in the subquery rather than aggregation, then count distinct values in the outer query:
select extract(year from mindate) yr, category,
countdistinctif(user_id, date = mindate) num_new,
countdistinct(user_id) num_total
from (
select date, user_id, category, min(date) over(partition by user_id, category) mindate
from table_1
) t
group by extract(year from mindate)
Below is for BigQuery Standard SQL
#standardSQL
WITH temp AS (
SELECT *,
0 = COUNT(1) OVER(
PARTITION BY user_id, category
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) new_user
FROM `project.dataset.table_1`
ORDER BY date, user_id
)
SELECT EXTRACT(YEAR FROM date) AS year,
category,
COUNT(DISTINCT IF(new_user, user_id, NULL)) AS num_new,
COUNT(DISTINCT IF(new_user, NULL, user_id)) AS num_existing
FROM temp
GROUP BY year, category
I would like to modify the query below so it only keeps the highest VISIT_flag value grouped by CUSTOMER_ID, TRANS_TO_DATE and then average VISIT_flag by CUSTOMER_ID.
I'm having challenges figuring out how to take the maximum DENSE_RANK() value and aggregate by taking the average.
(
SELECT
CUSTOMER_ID,
TRANS_TO_DATE ,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR - RN) VISIT_flag
from (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$')
)
ORDER BY CUSTOMER_ID, TRANS_TO_DATE
Following your logic in order to get the last VISIT_flag, meaning the last "visit" occured within a day, you must order (within the DENSE_RANK) descending. Though descending is solving the problem of getting the last visit, you cannot calculate the average visits of customer because the VISIT_flag will always be 1. So to bypass this issue you must declare a second DENSE_RANK with the same partition by and ascending order by in order to quantify the visits of the day and calculate your average. So the derived query
SELECT customer_id,AVG(quanitify) FROM (
SELECT
customer_id,
trans_to_date ,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR DESC, RN DESC, rownum) VISIT_flag,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR ASC, RN ASC, rownum) quanitify FROM (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$') )) WHERE VISIT_flag = 1 GROUP BY customer_id
Now to be honest the above query can be implemented with easier way without using DENSE_RANK. The above query makes sense only if you remove GROUP BY customer_id from outer query and AVG calculation and you want to get information about the last visit.
In any case you may find below the easier way
SELECT CUSTOMER_ID,AVG(cnt) avg_visits FROM (
SELECT CUSTOMER_ID, TRANS_TO_DATE, count(*) cnt FROM (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$'))
GROUP BY CUSTOMER_ID, TRANS_TO_DATE) GROUP BY CUSTOMER_ID
P.S. i always include rownnum in dense_rank order by statement, in order to prevent exceptional cases (that always exists one to the database :D ) of having the same transaction_time. This will produce two records with the same dense_rank and might an issue to the application that uses the query data.
Consider the following relation
column measured_at holds thousands of different timestamps and column cell_id holds the number of the cell tower used at each timestamp. I want to query for each day saved in measured_at, which cell tower has the most occurences (used the most at that day, here is time irrelevant, only the date is to query). This probably can be done using window functions, but I want to do it using only aggregate functions and simple queries.
an output should look like for example:
cell_id measured_at
27997442 2015-12-22
for the above example because on 22-12-2015 tower number 27997442 has been used the most.
You can use aggregation and distinct on. To get the counts:
select date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt
from t
group by dte, cell_id
And then extend this for only one value:
select distinct on (date_trunc(date, measured_at)) date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt
from t
group by dte, cell_id
order by date_trunc(date, measured_at), count(*) desc;
Of course, you can use window functions as well -- and that is a better approach if you want to get ties as well:
select dte, cell_id, cnt
from (select date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt,
rank() over (partition by date_trunc(date, measured_at) order by count(*) desc) as seqnum
from t
group by dte, cell_id
) dc
where seqnum = 1;
I would like to run the following query in BigQuery, ideally as efficiently as possible. The idea is that I have all of these rows corresponding to tests (taken daily) by millions of users and I want to determine, of the users who have been active for over a year, how much each user has improved.
"Improvement" in this case is the average of the first N subtracted from the last N.
In this example, N is 30. (I've also added in the where cnt >= 100 part because I don't want to consider users who took a test a long time ago and just came back to try once more.)
select user_id,
avg(score) filter (where seqnum_asc <= 30) as first_n_avg,
avg(score) filter (where seqnum_desc <= 30) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= min(created_at) + interval '1 year';
Just use conditional aggregation and fix the date functions:
select user_id,
avg(case when seqnum_asc <= 30 then score end) as first_n_avg,
avg(case when seqnum_desc <= 30 then score end) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= timestamp_add(min(created_at), interval 1 year);
The function in the having clause could be timetamp_add(), datetime_add(), or date_add(), depending on the type of created_at.
I'm attempting to find the number of new users per month by product type. However, I continue to receive an error requesting cnt to be used in an aggregate function.
SELECT EXTRACT(MONTH FROM date) AS month
FROM (SELECT users.date,
COUNT(*) OVER(PARTITION BY product_type) AS cnt FROM users) AS u
GROUP BY month
ORDER BY cnt DESC;
That seems like a very strange construct. Here is a method that doesn't use window functions:
select date_trunc('month', date) as yyyymm, product_id, count(*)
from (select distinct on (u.userid) u.*
from users u
order by u.userid, u.date
) u
group by date_trunc('month', date), product_id
order by yyyymm, product_id;