I have a transaction data like this:
User_ID
Purchase_Date
12345
2022-08-02
12231
2022-06-25
12231
2022-07-15
13421
2022-07-12
23132
2022-05-02
15231
2022-04-09
I want to calculate a monthly rolling unique count of users which updates on a weekly basis. The week must be a full week that starts from Monday to Sunday.
Here is the desired output:
Unique_User_ID_Count
start_week_date
end_week_date
403
2022-07-04
2022-07-31
562
2022-06-27
2022-07-24
312
2022-06-20
2022-07-17
and so on.. data goes back 3 years
Using the code below, I am able to get the first row of the desired output but not sure how to get row 2 and 3 (and going back 3 years).
SELECT count(distinct user_id) as Unique_User_ID_Count, min(Purchase_Date) as start_week_date, max(Purchase_Date) as end_week_date
FROM table
WHERE Purchase_Date>= DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), WEEK(MONDAY)) AND Purchase_Date<= DATE_TRUNC(CURRENT_DATE()-6, WEEK(SUNDAY))
Any help is appreciated
You could use CTEs to compute the auxiliary data you need. With your starting dataset, I would do the following:
with data as (
select
User_ID,
Purchase_Date,
DATE_TRUNC(Purchase_Date, WEEK(MONDAY)) as start_week_date,
DATE_ADD(DATE_TRUNC(Purchase_Date, WEEK(MONDAY)), INTERVAL 6 DAY) as end_week_date,
from your_database
)
select distinct
count(distinct User_ID) over (partition by first_day_week, last_day_week) as Unique_User_ID_Count,
first_day_week,
last_day_week,
from data
That should work.
i think what you need is something like this..
select
DATEADD(DAY, 1-DATEPART(WEEKDAY, DateField)+1, convert(int,DateField)),DATEADD(DAY, 1-DATEPART(WEEKDAY, DateField)+7, convert(int,DateField)),count(*)
from Table1
group by DATEADD(DAY, 1-DATEPART(WEEKDAY, DateField)+1, convert(int,DateField)),DATEADD(DAY, 1-DATEPART(WEEKDAY, DateField)+7, convert(int,DateField))
if the data is big.. i d convert date to float then div to 7 then convert to int.. which i think can group the same results.. but then you are gonna have some more trouble in frontend..
Related
Edit I've updated the examples to reflect the logic, before they were used to reflect data structure only.
Having data like this:
id timestamp
3 2022-10-01 12:45:47 UTC
3 2022-10-01 12:45:27 UTC
3 2022-10-01 12:45:17 UTC
1 2022-09-29 15:26:40 UTC
2 2022-09-29 13:15:38 UTC
1 2022-09-29 12:08:28 UTC
2 2022-09-26 16:17:15 UTC
(Basically, every id can have a lot of timestamps throughout a single day over months of time.)
I would like to have have average time between each two timestamps within a week, so something like:
id week averageTimeSec
3 2022-09-26 15 (2022-10-01 12:45:27 - 2022-10-01 12:45:17 = 10 sec, 2022-10-01 12:45:47 - 2022-10-01 12:45:27 = 20 sec)
2 2022-09-26 248303 (2022-09-29T13:15:38Z - 2022-09-26T16:17:15Z = 248303 sec)
1 2022-09-26 11892 (2022-09-29 15:26:40 - 2022-09-29 12:08:28 = 11892 sec)
The idea is to see the frequency of these events generated over a long period of time. Say, two months ago some ID would generate an event on average every 100 seconds, one month ago - 50 seconds, and so on.
I am normally not working with BigQuery or SQL, and facing a task like this tripped me over. I can imagine an approach I would use to solve the problem using InfluxDB's Flux, but it looks like that knowledge helps me none when it comes to BigQuery... I've started reading BigQuery documentation and so far, I don't have a proper idea on how to achieve the desired result. If someone could point me in the right direction, I would greatly appreciate it.
What I have managed to achieve was something like this (just to see the average amount of events between IDs within each week and not the actual time in-between the events):
SELECT week, AVG(connectionCount)
FROM (SELECT id,
TIMESTAMP_TRUNC(timestamp, WEEK) week,
COUNT(timestamp) connectionCount
FROM `allEvents`
GROUP BY id, week
ORDER BY id, week)
GROUP BY week
ORDER BY week
Solution
Thanks to https://stackoverflow.com/users/356506/daryl-wenman-bateson I think I have exactly what I needed. Here is a full example with data:
WITH testData AS (
SELECT '3' as id, TIMESTAMP('2022-10-01 12:45:47 UTC') as timestamp UNION ALL
SELECT '3', TIMESTAMP('2022-10-01 12:45:27 UTC') UNION ALL
SELECT '3', TIMESTAMP('2022-10-01 12:45:17 UTC') UNION ALL
SELECT '1', TIMESTAMP('2022-09-29 15:26:40 UTC') UNION ALL
SELECT '2', TIMESTAMP('2022-09-29 13:15:38 UTC') UNION ALL
SELECT '1', TIMESTAMP('2022-09-29 12:08:28 UTC') UNION ALL
SELECT '2', TIMESTAMP('2022-09-26 16:17:15 UTC')
)
SELECT
id,
week,
AVG(timeDifference) diff
FROM
(
SELECT
id,
timestamp,
TIMESTAMP_TRUNC(timestamp, ISOWEEK) week,
UNIX_SECONDS(timestamp) - LAG(UNIX_SECONDS(timestamp)) OVER (PARTITION BY id ORDER BY timestamp) AS timeDifference
FROM
testData
)
GROUP BY
id,
week
ORDER BY
diff
Here is the output:
Use LAG to identify to identify the last time, partitioned by date and ordered by the timestamp column and then average the result.
SELECT id,
week,
avg(timedifference) diff
FROM
(
SELECT id,
event,
TIMESTAMP_TRUNC(event, WEEK)week,
UNIX_SECONDS(event) - LAG(UNIX_SECONDS(event) ) OVER (PARTITION BY id ORDER BY event) AS timedifference
FROM `DataSet.TimeDif`
)
GROUP BY id,
week
The following result is returned for your data
(note: numbers are large because your events are several days apart and shown in seconds)
The question I have is very similar to the question here, but I am using Presto SQL (on aws athena) and couldn't find information on loops in presto.
To reiterate the issue, I want the query that:
Given table that contains: Day, Number of Items for this Day
I want: Day, Average Items for Last 7 Days before "Day"
So if I have a table that has data from Dec 25th to Jan 25th, my output table should have data from Jan 1st to Jan 25th. And for each day from Jan 1-25th, it will be the average number of items from last 7 days.
Is it possible to do this with presto?
maybe you can try this one
calendar Common Table Expression (CTE) is used to generate dates between two dates range.
with calendar as (
select date_generated
from (
values (sequence(date'2021-12-25', date'2022-01-25', interval '1' day))
) as t1(date_array)
cross join unnest(date_array) as t2(date_generated)),
temp CTE is basically used to make a date group which contains last 7 days for each date group.
temp as (select c1.date_generated as date_groups
, format_datetime(c2.date_generated, 'yyyy-MM-dd') as dates
from calendar c1, calendar c2
where c2.date_generated between c1.date_generated - interval '6' day and c1.date_generated
and c1.date_generated >= date'2021-12-25' + interval '6' day)
Output for this part:
date_groups
dates
2022-01-01
2021-12-26
2022-01-01
2021-12-27
2022-01-01
2021-12-28
2022-01-01
2021-12-29
2022-01-01
2021-12-30
2022-01-01
2021-12-31
2022-01-01
2022-01-01
last part is joining day column from your table with each date and then group it by the date group
select temp.date_groups as day
, avg(your_table.num_of_items) avg_last_7_days
from your_table
join temp on your_table.day = temp.dates
group by 1
You want a running average (AVG OVER)
select
day, amount,
avg(amount) over (order by day rows between 6 preceding and current row) as avg_amount
from mytable
order by day
offset 6;
I tried many different variations of getting the "running average" (which I now know is what I was looking for thanks to Thorsten's answer), but couldn't get the output I wanted exactly with my other columns (that weren't included in my original question) in the table, but this ended up working:
SELECT day, <other columns>, avg(amount) OVER (
PARTITION BY <other columns>
ORDER BY date(day) ASC
ROWS 6 PRECEDING) as avg_7_days_amount FROM table ORDER BY date(day) ASC
If you have table like this:
Name
Data type
UserID
INT
StartDate
DATETIME
EndDate
DATETIME
With data like this:
UserID
StartDate
EndDate
21
2021-01-02 00:00:00
2021-01-02 23:59:59
21
2021-01-03 00:00:00
2021-01-04 15:42:00
24
2021-01-02 00:00:00
2021-01-06 23:59:59
And you want to calculate number of users that is represented on each day in a week with a result like this:
Year
Week
NumberOfTimes
2021
1
8
2021
2
10
2021
3
4
Basically I want to to a Select like this:
SELECT YEAR(dateColumn) AS yearname, WEEK(dateColumn)as week name, COUNT(somecolumen)
GROUP BY YEAR(dateColumn) WEEK(dateColumn)
The problem I have is the start and end date if the date goes over several days I want it to counted each day. Preferably I don't want the same user counted twice each day. There are millions of rows that are constantly being deleted and added so speed is key.
The database is MS-SQL 2019
I would suggest a recursive CTE:
with cte as (
select userid, startdate, enddate
from t
union all
select userid, startdate,
enddate
from cte
where startdate < enddate and
week(startdate) <> week(enddate)
)
select year(startdate), week(startdate), count(*)
from cte
group by year(startdate), week(startdate)
option (maxrecursion 0);
The CTE expands the data by adding 7 days to each row. This should be one day per week.
There is a little logic in the second part to handle the situation where the enddate ends in the same week as the last start date. The above solution assumes that the dates are all in the same year -- which seems quite reasonable given the sample data. There are other ways to prevent this problem.
You need to cross-join each row with the relevant dates.
Create a calendar table with columns of years and weeks, include a start and end date of the week. See here for an example of how to create one, and make sure you index those columns.
Then you can cross-join like this
SELECT
YEAR(dateColumn) AS yearname,
WEEK(dateColumn)as weekname,
COUNT(somecolumen)
FROM Table t
JOIN CalendarWeek c ON c.StartDate >= t.StartDate AND c.EndDate <= t.EndDate
GROUP BY YEAR(dateColumn), WEEK(dateColumn)
Suppose you have a table like:
id subscription_start subscription_end segment
1 2016-12-01 2017-02-01 87
2 2016-12-01 2017-01-24 87
...
And wish to generate a temporary table with months.
One way would be to encode the month date as:
with months as (
select
'2016-12-01' as 'first',
'2016-12-31' as 'last'
union
select
'2017-01-01' as 'first',
'2017-01-31' as 'last'
...
) select * from months;
So that I have an output table like:
first_day last_day
2017-01-01 2017-01-31
2017-02-01 2017-02-31
2017-03-01 2017-03-31
I would like to generate a temporary table with a custom interval (above), without manually encoding all the dates.
Say the interval is of 12 months, for each year, for as many years there are in the db.
I'd like to have general approach to compute the months table with the same output as above.
Or, one may adjust the range to a custom interval (months split an year in 12 parts, but one may want to split a time in a custom interval of days).
To start, I was thinking to use recursive query like:
with months(id, first_day, last_day, month) as (
select
id,
first_day,
last_day,
0
where
subscriptions.first_day = min(subscriptions.first_day)
union all
select
id,
first_day,
last_day,
months.month + 1
from
subscriptions
left join months on cast(
strftime('%m', datetime(subscriptions.subscription_start)) as int
) = months.month
where
months.month < 13
)
select
*
from
months
where
month = 1;
but it does not do what I'd expect: here I was attempting to select the first row from the table with the minimum date, and populate a table at interval of months, ranging from 1 to 12. For each month, I was comparing the string date field of my table (e.g. 2017-03-01 = 3 is march).
The query above does work and also seems a bit complicated, but for the sake of learning, which alternative would you propose to create a temporary table months without manually coding the intervals ?
I'm using Redshift (Postgres), and Pandas to do my work. I'm trying to get the number of user actions, lets say purchases to make it easier to understand. I have a table, purchases that holds the following data:
user_id, timestamp , price
1, , 2015-02-01, 200
1, , 2015-02-02, 50
1, , 2015-02-10, 75
ultimately I would like the number of purchases over a certain timestamp. Such as
userid, 28-14_days, 14-7_days, 7
Here is what I have so far, I'm aware I don't have an upper limit on the dates:
SELECT DISTINCT x_days.user_id, SUM(x_days.purchases) AS x_num, SUM(y_days.purchases) AS y_num,
x_days.x_date, y_days.y_date
FROM
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as x_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(x_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id
) AS x_days
JOIN
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as y_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(y_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id) AS y_days
ON
x_days.user_id = y_days.user_id
GROUP BY
x_days.user_id, x_days.x_date, y_days.y_date
params={'x_days_ago':x_days_ago, 'y_days_ago':y_days_ago}
where these are set in python/pandas
x_days_ago = 14
y_days_ago = 7
But this didn't work out exactly as planned:
user_id x_num y_num x_date y_date
0 5451772 1 1 2015-02-10 2015-02-10
1 5026678 1 1 2015-02-09 2015-02-09
2 6337993 2 1 2015-02-14 2015-02-13
3 6204432 1 3 2015-02-10 2015-02-11
4 3417539 1 1 2015-02-11 2015-02-11
Even though I don't have an upper date to look between (so x is effectively searching from 14 days to now and y is 7 days to now, meaning overlap), in some cases y is higher.
Can anyone help me either fix this or give me a better way?
Thanks!
It might not be the most efficient answer, but you can generate each sum with a sub-select:
WITH
summed AS (
SELECT user_id, day, COUNT(1) AS purchases
FROM (SELECT user_id, DATE(timestamp) AS day FROM purchases) AS _
GROUP BY user_id, day
),
users AS (SELECT DISTINCT user_id FROM purchases)
SELECT user_id,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval ' 7 days')) AS days_7,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval '14 days')) AS days_14
FROM users;
(This was tested in Postgres, not in Redshift; but the Redshift documentation suggests that both WITH and DISTINCT are supported.) I would have liked to do this with a window, to obtain rolling sums; but it's a little onerous without generate_series().