SQL: last 7 Days Calculations based on date - sql

Below Tables consists of count of users on particular day.Looking to populate Total_Users signup column
Logic:Contains user count b/w Signupdate-14 & Signupdate-7
For Example: 15/01/2020 , contains users count between 1/1/2020 AND 1/7/2020
Signupdate| |Users| Total_Users(b/w D-14 & D-7)
1/1/2020 | |20. | 60
2/1/2020 | |30. | 80
3/1/2020 | |10. | 90
--- | |-- | --
--- | |-- | --
15/1/2020 | |30. | 120
16/1/2020 | |10. | 40

SELECT Signupdate
, Users
,SUM(CASE
WHEN Signupdate BETWEEN to_date(Signupdate,'DDMMYYYY')-14 and to_date(Signupdate,'DDMMYYYY')-7
THEN Users END) AS 'Total_Users'
FROM
This is assuming that the users column is of numeric type

Assuming you have a row for each date, you would use window functions with a windowing clause. I'm not sure if Redshift supports window frames with intervals, but this is the basic logic:
select t.*,
sum(users) over (order by signupdate
range between interval '-14' day and interval '-7 day'
) as total_users
from t;
If not, you can turn the date into a number and use that:
select t.*,
sum(users) over (order by signupdate
rows between 14 preceding and 7 preceding
) as total_users
from (select t.*,
datediff(day, signupdate, date '2000-01-01') as diff
from t
) t
I am guessing you want a complete week. However, this is 8 days.

Related

SQL counting distinct users over a growing timeframe

I don't think I properly titled this, but in essence I'm wanting to be able to count distinct users but have those previous distinct users be considered as time goes on. As an example, say we have a dataset of user purchases over time:
Date | User
-----------------
2/3/22 | A
2/4/22 | B
2/22/22 | C
3/2/22 | A
3/4/22 | D
3/15/22 | A
4/30/22 | B
Generally, if I were to count distincts grouped by months as would be normal we would get:
Date | Count
-----------------
2/1/22 | 3
3/1/22 | 2
4/1/22 | 1
But what I'm really wanting to see would be how the total number of distinct users increases over the time period.
Date | Count
-----------------
2/1/22 | 3
3/1/22 | 4
4/1/22 | 4
As such it would be 3 distinct users for the first month. Then 4 for the second month considering the total number of distinct users grew by one with the addition of "D" while "A" isn't counted because it was already recognized as a distinct user in the previous month. The third month would then still be 4 because no new distinct user performed an action that month.
Any help would be greatly appreciated (even if it is just a better title so that it reaches more people more appropriately haha)
here's a solution based on running sum in Postgres that should translate well to Vertica.
select date_trunc('month', "Date") as "Date"
,sum(count(case rn when 1 then 1 end)) over (order by date_trunc('month', "Date")) as "Count"
from (
select "Date"
,"User"
,row_number() over(partition by "User" order by "Date") as rn
from t
) t
group by date_trunc('month', "Date")
order by "Date"
Date
Count
2022-02-01 00:00:00
3
2022-03-01 00:00:00
4
2022-04-01 00:00:00
4
Fiddle

Finding total session time of a user in postgres

I am trying to create a query that will give me a column of total time logged in for each month for each user.
username | auth_event_type | time | credential_id
Joe | 1 | 2021-11-01 09:00:00 | 44
Joe | 2 | 2021-11-01 10:00:00 | 44
Jeff | 1 | 2021-11-01 11:00:00 | 45
Jeff | 2 | 2021-11-01 12:00:00 | 45
Joe | 1 | 2021-11-01 12:00:00 | 46
Joe | 2 | 2021-11-01 12:30:00 | 46
Joe | 1 | 2021-12-06 14:30:00 | 47
Joe | 2 | 2021-12-06 15:30:00 | 47
The auth_event_type column specifies whether the event was a login (1) or logout (2) and the credential_id indicates the session.
I'm trying to create a query that would have an output like this:
username | year_month | total_time
Joe | 2021-11 | 1:30
Jeff | 2021-11 | 1:00
Joe | 2021-12 | 1:00
How would I go about doing this in postgres? I am thinking it would involve a window function? If someone could point me in the right direction that would be great. Thank you.
Solution 1 partially working
Not sure that window functions will help you in your case, but aggregate functions will :
WITH list AS
(
SELECT username
, date_trunc('month', time) AS year_month
, max(time ORDER BY time) - min(time ORDER BY time) AS session_duration
FROM your_table
GROUP BY username, date_trunc('month', time), credential_id
)
SELECT username
, to_char (year_month, 'YYYY-MM') AS year_month
, sum(session_duration) AS total_time
FROM list
GROUP BY username, year_month
The first part of the query aggregates the login/logout times for the same username, credential_id, the second part makes the sum per year_month of the difference between the login/logout times. This query works well until the login time and logout time are in the same month, but it fails when they aren't.
Solution 2 fully working
In order to calculate the total_time per username and per month whatever the login time and logout time are, we can use a time range approach which intersects the session ranges [login_time, logout_time) with the monthly ranges [monthly_start_time, monthly_end_time) :
WITH monthly_range AS
(
SELECT to_char(m.month_start_date, 'YYYY-MM') AS month
, tsrange(m.month_start_date, m.month_start_date+ interval '1 month' ) AS monthly_range
FROM
( SELECT generate_series(min(date_trunc('month', time)), max(date_trunc('month', time)), '1 month') AS month_start_date
FROM your_table
) AS m
), session_range AS
(
SELECT username
, tsrange(min(time ORDER BY auth_event_type), max(time ORDER BY auth_event_type)) AS session_range
FROM your_table
GROUP BY username, credential_id
)
SELECT s.username
, m.month
, sum(upper(p.period) - lower(p.period)) AS total_time
FROM monthly_range AS m
INNER JOIN session_range AS s
ON s.session_range && m.monthly_range
CROSS JOIN LATERAL (SELECT s.session_range * m.monthly_range AS period) AS p
GROUP BY s.username, m.month
see the result in dbfiddle
Use the window function lag() with a partition it by credential_id ordered by time, e.g.
WITH j AS (
SELECT username, time, age(time, LAG(time) OVER w)
FROM t
WINDOW w AS (PARTITION BY credential_id ORDER BY time
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
SELECT username, to_char(time,'yyyy-mm'),sum(age) FROM j
GROUP BY 1,2;
Note: the frame ROWS BETWEEN 1 PRECEDING AND CURRENT ROW is pretty much optional in this case, but it is considered a good practice to keep window functions as explicit as possible, so that in the future you don't have to read the docs to figure out what your query is doing.
Demo: db<>fiddle

Counting number of orders per customer

I have a table with the following columns: date, customers_id, and orders_id (unique).
I want to addd a column in which, for each order_id, I can see how many times the given customer has already placed an order during the previous year.
e.g. this is what it would look like:
customers_id | orders_id | date | order_rank
2083 | 4725 | 2018-08-314 | 1
2573 | 4773 | 2018-09-035 | 1
3393 | 3776 | 2017-09-11 | 1
3393 | 4172 | 2018-01-09 | 2
3393 | 4655 | 2018-08-17 | 3
I'm doing this in BigQuery, thank you!
Use count(*) with a window frame. Ideally, you would use an interval. But BigQuery doesn't (yet) support that syntax. So convert to a number:
select t.*,
count(*) over (partition by customer_id
order by unix_date(date)
range between 364 preceding and current row
) as order_rank
from t;
This treats a year as 365 days, which seems suitable for most purposes.
I suggest that you use the over clause and restrict the data in your where clause. You don't really need a window for your case. If you consider one your a period from 365 days in the past until now, this is gonna work:
select t.*,
count(*) over (partition by customer_id
order by date
) as c
from `your-table` t
where date > DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
order by customer_id, c
If you need some specific year, for example 2019, you can do something like:
select t.*,
count(*) over (partition by customer_id
order by date
) as c
from `your-table` t
where date between cast("2019-01-01" as date) and cast("2019-12-31" as date)
order by customer_id, c

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC

Select first & last date in window

I'm trying to select first & last date in window based on month & year of date supplied.
Here is example data:
F.rates
| id | c_id | date | rate |
---------------------------------
| 1 | 1 | 01-01-1991 | 1 |
| 1 | 1 | 15-01-1991 | 0.5 |
| 1 | 1 | 30-01-1991 | 2 |
.................................
| 1 | 1 | 01-11-2014 | 1 |
| 1 | 1 | 15-11-2014 | 0.5 |
| 1 | 1 | 30-11-2014 | 2 |
Here is pgSQL SELECT I came up with:
SELECT c_id, first_value(date) OVER w, last_value(date) OVER w FROM F.rates
WINDOW w AS (PARTITION BY EXTRACT(YEAR FROM date), EXTRACT(MONTH FROM date), c_id
ORDER BY date ASC)
Which gives me a result pretty close to what I want:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 15-01-1991 |
| 1 | 01-01-1991 | 30-01-1991 |
.................................
Should be:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 30-01-1991 |
.................................
For some reasons last_value(date) returns every record in a window. Which giving me a thought that I'm misunderstanding how windows in SQL works. It's like SQL forming a new window for each row it iterates through, but not multiple windows for entire table based on YEAR and MONTH.
So could any one be kind and explain if I'm wrong and how do I achieve the result I want?
There is a reason why i'm not using MAX/MIN over GROUP BY clause. My next step would be to retrieve associated rates for dates I selected, like:
| c_id | first_date | last_date | first_rate | last_rate | avg rate |
-----------------------------------------------------------------------
| 1 | 01-01-1991 | 30-01-1991 | 1 | 2 | 1.1 |
.......................................................................
If you want your output to become grouped into a single (or just fewer) row(s), you should use simple aggregation (i.e. GROUP BY), if avg_rate is enough:
SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)
More about window functions in PostgreSQL's documentation:
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
...
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
...
There are options to define the window frame in other ways ... See Section 4.2.8 for details.
EDIT:
If you want to collapse (min/max aggregation) your data and want to collect more columns than those what listed in GROUP BY, you have 2 choice:
The SQL way
Select min/max value(s) in a sub-query, then join their original rows back (but this way, you have to deal with the fact, that min/max-ed column(s) usually not unique):
SELECT c_id,
min first_date,
max last_date,
first.rate first_rate,
last.rate last_rate,
avg avg_rate
FROM (SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)) agg
JOIN F.rates first ON agg.c_id = first.c_id AND agg.min = first.date
JOIN F.rates last ON agg.c_id = last.c_id AND agg.max = last.date
PostgreSQL's DISTINCT ON
DISTINCT ON is typically meant for this task, but highly rely on ordering (only 1 extremum can be searched for this way at a time):
SELECT DISTINCT ON (c_id, date_trunc('month', date))
c_id,
date first_date,
rate first_rate
FROM F.rates
ORDER BY c_id, date
You can join this query with other aggregated sub-queries of F.rates, but this point (if you really need both minimum & maximum, and in your case even an average) the SQL compliant way is more suiting.
Windowing functions aren't appropriate for this. Use aggregate functions instead.
select
c_id, date_trunc('month', date)::date,
min(date) first_date, max(date) last_date
from rates
group by c_id, date_trunc('month', date)::date;
c_id | date_trunc | first_date | last_date
------+------------+------------+------------
1 | 2014-11-01 | 2014-11-01 | 2014-11-30
1 | 1991-01-01 | 1991-01-01 | 1991-01-30
create table rates (
id integer not null,
c_id integer not null,
date date not null,
rate numeric(2, 1),
primary key (id, c_id, date)
);
insert into rates values
(1, 1, '1991-01-01', 1),
(1, 1, '1991-01-15', 0.5),
(1, 1, '1991-01-30', 2),
(1, 1, '2014-11-01', 1),
(1, 1, '2014-11-15', 0.5),
(1, 1, '2014-11-30', 2);