SQL Server : count distinct every 30 minutes or more - sql

We have an activity database that records user interaction to a website, storing a log that includes values such as Time1, session_id and customer_id e.g.
2022-05-12 08:00:00|11|1
2022-05-12 08:20:00|11|1
2022-05-12 08:30:01|11|1
2022-05-12 08:14:00|22|2
2022-05-12 08:18:00|22|2
2022-05-12 08:16:00|33|1
2022-05-12 08:50:00|33|1
I need to have two separate queries:
Query #1: I need to count sessions multiple times if they have a log of 30 minutes or more grouping them on sessions on daily basis.
For example: Initially count=0
For session_id = 11, it starts at 08:00 and the last time with the same session_id is 08:30 -- count=1
For session_id = 22 it starts at 08:14 and the last time with the same session is 08:14 -- still the count=1 since it was less than 30 min
I tried this query, but it didn't work
select
count(session_id)
from
table1
where
#datetime between Time1 and dateadd(minute, 30, Time1);
Expected result:
Query #2: it's an extension of the above query where I need the unique customers on daily basis whose sessions were 30 min or more.
For example: from the above table I will have two unique customers on May 8th
Expected result
For the Time1 column, the input is in timestamp format when I show it in output I will group it on a basis.

This is a two-level aggregation (GROUP BY) problem. You need to start with a subquery to get the first and last timestamp of each session.
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
Next you need to use the subquery like this:
SELECT COUNT(session_id),
COUNT(DISTINCT customer_id),
CAST(start_time AS DATE)
FROM (
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
) a
WHERE DATEDIFF(MINUTE, start_time, end_time) >= 30
GROUP BY CAST(start_time AS DATE);

Related

Querying data from last 30 days and 24 hours from different tables

I am currently storing data in postgreSQL that is being displayed back to the user in a chart based on last 24 hours, 30 days and 3 months.
To get the last 24 hours worth of data, I just run the following code when the user requests for it:
SELECT COUNT(*)
FROM page_visits
WHERE page_id = '1111'
AND created_at >= NOW() - '1 day'::INTERVAL
I run a cron job every night to aggregate data for that day and store it in a different table (table_day), which only contains aggregated data.
So when the user requests for the last 30 days worth of data, I can run a code similar to the one above to get this month's data. However, it does not include today's data as it is not aggregated and stored in table_day.
So how can I run a query that gets the last 1 month worth of data from table page_visits and aggregated 24 hour data from table_day?
Or is this approach of storing data of different intervals completely wrong?
I intend to do something similar for monthly data, where a cron job runs at the end of every month to aggregate that month's data and stores it in table_month.
And the same question repeats with how I can query data from last last month and last month from table_month and aggregate this month's data from table_day in a single query.
page_visits
id
page_id
created_at
1
1111
2021-12-02T04:55:26.779Z
2
1442
2021-12-02T02:25:32.219Z
3
1111
2021-12-02T04:55:26.214Z
Table_day
id
page_id
visit_count
created_at
1
1111
2001
2021-13-02T04:55:26.779Z
2
1442
103
2021-13-02T02:25:32.219Z
3
1111
4024
2021-14-02T04:55:26.214Z
If you aggregate the data in the table_day table every end of day, then you can use these aggregate data to calculate the monthly's data and store them in the table_month table, and then you cas reuse the monthly's data to calculate the quarter's data and store them in the table_quarter table.
Every end of day :
INSERT INTO table_day
SELECT page_id, extract('day' from NOW()) AS day, count(*) AS count
FROM page_visits
WHERE created_at >= extract('day' from NOW())
GROUP BY page_id
Every end of calendar month :
INSERT INTO table_month
SELECT page_id, extract('month' from NOW()) AS month, count(*) AS count
FROM table_day
WHERE day >= extract('month' from NOW())
GROUP BY page_id
Every end of calendar quarter:
INSERT INTO table_quarter
SELECT page_id, extract('year' from NOW()) || '-' || extract('quarter' from NOW()) AS qurter, count(*) AS count
FROM table_month
WHERE month >= extract('month' from NOW()) - interval '2 months'
GROUP BY page_id

Getting counts for overlapping time periods

I have a table data in PostgreSQL with this structure:
created_at. customer_email status
2020-12-31 xxx#gmail.com opened
...
2020-12-24 yyy#gmail.com delivered
2020-12-24 xxx#gmail.com opened
...
2020-12-17 zzz#gmail.com opened
2020-12-10 xxx#gmail.com opened
2020-12-03 hhh#gmail.com enqueued
2020-11-27 xxx#gmail.com opened
...
2020-11-20 rrr#gmail.com opened
2020-11-13 ttt#gmail.com opened
There are many rows for each day.
Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.
Desired output:
period active
2021-W01 1539
2020-W53 1480
2020-W52 1630
2020-W51 1820
2020-W50 1910
2020-W49 1890
2020-W48 2000
How can I do that?
Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.
Instead, get distinct counts from a LATERAL subquery:
WITH weekly_dist AS (
SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
FROM tbl
WHERE status = 'opened'
)
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM (
SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
, date_trunc('week', now()::timestamp)
, interval '1 week') AS wk
FROM tbl
) t
LEFT JOIN LATERAL (
SELECT count(DISTINCT customer_email) AS active
FROM weekly_dist d
WHERE d.wk >= t.wk - interval '91 days'
AND d.wk < t.wk
) ct ON true;
db<>fiddle here
I operate with timestamp, not timestamptz, might make a corner case difference.
The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.
The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:
PostgreSQL: running count of rows for a query 'by minute'
Generating time series between two dates in PostgreSQL
But I do skip the first week since I count active emails before each start of the week.
Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.
Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.
You can combine the date_part() function with a group by like this:
SELECT
DATE_PART('year', created_at)::varchar || '-W' || DATE_PART('week', created_at)::varchar,
SUM(CASE WHEN status = 'opened' THEN 1 ELSE 0 END)
FROM
your_table
GROUP BY 1
ORDER BY created_at DESC

Define user's sessions (sql)

I have an event table (user_id, timestamp). I need to write a query to define a user session (every user can have more than one session and every session can have >= 1 event). 30 minutes of inactivity for the user is a completed session.
The output table should have the following format: (user_id, start_session, end_sesson). I wrote part of query, but what to do next i have no idea.
select
t.user_id,
t.ts start_session,
t.next_ts
from ( select
user_id,
ts,
DATEDIFF(SECOND, lag(ts, 1) OVER(partition by user_id order by ts), ts) next_ts
from
events_tabl ) t
You want a cumulative sum to identify the sessions and then aggregation:
select user_id, session_id, min(ts), max(ts)
from (select e.*,
sum(case when prev_ts > dateadd(minute, -30, ts)
then 0 else 1
end) over (partition by user_id order by ts) as session_id
from (select e.*,
lag(ts) over (partition by user_id order by ts), ts) as prev_ts
from events_tabl e
) e
) e
group by user_id, session_id;
Note that I changed the date/time logic from using datediff() to a direct comparison of the times. datediff() counts the number of "boundaries" between two times. So, there is 1 hour between 12:59 a.m. and 1:01 a.m. -- but zero hours between 1:01 a.m. and 1:59 a.m.
Although handling the diffs at the second level produces similar results, you can run into occasions where you are working with seconds or milliseconds -- but the time spans are too long to fit into an integer. Overflow errors. It is just easier to work directly with the date/time values.

How to use BigQuery Analytic Functions to calculate time between timestamped rows?

I have a data set that represents analytics events like:
Row timestamp account_id type
1 2018-11-14 21:05:40 UTC abc start
2 2018-11-14 21:05:40 UTC xyz another_type
3 2018-11-26 22:01:19 UTC xyz start
4 2018-11-26 22:01:23 UTC abc start
5 2018-11-26 22:01:29 UTC xyz some_other_type
11 2018-11-26 22:13:58 UTC xyz start
...
With some number of account_ids. I need to find the average time between start records per account_id.
I'm trying to use analytic functions as described here. My end goal would be a table like:
Row account_id avg_time_between_events_mins
1 xyz 53
2 abc 47
3 pqr 65
...
my best attempt--based on this post--looks like this:
WITH
events AS (
SELECT
COUNTIF(type = 'start' AND account_id='abc') OVER (ORDER BY timestamp) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
account_id='abc')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
This calculates the time between each start event and the last non-start event prior to the next start event for a specific account_id.
I tried to use PARTITION and a WINDOW FRAME CLAUSE like this:
WITH
events AS (
SELECT
COUNT(*) OVER (PARTITION BY account_id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
type = 'start')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
But I got a nonsense result table. Can anyone walk me through how I would write and reason about a query like this?
You don't really need analytic functions for this:
select timestamp_diff(min(timestamp), max(timestamp), MINUTE)) / nullif(count(*) - 1, 0)
from `myproject.dataset.events`
where type = 'start'
group by account_id;
This is the timestamp of the most recent minus the oldest, divided by one less than the number of starts. That is the average between the starts.

Count distinct sql and break by day with timestamp over midnight

I have a time series of data that has a trip_id and time stamp. I'm trying to write a SQL query to give me the number of unique trip_id's that occur on one day.
The problem is that the trip's extend across midnight, as the next day comes the trip is treated as a new distinct value and counted twice using this code select date(Timestamp), COUNT(DISTINCT trip_id) . Any help or the appropriate point in the correct direction would be very much appreciated.
Data:
trip_id Timestamp
47585 "2015-11-05 09:22:23"
16935 "2015-11-05 12:34:28"
16935 "2015-11-05 20:40:28"
16935 "2015-11-05 23:09:24"
16935 "2015-11-05 23:21:58"
16935 "2015-11-06 00:22:05"
15434 "2015-11-06 21:23:28"
Desired Outcome
date count
2015-11-05 2
2015-11-06 1
Use the minimum of the timestamp for each trip:
select dte, count(*)
from (select trip_id, min(date_trunc('day', timestamp)) as dte
from t
group by trip_id
) t
group by dte
order by dte;
That is, count the day when the trip begins.