We have an activity database that records user interaction to a website, storing a log that includes values such as Time1, session_id and customer_id e.g.
2022-05-12 08:00:00|11|1
2022-05-12 08:20:00|11|1
2022-05-12 08:30:01|11|1
2022-05-12 08:14:00|22|2
2022-05-12 08:18:00|22|2
2022-05-12 08:16:00|33|1
2022-05-12 08:50:00|33|1
I need to have two separate queries:
Query #1: I need to count sessions multiple times if they have a log of 30 minutes or more grouping them on sessions on daily basis.
For example: Initially count=0
For session_id = 11, it starts at 08:00 and the last time with the same session_id is 08:30 -- count=1
For session_id = 22 it starts at 08:14 and the last time with the same session is 08:14 -- still the count=1 since it was less than 30 min
I tried this query, but it didn't work
select
count(session_id)
from
table1
where
#datetime between Time1 and dateadd(minute, 30, Time1);
Expected result:
Query #2: it's an extension of the above query where I need the unique customers on daily basis whose sessions were 30 min or more.
For example: from the above table I will have two unique customers on May 8th
Expected result
For the Time1 column, the input is in timestamp format when I show it in output I will group it on a basis.
This is a two-level aggregation (GROUP BY) problem. You need to start with a subquery to get the first and last timestamp of each session.
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
Next you need to use the subquery like this:
SELECT COUNT(session_id),
COUNT(DISTINCT customer_id),
CAST(start_time AS DATE)
FROM (
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
) a
WHERE DATEDIFF(MINUTE, start_time, end_time) >= 30
GROUP BY CAST(start_time AS DATE);
Related
I am currently storing data in postgreSQL that is being displayed back to the user in a chart based on last 24 hours, 30 days and 3 months.
To get the last 24 hours worth of data, I just run the following code when the user requests for it:
SELECT COUNT(*)
FROM page_visits
WHERE page_id = '1111'
AND created_at >= NOW() - '1 day'::INTERVAL
I run a cron job every night to aggregate data for that day and store it in a different table (table_day), which only contains aggregated data.
So when the user requests for the last 30 days worth of data, I can run a code similar to the one above to get this month's data. However, it does not include today's data as it is not aggregated and stored in table_day.
So how can I run a query that gets the last 1 month worth of data from table page_visits and aggregated 24 hour data from table_day?
Or is this approach of storing data of different intervals completely wrong?
I intend to do something similar for monthly data, where a cron job runs at the end of every month to aggregate that month's data and stores it in table_month.
And the same question repeats with how I can query data from last last month and last month from table_month and aggregate this month's data from table_day in a single query.
page_visits
id
page_id
created_at
1
1111
2021-12-02T04:55:26.779Z
2
1442
2021-12-02T02:25:32.219Z
3
1111
2021-12-02T04:55:26.214Z
Table_day
id
page_id
visit_count
created_at
1
1111
2001
2021-13-02T04:55:26.779Z
2
1442
103
2021-13-02T02:25:32.219Z
3
1111
4024
2021-14-02T04:55:26.214Z
If you aggregate the data in the table_day table every end of day, then you can use these aggregate data to calculate the monthly's data and store them in the table_month table, and then you cas reuse the monthly's data to calculate the quarter's data and store them in the table_quarter table.
Every end of day :
INSERT INTO table_day
SELECT page_id, extract('day' from NOW()) AS day, count(*) AS count
FROM page_visits
WHERE created_at >= extract('day' from NOW())
GROUP BY page_id
Every end of calendar month :
INSERT INTO table_month
SELECT page_id, extract('month' from NOW()) AS month, count(*) AS count
FROM table_day
WHERE day >= extract('month' from NOW())
GROUP BY page_id
Every end of calendar quarter:
INSERT INTO table_quarter
SELECT page_id, extract('year' from NOW()) || '-' || extract('quarter' from NOW()) AS qurter, count(*) AS count
FROM table_month
WHERE month >= extract('month' from NOW()) - interval '2 months'
GROUP BY page_id
I have a table data in PostgreSQL with this structure:
created_at. customer_email status
2020-12-31 xxx#gmail.com opened
...
2020-12-24 yyy#gmail.com delivered
2020-12-24 xxx#gmail.com opened
...
2020-12-17 zzz#gmail.com opened
2020-12-10 xxx#gmail.com opened
2020-12-03 hhh#gmail.com enqueued
2020-11-27 xxx#gmail.com opened
...
2020-11-20 rrr#gmail.com opened
2020-11-13 ttt#gmail.com opened
There are many rows for each day.
Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.
Desired output:
period active
2021-W01 1539
2020-W53 1480
2020-W52 1630
2020-W51 1820
2020-W50 1910
2020-W49 1890
2020-W48 2000
How can I do that?
Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.
Instead, get distinct counts from a LATERAL subquery:
WITH weekly_dist AS (
SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
FROM tbl
WHERE status = 'opened'
)
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM (
SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
, date_trunc('week', now()::timestamp)
, interval '1 week') AS wk
FROM tbl
) t
LEFT JOIN LATERAL (
SELECT count(DISTINCT customer_email) AS active
FROM weekly_dist d
WHERE d.wk >= t.wk - interval '91 days'
AND d.wk < t.wk
) ct ON true;
db<>fiddle here
I operate with timestamp, not timestamptz, might make a corner case difference.
The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.
The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:
PostgreSQL: running count of rows for a query 'by minute'
Generating time series between two dates in PostgreSQL
But I do skip the first week since I count active emails before each start of the week.
Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.
Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.
You can combine the date_part() function with a group by like this:
SELECT
DATE_PART('year', created_at)::varchar || '-W' || DATE_PART('week', created_at)::varchar,
SUM(CASE WHEN status = 'opened' THEN 1 ELSE 0 END)
FROM
your_table
GROUP BY 1
ORDER BY created_at DESC
I have an event table (user_id, timestamp). I need to write a query to define a user session (every user can have more than one session and every session can have >= 1 event). 30 minutes of inactivity for the user is a completed session.
The output table should have the following format: (user_id, start_session, end_sesson). I wrote part of query, but what to do next i have no idea.
select
t.user_id,
t.ts start_session,
t.next_ts
from ( select
user_id,
ts,
DATEDIFF(SECOND, lag(ts, 1) OVER(partition by user_id order by ts), ts) next_ts
from
events_tabl ) t
You want a cumulative sum to identify the sessions and then aggregation:
select user_id, session_id, min(ts), max(ts)
from (select e.*,
sum(case when prev_ts > dateadd(minute, -30, ts)
then 0 else 1
end) over (partition by user_id order by ts) as session_id
from (select e.*,
lag(ts) over (partition by user_id order by ts), ts) as prev_ts
from events_tabl e
) e
) e
group by user_id, session_id;
Note that I changed the date/time logic from using datediff() to a direct comparison of the times. datediff() counts the number of "boundaries" between two times. So, there is 1 hour between 12:59 a.m. and 1:01 a.m. -- but zero hours between 1:01 a.m. and 1:59 a.m.
Although handling the diffs at the second level produces similar results, you can run into occasions where you are working with seconds or milliseconds -- but the time spans are too long to fit into an integer. Overflow errors. It is just easier to work directly with the date/time values.
I have a data set that represents analytics events like:
Row timestamp account_id type
1 2018-11-14 21:05:40 UTC abc start
2 2018-11-14 21:05:40 UTC xyz another_type
3 2018-11-26 22:01:19 UTC xyz start
4 2018-11-26 22:01:23 UTC abc start
5 2018-11-26 22:01:29 UTC xyz some_other_type
11 2018-11-26 22:13:58 UTC xyz start
...
With some number of account_ids. I need to find the average time between start records per account_id.
I'm trying to use analytic functions as described here. My end goal would be a table like:
Row account_id avg_time_between_events_mins
1 xyz 53
2 abc 47
3 pqr 65
...
my best attempt--based on this post--looks like this:
WITH
events AS (
SELECT
COUNTIF(type = 'start' AND account_id='abc') OVER (ORDER BY timestamp) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
account_id='abc')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
This calculates the time between each start event and the last non-start event prior to the next start event for a specific account_id.
I tried to use PARTITION and a WINDOW FRAME CLAUSE like this:
WITH
events AS (
SELECT
COUNT(*) OVER (PARTITION BY account_id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
type = 'start')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
But I got a nonsense result table. Can anyone walk me through how I would write and reason about a query like this?
You don't really need analytic functions for this:
select timestamp_diff(min(timestamp), max(timestamp), MINUTE)) / nullif(count(*) - 1, 0)
from `myproject.dataset.events`
where type = 'start'
group by account_id;
This is the timestamp of the most recent minus the oldest, divided by one less than the number of starts. That is the average between the starts.
I have a time series of data that has a trip_id and time stamp. I'm trying to write a SQL query to give me the number of unique trip_id's that occur on one day.
The problem is that the trip's extend across midnight, as the next day comes the trip is treated as a new distinct value and counted twice using this code select date(Timestamp), COUNT(DISTINCT trip_id) . Any help or the appropriate point in the correct direction would be very much appreciated.
Data:
trip_id Timestamp
47585 "2015-11-05 09:22:23"
16935 "2015-11-05 12:34:28"
16935 "2015-11-05 20:40:28"
16935 "2015-11-05 23:09:24"
16935 "2015-11-05 23:21:58"
16935 "2015-11-06 00:22:05"
15434 "2015-11-06 21:23:28"
Desired Outcome
date count
2015-11-05 2
2015-11-06 1
Use the minimum of the timestamp for each trip:
select dte, count(*)
from (select trip_id, min(date_trunc('day', timestamp)) as dte
from t
group by trip_id
) t
group by dte
order by dte;
That is, count the day when the trip begins.