Calculating Fast Product Usage Intervals (Funnel Conversions) in Vertica SQL - sql

I've got a massive dataset (~30 billion rows):
host_id | usr_id | src_id | visit_num | event_ts
where any user from their parent host can visit a page on a source (src_id), where the source is, say, their phone, tablet, or computer (unidentifiable). The column vis_num is the ordered number of visits per source per user per host. The column event_ts captures the timestamp of each visit per source per user per host. An example data set for one host might look like this:
host_id | usr_id | src_id | vis_num | event_ts
----------------------------------------------------------------
100 | 10 | 05 | 1 | 2017-08-01 14:52:34
100 | 10 | 05 | 1 | 2017-08-01 14:56:00
100 | 10 | 05 | 1 | 2017-08-01 14:58:09
100 | 10 | 05 | 2 | 2017-08-01 17:08:10
100 | 10 | 05 | 2 | 2017-08-01 17:16:07
100 | 10 | 05 | 2 | 2017-08-01 17:23:25
100 | 10 | 72 | 1 | 2017-07-29 20:03:01
100 | 10 | 72 | 1 | 2017-07-29 20:04:10
100 | 10 | 72 | 2 | 2017-07-29 20:45:17
100 | 10 | 72 | 2 | 2017-07-29 20:56:46
100 | 10 | 72 | 3 | 2017-07-30 09:30:15
100 | 10 | 72 | 3 | 2017-07-30 09:34:19
100 | 10 | 72 | 4 | 2017-08-01 18:16:57
100 | 10 | 72 | 4 | 2017-08-01 18:26:00
100 | 10 | 72 | 5 | 2017-08-02 07:53:33
100 | 22 | 43 | 1 | 2017-07-06 11:45:48
100 | 22 | 43 | 1 | 2017-07-06 11:46:12
100 | 22 | 43 | 2 | 2017-07-07 08:41:11
Per each source id, a change in visit number implies a log-off time and a subsequent log-on time. Note that activity from different sources may overlap in time.
My goal is to calculate how many (non-new) users logged in at least twice within some time interval, say 45 days. My ultimate end goal is:
1) Identify all users who repeated the critical event at least twice within a certain time period (45 days).
2) For those users, measure the length of time they took between completing the event the first and second time.
3) Plot a cumulative distribution function – i.e., the percentage of users who performed the second event over different time intervals.
4) Identify the time interval at which 80% of users have completed the second event—this is your product usage interval.
Page 23 of:
http://usdatavault.com/library/Product-Analytics-Playbook-vol1-Mastering_Retention.pdf
Here is what I've tried:
with new_users as (
select host_id || ' ' || usr_id as host_usr_id,
min(event_ts) as first_login_date
from tableA
group by 1
)
,
time_diffs as (
select a.host_id || ' ' || a.usr_id as host_usr_id,
a.usr_id,
a.src_id,
a.event_ts,
a.vis_num,
b.first_login_date,
case when lag(a.vis_num) over
(partition by a.host_id, a.usr_id, a.src_id
order by a.event_ts) <> a.vis_num
then a.event_ts - lag(a.event_ts) over
(partition by a.host_id, a.usr_id,
a.src_id
order by a.event_ts)
else null end
as time_diff
from tableA a
left join new_users b
on b.host_usr_id = a.host_id || ' ' || a.usr_id
where a.event_date > current_date - interval '45 days'
and a.event_date > b.first_login_date + interval '45 days'
)
select count(distinct case when time_diff < interval '45 days'
and event_ts > first_login_date + interval '45
days'
then host_usr_id end) as cnt_45
from time_diffs
I've tried multiple other (very different) queries (see below), but performance is definitely an issue here. Joining on date intervals is also a new concept to me. Any help is appreciated.
Another approach:
with new_users as (
select host_id,
usr_id,
min(event_ts) as first_login_date
from tableA
group by 1,2
),
x_day_twice as (
select a.host_id,
a.usr_id,
a.src_id,
max(a.vis_num) - min(a.vis_num) + 1 as num_logins
from tableA a
left join new_users b
on a.host_id || ' ' || a.usr_id = b.host_id || ' ' || b.usr_id
and a.event_ts > b.first_login_date + interval '45 days'
where event_ts >= current_timestamp - interval '1 days' -
interval '45 days' and first_login_date < current_date - 1 - 45
group by 1, 2, 3
)
select count(distinct case when num_logins > 1
then host_id || ' ' || usr_id end)
from x_day_twice

Related

Calculating values from the same table but a different period

I have a readings table with the following definition.
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('readings_id_seq'::regclass)
created_at | timestamp without time zone | | not null |
type | character varying(50) | | not null |
device | character varying(100) | | not null |
value | numeric | | not null |
It has data such as:
id | created_at | type | device | value
----+---------------------+--------+--------+-------
1 | 2021-05-11 04:00:00 | weight | 1 | 100
2 | 2021-05-10 03:00:00 | weight | 2 | 120
3 | 2021-05-10 04:00:00 | weight | 1 | 120
4 | 2021-05-10 03:00:00 | weight | 1 | 124
5 | 2021-05-01 22:43:47 | weight | 1 | 130
6 | 2021-05-01 15:00:48 | weight | 1 | 140
7 | 2021-05-01 13:00:48 | weight | 2 | 160
Desired Output
Given a device and a type, I would like the max and min value from the past 7 days for each matched row (active row ignored). If there's nothing in the past 7 days, then it should be 0.
id | created_at | type | device | value | min | max
----+---------------------+--------+--------+-------+-----+-----
1 | 2021-05-11 03:09:47 | weight | 1 | 100 | 120 | 124
3 | 2021-05-10 04:00:00 | weight | 1 | 120 | 124 | 124
4 | 2021-05-10 03:00:00 | weight | 1 | 124 | 0 | 0
5 | 2021-05-01 22:43:47 | weight | 1 | 130 | 140 | 140
6 | 2021-05-01 15:00:48 | weight | 1 | 140 | 0 | 0
I have created a db-fiddle.
You can use lateral left join for your requirement like below:
select
t1.id,
t1.created_at,
t1.type,
t1.device,
t1.value,
min(coalesce(t2.value,0)),
max(coalesce(t2.value,0))
from
readings t1
left join lateral
( select *
from readings
where id!=t1.id and created_at between t1.created_at- interval '7 day' and t1.created_at and device=t1.device and t1.type=type
) t2 on true
where t1.device='1' -- Change the device
and t1.type='weight' -- Change the type
group by 1,2,3,4,5
order by 1
DEMO
Considering the comments, here is your PSQL:
select readings.id, readings.type, readings.device, readings.created_at, readings.value,
min(COALESCE(m_readings.value,0)) min, max(COALESCE(m_readings.value,0)) max
from readings LEFT JOIN readings m_readings
ON m_readings.type =readings.type
AND m_readings.device =readings.device
AND m_readings.id > readings.id
AND date( m_readings.created_at) between (date(readings.created_at)-7) and date(readings.created_at)
group by readings.id, readings.type, readings.device, readings.created_at, readings.value
order by readings.id;
Explanations: We make a LEFT junction between each records of readings and the others records of readings which are of the same type and device but not the same id, keeping only the last 7 days records. Then for each type/devicewe are grouping to get max and min valueon this 7 days.
You should be using window functions for this!
select r.*,
max(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
),
min(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
)
from readings r;
The above returns NULL values when there are no values -- and that makes more sense to me than 0. But if you really want 0, just use COALESCE():
select r.*,
coalesce(max(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
), 0),
coalesce(min(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
), 0)
from readings r;
In addition to being more concise, this is easier to read and should have better performance than other methods.

I get different results using between condition with dates including milliseconds precision in MariaDB

I got the following Sales table with columns productName Varchar(50) and saleDate bigint(20).
Let's suppose it stores 100 records.
Examples:
productName saleDate
----------- ------------
TV 1601555334694
Radio 1603585354888
...
The saleDate column stores in timestamp number with date with milliseconds precision. Then I run the following queries to get the number of sales in October 2020:
-- This returns a result of 70
SELECT COUNT(*)
FROM (
SELECT productName,
DATE_FORMAT(SUBSTRING(DATE_ADD((FROM_UNIXTIME(SUBSTRING(saleDate, 1, 10)) -
INTERVAL (10+5*60) MINUTE), INTERVAL SUBSTRING(saleDate, 11, 13) SECOND_MICROSECOND),1,23), '%Y-%m-%d 00:00:00') AS saleDate
FROM Sales
) s
WHERE s.saleDate between '2020-10-01 00:00:00' and '2020-10-31 23:59:59.999'
-- This returns a result of 20
SELECT COUNT(*)
FROM (
SELECT productName,
DATE_FORMAT(SUBSTRING(DATE_ADD((FROM_UNIXTIME(SUBSTRING(saleDate, 1, 10)) -
INTERVAL (10+5*60) MINUTE), INTERVAL SUBSTRING(saleDate, 11, 13) SECOND_MICROSECOND),1,23), '%Y-%m-%d 00:00:00') AS saleDate
FROM Sales
) s
WHERE s.saleDate between '2020-10-01 00:00:00.000' and '2020-10-31 23:59:59.999'
I'm subtracting the date 5 hours with 10 minutes less in both queries. It's system requirement.
So When I filter the begin date with .000 the result changes. Shouldn't be the same result of 70?
I'm using Mariadb 10.2.13
I would recommend translating the interval bounds to unix timestamps instead of proceeding the other way around. This is simpler, and much more efficient: the where predicate is SARGable (meaning that it may take advantage of an index on saledate), while, in your original query, the entire column needs to be converted before it can be filtered.
Also, using half-open intervals saves your from handling the trailing milliseconds.
So:
select count(*)
from sales
where saledate >= unix_timestamp('2020-10-01') * 1000
and saledate < unix_timestamp('2020-11-01') * 1000
If you want to offset by 5 hours and 10 minutes, then that's simple artithmetics:
select count(*)
from sales
where saledate >= (unix_timestamp('2020-10-01') + 5 * 60 * 60 + 10 * 60) * 1000
and saledate < (unix_timestamp('2020-11-01') + 5 * 60 * 60 + 10 * 60) * 1000
The problem is that your DATE_FORMAT() format string doesn't include the milliseconds. So if the value of saleDate is exactly 2020-10-01 00:00:00, it won't satisfy the BETWEEN condition, since 2020-10-01 00:00:00 is not lexicographically higher than 2020-10-01 00:00:00.000.
Either add the milliseconds to the format string '%Y-%m-%d 00:00:00.000'
or remove the milliseconds from the times you use in BETWEEN.
Just something to think about...
SELECT * FROM ints;
+---+
| i |
+---+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+---+
SELECT 99-i j FROM ints ORDER BY i;
+----+
| j |
+----+
| 99 |
| 98 |
| 97 |
| 96 |
| 95 |
| 94 |
| 93 |
| 92 |
| 91 |
| 90 |
+----+
SELECT 99-i i FROM ints ORDER BY i;
+----+
| i |
+----+
| 90 |
| 91 |
| 92 |
| 93 |
| 94 |
| 95 |
| 96 |
| 97 |
| 98 |
| 99 |
+----+

Pad row with default if values not found PostgresSQL

I wanted to return the last 7 days of user_activity, but for those empty days I want to add 0 as value
Say I have this table
actions | id | date
------------------------
67 | 123 | 2019-07-7
90 | 123 | 2019-07-9
100 | 123 | 2019-07-10
50 | 123 | 2019-07-13
30 | 123 | 2019-07-15
and this should be the expected output , for the last 7 days
actions | id | date
------------------------
90 | 123 | 2019-07-9
100 | 123 | 2019-07-10
0 | 123 | 2019-07-11 <--- padded
0 | 123 | 2019-07-12 <--- padded
50 | 123 | 2019-07-13
0 | 123 | 2019-07-14 <--- padded
30 | 123 | 2019-07-15
Here is my query so far, I can only get the last 7 days
but not sure if it's positive to add to default values
SELECT *
FROM user_activity
WHERE action_day > CURRENT_DATE - INTERVAL '7 days'
ORDER BY uid, action_day
You may left join your table with generate_series. First you need to have a way to use the rows for distinct ids. That set can then be correctly joined with the main table.
WITH days
AS (SELECT id,dt
FROM (
SELECT DISTINCT id FROM user_activity
) AS ids
CROSS JOIN generate_series(
CURRENT_DATE - interval '7 days',
CURRENT_DATE, interval '1 day') AS dt
)
SELECT
coalesce(u.actions,0)
,d.id
,d.dt
FROM days d LEFT JOIN user_activity u ON u.id = d.id AND u.action_day = d.dt
DEMO

Query to get maximum value based on timestamp every 4 hour

I have a sql table that stores data every 15 minutes, but I want to fetch the maximum value every 4 hour.
This is my Actual table:
+----+----+----+-------------------------+
| Id | F1 | F2 | timestamp |
+----+----+----+-------------------------+
| 1 | 24 | 30 | 2019-03-25 12:15:00.000 |
| 2 | 22 | 3 | 2019-03-25 12:30:00.000 |
| 3 | 2 | 4 | 2019-03-25 12:45:00.000 |
| 4 | 5 | 35 | 2019-03-25 13:00:00.000 |
| 5 | 18 | 23 | 2019-03-25 13:15:00.000 |
| ' | ' | ' | ' |
| 16 | 21 | 34 | 2019-03-25 16:00:00.000 |
+----+----+----+-------------------------+
The Output I am looking for is:
+----+----+----+
| Id | F1 | F2 |
+----+----+----+
| 1 | 24 | 35 |1st 4 Hours
+----+----+----+
| 2 | 35 | 25 |Next 4 Hours
+----+----+----+
I did use the query
select max(F1) as F1,
max(F2) as F2
from table
where timestamp>='2019/3/26 12:00:01'
and timestamp<='2019/3/26 16:00:01'
and it returns the first 4 hours value but when I Increase the timestamp from 4 hrs to 8 hrs it will still give me 1 max value rather than 2 per 4 hours.
I did try with the group by clause but wasn't able to get the expected result.
This should work
SELECT Max(f1),
Max(f2), datepart(hh,timestamp), convert(date,timestamp)
FROM TABLE
WHERE datepart(hh,timestamp)%4 = 0
AND timestamp>='2019/3/26 12:00:01'
AND timestamp<='2019/3/26 16:00:01'
GROUP BY datepart(hh,timestamp), convert(date,timestamp)
ORDER BY convert(date,timestamp) asc
Here is a relatively simple method:
select convert(date, timestamp) as dte,
(datepart(hour, timestamp) / 4) * 4 as hour,
max(F1) as F1,
max(F2) as F2
from table
group by convert(date, timestamp), (datepart(hour, timestamp) / 4) * 4;
This puts the date and hour into separate columns; you can use dateadd() to put them in one column.
Try this query:
declare #startingDatetime datetime = '2017-10-04 12:00:00';
select grp, max(F1) F1, max(F2) F2
from (
select datediff(hour, #startingDatetime, [timestamp]) / 4 grp, *
from MyTable
where [timestamp] > #startingDatetime
) a group by grp

Querying all past and future round birthdays

I got the birthdates of users in a table and want to display a list of round birthdays for the next n years (starting from an arbitrary date x) which looks like this:
+----------------------------------------------------------------------------------------+
| Name | id | birthdate | current_age | birthday | year | month | day | age_at_date |
+----------------------------------------------------------------------------------------+
| User 1 | 1 | 1958-01-23 | 59 | 2013-01-23 | 2013 | 1 | 23 | 55 |
| User 2 | 2 | 1988-01-29 | 29 | 2013-01-29 | 2013 | 1 | 29 | 25 |
| User 3 | 3 | 1963-02-12 | 54 | 2013-02-12 | 2013 | 2 | 12 | 50 |
| User 1 | 1 | 1958-01-23 | 59 | 2018-01-23 | 2018 | 1 | 23 | 60 |
| User 2 | 2 | 1988-01-29 | 29 | 2018-01-29 | 2018 | 1 | 29 | 30 |
| User 3 | 3 | 1963-02-12 | 54 | 2018-02-12 | 2018 | 2 | 12 | 55 |
| User 1 | 1 | 1958-01-23 | 59 | 2023-01-23 | 2023 | 1 | 23 | 65 |
| User 2 | 2 | 1988-01-29 | 29 | 2023-01-29 | 2023 | 1 | 29 | 35 |
| User 3 | 3 | 1963-02-12 | 54 | 2023-02-12 | 2023 | 2 | 12 | 60 |
+----------------------------------------------------------------------------------------+
As you can see, I want to be "wrap around" and not only show the next upcoming round birthday, which is easy, but also historical and far future data.
The core idea of my current approach is the following: I generate via generate_series all dates from 1900 till 2100 and join them by matching day and month of the birthdate with the user. Based on that, I calculate the age at that date to select finally only that birthdays, which are round (divideable by 5) and yield to a nonnegative age.
WITH
test_users(id, name, birthdate) AS (
VALUES
(1, 'User 1', '23-01-1958' :: DATE),
(2, 'User 2', '29-01-1988'),
(3, 'User 3', '12-02-1963')
),
dates AS (
SELECT
s AS date,
date_part('year', s) AS year,
date_part('month', s) AS month,
date_part('day', s) AS day
FROM generate_series('01-01-1900' :: TIMESTAMP, '01-01-2100' :: TIMESTAMP, '1 days' :: INTERVAL) AS s
),
birthday_data AS (
SELECT
id AS member_id,
test_users.birthdate AS birthdate,
(date_part('year', age((test_users.birthdate)))) :: INT AS current_age,
date :: DATE AS birthday,
date_part('year', date) AS year,
date_part('month', date) AS month,
date_part('day', date) AS day,
ROUND(extract(EPOCH FROM (dates.date - birthdate)) / (60 * 60 * 24 * 365)) :: INT AS age_at_date
FROM test_users, dates
WHERE
dates.day = date_part('day', birthdate) AND
dates.month = date_part('month', birthdate) AND
dates.year >= date_part('year', birthdate)
)
SELECT
test_users.name,
bd.*
FROM test_users
LEFT JOIN birthday_data bd ON bd.member_id = test_users.id
WHERE
bd.age_at_date % 5 = 0 AND
bd.birthday BETWEEN NOW() - INTERVAL '5' YEAR AND NOW() + INTERVAL '10' YEAR
ORDER BY bd.birthday;
My current approach seems to be very inefficient and rather complicated: It takes >100ms. Does anybody have an idea for a more compact and performant query? I am using Postgresql 9.5.3. Thank you!
Maybe try to join the generate series:
create table bday(id serial, name text, dob date);
insert into bday (name, dob) values ('a', '08-21-1972'::date);
insert into bday (name, dob) values ('b', '03-20-1974'::date);
select * from bday ,
lateral( select generate_series( (1950-y)/5 , (2010-y)/5)*5 + y as year
from (select date_part('year',dob)::integer as y) as t2
) as t1;
This will for each entry generate years between 1950 and 2010.
You can add a where clause to exclude people born after 2010 (they cant have a birthday in range)
Or exclude people born before 1850 (they are unlikely...)
--
Edit (after your edit):
So your generate_series creates 360+ rows per annum. In 100 years that is over 30.000. And they get joined to each user. (3 users => 100.000 rows)
My query generates only rows for years needed. In 100 years that is 20 rows.
That means 20 rows per user.
By dividing by 5, it ensures that the start date is a round birthday.
(1950-y)/5) calculates how many round birthdays there were before 1950.
A person born in 1941 needs to skip 1941 and 1946, but has a round birthday in 1951. So that is the difference (9 years) divided by 5, and then actually plus 1 to account for the 0st.
If the person is born after 1950 the number is negative, and greatest(-1,...)+1 gives 0, starting at the actual birthday year.
But actually it should be
select * from bday ,
lateral( select generate_series( greatest(-1,(1950-y)/5)+1, (2010-y)/5)*5 + y as year
from (select date_part('year',dob)::integer as y) as t2
) as t1;
(you may be doing greatest(0,...)+1 if you want to start at age 5)