Pad row with default if values not found PostgresSQL - sql

I wanted to return the last 7 days of user_activity, but for those empty days I want to add 0 as value
Say I have this table
actions | id | date
------------------------
67 | 123 | 2019-07-7
90 | 123 | 2019-07-9
100 | 123 | 2019-07-10
50 | 123 | 2019-07-13
30 | 123 | 2019-07-15
and this should be the expected output , for the last 7 days
actions | id | date
------------------------
90 | 123 | 2019-07-9
100 | 123 | 2019-07-10
0 | 123 | 2019-07-11 <--- padded
0 | 123 | 2019-07-12 <--- padded
50 | 123 | 2019-07-13
0 | 123 | 2019-07-14 <--- padded
30 | 123 | 2019-07-15
Here is my query so far, I can only get the last 7 days
but not sure if it's positive to add to default values
SELECT *
FROM user_activity
WHERE action_day > CURRENT_DATE - INTERVAL '7 days'
ORDER BY uid, action_day

You may left join your table with generate_series. First you need to have a way to use the rows for distinct ids. That set can then be correctly joined with the main table.
WITH days
AS (SELECT id,dt
FROM (
SELECT DISTINCT id FROM user_activity
) AS ids
CROSS JOIN generate_series(
CURRENT_DATE - interval '7 days',
CURRENT_DATE, interval '1 day') AS dt
)
SELECT
coalesce(u.actions,0)
,d.id
,d.dt
FROM days d LEFT JOIN user_activity u ON u.id = d.id AND u.action_day = d.dt
DEMO

Related

Calculating values from the same table but a different period

I have a readings table with the following definition.
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('readings_id_seq'::regclass)
created_at | timestamp without time zone | | not null |
type | character varying(50) | | not null |
device | character varying(100) | | not null |
value | numeric | | not null |
It has data such as:
id | created_at | type | device | value
----+---------------------+--------+--------+-------
1 | 2021-05-11 04:00:00 | weight | 1 | 100
2 | 2021-05-10 03:00:00 | weight | 2 | 120
3 | 2021-05-10 04:00:00 | weight | 1 | 120
4 | 2021-05-10 03:00:00 | weight | 1 | 124
5 | 2021-05-01 22:43:47 | weight | 1 | 130
6 | 2021-05-01 15:00:48 | weight | 1 | 140
7 | 2021-05-01 13:00:48 | weight | 2 | 160
Desired Output
Given a device and a type, I would like the max and min value from the past 7 days for each matched row (active row ignored). If there's nothing in the past 7 days, then it should be 0.
id | created_at | type | device | value | min | max
----+---------------------+--------+--------+-------+-----+-----
1 | 2021-05-11 03:09:47 | weight | 1 | 100 | 120 | 124
3 | 2021-05-10 04:00:00 | weight | 1 | 120 | 124 | 124
4 | 2021-05-10 03:00:00 | weight | 1 | 124 | 0 | 0
5 | 2021-05-01 22:43:47 | weight | 1 | 130 | 140 | 140
6 | 2021-05-01 15:00:48 | weight | 1 | 140 | 0 | 0
I have created a db-fiddle.
You can use lateral left join for your requirement like below:
select
t1.id,
t1.created_at,
t1.type,
t1.device,
t1.value,
min(coalesce(t2.value,0)),
max(coalesce(t2.value,0))
from
readings t1
left join lateral
( select *
from readings
where id!=t1.id and created_at between t1.created_at- interval '7 day' and t1.created_at and device=t1.device and t1.type=type
) t2 on true
where t1.device='1' -- Change the device
and t1.type='weight' -- Change the type
group by 1,2,3,4,5
order by 1
DEMO
Considering the comments, here is your PSQL:
select readings.id, readings.type, readings.device, readings.created_at, readings.value,
min(COALESCE(m_readings.value,0)) min, max(COALESCE(m_readings.value,0)) max
from readings LEFT JOIN readings m_readings
ON m_readings.type =readings.type
AND m_readings.device =readings.device
AND m_readings.id > readings.id
AND date( m_readings.created_at) between (date(readings.created_at)-7) and date(readings.created_at)
group by readings.id, readings.type, readings.device, readings.created_at, readings.value
order by readings.id;
Explanations: We make a LEFT junction between each records of readings and the others records of readings which are of the same type and device but not the same id, keeping only the last 7 days records. Then for each type/devicewe are grouping to get max and min valueon this 7 days.
You should be using window functions for this!
select r.*,
max(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
),
min(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
)
from readings r;
The above returns NULL values when there are no values -- and that makes more sense to me than 0. But if you really want 0, just use COALESCE():
select r.*,
coalesce(max(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
), 0),
coalesce(min(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
), 0)
from readings r;
In addition to being more concise, this is easier to read and should have better performance than other methods.

How can I insert data in table form into another table provided some specific conditions are satisfied

Logic: If today is Monday (reference 'time' table), data present in S should be inserted into M (along with a sent_day column which will have today's date).
If today is not Monday, dates corresponding to current week (unique week_id) should be checked in M table. If any of these dates are available in M then S should not be inserted into M. If these dates are not available in M then S should be inserted into M
time
+------------+------------+----------------+
| cal_dt | cal_day | week_id |
+------------+------------+----------------+
| 2020-03-23 | Monday | 123 |
| 2020-03-24 | Tuesday | 123 |
| 2020-03-25 | Wednesday | 123 |
| 2020-03-26 | Thursday | 123 |
| 2020-03-27 | Friday | 123 |
| 2020-03-30 | Monday | 124 |
| 2020-03-31 | Tueday | 124 |
+------------+------------+----------------+
M
+------------+----------+-------+
| sent_day | item | price |
+------------+----------+-------+
| 2020-03-11 | pen | 10 |
| 2020-03-11 | book | 50 |
| 2020-03-13 | Eraser | 5 |
| 2020-03-13 | sharpner | 5 |
+------------+----------+-------+
S
+----------+-------+
| item | price |
+----------+-------+
| pen | 25 |
| book | 20 |
| Eraser | 10 |
| sharpner | 3 |
+----------+-------+
Insert INTO M
SELECT
CASE WHEN(SELECT cal_day FROM time WHERE cal_dt = current_date) = 'Monday' THEN s.*
ELSE
(CASE WHEN(SELECT cal_dt FROM time WHERE wk_id =(SELECT wk_id FROM time WHERE cal_dt = current_date ) NOT IN (SELECT DISTINCT sent_day FROM M) THEN 1 ELSE 0 END)
THEN s.* ELSE END
FROM s
I would do this in two separate INSERT statements:
The first condition ("if today is monday") is quite easy:
insert into m (sent_day, item, price)
select current_date, item, price
from s
where exists (select *
from "time"
where cal_dt = current_date
and cal_day = 'Monday');
I find storing the date and the week day a bit confusing as the week day can easily be extracted from the day. For the test "if today is Monday" it's actually not necessary to consult the "time" table at all:
insert into m (sent_day, item, price)
select current_date, item, price
from s
where extract(dow from current_date) = 1;
The second part is a bit more complicated, but if I understand it correctly, it should be something like this:
insert into m (sent_day, item, price)
select current_date, item, price
from s
where not exists (select *
from m
where m.sent_day in (select cal_dt
from "time" t
where cal_dt = current_date
and cal_day <> 'Monday'));
If you just want a single INSERT statement, you could simply do a UNION ALL between the two selects:
insert into m (sent_day, item, price)
select current_date, item, price
from s
where extract(dow from current_date) = 1
union all
select current_date, item, price
from s
where not exists (select *
from m
where m.sent_day in (select cal_dt
from "time" t
where cal_dt = current_date
and cal_day <> 'Monday'));

SQL interpolating missing values for a specific date range - with some conditions

There are some similar questions on the site, but I believe mine warrants a new post because there are specific conditions that need to be incorporated.
I have a table with monthly intervals, structured like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 1 | 10 | 12/17/2017 | 1/17/2018 |
| 1 | 10 | 1/18/2018 | 2/18/2018 |
| 1 | 10 | 2/19/2018 | 3/19/2018 |
| 1 | 10 | 3/20/2018 | 4/20/2018 |
| 1 | 10 | 4/21/2018 | 5/21/2018 |
+----+--------+--------------+--------------+
I've found that sometimes there is a month of data missing around the end/beginning of the year where I know it should exist, like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
| 2 | 10 | 3/19/2019 | 4/19/2019 |
+----+--------+--------------+--------------+
What I need is a statement that will:
Identify where this year-end period is missing (but not find missing
months that aren't at the beginning/end of the year).
Create this interval by using the length of an existing interval for
that ID (maybe using the mean interval length for the ID to do it?). I could create the interval from the "gap" between the previous and next interval, except that won't work if I'm missing an interval at the beginning or end of the ID's record (i.e. if the record starts at say 1/16/2015, I need the amount for 12/15/2014-1/15/2015
Interpolate an 'amount' for this interval using the mean daily
'amount' from the closest existing interval.
The end result for the sample above should look like:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 12/16/2018 | 1/16/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
+----+--------+--------------+--------------+
A 'nice to have' would be a flag indicating that this value is interpolated.
Is there a way to do this efficiently in SQL? I have written a solution in SAS, but have a need to move it to SQL, and my SAS solution is very inefficient (optimization isn't a goal, so any statement that does what I need is fantastic).
EDIT: I've made an SQLFiddle with my example table here:
http://sqlfiddle.com/#!18/8b16d
You can use a sequence of CTEs to build up the data for the missing periods. In this query, the first CTE (EOYS) generates all the end-of-year dates (YYYY-12-31) relevant to the table; the second (INTERVALS) the average interval length for each ID and the third (MISSING) attempts to find start (from t2) and end (from t3) dates of adjoining intervals for any missing (indicated by t1.ID IS NULL) end-of-year interval. The output of this CTE is then used in an INSERT ... SELECT query to add missing interval records to the table, generating missing dates by adding/subtracting the interval length to the end/start date of the adjacent interval as necessary.
First though we add the interp column to indicate if a row was interpolated:
ALTER TABLE Table1 ADD interp TINYINT NOT NULL DEFAULT 0;
This sets interp to 0 for all existing rows. Then we can do the INSERT, setting interp for all those rows to 1:
WITH EOYS AS (
SELECT DISTINCT DATEFROMPARTS(DATEPART(YEAR, interval_beg), 12, 31) AS eoy
FROM Table1
),
INTERVALS AS (
SELECT ID, AVG(DATEDIFF(DAY, interval_beg, interval_end)) AS interval_len
FROM Table1
GROUP BY ID
),
MISSING AS (
SELECT e.eoy,
ids.ID,
i.interval_len,
COALESCE(t2.amount, t3.amount) AS amount,
DATEADD(DAY, 1, t2.interval_end) AS interval_beg,
DATEADD(DAY, -1, t3.interval_beg) AS interval_end
FROM EOYS e
CROSS JOIN (SELECT DISTINCT ID FROM Table1) ids
JOIN INTERVALS i ON i.ID = ids.ID
LEFT JOIN Table1 t1 ON ids.ID = t1.ID
AND e.eoy BETWEEN t1.interval_beg AND t1.interval_end
LEFT JOIN Table1 t2 ON ids.ID = t2.ID
AND DATEADD(MONTH, -1, e.eoy) BETWEEN t2.interval_beg AND t2.interval_end
LEFT JOIN Table1 t3 ON ids.ID = t3.ID
AND DATEADD(MONTH, 1, e.eoy) BETWEEN t3.interval_beg AND t3.interval_end
WHERE t1.ID IS NULL
)
INSERT INTO Table1 (ID, amount, interval_beg, interval_end, interp)
SELECT ID,
amount,
COALESCE(interval_beg, DATEADD(DAY, -interval_len, interval_end)) AS interval_beg,
COALESCE(interval_end, DATEADD(DAY, interval_len, interval_beg)) AS interval_end,
1 AS interp
FROM MISSING
This adds the following rows to the table:
ID amount interval_beg interval_end interp
2 10 2017-12-05 2018-01-04 1
2 10 2018-12-16 2019-01-16 1
2 10 2019-12-28 2020-01-27 1
Demo on SQLFiddle

Querying all past and future round birthdays

I got the birthdates of users in a table and want to display a list of round birthdays for the next n years (starting from an arbitrary date x) which looks like this:
+----------------------------------------------------------------------------------------+
| Name | id | birthdate | current_age | birthday | year | month | day | age_at_date |
+----------------------------------------------------------------------------------------+
| User 1 | 1 | 1958-01-23 | 59 | 2013-01-23 | 2013 | 1 | 23 | 55 |
| User 2 | 2 | 1988-01-29 | 29 | 2013-01-29 | 2013 | 1 | 29 | 25 |
| User 3 | 3 | 1963-02-12 | 54 | 2013-02-12 | 2013 | 2 | 12 | 50 |
| User 1 | 1 | 1958-01-23 | 59 | 2018-01-23 | 2018 | 1 | 23 | 60 |
| User 2 | 2 | 1988-01-29 | 29 | 2018-01-29 | 2018 | 1 | 29 | 30 |
| User 3 | 3 | 1963-02-12 | 54 | 2018-02-12 | 2018 | 2 | 12 | 55 |
| User 1 | 1 | 1958-01-23 | 59 | 2023-01-23 | 2023 | 1 | 23 | 65 |
| User 2 | 2 | 1988-01-29 | 29 | 2023-01-29 | 2023 | 1 | 29 | 35 |
| User 3 | 3 | 1963-02-12 | 54 | 2023-02-12 | 2023 | 2 | 12 | 60 |
+----------------------------------------------------------------------------------------+
As you can see, I want to be "wrap around" and not only show the next upcoming round birthday, which is easy, but also historical and far future data.
The core idea of my current approach is the following: I generate via generate_series all dates from 1900 till 2100 and join them by matching day and month of the birthdate with the user. Based on that, I calculate the age at that date to select finally only that birthdays, which are round (divideable by 5) and yield to a nonnegative age.
WITH
test_users(id, name, birthdate) AS (
VALUES
(1, 'User 1', '23-01-1958' :: DATE),
(2, 'User 2', '29-01-1988'),
(3, 'User 3', '12-02-1963')
),
dates AS (
SELECT
s AS date,
date_part('year', s) AS year,
date_part('month', s) AS month,
date_part('day', s) AS day
FROM generate_series('01-01-1900' :: TIMESTAMP, '01-01-2100' :: TIMESTAMP, '1 days' :: INTERVAL) AS s
),
birthday_data AS (
SELECT
id AS member_id,
test_users.birthdate AS birthdate,
(date_part('year', age((test_users.birthdate)))) :: INT AS current_age,
date :: DATE AS birthday,
date_part('year', date) AS year,
date_part('month', date) AS month,
date_part('day', date) AS day,
ROUND(extract(EPOCH FROM (dates.date - birthdate)) / (60 * 60 * 24 * 365)) :: INT AS age_at_date
FROM test_users, dates
WHERE
dates.day = date_part('day', birthdate) AND
dates.month = date_part('month', birthdate) AND
dates.year >= date_part('year', birthdate)
)
SELECT
test_users.name,
bd.*
FROM test_users
LEFT JOIN birthday_data bd ON bd.member_id = test_users.id
WHERE
bd.age_at_date % 5 = 0 AND
bd.birthday BETWEEN NOW() - INTERVAL '5' YEAR AND NOW() + INTERVAL '10' YEAR
ORDER BY bd.birthday;
My current approach seems to be very inefficient and rather complicated: It takes >100ms. Does anybody have an idea for a more compact and performant query? I am using Postgresql 9.5.3. Thank you!
Maybe try to join the generate series:
create table bday(id serial, name text, dob date);
insert into bday (name, dob) values ('a', '08-21-1972'::date);
insert into bday (name, dob) values ('b', '03-20-1974'::date);
select * from bday ,
lateral( select generate_series( (1950-y)/5 , (2010-y)/5)*5 + y as year
from (select date_part('year',dob)::integer as y) as t2
) as t1;
This will for each entry generate years between 1950 and 2010.
You can add a where clause to exclude people born after 2010 (they cant have a birthday in range)
Or exclude people born before 1850 (they are unlikely...)
--
Edit (after your edit):
So your generate_series creates 360+ rows per annum. In 100 years that is over 30.000. And they get joined to each user. (3 users => 100.000 rows)
My query generates only rows for years needed. In 100 years that is 20 rows.
That means 20 rows per user.
By dividing by 5, it ensures that the start date is a round birthday.
(1950-y)/5) calculates how many round birthdays there were before 1950.
A person born in 1941 needs to skip 1941 and 1946, but has a round birthday in 1951. So that is the difference (9 years) divided by 5, and then actually plus 1 to account for the 0st.
If the person is born after 1950 the number is negative, and greatest(-1,...)+1 gives 0, starting at the actual birthday year.
But actually it should be
select * from bday ,
lateral( select generate_series( greatest(-1,(1950-y)/5)+1, (2010-y)/5)*5 + y as year
from (select date_part('year',dob)::integer as y) as t2
) as t1;
(you may be doing greatest(0,...)+1 if you want to start at age 5)

Calculating Fast Product Usage Intervals (Funnel Conversions) in Vertica SQL

I've got a massive dataset (~30 billion rows):
host_id | usr_id | src_id | visit_num | event_ts
where any user from their parent host can visit a page on a source (src_id), where the source is, say, their phone, tablet, or computer (unidentifiable). The column vis_num is the ordered number of visits per source per user per host. The column event_ts captures the timestamp of each visit per source per user per host. An example data set for one host might look like this:
host_id | usr_id | src_id | vis_num | event_ts
----------------------------------------------------------------
100 | 10 | 05 | 1 | 2017-08-01 14:52:34
100 | 10 | 05 | 1 | 2017-08-01 14:56:00
100 | 10 | 05 | 1 | 2017-08-01 14:58:09
100 | 10 | 05 | 2 | 2017-08-01 17:08:10
100 | 10 | 05 | 2 | 2017-08-01 17:16:07
100 | 10 | 05 | 2 | 2017-08-01 17:23:25
100 | 10 | 72 | 1 | 2017-07-29 20:03:01
100 | 10 | 72 | 1 | 2017-07-29 20:04:10
100 | 10 | 72 | 2 | 2017-07-29 20:45:17
100 | 10 | 72 | 2 | 2017-07-29 20:56:46
100 | 10 | 72 | 3 | 2017-07-30 09:30:15
100 | 10 | 72 | 3 | 2017-07-30 09:34:19
100 | 10 | 72 | 4 | 2017-08-01 18:16:57
100 | 10 | 72 | 4 | 2017-08-01 18:26:00
100 | 10 | 72 | 5 | 2017-08-02 07:53:33
100 | 22 | 43 | 1 | 2017-07-06 11:45:48
100 | 22 | 43 | 1 | 2017-07-06 11:46:12
100 | 22 | 43 | 2 | 2017-07-07 08:41:11
Per each source id, a change in visit number implies a log-off time and a subsequent log-on time. Note that activity from different sources may overlap in time.
My goal is to calculate how many (non-new) users logged in at least twice within some time interval, say 45 days. My ultimate end goal is:
1) Identify all users who repeated the critical event at least twice within a certain time period (45 days).
2) For those users, measure the length of time they took between completing the event the first and second time.
3) Plot a cumulative distribution function – i.e., the percentage of users who performed the second event over different time intervals.
4) Identify the time interval at which 80% of users have completed the second event—this is your product usage interval.
Page 23 of:
http://usdatavault.com/library/Product-Analytics-Playbook-vol1-Mastering_Retention.pdf
Here is what I've tried:
with new_users as (
select host_id || ' ' || usr_id as host_usr_id,
min(event_ts) as first_login_date
from tableA
group by 1
)
,
time_diffs as (
select a.host_id || ' ' || a.usr_id as host_usr_id,
a.usr_id,
a.src_id,
a.event_ts,
a.vis_num,
b.first_login_date,
case when lag(a.vis_num) over
(partition by a.host_id, a.usr_id, a.src_id
order by a.event_ts) <> a.vis_num
then a.event_ts - lag(a.event_ts) over
(partition by a.host_id, a.usr_id,
a.src_id
order by a.event_ts)
else null end
as time_diff
from tableA a
left join new_users b
on b.host_usr_id = a.host_id || ' ' || a.usr_id
where a.event_date > current_date - interval '45 days'
and a.event_date > b.first_login_date + interval '45 days'
)
select count(distinct case when time_diff < interval '45 days'
and event_ts > first_login_date + interval '45
days'
then host_usr_id end) as cnt_45
from time_diffs
I've tried multiple other (very different) queries (see below), but performance is definitely an issue here. Joining on date intervals is also a new concept to me. Any help is appreciated.
Another approach:
with new_users as (
select host_id,
usr_id,
min(event_ts) as first_login_date
from tableA
group by 1,2
),
x_day_twice as (
select a.host_id,
a.usr_id,
a.src_id,
max(a.vis_num) - min(a.vis_num) + 1 as num_logins
from tableA a
left join new_users b
on a.host_id || ' ' || a.usr_id = b.host_id || ' ' || b.usr_id
and a.event_ts > b.first_login_date + interval '45 days'
where event_ts >= current_timestamp - interval '1 days' -
interval '45 days' and first_login_date < current_date - 1 - 45
group by 1, 2, 3
)
select count(distinct case when num_logins > 1
then host_id || ' ' || usr_id end)
from x_day_twice