How to get a count of new values per day in Postgres - sql

I have the following schema -
Date
UserID
"2021-07-29"
1
"2021-07-29"
2
"2021-07-30"
1
"2021-07-30"
4
"2021-08-01"
2
"2021-08-01"
2
It contains the dates of some event, along with the user who triggered that event.
I need to get a count of all the NEW users who triggered the event on every given day until today, ignoring users who have triggered the event in the past.
So after running the query, results would look like this
Date
Count
"2021-07-29"
2
"2021-07-30"
1
"2021-08-01"
0
Because on the 29th, user 1, and 2 - who I've never seen before triggered it.
On the 30th, user 4 - who I've never seen before triggered it.
On the first, I've seen user 2 before, so ignore him.

You can use a window function to get the first date for each user. Then use conditional aggregation:
select date, count(*) filter (where seqnum = 1) as num_new_users
from (select t.*,
row_number() over (partition by userid order by date) as seqnum
from t
) t
group by date;

Use the window function FIRST_VALUE in a subquery (or CTE) to get the first trigger for each user and in the outer query count if it's equal to the current date:
SELECT dt,count(*) FILTER (WHERE first_trigger = dt)
FROM (
SELECT *,FIRST_VALUE(dt) OVER w first_trigger FROM t
WINDOW w AS (PARTITION BY userid ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ORDER BY dt)j
GROUP BY dt;
Demo: db<>fiddle

Use MIN() window function to get the min date for each user and then aggregate and count for each date only the min dates:
SELECT Date, SUM((Date = min_date)::int) Count
FROM (
SELECT *, MIN(date) OVER (PARTITION BY UserID) min_date
FROM tablename
) t
GROUP BY Date;
Or:
SELECT Date, COUNT(*) FILTER (WHERE Date = min_date) Count
FROM (
SELECT *, MIN(date) OVER (PARTITION BY UserID) min_date
FROM tablename
) t
GROUP BY Date;
See the demo.

Related

Hive/SQL How do you access the value of the column which you just computed for previous rows?

I have a table uv_user_date looks like this:
Its basically a user log in table which shows the cumulative login days partition by user_id.
And the column pre show the last login date of a user login record.
Based on this I want to compute the consecutive login days for each user record.
The answer should be :
My idea is : for a record
if(uv_date - pre = 1 day)
then consecutive login days is the last consecutive login days + 1
else
1
but I am having trouble with accessing the last consecutive login days value.
The Code would be:
SELECT *,
if(pre = date_add(uv_date, -1), last(consecutive_days) + 1, 1) consecutive_days
FROM uv_user_date
Is there any way to get the value of last(consecutive_days)
First find date difference
tbl1:
select *,
if(pre = NULL, 1, datediff(uv_date, pre)) as diff
from your_table
then difference between cumulative sum of difference and accumulative_uv_date for each user_id, you want to use it as rank
tbl2:
select *,
sum(diff) over (partition by user_id order by uv_date rows between unbounded preceding and current) - accumulative_uv_date as rnk
from tbl1
finally, count consecutive days
select user_id, uv_date, rnk
row_number() over (partition by user_id, rnk order by uv_date) as consecutive_days
from tbl2

Postgres DB query to get the count, and first and last ids by date in a single query

I have the following db structure.
table
-----
id (uuids)
date (TIMESTAMP)
I want to write a query in postgres (actually cockroachdb which uses the postgres engine, so postgres query should be fine).
The query should return a count of records between 2 dates , id of the record with latest date and id of the record with latest earliest date within that range.
So the query should return the following:
count, id(of the earliest record in the range), id (of the latest record in the range)
thanks.
You can use row_number() twice, then conditional aggregation:
select
no_records,
min(id) filter(where rn_asc = 1) first_id
max(id) filter(where rn_desc = 1) last_id
from (
select
id,
count(*) over() no_records
row_number() over(order by date asc) rn_asc,
row_number() over(order by date desc) rn_desc
from mytable
where date >= ? and date < ?
) t
where 1 in (rn_asc, rn_desc)
The question marks represents the (inclusive) start and (exclusive) end of the date interval.
Of course, if ids are always increasing, simple aggregation is sufficient:
select count(*), min(id) first_id, max(id) last_id
from mytable
where date >= ? and date < ?
Unfortunately, Postgres doesn't support first_value() as an aggregation function. One method is to use arrays:
select count(*),
(array_agg(id order by date asc))[1] as first_id,
(array_agg(id order by date desc))[1] as last_id
from t
where date >= ? and date <= ?

Vertica Analytic function to count instances in a window

Let's say I have a dataset with two columns: ID and timestamp. My goal is to count return IDs that have at least n timestamps in any 30 day window.
Here is an example:
ID Timestamp
1 '2019-01-01'
2 '2019-02-01'
3 '2019-03-01'
1 '2019-01-02'
1 '2019-01-04'
1 '2019-01-17'
So, let's say I want to return a list of IDs that have 3 timestamps in any 30 day window.
Given above, my resultset would just be ID = 1. I'm thinking some kind of windowing function would accomplish this, but I'm not positive.
Any chance you could help me write a query that accomplishes this?
A relatively simple way to do this involves lag()/lead():
select t.*
from (select t.*,
lead(timestamp, 2) over (partition by id order by timestamp) as timestamp_2
from t
) t
where datediff(day, timestamp, timestamp_2) <= 30;
The lag() looks at the third timestamp in a series. The where checks if this is within 30 days of the original one. The result is rows where this occurs.
If you just want the ids, then:
select distinct id
from (select t.*,
lead(timestamp, 2) over (partition by id order by timestamp) as timestamp_2
from t
) t
where datediff(day, timestamp, timestamp_2) <= 30;

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1

How to take only one entry from a table based on an offset to a date column value

I have a requirement to get values from a table based on an offset conditions on a date column.
Say for eg: for the below attached table, if there is any dates that comes close within 15 days based on effectivedate column I should return only the first one.
So my expected result would be as below:
Here for A1234 policy, it returns 6/18/16 entry and skipped 6/12/16 entry as the offset between these 2 dates is within 15 days and I took the latest one from the list.
If you want to group rows together that are within 15 days of each other, then you have a variant of the gaps-and-islands problem. I would recommend lag() and cumulative sum for this version:
select polno, min(effectivedate), max(expirationdate)
from (select t.*,
sum(case when prev_ed >= dateadd(day, -15, effectivedate)
then 1 else 0
end) over (partition by polno order by effectivedate) as grp
from (select t.*,
lag(expirationdate) over (partition by polno order by effectivedate) as prev_ed
from t
) t
) t
group by polno, grp;